Link to this document's Jupyter Notebook

In order to successfully complete this assignment you need to participate both individually and in groups during class. If you attend class in-person then have one of the instructors check your notebook and sign you out before leaving class on Monday April 19. If you are attending asynchronously, turn in your assignment using D2L no later than _11:59pm on Monday April 19.


In-Class Assignment: Checkpoint Resart

Comic by Jorge Cham about saving data for a graduate student is more important than their own life, It is funny because it is true!

Agenda for today's class (70 minutes)

  1. What is checkpoint_restart and what problems does it solve?
  2. Using DMTCP on the HPCC.
  3. SIRS Forms

1. What is checkpoint_restart and what problems does it solve?

Flow chart showing how checpoint/restart works with the DMTCP program.  See details in link below


2. Using DMTCP on the HPCC.

As a class we are going to look at the following submission script and try to figure out everything that it does. This is an opportunity to review what we learned at the beginning of the semester.

If you get it. Try to get an example working on the HPCC.

#!/bin/bash -login

## make job description and resource requests for short partial task:
#SBATCH -J count-longjob                  # Job Name
#SBATCH --time=04:00:00                   # Run time (hh:mm:ss) -  mimutes
#SBATCH -N 1 -c 1 --mem=20MB              # requested resource
#SBATCH --constraint=lac

# Set a limited stack size so DMTCP could work
ulimit -s 8192

# current working directory shuld have source code dmtcp1.c
cd ${SLURM_SUBMIT_DIR}

# script name. This script is to be resubmit multiple times
export SLURM_JOBSCRIPT="TEMP_longjob.sb"

cp $0 $SLURM_JOBSCRIPT

# start dmtcp_coordinator
fname=port.$SLURM_JOBID                                                                 # to store port number
dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname $@ 1>/dev/null 2>&1   # start coordinater
h=`hostname`                                                                            # get host name
p=`cat $fname`
export DMTCP_COORD_HOST=$h
export DMTCP_COORD_PORT=$p
#rm $fname

# print out some information
#echo "coordinator is on host $DMTCP_COORD_HOST "
#echo "port number is $DMTCP_COORD_PORT "
#echo " working directory: ${SLURM_SUBMIT_DIR} "
#echo " job script is $SLURM_JOBSCRIPT "

####################### BODY of the JOB ######################
# prepare work environment of the job
module swap GNU/6.4.0-2.28 GCC/4.9.2

# build the program if not exist
if [ ! -f count.exe ] 
then
    cc count.c -o count.exe
fi

# run the program count.exe. 
# To run interactively: 
# $ ./count.exe n num.odd 1> num.even 
# it will count to number n and generate 2 files: 
# num.odd contains all the odd number;
# num.even contains all the even number.

# To run with DMTCP, use dmtcp commamds.
# if first time launch, use "dmtcp_launch"
# otherwise use "dmtcp_restart"

# set checkpoint interval. This script would wait after dmtcp_launch
# the job for the interval (in seconds), then do start the checkpoint. 
export CKPT_WAIT_SEC=$(( (3*60+55) * 60 ))

# Launch or restart the execution
if [ ! -f ckpt_*.dmtcp ]         # no ckpt file exists, use dmtcp_launch
then
  # first time run, use dmtcp_launch the job */
  echo " call dmtcp_launch "
  dmtcp_launch -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --rm --ckpt-open-files ./count.exe 800 num.odd 1> num.even 10>&- 11>&- &

  #wait for an inverval of checkpoint seconds to start checkpointing
  sleep $CKPT_WAIT_SEC

  # start checkpointing
  # echo " start dmtcp checkpointing"
  dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files --bcheckpoint
  # echo " finish dmtcp checkpointing"

  # kill the running job after checkpointing
  # echo " terminate job after checkpoint "
  dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
  # echo " terminate job after checkpoint "

  # resubmit the job
  echo resubmit ${SLURM_JOBSCRIPT}
  sbatch $SLURM_JOBSCRIPT

else
  # restart job with checkpoint files
  echo " call dmtcp_restart "
  dmtcp_restart -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT ckpt_*.dmtcp 1> num.even &
  # echo " restarted "

  # wait for a checkpoint interval to start checkpointing
  sleep $CKPT_WAIT_SEC

  # if program is running, do the checkpoint and resubmit
  if dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT -s 1>/dev/null 2>&1
  then   
    # echo " start checkpointing again "
    # clean up old ckpt files before start new ckpt
    rm -r ckpt_*.dmtcp
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --ckpt-open-files -bc
    # echo " finish checkpointing again "
    # kill the running program
    dmtcp_command -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT --quit
    # resubmit this script to slurm
    echo resumit $SLURM_JOBSCRIPT
    sbatch $SLURM_JOBSCRIPT
  else
    echo "job finished"
  fi
fi

scontrol show job $SLURM_JOB_ID

QUESTION: What does the #SBATCH --constraint=lac line do in the above script? Why is it needed? How did you figure this out?

Put your answer to the above question here

QUESTION: What does the line if [ ! -f count.exe ] do in the above script?

Put your answer to the above question here

QUESTION: The above script declares a large number of variables. List the variables here:

Put your answer to the above question here

QUESTION: Some variables in the above script are declared using the export command and some do not include export what is the difference?

Put your answer to the above question here

QUESTION: The above script can be intimidating when you don't understand what all of the lines of code are doing. However, that doesn't mean you can't use it in your own research. Assume that you want to run a python script using the line python mylongprogram.py. What lines would you try to modify in the above script to try to run this command using checkpoint/restart?

Put your answer to the above question here

QUESTION: Will the above checkpointing script would work with OpenMP, MPI and/or CUDA parallel programs? How do you know? What would be the best way to check?

Put your answer to the above question here


3. SIRS Forms

Lets use this time to fill out SIRS forms for this and other classes.


Congratulations, we're done!

If you attend class in-person then have one of the instructors check your notebook and sign you out before leaving class. If you are attending asynchronously, turn in your assignment using D2L.

Course Resources:

Written by Dr. Dirk Colbry, Michigan State University Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.