In order to successfully complete this assignment you need to participate both individually and in groups during class. Have one of the instructors check your notebook and sign you out before leaving class. Turn in your assignment using D2L.
ICA 36: Scavenger Queue and Checkpointing#

Agenda for today’s class (70 minutes)#
(20 minutes) Introduction to the Scavenger Queue
(50 minutes) Checkpointing in Python
1. Introduction to the Scavenger Queue#
The scavenger queue is a specific queue on the HPCC that allows jobs to fill gaps in the scheduler to improve overall efficiency of the system. The catch, is that jobs scheduled via the scavenger queue can be stopped and restarted at any point. For many jobs, this would be detrimental, but for jobs that can checkpoint, we can restart without much loss of progress.
✅ DO THIS: Checkout the documentation on the scavenger queue https://docs.icer.msu.edu/Scavenger_Queue/.
✅ QUESTION: What types of problems do you think would work well on the scavenger queue? Are there any examples we covered throughout the course that you think could work well here?
✅ DO THIS: Write a job script that uses the scavenger queue.
2. Checkpointing#
Checkpointing is the process of periodically saving the state of a program. While it is especially useful when working with the scavenger queue, checkpointing can be useful in any scenario where it is important to not lose substantial progress if a program is interupted for any reason.
✅ QUESTION: Think of some scenarios beyond using the scavenger queue where checkpointing would be valuable.
✅ DO THIS: Take a look at the code below that demonstrates a basic implementation of checkpointing in Python. Copy the code onto the HPCC then start running the code. At some random point, kill the code using ctrl+c, then restart the code and see what happens.
import dill
import os
import sys
import time
def checkpointSave(name,data):
file=open(str(name),"wb+")
dill.dump(data,file)
file.close()
def checkpointLoad(name,data):
if os.path.exists(str(name)):
print("\n Checkpoint Loading... \n")
with open(str(name),'rb') as file:
data=dill.load(file)
print("\n Loaded Data: ",data,"\n")
else:
return data
return data
if __name__ == "__main__":
if len(sys.argv) > 1:
name=sys.argv[1]
else:
name=0
data = 0
data=checkpointLoad(name, data)
while data<100:
data+=1
if data%1==0:
checkpointSave(name, data)
print("Data=",data)
time.sleep(1)
✅ QUESTION: Notice how the code has an if statement around the checkpointSave
. In this example, the code is saved each iteration, but you may not always want to do this. Thinking about this specific example, why would it be a bad idea to checkpoint every iteration if we removed the time.sleep(1)
line? (NOTE: This mistake was responsible for taking down the HPCC for about a week)
✅ DO THIS: Now that you have an idea how checkpointing works in Python code, revisit your 1D Wave Solver code from HW 1 and see if you can get checkpointing working with the code. Once you confirm it works, try submitting it to the HPCC using your job submission script in part 1. (NOTE: If you did not implement your HW 1 in Python, there is an example using Python in the public repository for this course.)
Congratulations, we’re done!#
Have one of the instructors check your notebook and sign you out before leaving class. Turn in your assignment using D2L.
Written by Dr. Nathan Haut, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.