This is an open Internet exam. Feel free to use anything on the Internet with one important exception...
You will be given 60 minutes (wishful thinking by the instructor, you will be given the entire exam time if needed) to complete this Exam. Use your time wisely.
HINTS:
I, agree to neither give nor receive any help on this exam from other people. I also understand that providing answers to questions on this exam to other students is also an academic misconduct violation as is live communication or receiving answers to questions on this exam from other people. It is important to me to be a person of integrity and that means that ALL ANSWERS on this exam are my answers.
✅ DO THIS: Include your name in the line below to acknowledge the above statement:
Put your name here.
Image from https://innovativegenomics.org/free-covid-19-illustrations/
The following is a simple python program (called runlevelsim.py
) that runs the levelsim.analysis
function 100 times (NOTE: levelsim
is a program made up by your instructor and any relationship to an existing program of the same or similar name is just a coincidence).
import levelsim as ls
import sys
# Run the level simulator starting from 100 different input states.
for i in range(0,100):
print(f"Running simulation with {i} input")
ls.analysis(i)
Each iteration of the loop is independent and takes around 2 hours and 23 minutes to run (the entire program takes ~9.9 days). Instead of the loop we would like to speed up the program and run it using a job array.
✅ Question 1.a: Modify the above python program to the unroll the loop and pass in the i
variable as an input argument such that the program can be run as follows:
python runlevelsim.py 45
Where 45 is just the input number passed in as a string.
Put the answer to the above question here
✅ Question 1.b: Modify the following SLURM submission script (used by the original python program) to now run the unrolled loop as a pleasently parallel job array.
Put the answer to the above question here
#!/bin/bash
#SBATCh -N 1
#SBATCH -c 1
#SBATCH -mem=500mb
#SBATCH --time=240:00:00
time python runlevelsim.py
✅ Question 1.c: What is the best possible time for the entire job array to run. Explain the situation when this optimal solution is achieved.
Put the answer to the above question here
✅ Question 1.d: Now assume the levelsim
library has been updated using numba
and is using a parallel jit function on its main internal loop of the anaysis function. When the updated function is run with an optimal 4 processors (as determined by timing studies) the entire code is speed up by 68%. Write a new submission script that will support this new function by:
Copy and paste your script from above and modify to use shared memory parallel
FFTW is a C subroutine library for computing the Discrete Fourier Transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data. The Fourier Transform is used in a wide variety of scientific applications and many projects depend on FFTW.
Question 2.a: What module command would you use on the HPCC to determine which versions of FFTW are installed?
Put the answer to the above question here
Question 2.b: What gcc compiler command line is needed to link in the FFTW libraries?
Put the answer to the above question here
Question 2.c: Review the FFTW library documentation and determine if it is already has support for running in parallel. Specifically figure out if it can use Shared Memory, Shared Network and/or GPU acceleration. Answer either yes or no for each option, explain how you came to that answer and give a reference.
Question 2.d: Assume that a researcher has emailed you and asked you to help them debug some code that they have modified in some way to run using FFTW in parallel. However, every time they run the code in parallel it takes much longer than it does in serial (with the same calculation). Given that you know almost nothing about the details of the code, can you give the researcher three reasons why their code might be slower. HINT: Try to make different reasons other than variations on the same idea. In other words try to come up with helpful ideas they can use to isolate and maybe even fix the problem. Provide your answer as a response email (you can use the template below). Give your three reasons and, if you can, include a question or a test the researcher can run that you think would help isolate the problem.
Dear Researcher,
I am sorry you are having trouble with your program. I can think of at least three reasons why you may be having trouble. However, I will need a little more information to debug the problem:
Thank you,
Condiser the following 2D wave equation which is very similar to the 1D equation we did in class.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "png_util.h"
#define min(X,Y) ((X) < (Y) ? (X) : (Y))
#define max(X,Y) ((X) > (Y) ? (X) : (Y))
int main(int argc, char ** argv) {
int nx = 500;
int ny = 500;
int nt = 100000;
int frame=0;
int r,c,it;
double dx,dy,dt;
double max,min;
double tmax;
double dx2inv, dy2inv;
char filename[sizeof "./images/file00000.png"];
image_size_t sz;
sz.width=nx;
sz.height=ny;
//make position grid
double * h_z = (double *) malloc(nx*ny*sizeof(double));
double ** z = malloc(ny * sizeof(double*));
for (r=0; r<ny; r++)
z[r] = &h_z[r*nx];
//make velocity grid
double * h_v = (double *) malloc(nx*ny*sizeof(double));
double ** v = malloc(ny * sizeof(double*));
for (r=0; r<ny; r++)
v[r] = &h_v[r*nx];
//Amake acceleration grid
double * h_a = (double *) malloc(nx*ny*sizeof(double));
double ** a = malloc(ny * sizeof(double*));
for (r=0; r<ny; r++)
a[r] = &h_a[r*nx];
//allocate space for output image
char * o_img = (char *) malloc(sz.width*sz.height*sizeof(char));
char **output = malloc(sz.height * sizeof(char*));
for (int r=0; r<sz.height; r++)
output[r] = &o_img[r*sz.width];
max=10.0;
min=0.0;
dx = (max-min)/(double)(nx-1);
dy = (max-min)/(double)(ny-1);
tmax=20.0;
dt= (tmax-0.0)/(double)(nt-1);
double x,y;
for (r=0;r<ny;r++) {
for (c=0;c<nx;c++) {
x = min+(double)c*dx;
y = min+(double)r*dy;
z[r][c] = exp(-(sqrt((x-5.0)*(x-5.0)+(y-5.0)*(y-5.0))));
v[r][c] = 0.0;
a[r][c] = 0.0;
}
}
dx2inv=1.0/(dx*dx);
dy2inv=1.0/(dy*dy);
//Main loop
for(it=0;it<nt-1;it++) {
//update acceleration
for (r=1;r<ny-1;r++) {
for (c=1;c<nx-1;c++) {
double ax = (z[r+1][c]+z[r-1][c]-2.0*z[r][c])*dx2inv;
double ay = (z[r][c+1]+z[r][c-1]-2.0*z[r][c])*dy2inv;
a[r][c] = (ax+ay)/2;
}
}
//update velocity and position
for (r=1;r<ny-1;r++) {
for (c=1;c<nx-1;c++) {
v[r][c] = v[r][c] + dt*a[r][c];
z[r][c] = z[r][c] + dt*v[r][c];
}
}
if (it % 100 ==0)
{
double mx,mn;
mx = -999999;
mn = 999999;
for(r=0;r<ny;r++)
for(c=0;c<nx;c++){
mx = max(mx, z[r][c]);
mn = min(mn, z[r][c]);
}
for(r=0;r<ny;r++)
for(c=0;c<nx;c++)
output[r][c] = (char) round((z[r][c]-mn)/(mx-mn)*255);
sprintf(filename, "./images/file%05d.png", frame);
printf("Writing %s\n",filename);
write_png_file(filename,o_img,sz);
frame+=1;
}
}
//find the max and min values to normalize the output.
double mx,mn;
mx = -999999;
mn = 999999;
for(r=0;r<ny;r++)
for(c=0;c<nx;c++){
mx = max(mx, z[r][c]);
mn = min(mn, z[r][c]);
}
printf("%f, %f\n", mn,mx);
for(r=0;r<ny;r++)
for(c=0;c<nx;c++){
output[r][c] = (char) round((z[r][c]-mn)/(mx-mn)*255);
}
sprintf(filename, "./images/file%05d.png", it);
printf("Writing %s\n",filename);
//Write out output image using 1D serial pointer
write_png_file(filename,o_img,sz);
return 0;
}
✅ Question 3.a: Modify the above example to add an OpenMP pragma for commands to efficiently parallelize the loops that update the position, velocity and acceleration code? Since there are two loops (one for each dimention) you should probably include the collapes option.
Put the answer to the above question here
✅ Question 3.b: Which OpenMP scheduling option would best be suited for these pragmas and why?
Put the answer to the above question here
✅ Question 3.c: Explain why we can NOT use a parallel pragma on the main loop?
Put the answer to the above question here
✅ Question 3.d: Modify the above code for the "find array minimum and maxium" (near the end of the code) using an OpenMP reduction.
Put the answer to the above question here
The GPU enabled nodes on the HPCC have more than one GPU, however, all of the code we developed in this course and most of the example GPU code available online is designed to only work on one GPU. This is because GPUs do not share their memory and in order to run a kernel on different GPUs we need to copy memory individually to the GPUS which can be tricky.
One common solution is to to use a hybrid OpenMP/CUDA program, where every OpenMP CPU is given a GPU. The following code is a simple template to test how a basic hybrid OpenMP/CUDA program may work:
#include <stdio.h>
#include <omp.h>
#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf("CUDA error: " #x " returned \"%s\"\n", cudaGetErrorString(cuda
_error__));}
int main(int argc, char **argv) {
int num_gpus,num_cpus;
CUDA_CALL(cudaGetDeviceCount(&num_gpus));
num_cpus = omp_get_max_threads();
printf("There is %d cpus and %d gpus available for this job.\n", num_cpus, num_gpus);
#pragma omp parallel shared(num_gpus)
{
int cpu_thread_id = omp_get_thread_num();
if (cpu_thread_id < num_gpus) {
printf("Setting up GPU #%d\n", cpu_thread_id);
CUDA_CALL(cudaSetDevice(cpu_thread_id));
//COPY MEMORY FROM HOST TO THIS CUDA DEVICE
//RUN A KERNEL ON THIS CUDA DEVICE
//COPY MEMORY FROM THIS CUDA BACK TO HOST
} else
printf("CPU #%d is more than avaliable GPUS\n", cpu_thread_id);
}
}
✅ Question 4.a: In your own words, describe what the cudaSetDevice
function is doing in the above code.
Put the answer to the above question here
The above program was uploaded to the HPCC and we tried to compile it on the dev-amd20-v100
node using the following command:
nvcc -fopenmp -o multigpu multigpu.cu
However, the above command returned the following error
-bash: nvcc: command not found
✅ Question 4.b: What is the source of this "command not found" error and how can it be fixed?
Put the answer to the above question here
Assuming the above error is fixed we tried compiling again but now get the following error:
nvcc fatal : Unknown option 'fopenmp'
✅ Question 4.c: What is the cause of this second error and how can it be fixed?
Put the answer to the above question here
Assuming we got the above to compile properly, we ran it on the command line and get the following output (truncated):
>./multigpu
There is 48 cpus and 4 gpus available for this job.
Setting up GPU #0
CPU #9 is more than avaliable GPUS
CPU #41 is more than avaliable GPUS
CPU #31 is more than avaliable GPUS
CPU #22 is more than avaliable GPUS
CPU #36 is more than avaliable GPUS
CPU #33 is more than avaliable GPUS
CPU #40 is more than avaliable GPUS
CPU #34 is more than avaliable GPUS
CPU #38 is more than avaliable GPUS
CPU #4 is more than avaliable GPUS
Setting up GPU #3
CPU #5 is more than avaliable GPUS
...
The above seems good now we want to test it with SLURM using the following job script:
#!/bin/bash
#SBATCh -N 1
#SBATCH -c 3
#SBATCH --time=00:01:00
#SBATCH --mem=200mb
#SBATCH --gres=gpu:3
srun ./multigpu
✅ Question 4.d: What output would be produced by running the above submission script and the provided code?
Put the answer to the above question here
Your research team is designing an agent based model for an economic simulator using MPI. The overall design includes the following:
The Rank 0 node will be the head node and act as a "stock ticker" and bank. At each beginning of each iteration this node will broadcast the current stock prices to the other nodes. The stock prices is a 1D array integers named stocks
with NumStocks
total stocks.
The other nodes (rank > 0) will run individual models that read in the stocks and then send their purchase back to Rank 0.
So far the team has designed the following basic pseudocode as a starting point for the program:
At this point in the design we want to just think through and debug what would go into the MPI portion of the code and not worry about the models themselves. As a starting point here is a stubbed out code.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int stocks[5];
int num_stocks = 5;
printf("hello world\n");
// Initialize MPI
// Start Main Loop
// If Rank 0
// Run Stock Model
// Send stocks vector
// Loop over other processors
// Receive and process their purchase
// else
// Receive Stock vectors
// Run Purchase Model
// Send purchase to Bank
// Rank 0 prints final results
// Finalize MPI
}
✅ Question 5.a: What c/c++
commands should be used to initialize the MPI job? i.e. write a code block for standard MPI initialization.
Put your answer to the above question here.
✅ Question 5.b: Write c/c++
code to send from the Rank 0 node the current stocks to each of the other processors. At the end of your block of code each processor should have a array pointer named stocks
and integer num_stocks
with all the same values. Just assume that rank 0 already has thestocks
andnum_stocks
initialized. Note, you can send using a loop or a broadcast.
Put your answer to the above question here.
✅ Question 5.c: Assuming the program is named market.c
in your current directory and has no additional linked libraries. What is the command needed to compile the MPI program into an executable named market
(use level 3 compiler optimization)?
Put your answer to the above question here.
✅ Question 5.d: Assume it is 3 months later and the simulation code is finished, is running well and producing results. Your team typically runs jobs with 100-200 agents/nodes. Your lead researcher wants to go bigger and faster and asked you if you should switch from MPI to CUDA. Write down two situation; 1) when CUDA would be a better idea as compared to MPI and 2) when you should stick with MPI. Do your best to explain your reasoning.
When you may want to use CUDA: Put the answer here
When you want to stick to MPI: Put the answer here
Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for today's dropbox.
✅ DO THIS:
Written by Dr. Dirk Colbry, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.