CMSE 401 - Spring 2021 Final Exam¶

This is an open Internet exam. Feel free to use anything on the Internet with one important exception...

DO NOT communicate live with other people during the exam (either verbally or on-line). The goal here is to find answers to problems as you would in the real world.

You will be given 60 minutes (wishful thinking by the instructor, you will be given the entire exam time if needed) to complete this Exam. Use your time wisely.

HINTS:

Neatness and grammar is important. We will ignore all notes or code we can not read or understand.
Read the entire exam from beginning to end before starting. Not all questions are equal in points vs. time so plan your time accordingly.
Some of the information provided my be a distraction. Do not assume you need to understand everything written to answer the questions.
Spaces for answers are provided. Delete the prompting text such as "Put your answer to the above question here" and replace it with your answer. Do not leave the prompting text with your answer.
Do not assume that the answer must be in the same format of the cell provided. Feel free to change the cell formatting (ex. markdown to code) or add additional cells as needed to provide your answer.
When a questions asks for an answer "in your own words" it is still okay to search the Internet for the answer as a reminder, however, we would like you to do more than cut and paste. Make the answer your own and still provide references.
If you get stuck, try not to leave an answer blank. It is better to include some notes or stub functions so we have an idea about your thinking process so we can give you partial credit.
Always provided links to any references you find helpful.
Feel free to delete the provided check marks (✅) as a way to keep track of which questions you have successfully completed.

Honor Code¶

I, agree to neither give nor receive any help on this exam from other people. I also understand that providing answers to questions on this exam to other students is also an academic misconduct violation as is live communication or receiving answers to questions on this exam from other people. It is important to me to be a person of integrity and that means that ALL ANSWERS on this exam are my answers.

✅ DO THIS: Include your name in the line below to acknowledge the above statement:

Put your name here.

Exam Summary¶

(20 points) Unrolling Loops
(20 points) FFTW
(20 points) 2D Wave Equation
(20 points) Multiple Accelerators
(20 points) Agent Based Economic Simulator

Question 1 - (20 points) Unrolling Loops¶

Image of the COVID-19 virus. Not really needed I just wanted a motivating picture

Image from https://innovativegenomics.org/free-covid-19-illustrations/

The following is a simple python program (called runlevelsim.py) that runs the levelsim.analysis function 100 times (NOTE: levelsim is a program made up by your instructor and any relationship to an existing program of the same or similar name is just a coincidence).

import levelsim as ls
import sys

# Run the level simulator starting from 100 different input states.
for i in range(0,100):
    print(f"Running simulation with {i} input")
    ls.analysis(i)

Each iteration of the loop is independent and takes around 2 hours and 23 minutes to run (the entire program takes ~9.9 days). Instead of the loop we would like to speed up the program and run it using a job array.

✅ Question 1.a: Modify the above python program to the unroll the loop and pass in the i variable as an input argument such that the program can be run as follows:

python runlevelsim.py 45

Where 45 is just the input number passed in as a string.

Put the answer to the above question here

✅ Question 1.b: Modify the following SLURM submission script (used by the original python program) to now run the unrolled loop as a pleasently parallel job array.

Put the answer to the above question here

#!/bin/bash
#SBATCh -N 1
#SBATCH -c 1
#SBATCH -mem=500mb
#SBATCH --time=240:00:00

time python runlevelsim.py

✅ Question 1.c: What is the best possible time for the entire job array to run. Explain the situation when this optimal solution is achieved.

Put the answer to the above question here

✅ Question 1.d: Now assume the levelsim library has been updated using numba and is using a parallel jit function on its main internal loop of the anaysis function. When the updated function is run with an optimal 4 processors (as determined by timing studies) the entire code is speed up by 68%. Write a new submission script that will support this new function by:

requesting 4 shared memory cpus for each job in the array
running the python script inside of srun
requesting less time to account for the increase in speed.

Copy and paste your script from above and modify to use shared memory parallel

Question 2 - (20 points) FFTW¶

Fastest Fourier Transform in the West (FFTW)

FFTW is a C subroutine library for computing the Discrete Fourier Transform (DFT) in one or more dimensions, of arbitrary input size, and of both real and complex data. The Fourier Transform is used in a wide variety of scientific applications and many projects depend on FFTW.

Question 2.a: What module command would you use on the HPCC to determine which versions of FFTW are installed?

Put the answer to the above question here

Question 2.b: What gcc compiler command line is needed to link in the FFTW libraries?

Put the answer to the above question here

Question 2.c: Review the FFTW library documentation and determine if it is already has support for running in parallel. Specifically figure out if it can use Shared Memory, Shared Network and/or GPU acceleration. Answer either yes or no for each option, explain how you came to that answer and give a reference.

Shared Memory (yes/no): Explanation and URL reference
Shared Network (yes/no): Explanation and URL reference
GPU Acceleration (yes/no): Explanation and URL reference

Question 2.d: Assume that a researcher has emailed you and asked you to help them debug some code that they have modified in some way to run using FFTW in parallel. However, every time they run the code in parallel it takes much longer than it does in serial (with the same calculation). Given that you know almost nothing about the details of the code, can you give the researcher three reasons why their code might be slower. HINT: Try to make different reasons other than variations on the same idea. In other words try to come up with helpful ideas they can use to isolate and maybe even fix the problem. Provide your answer as a response email (you can use the template below). Give your three reasons and, if you can, include a question or a test the researcher can run that you think would help isolate the problem.

Dear Researcher,

I am sorry you are having trouble with your program. I can think of at least three reasons why you may be having trouble. However, I will need a little more information to debug the problem:

Reason 1: Put your answer here
Reason 2: Put your answer here
Reason 3: Put your answer here

Thank you,

Your name here.

Question 3 - (20 points) 2D Wave Equation¶

Condiser the following 2D wave equation which is very similar to the 1D equation we did in class. animated gif

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "png_util.h"
#define min(X,Y) ((X) < (Y) ? (X) : (Y))
#define max(X,Y) ((X) > (Y) ? (X) : (Y))

int main(int argc, char ** argv) {
    int nx = 500;
    int ny = 500;
    int nt = 100000; 
    int frame=0;

    int r,c,it;
    double dx,dy,dt;
    double max,min;
    double tmax;
    double dx2inv, dy2inv;
    char filename[sizeof "./images/file00000.png"];

    image_size_t sz; 
    sz.width=nx;
    sz.height=ny;

    //make position grid
    double * h_z = (double *) malloc(nx*ny*sizeof(double));
    double ** z = malloc(ny * sizeof(double*));
    for (r=0; r<ny; r++)
        z[r] = &h_z[r*nx];

    //make velocity grid
    double * h_v = (double *) malloc(nx*ny*sizeof(double));
    double ** v = malloc(ny * sizeof(double*));
    for (r=0; r<ny; r++)
        v[r] = &h_v[r*nx];

    //Amake acceleration grid
    double * h_a = (double *) malloc(nx*ny*sizeof(double));
    double ** a = malloc(ny * sizeof(double*));
    for (r=0; r<ny; r++)
        a[r] = &h_a[r*nx];

    //allocate space for output image
    char * o_img = (char *) malloc(sz.width*sz.height*sizeof(char));
    char **output = malloc(sz.height * sizeof(char*));
    for (int r=0; r<sz.height; r++)
        output[r] = &o_img[r*sz.width];

    max=10.0;
    min=0.0;
    dx = (max-min)/(double)(nx-1);
    dy = (max-min)/(double)(ny-1);

    tmax=20.0;
    dt= (tmax-0.0)/(double)(nt-1);

    double x,y; 
    for (r=0;r<ny;r++)  {
        for (c=0;c<nx;c++)  {
        x = min+(double)c*dx;
        y = min+(double)r*dy;
            z[r][c] = exp(-(sqrt((x-5.0)*(x-5.0)+(y-5.0)*(y-5.0))));
            v[r][c] = 0.0;
            a[r][c] = 0.0;
        }
    }

    dx2inv=1.0/(dx*dx);
    dy2inv=1.0/(dy*dy);

    //Main loop
    for(it=0;it<nt-1;it++) {
        //update acceleration
        for (r=1;r<ny-1;r++)  {
            for (c=1;c<nx-1;c++)  {
                double ax = (z[r+1][c]+z[r-1][c]-2.0*z[r][c])*dx2inv;
                double ay = (z[r][c+1]+z[r][c-1]-2.0*z[r][c])*dy2inv;
                a[r][c] = (ax+ay)/2;
            }
        }
        //update velocity and position
        for (r=1;r<ny-1;r++)  {
            for (c=1;c<nx-1;c++)  {
               v[r][c] = v[r][c] + dt*a[r][c];
               z[r][c] = z[r][c] + dt*v[r][c];
            }
        }
        if (it % 100 ==0)
        {
                double mx,mn;
                mx = -999999;
                mn = 999999;
                for(r=0;r<ny;r++)
                    for(c=0;c<nx;c++){
                        mx = max(mx, z[r][c]);
                        mn = min(mn, z[r][c]);
                    }
                for(r=0;r<ny;r++)
                    for(c=0;c<nx;c++)
                        output[r][c] = (char) round((z[r][c]-mn)/(mx-mn)*255);

                sprintf(filename, "./images/file%05d.png", frame);
                printf("Writing %s\n",filename);    
                write_png_file(filename,o_img,sz);
            frame+=1;
            }

        }


    //find the max and min values to normalize the output.
    double mx,mn;
    mx = -999999;
    mn = 999999;
    for(r=0;r<ny;r++)
        for(c=0;c<nx;c++){
           mx = max(mx, z[r][c]);
           mn = min(mn, z[r][c]);
        }

    printf("%f, %f\n", mn,mx);

    for(r=0;r<ny;r++)
        for(c=0;c<nx;c++){  
           output[r][c] = (char) round((z[r][c]-mn)/(mx-mn)*255);  
        }

    sprintf(filename, "./images/file%05d.png", it);
    printf("Writing %s\n",filename);    

    //Write out output image using 1D serial pointer
    write_png_file(filename,o_img,sz);
    return 0;
}

✅ Question 3.a: Modify the above example to add an OpenMP pragma for commands to efficiently parallelize the loops that update the position, velocity and acceleration code? Since there are two loops (one for each dimention) you should probably include the collapes option.

Put the answer to the above question here

✅ Question 3.b: Which OpenMP scheduling option would best be suited for these pragmas and why?

Put the answer to the above question here

✅ Question 3.c: Explain why we can NOT use a parallel pragma on the main loop?

Put the answer to the above question here

✅ Question 3.d: Modify the above code for the "find array minimum and maxium" (near the end of the code) using an OpenMP reduction.

Put the answer to the above question here

Question 4 - (20 points) Multiple Accelerators¶

Picture of the inside of a computer with multiple GPUS attached

Image from extremetech.com

The GPU enabled nodes on the HPCC have more than one GPU, however, all of the code we developed in this course and most of the example GPU code available online is designed to only work on one GPU. This is because GPUs do not share their memory and in order to run a kernel on different GPUs we need to copy memory individually to the GPUS which can be tricky.

One common solution is to to use a hybrid OpenMP/CUDA program, where every OpenMP CPU is given a GPU. The following code is a simple template to test how a basic hybrid OpenMP/CUDA program may work:

#include <stdio.h>
#include <omp.h>
#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) printf("CUDA error: " #x " returned \"%s\"\n", cudaGetErrorString(cuda
_error__));}
int main(int argc, char **argv) {
    int num_gpus,num_cpus;
    CUDA_CALL(cudaGetDeviceCount(&num_gpus));
    num_cpus = omp_get_max_threads();
    printf("There is %d cpus and %d gpus available for this job.\n", num_cpus, num_gpus);

    #pragma omp parallel shared(num_gpus)
    {
        int cpu_thread_id = omp_get_thread_num();
        if (cpu_thread_id < num_gpus) {
            printf("Setting up GPU #%d\n", cpu_thread_id);
            CUDA_CALL(cudaSetDevice(cpu_thread_id));

            //COPY MEMORY FROM HOST TO THIS CUDA DEVICE

            //RUN A KERNEL ON THIS CUDA DEVICE        

            //COPY MEMORY FROM THIS CUDA BACK TO HOST

        } else
            printf("CPU #%d is more than avaliable GPUS\n", cpu_thread_id);
    }
}

✅ Question 4.a: In your own words, describe what the cudaSetDevice function is doing in the above code.

Put the answer to the above question here

The above program was uploaded to the HPCC and we tried to compile it on the dev-amd20-v100 node using the following command:

nvcc -fopenmp -o multigpu multigpu.cu

However, the above command returned the following error

-bash: nvcc: command not found

✅ Question 4.b: What is the source of this "command not found" error and how can it be fixed?

Put the answer to the above question here

Assuming the above error is fixed we tried compiling again but now get the following error:

nvcc fatal   : Unknown option 'fopenmp'

✅ Question 4.c: What is the cause of this second error and how can it be fixed?

Put the answer to the above question here

Assuming we got the above to compile properly, we ran it on the command line and get the following output (truncated):

>./multigpu 
There is 48 cpus and 4 gpus available for this job.
Setting up GPU #0
CPU #9 is more than avaliable GPUS
CPU #41 is more than avaliable GPUS
CPU #31 is more than avaliable GPUS
CPU #22 is more than avaliable GPUS
CPU #36 is more than avaliable GPUS
CPU #33 is more than avaliable GPUS
CPU #40 is more than avaliable GPUS
CPU #34 is more than avaliable GPUS
CPU #38 is more than avaliable GPUS
CPU #4 is more than avaliable GPUS
Setting up GPU #3
CPU #5 is more than avaliable GPUS
...

The above seems good now we want to test it with SLURM using the following job script:

#!/bin/bash
#SBATCh -N 1
#SBATCH -c 3
#SBATCH --time=00:01:00
#SBATCH --mem=200mb
#SBATCH --gres=gpu:3
srun ./multigpu

✅ Question 4.d: What output would be produced by running the above submission script and the provided code?

Put the answer to the above question here

Question 5 - (20 points) Agent Based Economic Simulator¶

From NSI website

Your research team is designing an agent based model for an economic simulator using MPI. The overall design includes the following:

The Rank 0 node will be the head node and act as a "stock ticker" and bank. At each beginning of each iteration this node will broadcast the current stock prices to the other nodes. The stock prices is a 1D array integers named stocks with NumStocks total stocks.
The other nodes (rank > 0) will run individual models that read in the stocks and then send their purchase back to Rank 0.

So far the team has designed the following basic pseudocode as a starting point for the program:

Initialize MPI
Start Main Loop
- If Rank 0
  - Run Stock Model
  - Send stocks vector
  - Loop over other processors
    - Receive and process their purchase
- else
  - Receive Stock vectors
  - Run Purchase Model
  - Send purchase to Bank
Rank 0 prints final results
Finalize MPI

At this point in the design we want to just think through and debug what would go into the MPI portion of the code and not worry about the models themselves. As a starting point here is a stubbed out code.

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char **argv)
{
    int stocks[5];
    int num_stocks = 5;

    printf("hello world\n");

    // Initialize MPI
    // Start Main Loop
    // If Rank 0
        // Run Stock Model
        // Send stocks vector
        // Loop over other processors
            // Receive and process their purchase
    // else
        // Receive Stock vectors
        // Run Purchase Model
        // Send purchase to Bank
    // Rank 0 prints final results
    // Finalize MPI
}

✅ Question 5.a: What c/c++ commands should be used to initialize the MPI job? i.e. write a code block for standard MPI initialization.

Put your answer to the above question here.

✅ Question 5.b: Write c/c++ code to send from the Rank 0 node the current stocks to each of the other processors. At the end of your block of code each processor should have a array pointer named stocks and integer num_stocks with all the same values. Just assume that rank 0 already has thestocks andnum_stocks initialized. Note, you can send using a loop or a broadcast.

Put your answer to the above question here.

✅ Question 5.c: Assuming the program is named market.c in your current directory and has no additional linked libraries. What is the command needed to compile the MPI program into an executable named market (use level 3 compiler optimization)?

Put your answer to the above question here.

✅ Question 5.d: Assume it is 3 months later and the simulation code is finished, is running well and producing results. Your team typically runs jobs with 100-200 agents/nodes. Your lead researcher wants to go bigger and faster and asked you if you should switch from MPI to CUDA. Write down two situation; 1) when CUDA would be a better idea as compared to MPI and 2) when you should stick with MPI. Do your best to explain your reasoning.

When you may want to use CUDA: Put the answer here

When you want to stick to MPI: Put the answer here

Congratulations, you're done with your EXAM¶

Now, you just need to submit this assignment by uploading it to the course Desire2Learn web page for today's dropbox.

✅ DO THIS:

Download the Notebook to your desktop with the filename using the format "<NETID>_Final-Exam.ipynb". Replace <NETID> in the filename with your personal MSU NetID (the stuff that comes before the @ symbol in your msu email address).
Upload the newly renamed notebook to the D2L dropbox.

Written by Dr. Dirk Colbry, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.