In order to successfully complete this assignment you must do the required reading, watch the provided videos and complete all instructions. The embedded survey form must be entirely filled out and submitted on or before 11:59pm on Tuesday March 9. Students must come to class the next day prepared to discuss the material covered in this assignment.
A GPU does not do automatic memory caching like a CPU. Instead you need to do all of work for memory management yourself. The following video gives a brief overview of the concept of tiling.
from IPython.display import YouTubeVideo
YouTubeVideo("tGu5DyIlofY",width=640,height=360)
✅ QUESTION: Which of the following two code snippets (foo or bar) would benefit from tiling?
__global__ void foo(flout out[], float A[], float B[], float C[], float D[], float E[])
{
int i = threadIdx.x;
out[i] = (A[i] + B[i] + C[i] + D[i] + E[i] / 5.of;
}
__global__ void bar(flout out[], float in[])
{
int i = threadIdx.x;
out[i] = (in[i-2] + in[i-1] + in[i] + in[i+1] + in[i+2] / 5.of;
}
✅ QUESTION: Explain your answer to the above question.
Put your answer to the above question here.
Consider the following inefficient CUDA transpose code.
✅ DO THIS: Copy the code to the HPC, debug any errors and get it to run.
%%writefile NCode/transpose.cu
#include <iostream>
#include <cuda.h>
#include <chrono>
#define CUDA_CALL(x) {cudaError_t cuda_error__ = (x); if (cuda_error__) { fprintf(stderr, "CUDA error: " #x " returned \"%s\"\n", cudaGetErrorString(cuda_error__)); fflush(stderr); exit(cuda_error__); } }
using namespace std;
__global__ void transpose(double *in_d, double * out_d, int row, int col)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
out_d[y+col*x] = in_d[x+row*y];
}
int main(int argc,char **argv)
{
int sz_x=32*300;
int sz_y=32*300;
int nBytes = sz_x*sz_y*sizeof(double);
int block_size;
double *m_h = (double *)malloc(nBytes);
double * in_d;
double * out_d;
int count = 0;
for (int i=0; i < sz_x*sz_y; i++){
m_h[i] = count;
count++;
}
std::cout << "Allocating device memory on host..\n";
auto start_d = std::chrono::high_resolution_clock::now();
CUDA_CALL(cudaMalloc((void **)&in_d,nBytes));
CUDA_CALL(cudaMalloc((void **)&out_d,nBytes));
//Set up blocks
block_size=32;
dim3 dimBlock(block_size,block_size,1);
dim3 dimGrid(sz_x/block_size,sz_y/block_size,1);
std::cout << "Doing GPU Transpose\n";
CUDA_CALL(cudaMemcpy(in_d,m_h,nBytes,cudaMemcpyHostToDevice));
transpose<<<dimGrid,dimBlock>>>(in_d,out_d,sz_y,sz_x);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "\n\nError: %s\n\n", cudaGetErrorString(err)); fflush(stderr); exit(err);
}
CUDA_CALL(cudaMemcpy(m_h,out_d,nBytes,cudaMemcpyDeviceToHost));
auto end_d = std::chrono::high_resolution_clock::now();
std::cout << "Doing CPU Transpose\n";
auto start_h = std::chrono::high_resolution_clock::now();
for (int y=0; y < sz_y; y++){
for (int x=y; x < sz_x; x++){
double temp = m_h[x+sz_x*y];
//std::cout << temp << " ";
m_h[x+sz_x*y] = m_h[y+sz_y*x];
m_h[y+sz_y*x] = temp;
}
//std::cout << "\n";
}
auto end_h = std::chrono::high_resolution_clock::now();
//Checking errors (should be same values as start)
count = 0;
int errors = 0;
for (int i=0; i < sz_x*sz_y; i++){
if (m_h[i] != count)
errors++;
count++;
}
std::cout << errors << " Errors found in transpose\n";
//Print Timing
std::chrono::duration<double> time_d = end_d - start_d;
std::cout << "Device time: " << time_d.count() << " s\n";
std::chrono::duration<double> time_h = end_h - start_h;
std::cout << "Host time: " << time_h.count() << " s\n";
cudaFree(in_d);
cudaFree(out_d);
return 0;
}
Overwriting NCode/transpose.cu
#Ignore this cell, it will only work on a cuda enabled server, most likely you will get a "Command not found" error
#Compile Cuda
!nvcc -std=c++11 -o transpose NCode/transpose.cu
nvcc: Command not found.
#Ignore this cell, it will only work on a cuda enabled server, most likely you will get a "No such file or directory" error
#Run Example
!./transpose
./transpose: Command not found.
✅ QUESTION: What is the "speedup" of running the GPU vs CPU. Calculate the speedup using the following equation:
$$speedup = \frac{time_{host}}{time_{device}}$$Put your answer to the above question here
The following video describes how you might improve the performace of the matrix transpose using tiling.
from IPython.display import YouTubeVideo
YouTubeVideo("pP-1nJEp4Qc",width=640,height=360)
✅ QUESTION: How could you modify the transpose code to take advantage of tiling?
Put your answer to the above question here.
Please fill out the form that appears when you run the code below. You must completely fill this out in order to receive credits for the assignment!
If you have trouble with the embedded form, please make sure you log on with your MSU google account at googleapps.msu.edu and then click on the direct link above.
✅ Assignment-Specific QUESTION: Where you able to get the transpose example working, If not, where did you get stuck?
Put your answer to the above question here
✅ QUESTION: Summarize what you did in this assignment.
Put your answer to the above question here
✅ QUESTION: What questions do you have, if any, about any of the topics discussed in this assignment after working through the jupyter notebook?
Put your answer to the above question here
✅ QUESTION: How well do you feel this assignment helped you to achieve a better understanding of the above mentioned topic(s)?
Put your answer to the above question here
✅ QUESTION: What was the most challenging part of this assignment for you?
Put your answer to the above question here
✅ QUESTION: What was the least challenging part of this assignment for you?
Put your answer to the above question here
✅ QUESTION: What kind of additional questions or support, if any, do you feel you need to have a better understanding of the content in this assignment?
Put your answer to the above question here
✅ QUESTION: Do you have any further questions or comments about this material, or anything else that's going on in class?
Put your answer to the above question here
✅ QUESTION: Approximately how long did this pre-class assignment take?
Put your answer to the above question here
from IPython.display import HTML
HTML(
"""
<iframe
src="https://cmse.msu.edu/cmse401-pc-survey"
width="100%"
height="500px"
frameborder="0"
marginheight="0"
marginwidth="0">
Loading...
</iframe>
"""
)
To get credit for this assignment you must fill out and submit the above survey from on or before the assignment due date.
Written by Dr. Dirk Colbry, Michigan State University
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.