Link to this document's Jupyter Notebook

In order to successfully complete this assignment you need to participate both individually and in groups during class. If you attend class in-person then have one of the instructors check your notebook and sign you out before leaving class on Monday March 8. If you are attending asynchronously, turn in your assignment using D2L no later than _11:59pm on Monday March 8.


In-Class Assignment: CUDA Practice

Fancy rendering of a 3D image often used as a CUDA icon

Image from: https://www.nvidia.com<

Agenda for today's class (70 minutes)

  1. (20 minutes) Pre class Review
  2. (30 minutes) 1D Wave Example
  3. (10 minutes) Amdahl's law
  4. (10 minutes) Strong vs Weak Scaling

1. Pre class Review

0307--CUDA_Memory_pre-class-assignment


2. 1D Wave Example

Individually and as a class lets see if we can run the 1D wave example on the GPU and measure the performance:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char ** argv) {
    int nx = 500;
    int nt = 100000;
    int i,it;
    double x[nx];
    double y[nx];
    double v[nx];
    double dvdt[nx];
    double dt;
    double dx;
    double max,min;
    double dx2inv;
    double tmax;
    int nxm1;

    max=10.0;
    min=0.0;
    dx = (max-min)/(double)(nx);
    x[0] = min;
    for(i=1;i<nx-1;i++) {
        x[i] = min+(double)i*dx;
    }
    x[nx-1] = max;
    tmax=10.0;
    dt= (tmax-0.0)/(double)(nt);

    for (i=0;i<nx;i++)  {
        y[i] = exp(-(x[i]-5.0)*(x[i]-5.0));
        v[i] = 0.0;
        dvdt[i] = 0.0;
    }

    dx2inv=1.0/(dx*dx);
    nxm1=nx-1;
    for(it=0;it<nt-1;it++) {
        for(i=1;i<nxm1;i++)
            dvdt[i]=(y[i+1]+y[i-1]-2.0*y[i])*(dx2inv);

        for(i=1; i<nxm1; i++)  {
            v[i] = v[i] + dt*dvdt[i];
            y[i] = y[i] + dt*v[i];
        }

    }

    for(i=nx/2-10; i<nx/2+10; i++) {
        printf("%g %g\n",x[i],y[i]);
    }

    return 0;
}

3. Amdahl's law

The following examples come from here: https://en.wikipedia.org/wiki/Amdahl%27s_law

Amdahl's law can be formulated in the following way:

$$S_\text{latency}(s)=\frac {1}{(1-p)+{\frac {p}{s}}}$$

$S_\text{latency}$ is the theoretical speedup of the execution of the whole task;

Furthermore,

$${\displaystyle {\begin{cases}S_{\text{latency}}(s)\leq {\dfrac {1}{1-p}}\\[8pt]\lim \limits _{s\to \infty }S_{\text{latency}}(s)={\dfrac {1}{1-p}}.\end{cases}}} $$

If 30% of the execution time may be the subject of a speedup, $p$ will be 0.3; if the improvement makes the affected part twice as fast, $s$ will be 2. Amdahl's law states that the overall speedup of applying the improvement will be:

$${\displaystyle S_{\text{latency}}={\frac {1}{1-p+{\frac {p}{s}}}}={\frac {1}{1-0.3+{\frac {0.3}{2}}}}=1.18.}$$

For example, assume that we are given a serial task which is split into four consecutive parts, whose percentages of execution time are p1 = 0.11, p2 = 0.18, p3 = 0.23, and p4 = 0.48 respectively. Then we are told that the 1st part is not sped up, so s1 = 1, while the 2nd part is sped up 5 times, so s2 = 5, the 3rd part is sped up 20 times, so s3 = 20, and the 4th part is sped up 1.6 times, so s4 = 1.6. By using Amdahl's law, the overall speedup is

$${\displaystyle S_{\text{latency}}={\frac {1}{{\frac {p1}{s1}}+{\frac {p2}{s2}}+{\frac {p3}{s3}}+{\frac {p4}{s4}}}}={\frac {1}{{\frac {0.11}{1}}+{\frac {0.18}{5}}+{\frac {0.23}{20}}+{\frac {0.48}{1.6}}}}=2.19} $$

Notice how the 5 times and 20 times speedup on the 2nd and 3rd parts respectively don't have much effect on the overall speedup when the 4th part (48% of the execution time) is accelerated by only 1.6 times.


4. Strong vs Weak Scaling

STRONG SCALING In this case the problem size stays fixed but the number of processing elements are increased. This is used as justification for programs that take a long time to run (something that is cpu-bound). The goal in this case is to find a "sweet spot" that allows the computation to complete in a reasonable amount of time, yet does not waste too many cycles due to parallel overhead. In strong scaling, a program is considered to scale linearly if the speedup (in terms of work units completed per unit time) is equal to the number of processing elements used ( N ). In general, it is harder to achieve good strong-scaling at larger process counts since the communication overhead for many/most algorithms increases in proportion to the number of processes used.

If the amount of time to complete a work unit with 1 processing element is $t_1$, and the amount of time to complete the same unit of work with $N$ processing elements is $t_N$, the strong scaling efficiency (as a percentage of linear) is given as:

$$ \frac{t_1}{( N * t_N )} * 100%$$

WEAK SCALING In this case the problem size (workload) assigned to each processing element stays constant and additional elements are used to solve a larger total problem (one that wouldn't fit in RAM on a single node, for example). Therefore, this type of measurement is justification for programs that take a lot of memory or other system resources (something that is memory-bound). In the case of weak scaling, linear scaling is achieved if the run time stays constant while the workload is increased in direct proportion to the number of processors. Most programs running in this mode should scale well to larger core counts as they typically employ nearest-neighbour communication patterns where the communication overhead is relatively constant regardless of the number of processes used; exceptions include algorithms that employ heavy use of global communication patterns, eg. FFTs and transposes.

If the amount of time to complete a work unit with 1 processing element is $t_1$, and the amount of time to complete $N$ of the same work units with $N$ processing elements is $t_N$, the weak scaling efficiency (as a percentage of linear) is given as:

$$ \frac{t_1}{t_N} * 100% $$

from: https://www.sharcnet.ca/help/index.php/Measuring_Parallel_Scaling_Performance


Congratulations, we're done!

If you attend class in-person then have one of the instructors check your notebook and sign you out before leaving class. If you are attending asynchronously, turn in your assignment using D2L.

Course Resources:

Written by Dr. Dirk Colbry, Michigan State University Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.