ICA 21: CUDA Practice

ICA 21: CUDA Practice#

Fancy rendering of a 3D image often used as a CUDA icon

Image from: https://www.nvidia.com

Agenda for today’s class (70 minutes)#

(20 minutes) Pre class Review
(30 minutes) 1D Wave Example
(10 minutes) Amdahl’s law
(10 minutes) Strong vs Weak Scaling

1. Pre class Review#

Briefly discuss the PCA with your group and work through any challenges with the PCA content.

2. 1D Wave Example#

Based on the coding demo, finish updating your wave code to work with CUDA:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

int main(int argc, char ** argv) {
    int nx = 500;
    int nt = 100000;
    int i,it;
    double x[nx];
    double y[nx];
    double v[nx];
    double dvdt[nx];
    double dt;
    double dx;
    double max,min;
    double dx2inv;
    double tmax;
    int nxm1;

    max=10.0;
    min=0.0;
    dx = (max-min)/(double)(nx);
    x[0] = min;
    for(i=1;i<nx-1;i++) {
        x[i] = min+(double)i*dx;
    }
    x[nx-1] = max;
    tmax=10.0;
    dt= (tmax-0.0)/(double)(nt);

    for (i=0;i<nx;i++)  {
        y[i] = exp(-(x[i]-5.0)*(x[i]-5.0));
        v[i] = 0.0;
        dvdt[i] = 0.0;
    }
    
    dx2inv=1.0/(dx*dx);
    nxm1=nx-1;
    for(it=0;it<nt-1;it++) {
        for(i=1;i<nxm1;i++)
            dvdt[i]=(y[i+1]+y[i-1]-2.0*y[i])*(dx2inv);

        for(i=1; i<nxm1; i++)  {
            v[i] = v[i] + dt*dvdt[i];
            y[i] = y[i] + dt*v[i];
        }

    }

    for(i=nx/2-10; i<nx/2+10; i++) {
        printf("%g %g\n",x[i],y[i]);
    }

    return 0;
}

3. Amdahl’s law#

The following examples come from here: https://en.wikipedia.org/wiki/Amdahl’s_law

Amdahl’s law can be formulated in the following way:

\[S_\text{latency}(s)=\frac {1}{(1-p)+{\frac {p}{s}}}\]

\(S_\text{latency}\) is the theoretical speedup of the execution of the whole task;

\(s\) is the speedup of the part of the task that benefits from improved system resources;
p is the proportion of execution time that the part benefiting from improved resources originally occupied.

Furthermore,

\[\begin{split}{\displaystyle {\begin{cases}S_{\text{latency}}(s)\leq {\dfrac {1}{1-p}}\\[8pt]\lim \limits _{s\to \infty }S_{\text{latency}}(s)={\dfrac {1}{1-p}}.\end{cases}}} \end{split}\]

If 30% of the execution time may be the subject of a speedup, \(p\) will be 0.3; if the improvement makes the affected part twice as fast, \(s\) will be 2. Amdahl’s law states that the overall speedup of applying the improvement will be:

\[{\displaystyle S_{\text{latency}}={\frac {1}{1-p+{\frac {p}{s}}}}={\frac {1}{1-0.3+{\frac {0.3}{2}}}}=1.18.}\]

For example, assume that we are given a serial task which is split into four consecutive parts, whose percentages of execution time are p1 = 0.11, p2 = 0.18, p3 = 0.23, and p4 = 0.48 respectively. Then we are told that the 1st part is not sped up, so s1 = 1, while the 2nd part is sped up 5 times, so s2 = 5, the 3rd part is sped up 20 times, so s3 = 20, and the 4th part is sped up 1.6 times, so s4 = 1.6. By using Amdahl’s law, the overall speedup is

\[{\displaystyle S_{\text{latency}}={\frac {1}{{\frac {p1}{s1}}+{\frac {p2}{s2}}+{\frac {p3}{s3}}+{\frac {p4}{s4}}}}={\frac {1}{{\frac {0.11}{1}}+{\frac {0.18}{5}}+{\frac {0.23}{20}}+{\frac {0.48}{1.6}}}}=2.19} \]

Notice how the 5 times and 20 times speedup on the 2nd and 3rd parts respectively don’t have much effect on the overall speedup when the 4th part (48% of the execution time) is accelerated by only 1.6 times.

4. Strong vs Weak Scaling#

STRONG SCALING In this case the problem size stays fixed but the number of processing elements are increased. This is used as justification for programs that take a long time to run (something that is cpu-bound). The goal in this case is to find a “sweet spot” that allows the computation to complete in a reasonable amount of time, yet does not waste too many cycles due to parallel overhead. In strong scaling, a program is considered to scale linearly if the speedup (in terms of work units completed per unit time) is equal to the number of processing elements used ( N ). In general, it is harder to achieve good strong-scaling at larger process counts since the communication overhead for many/most algorithms increases in proportion to the number of processes used.

If the amount of time to complete a work unit with 1 processing element is \(t_1\), and the amount of time to complete the same unit of work with \(N\) processing elements is \(t_N\), the strong scaling efficiency (as a percentage of linear) is given as:

\[ \frac{t_1}{( N * t_N )} * 100%\]

WEAK SCALING In this case the problem size (workload) assigned to each processing element stays constant and additional elements are used to solve a larger total problem (one that wouldn’t fit in RAM on a single node, for example). Therefore, this type of measurement is justification for programs that take a lot of memory or other system resources (something that is memory-bound). In the case of weak scaling, linear scaling is achieved if the run time stays constant while the workload is increased in direct proportion to the number of processors. Most programs running in this mode should scale well to larger core counts as they typically employ nearest-neighbour communication patterns where the communication overhead is relatively constant regardless of the number of processes used; exceptions include algorithms that employ heavy use of global communication patterns, eg. FFTs and transposes.

If the amount of time to complete a work unit with 1 processing element is \(t_1\), and the amount of time to complete \(N\) of the same work units with \(N\) processing elements is \(t_N\), the weak scaling efficiency (as a percentage of linear) is given as:

\[ \frac{t_1}{t_N} * 100% \]

from: https://www.sharcnet.ca/help/index.php/Measuring_Parallel_Scaling_Performance

✅ DO THIS: Design a weak and strong scaling study for the wave equation. Time-permitting, run the scaling studies and plot the results. Feel free to do this with CUDA or OpenMP.

Congratulations, we’re done!#

Have one of the instructors check your notebook and sign you out before leaving class. Turn in your assignment using D2L.

Written by Dr. Dirk Colbry, Michigan State University (Updated by Dr. Nathan Haut in Spring 2025)
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.