Graphical Processing Units (GPUs) are currently one of the most popular devices for accelerating scientific computing. CUDA is currently one of the most popular languages to write code for efficient execution on NVIDIA GPUs, such as those in Garrawarla. For the AMD GPUs in Setonix, refer to the HIP page.
A very short introduction to GPU programming
GPUs are currently one of the most popular devices for accelerating scientific computing. GPUs can be successfully used in scientific disciplines like machine learning, life sciences and many more. HPC-focused GPU devices such as NVIDIA Tesla and AMD Instinct have significantly higher theoretical arithmetic performance than mainstream processors and have dedicated high bandwidth memory units in addition to the system RAM. These specifications come with improved power efficiency for appropriate computational workloads.
By design, GPUs support extremely fine-grained parallelism with thousands of threads. As a result not all algorithms are well suited for their architecture. Adapting code to the GPU architecture usually requires significant modifications to the source code, in particular when using dedicated languages such as Nvidia CUDA (external site) and AMD HIP. However, this requirement is continuously improving over time, with the availability of optimised libraries and support for directive-based programming languages like OpenMP and OpenACC. The process of adapting the code to a GPU architecture starts with identifying the most computationally expensive parts of the code and its suitability for GPU acceleration. If suitable, those parts of the code can be then rewritten in the so-called kernels (routines to be executed on GPUs) or marked with appropriate compiler directives. In both cases the programmer needs to take care of defining data transfers between host (CPU) and device (GPU) memory spaces.
The lifetime of a GPU-accelerated code is schematically depicted in figure 1.
Figure 1. lifetime of an accelerated GPU code
This page focuses on the use of the Nvidia CUDA Toolkit to showcase the basic concepts of GPU programming.
CUDA "Hello world" program
Listing 1 is a simple "hello world" program written in C that shows basic CUDA functions. The CUDA kernel helloGPU
is to be executed on the GPU. Its definition uses the __global__
keyword, and its invocation uses the triple angle bracket notation: <<<no_blocks,no_threads>>>
. Execution threads in CUDA are grouped in so called thread blocks; threads in the same block are guaranteed to be executed within the same compute unit inside the GPU, thus sharing the same caches. Block and thread indices are accessible within the kernel definition by means of the variables threadIdx.x and blockIdx.x.
#include <stdio.h> __global__ void helloGPU() { int tid = threadIdx.x; int bid = blockIdx.x; printf("Hello from GPU thread %d in block %d\n",tid,bid); } int main(int argc, char *argv[]) { int no_blocks = 4; int no_threads = 5; helloGPU<<<no_blocks,no_threads>>>(); cudaDeviceSynchronize(); }
This code can be compiled using the following command, using the CUDA C compiler nvcc
by Nvidia:
$ nvcc hello.cu -o helloCUDA
For detailed information on how to compile CUDA software on Pawsey systems, see Compiling.
The code can be executed in an interactive SLURM session or within a batch job. An explicit request to use one or more GPUs is required. Terminal 1 shows an interactive session on Garrawarla:
$ salloc -N 1 -n 1 --gres=gpu:1 -p gpuq -t 0:01:00 $ module load cuda/<version> $ srun -n 1 --gres=gpu:1 --export=all ./helloCUDA Hello from GPU thread 0 in block 0 Hello from GPU thread 1 in block 0 Hello from GPU thread 2 in block 0 Hello from GPU thread 3 in block 0 Hello from GPU thread 4 in block 0 Hello from GPU thread 0 in block 1 Hello from GPU thread 1 in block 1 Hello from GPU thread 2 in block 1 Hello from GPU thread 3 in block 1 Hello from GPU thread 4 in block 1 Hello from GPU thread 0 in block 3 Hello from GPU thread 1 in block 3 Hello from GPU thread 2 in block 3 Hello from GPU thread 3 in block 3 Hello from GPU thread 4 in block 3 Hello from GPU thread 0 in block 2 Hello from GPU thread 1 in block 2 Hello from GPU thread 2 in block 2 Hello from GPU thread 3 in block 2 Hello from GPU thread 4 in block 2
Implementation of the toy problem
The CUDA implementation of the toy computational problem consists of the following steps:
- A memory buffer for random numbers is allocated in the GPU device memory (line 34).
- The CUDA random number generator library is used to generate and store random numbers on the GPU device (lines 37-42).
- An accelerator kernel for counting points which fit in the circle is defined (lines 7-22).
- The CUDA kernel is launched on the device with 1000 threads per block and 1000 blocks (line 52). Those sizes need to be carefully defined by programmer according to the algorithm and specific GPU type in use. Numbers used in this example were picked arbitrarily.
- The CUDA synchronisation routine is called in line 55 to make sure that the kernel completed its execution.
- Partial counts are copied from the device memory to the host memory in line 58.
- The final result is computed and printed based on partial counts.
/* Compute pi in serial */ #include <stdio.h> #include <cuda.h> #include <curand.h> static long num_trials = 1000000; __global__ void kernel(int* Ncirc_t_device,float *randnum) { int i; double r = 1.0; // radius of circle double r2 = r*r; double x,y; i = blockDim.x * blockIdx.x + threadIdx.x; x=randnum[2*i]; y=randnum[2*i+1]; if ((x*x + y*y) <= r2) Ncirc_t_device[i]=1; else Ncirc_t_device[i]=0; } int main(int argc, char **argv) { int i; long Ncirc=0; int *Ncirc_t_device; int *Ncirc_t_host; float *randnum; int threads, blocks; double pi; // Allocate an array for the random numbers in GPU memory space cudaMalloc((void**)&randnum,(2*num_trials)*sizeof(float)); // Generate random numbers int status; curandGenerator_t randgen; status = curandCreateGenerator(&randgen, CURAND_RNG_PSEUDO_MRG32K3A); status |= curandSetPseudoRandomGeneratorSeed(randgen, 4294967296ULL^time(NULL)); status |= curandGenerateUniform(randgen, randnum, (2*num_trials)); status |= curandDestroyGenerator(randgen); threads=1000; blocks=num_trials/threads; // Allocate hit array on host Ncirc_t_host=(int*)malloc(num_trials*sizeof(int)); // Allocate hit array on device cudaMalloc((void**)&Ncirc_t_device,num_trials*sizeof(int)); kernel <<<blocks, threads>>> (Ncirc_t_device,randnum); // Synchronize host and device cudaDeviceSynchronize(); // Copy the hit array to host cudaMemcpy(Ncirc_t_host,Ncirc_t_device,num_trials*sizeof(int),cudaMemcpyDeviceToHost); // Count hits for(i=0; i<num_trials; i++) Ncirc+=Ncirc_t_host[i]; pi = 4.0 * ((double)Ncirc)/((double)num_trials); printf("\n \t Computing pi using CUDA: \n"); printf("\t For %ld trials, pi = %f\n", num_trials, pi); printf("\n"); cudaFree(randnum); cudaFree(Ncirc_t_device); free(Ncirc_t_host); return 0; }
Related pages
- Introduction to HIP
External links
- Beginner guides to CUDA by Nvidia:
- For a comprehensive CUDA C programming guide, see the CUDA C Programming Guide by Nvidia
- OpenMP homepage
- For an introduction to OpenMP for GPUs, see OpenMP on GPUs, First Experiences and Best Practices (PDF of a presentation delivered at GTC2018 by Jeff Larkin)
- OpenACC homepage
- For recorded tutorials on OpenACC, see the OpenACC Courses by Nvidia
- AMD HIP Programming Guide