Parallel Programming Models
A variety of parallel programming models is available on Pawsey Centre's systems. This section contains introductory information about the most popular parallel programming techniques like MPI, OpenMP, CUDA and HIP.
This part of the documentation is organised as an introductory tutorial on using various parallel programming models. The main idea is to illustrate the basic usage of parallel programming paradigms on Pawsey systems rather than to provide a detailed description of a particular programming model. Each subsection ends with links to useful learning materials for further reading.
It starts with the description of a Toy Computational Problem and its C and Fortran implementations, which are then ported to those models. The models covered here are:
There are other common parallelism models and APIs:
- OpenCL is an open standard for cross-platform, parallel programming for heterogenous compute on diverse accelerators, from GPUs to FPGAs.
OpenACC is a programming standard for parallel computing that is designed to simplify parallel programming of heterogeneous CPU/GPU systems. As with OpenMP, parallelisation is achieved through the use of the directive
#pragma acc
(in C/C++) or!$acc
(in Fortran).Transition existing OpenACC applications to OpenMP and use OpenMP instead of OpenACC in new applications as there is better support for the OpenMP API.
- HPX is a programming model in C++ providing abstractions for parallel execution of code.
- Kokkos is a programming model in C++ providing abstractions for both parallel execution of code and data management. It currently can use CUDA, HIP, SYCL, HPX, OpenMP and C++ threads as backend programming models with several other backends in development.
- Charm++ is a parallel programming framework in C++ supported by an adaptive runtime system that supports both irregular as well as regular applications. It can be used to specify task parallelism as well as data parallelism in a single application.
Toy computational problem
Monte Carlo estimate of pi
The value of pi can be estimated by a simple Monte Carlo algorithm where random points are generated within a square and the proportion of points that lie inside an inscribed circle is counted. The probability of a point landing in the circle is proportional to the relative areas of the circle and square.
Serial implementation
The following code blocks show the serial implementations of the Monte Carlo pi estimator in the C and Fortran languages. Use these as references when reading through the parallel implementation of the same algorithm in the various subpages of this page.
/* Compute pi in serial */ #include <stdio.h> // Random number generator -- and not a very good one, either! static long MULTIPLIER = 1366; static long ADDEND = 150889; static long PMOD = 714025; long random_last = 0; // This is not a thread-safe random number generator double lcgrandom() { long random_next; random_next = (MULTIPLIER * random_last + ADDEND)%PMOD; random_last = random_next; return ((double)random_next/(double)PMOD); } static long num_trials = 1000000; int main(int argc, char **argv) { long i; long Ncirc = 0; double pi, x, y; double r = 1.0; // radius of circle double r2 = r*r; // for loop with most of the compute for (i = 0; i < num_trials; i++) { x = lcgrandom(); y = lcgrandom(); if ((x*x + y*y) <= r2) Ncirc++; } pi = 4.0 * ((double)Ncirc)/((double)num_trials); printf("\n \t Computing pi in serial: \n"); printf("\t For %ld trials, pi = %f\n", num_trials, pi); printf("\n"); return 0; }
! Compute pi in serial ! First, the pseudorandom number generator real function lcgrandom() integer*8, parameter :: MULTIPLIER = 1366 integer*8, parameter :: ADDEND = 150889 integer*8, parameter :: PMOD = 714025 integer*8, save :: random_last = 0 integer*8 :: random_next = 0 random_next = mod((MULTIPLIER * random_last + ADDEND), PMOD) random_last = random_next lcgrandom = (1.0*random_next)/PMOD return end ! Now, we compute pi program darts implicit none integer*8 :: num_trials = 1000000, i = 0, Ncirc = 0 real :: pi = 0.0, x = 0.0, y = 0.0, r = 1.0 real :: r2 = 0.0 real :: lcgrandom r2 = r*r do i = 1, num_trials x = lcgrandom() y = lcgrandom() if ((x*x + y*y) .le. r2) then Ncirc = Ncirc+1 end if end do pi = 4.0*((1.0*Ncirc)/(1.0*num_trials)) print*, ' ' print*, ' Computing pi in serial: ' print*, ' For ', num_trials, ' trials, pi = ', pi print*, ' ' end
The above codes can be compiled and executed on a Cray supercomputer in the following way (C example):
$ cc pi.c -o pi.x $ salloc -N 1 -n 1 -pdebugq -t 0:10:00 #interactive session on debug for 10 minutes $ srun ./pi.x Computing pi in serial: For 1000000 trials, pi = 3.14140797
Related pages
For detailed information on how to compile software on Pawsey systems, see Compiling.