MPI+X
The MPI+X approach is a combination of Message Passing Interface for expressing internode parallelism and another parallel programming model for intranode parallelism or acceleration. MPI+X is the recommended parallelisation strategy on current HPC systems.
Prerequisites
This page builds on top of Message Passing Interface and OpenMP.
A very short introduction to MPI+X
The recommended parallelisation strategy for current HPC systems is to use MPI+X approach which is a combination of Message Passing Interface for expressing internode parallelism and another parallel programming model (for example, a threading model) for intranode parallelism or acceleration. One of the most popular MPI+X models is MPI+OpenMP, where OpenMP is used to express intranode parallelism. Another example is when the MPI parallel code is combined with a GPU programming model, such as CUDA or HIP, to make use of the accelerators available within the computational nodes.
There are a number of benefits in using the MPI+X model:
- Reducing the total number of MPI ranks used in the parallel execution
- Obtaining better performance due to more appropriate expression of intra-node parallelism
- Allowing a more convenient way of expressing SIMD parallelism at the loop or function level
- Optimising existing GPU/OpenMP codes to achieve multi-node scalability
"Hello world" programs
Listing 1 shows a simple "hello world" program written in C that uses the basic MPI and OpenMP functions. Listing 2 shows the same program written in Fortran. Each MPI process spawns additional threads for the execution of a parallel region. Note how the value of rank will be different for each process, and how the value of id will be different for each thread of the same process.
#include <stdio.h> #include "mpi.h" #include <omp.h> int main(int argc, char *argv[]) { int rank, size; int id = 0, nthreads = 1; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); #pragma omp parallel default(shared) private(id, nthreads) { nthreads = omp_get_num_threads(); id = omp_get_thread_num(); printf("Hello from thread %d out of %d from process %d out of %d\n", id, nthreads, rank, size); } MPI_Finalize(); }
program hello use mpi implicit none integer :: ifail integer :: rank, size integer:: id, nthreads,omp_get_thread_num,omp_get_num_threads call mpi_init(ifail) call mpi_comm_rank(MPI_COMM_WORLD, rank, ifail) call mpi_comm_size(MPI_COMM_WORLD, size, ifail) !$omp parallel private(id,nthreads) id = omp_get_thread_num() nthreads = omp_get_num_threads() write (*,*) 'Hello from thread', id, " out of ", nthreads, " from process ", rank, " out of ", size !$omp end parallel call mpi_finalize(ifail) end program hello
These codes can be compiled on a Cray supercomputer in the following way, assuming the compilers are wrapping around the GNU compiler collection:
$ cc -fopenmp hello.c -o helloC # C $ ftn -fopenmp hello.f90 -o helloF # Fortran
The codes can be executed in an interactive SLURM session or within a batch job. For example, terminal 2 shows an interactive session using 2 MPI processes on a node, each with 12 OpenMP threads:
$ salloc -N 1 -n 2 -c 12 -p debug -t 0:01:00 $ export OMP_NUM_THREADS=12 $ srun -n 2 -c 12 --export=all ./helloC Hello from thread 0 out of 12 from process 0 out of 2 Hello from thread 8 out of 12 from process 0 out of 2 Hello from thread 10 out of 12 from process 0 out of 2 Hello from thread 6 out of 12 from process 0 out of 2 Hello from thread 9 out of 12 from process 0 out of 2 Hello from thread 3 out of 12 from process 0 out of 2 Hello from thread 1 out of 12 from process 0 out of 2 Hello from thread 2 out of 12 from process 0 out of 2 Hello from thread 4 out of 12 from process 0 out of 2 Hello from thread 7 out of 12 from process 0 out of 2 Hello from thread 5 out of 12 from process 0 out of 2 Hello from thread 11 out of 12 from process 0 out of 2 Hello from thread 0 out of 12 from process 1 out of 2 Hello from thread 4 out of 12 from process 1 out of 2 Hello from thread 10 out of 12 from process 1 out of 2 Hello from thread 7 out of 12 from process 1 out of 2 Hello from thread 8 out of 12 from process 1 out of 2 Hello from thread 1 out of 12 from process 1 out of 2 Hello from thread 5 out of 12 from process 1 out of 2 Hello from thread 6 out of 12 from process 1 out of 2 Hello from thread 9 out of 12 from process 1 out of 2 Hello from thread 3 out of 12 from process 1 out of 2 Hello from thread 11 out of 12 from process 1 out of 2 Hello from thread 2 out of 12 from process 1 out of 2
Implementation of the toy problem
The parallel MPI+OpenMP version of the toy computational problem can be implemented with the use of the six MPI calls mentioned in the Message Passing Interface introduction, along with some basic OpenMP parallel constructs. This is illustrated in listing 3 for C and in listing 4 for Fortran.
Each process will execute the same code. The main loop which generates random points in the square and counts good hits is parallelised with the OpenMP parallel construct across all threads available for the process (lines 44-57 in the C code; lines 50-61 in the Fortran code). All processes call collective MPI_Reduce
operation which sums all partial counts and store the final result in the process with rank 0 (line 59 in the C code; lines 63-64 the Fortran code). The result is then printed out by the process with rank 0.
Note that we are using the critical
OpenMP section to call the lcgrandom
routine since it is not thread-safe (it operates on the global shared variables). Using critical OpenMP sections may decrease the overall performance of a multithreaded code. It is used here only for demonstration purposes. One alternative for this particular example would be to exchange the call to lcgrandom
with a thread-safe random number generator such as rand_r()
for C. (For another example see the OpenMP page).
/* Compute pi using hybrid MPI/OpenMP */ #include <mpi.h> #include <omp.h> #include <stdio.h> // Random number generator -- and not a very good one, either! static long MULTIPLIER = 1366; static long ADDEND = 150889; static long PMOD = 714025; long random_last = 0; // This is not a thread-safe random number generator double lcgrandom() { long random_next; random_next = (MULTIPLIER * random_last + ADDEND)%PMOD; random_last = random_next; return ((double)random_next/(double)PMOD); } static long num_trials = 1000000; int main(int argc, char **argv) { long i; long Ncirc = 0; double pi, x, y; double r = 1.0; // radius of circle double r2 = r*r; int rank, size, manager = 0; MPI_Status status; long my_trials, temp; int provided; MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); my_trials = num_trials/size; if (num_trials%(long)size > (long)rank) my_trials++; random_last = rank; #pragma omp parallel { #pragma omp for private(x,y) reduction(+:Ncirc) for (i = 0; i < my_trials; i++) { #pragma omp critical (randoms) { x = lcgrandom(); y = lcgrandom(); } if ((x*x + y*y) <= r2) Ncirc++; } } MPI_Reduce(&Ncirc, &temp, 1, MPI_LONG, MPI_SUM, manager, MPI_COMM_WORLD); if (rank == manager) { Ncirc = temp; pi = 4.0 * ((double)Ncirc)/((double)num_trials); printf("\n \t Computing pi using hybrid MPI/OpenMP: \n"); printf("\t For %ld trials, pi = %f\n", num_trials, pi); printf("\n"); } MPI_Finalize(); return 0; }
! Compute pi hybrid MPI/OpenMP ! Pseudorandom number generator ! (and a bad one at that) module lcgenerator integer*8, save :: random_last = 0 contains subroutine seed(s) integer :: s random_last = s end subroutine real function lcgrandom() integer*8, parameter :: MULTIPLIER = 1366 integer*8, parameter :: ADDEND = 150889 integer*8, parameter :: PMOD = 714025 integer*8 :: random_next = 0 random_next = mod((MULTIPLIER * random_last + ADDEND), PMOD) random_last = random_next lcgrandom = (1.0*random_next)/PMOD return end function end module lcgenerator program darts use lcgenerator implicit none include 'mpif.h' integer :: num_trials = 1000000, i = 0, Ncirc = 0 real :: pi = 0.0, x = 0.0, y = 0.0, r = 1.0 real :: r2 = 0.0 integer :: rank, np, manager = 0 integer :: provided, mpistatus, mpierr, j integer :: my_trials, temp call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided, mpierr) call MPI_Comm_size(MPI_COMM_WORLD, np, mpierr) call MPI_Comm_rank(MPI_COMM_WORLD, rank, mpierr) r2 = r*r my_trials = num_trials/np if (mod(num_trials, np) .gt. rank) then my_trials = my_trials+1 end if call seed(rank) !$OMP parallel private(x,y) reduction(+:Ncirc) !$OMP do do i = 1, my_trials !$OMP critical (randoms) x = lcgrandom() y = lcgrandom() !$OMP end critical (randoms) if ((x*x + y*y) .le. r2) then Ncirc = Ncirc+1 end if end do !$OMP end parallel call MPI_Reduce(Ncirc, temp, 1, MPI_INTEGER, MPI_SUM, & manager, MPI_COMM_WORLD, mpierr) if (rank .eq. manager) then Ncirc = temp pi = 4.0*((1.0*Ncirc)/(1.0*num_trials)) print*, ' ' print*, ' Computing pi using hybrid MPI/OpenMP:' print*, ' For ', num_trials, ' trials, pi = ', pi print*, ' ' end if call MPI_Finalize(mpierr) end
Related pages
- Introduction to CUDA
- Introduction to HIP
For detailed information on how to compile MPI+X software on Pawsey systems, see Compiling.
External links
- SC13 Tutorial: Hybrid MPI and OpenMP Parallel Programming
- Hybrid MPI/ OpenMP Programming (PDF), a presentation by Yun He at NERSC (2013)
- Multi GPU Programming with MPI and CUDA (PDF), a presentation by Jiri Kraus and Peter Messmer from Nvidia (2014)
- Using OpenACC with MPI (PDF) a tutorial by Nvidia PGI (2017)