MPI+X

The MPI+X approach is a combination of Message Passing Interface for expressing internode parallelism and another parallel programming model for intranode parallelism or acceleration. MPI+X is the recommended parallelisation strategy on current HPC systems. 

Prerequisites

This page builds on top of Message Passing Interface and OpenMP.

A very short introduction to MPI+X

The recommended parallelisation strategy for current HPC systems is to use MPI+X approach which is a combination of Message Passing Interface for expressing internode parallelism and another parallel programming model (for example, a threading model) for intranode parallelism or acceleration. One of the most popular MPI+X models is MPI+OpenMP, where OpenMP is used to express intranode parallelism. Another example is when the MPI parallel code is combined with a GPU programming model, such as CUDA or HIP, to make use of the accelerators available within the computational nodes. 

There are a number of benefits in using the MPI+X model:

  • Reducing the total number of MPI ranks used in the parallel execution
  • Obtaining better performance due to more appropriate expression of intra-node parallelism
  • Allowing a more convenient way of expressing SIMD parallelism at the loop or function level
  • Optimising existing GPU/OpenMP codes to achieve multi-node scalability

"Hello world" programs

Listing 1 shows a simple "hello world" program written in C that uses the basic MPI and OpenMP functions. Listing 2 shows the same program written in Fortran. Each MPI process spawns additional threads for the execution of a parallel region. Note how the value of rank will be different for each process, and how the value of id will be different for each thread of the same process.

Listing 1. MPI+OpenMP "hello world" program in C
#include <stdio.h>
#include "mpi.h"
#include <omp.h>

int main(int argc, char *argv[]) {
  int rank, size;
  int id = 0, nthreads = 1;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  #pragma omp parallel default(shared) private(id, nthreads)
  {
    nthreads = omp_get_num_threads();
    id = omp_get_thread_num();
    printf("Hello from thread %d out of %d from process %d out of %d\n",
           id, nthreads, rank, size);
  }

  MPI_Finalize();
}
Listing 2. MPI+OpenMP "hello world" program in Fortran
program hello
 
  use mpi
  implicit none
 
  integer :: ifail
  integer :: rank, size
  integer:: id, nthreads,omp_get_thread_num,omp_get_num_threads
 
  call mpi_init(ifail)
 
  call mpi_comm_rank(MPI_COMM_WORLD, rank, ifail)
  call mpi_comm_size(MPI_COMM_WORLD, size, ifail)
 
!$omp parallel private(id,nthreads)
    id = omp_get_thread_num()
    nthreads = omp_get_num_threads()
    write (*,*) 'Hello from thread', id, " out of ", nthreads, " from process ", rank, " out of ", size
!$omp end parallel
 
  call mpi_finalize(ifail)
 
end program hello

These codes can be compiled on a Cray supercomputer in the following way, assuming the compilers are wrapping around the GNU compiler collection:

Terminal 1. Compile MPI+OpenMP code
$ cc -fopenmp hello.c -o helloC     # C

$ ftn -fopenmp hello.f90 -o helloF  # Fortran

The codes can be executed in an interactive SLURM session or within a batch job. For example, terminal 2 shows an interactive session using 2 MPI processes on a node, each with 12 OpenMP threads:

Terminal 2. Run MPI+OpenMP code
$ salloc -N 1 -n 2 -c 12 -p debug -t 0:01:00

$ export OMP_NUM_THREADS=12

$ srun -n 2 -c 12 --export=all ./helloC
Hello from thread 0 out of 12 from process 0 out of 2
Hello from thread 8 out of 12 from process 0 out of 2
Hello from thread 10 out of 12 from process 0 out of 2
Hello from thread 6 out of 12 from process 0 out of 2
Hello from thread 9 out of 12 from process 0 out of 2
Hello from thread 3 out of 12 from process 0 out of 2
Hello from thread 1 out of 12 from process 0 out of 2
Hello from thread 2 out of 12 from process 0 out of 2
Hello from thread 4 out of 12 from process 0 out of 2
Hello from thread 7 out of 12 from process 0 out of 2
Hello from thread 5 out of 12 from process 0 out of 2
Hello from thread 11 out of 12 from process 0 out of 2
Hello from thread 0 out of 12 from process 1 out of 2
Hello from thread 4 out of 12 from process 1 out of 2
Hello from thread 10 out of 12 from process 1 out of 2
Hello from thread 7 out of 12 from process 1 out of 2
Hello from thread 8 out of 12 from process 1 out of 2
Hello from thread 1 out of 12 from process 1 out of 2
Hello from thread 5 out of 12 from process 1 out of 2
Hello from thread 6 out of 12 from process 1 out of 2
Hello from thread 9 out of 12 from process 1 out of 2
Hello from thread 3 out of 12 from process 1 out of 2
Hello from thread 11 out of 12 from process 1 out of 2
Hello from thread 2 out of 12 from process 1 out of 2

Implementation of the toy problem

The parallel MPI+OpenMP version of the toy computational problem can be implemented with the use of the six MPI calls mentioned in the Message Passing Interface introduction, along with some basic OpenMP parallel constructs. This is illustrated in listing 3 for C and in listing 4 for Fortran.

Each process will execute the same code. The main loop which generates random points in the square and counts good hits is parallelised with the OpenMP parallel construct across all threads available for the process (lines 44-57 in the C code; lines 50-61 in the Fortran code). All processes call collective MPI_Reduce operation which sums all partial counts and store the final result in the process with rank 0 (line 59 in the C code; lines 63-64 the Fortran code). The result is then printed out by the process with rank 0. 

Note that we are using the critical OpenMP section to call the lcgrandom routine since it is not thread-safe (it operates on the global shared variables). Using critical OpenMP sections may decrease the overall performance of a multithreaded code. It is used here only for demonstration purposes. One alternative for this particular example would be to exchange the call to lcgrandom with a thread-safe random number generator such as rand_r() for C. (For another example see the OpenMP page).

Listing 3. MPI+OpenMP toy in C
/* Compute pi using hybrid MPI/OpenMP */
#include <mpi.h>
#include <omp.h>
#include <stdio.h>

// Random number generator -- and not a very good one, either!

static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0;

// This is not a thread-safe random number generator

double lcgrandom() {
  long random_next;
  random_next = (MULTIPLIER * random_last + ADDEND)%PMOD;
  random_last = random_next;

  return ((double)random_next/(double)PMOD);
}

static long num_trials = 1000000;

int main(int argc, char **argv) {
  long i;
  long Ncirc = 0;
  double pi, x, y;
  double r = 1.0; // radius of circle
  double r2 = r*r;

  int rank, size, manager = 0;
  MPI_Status status;
  long my_trials, temp;
  int provided;

  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  my_trials = num_trials/size;
  if (num_trials%(long)size > (long)rank) my_trials++;
  random_last = rank;

#pragma omp parallel
{
#pragma omp for private(x,y) reduction(+:Ncirc)
  for (i = 0; i < my_trials; i++) {
#pragma omp critical (randoms)
{
    x = lcgrandom();
    y = lcgrandom();
}

    if ((x*x + y*y) <= r2)
      Ncirc++;
  }
}

  MPI_Reduce(&Ncirc, &temp, 1, MPI_LONG, MPI_SUM, manager, MPI_COMM_WORLD);

  if (rank == manager) {
    Ncirc = temp;
    pi = 4.0 * ((double)Ncirc)/((double)num_trials);
    printf("\n \t Computing pi using hybrid MPI/OpenMP: \n");
    printf("\t For %ld trials, pi = %f\n", num_trials, pi);
    printf("\n");
  } 
  MPI_Finalize();
  return 0;
}
Listing 4. MPI+OpenMP toy in Fortran
! Compute pi hybrid MPI/OpenMP


! Pseudorandom number generator
! (and a bad one at that)
        module lcgenerator
          integer*8, save :: random_last = 0
          contains
 
          subroutine seed(s)
            integer :: s
            random_last = s
          end subroutine
 
          real function lcgrandom()
            integer*8, parameter :: MULTIPLIER = 1366
            integer*8, parameter :: ADDEND = 150889
            integer*8, parameter :: PMOD = 714025
   
            integer*8 :: random_next = 0
            random_next = mod((MULTIPLIER * random_last + ADDEND), PMOD)
            random_last = random_next
            lcgrandom = (1.0*random_next)/PMOD
            return
          end function
        end module lcgenerator

        program darts
          use lcgenerator
          implicit none
          include 'mpif.h'
          integer :: num_trials = 1000000, i = 0, Ncirc = 0
          real :: pi = 0.0, x = 0.0, y = 0.0, r = 1.0
          real :: r2 = 0.0

          integer :: rank, np, manager = 0
          integer :: provided, mpistatus, mpierr, j
          integer :: my_trials, temp

          call MPI_Init_thread(MPI_THREAD_MULTIPLE, provided, mpierr)
          call MPI_Comm_size(MPI_COMM_WORLD, np, mpierr)
          call MPI_Comm_rank(MPI_COMM_WORLD, rank, mpierr)
          r2 = r*r
          my_trials = num_trials/np
          if (mod(num_trials, np) .gt. rank) then
            my_trials = my_trials+1
          end if
          call seed(rank)

!$OMP parallel private(x,y) reduction(+:Ncirc)
!$OMP do 
          do i = 1, my_trials
!$OMP critical (randoms)
            x = lcgrandom()
            y = lcgrandom()
!$OMP end critical (randoms)
            if ((x*x + y*y) .le. r2) then
              Ncirc = Ncirc+1
            end if
          end do
!$OMP end parallel

          call MPI_Reduce(Ncirc, temp, 1, MPI_INTEGER, MPI_SUM,
     &      manager, MPI_COMM_WORLD, mpierr)

          if (rank .eq. manager) then
              Ncirc = temp
              pi = 4.0*((1.0*Ncirc)/(1.0*num_trials))
              print*, '     '
              print*, '     Computing pi using hybrid MPI/OpenMP:'
              print*, '     For ', num_trials, ' trials, pi = ', pi
              print*, '     '
          end if
          call MPI_Finalize(mpierr)
        end

Related pages

  • Introduction to CUDA
  • Introduction to HIP
  • For detailed information on how to compile MPI+X software on Pawsey systems, see Compiling.

External links