OpenMP

Open Multi-Processing (OpenMP) is an Application Program Interface (API) to develop parallel applications on shared memory systems.

On this page:

A very short introduction to OpenMP

OpenMP is an API developed by a group of major hardware and software vendors. It is available in most of the compiler families and is a very popular and scalable tool for developing parallel applications on shared memory systems. OpenMP API supports C/C++ and Fortran programming languages.  

All compilers available on Pawsey systems fully support version 4.5 of the OpenMP standard.

OpenMP was developed to support applications on shared memory systems. The main idea is that a single process can spawn multiple threads on cores available within a compute node. These threads can jointly work on parallel regions of the code, e.g. they can execute subsets of iterations of a given parallel loop. The OpenMP programming model is mainly based on compiler directives (single-line instructions inserted in the code and read by the compiler). 

The main advantages of OpenMP are:

  • step-by-step parallelisation approach: OpenMP enables incremental parallelisation, which can be started by focusing on the most time-critical parts of the code first
  • minor changes to the code: code size grows only modestly
  • single code base: single source code for OpenMP and non-OpenMP versions of the code; compilers simply ignore the OpenMP directives if not compiled with the OpenMP support
  • easy-to-read code: expression of parallelism flows clearly

The lifetime of a parallel OpenMP code is schematically depicted in figure 1.


Schematic lifetime of a parallel OpenMP code

Figure 1. Lifetime of a parallel OpenMP code

The primary OpenMP construct is the parallel directive, which forms a team of threads and starts the parallel execution. It is most commonly used in conjunction with the for / do construct, which specifies that the iterations of the associated loop or loops will be executed in parallel by the threads in the team. Listing 1 shows the C syntax for the parallel directive and listing 2 shows the Fortran syntax.

Listing 1. OpenMP parallel construct in C
#pragma omp parallel [clause[[,]clause] ...]
{
  structured-block
}


#pragma parallel for [clause[[,]clause] ...]
for(int i=0; i<n; i++) 
{
  ...
}

Listing 2. OpenMP parallel construct in Fortran
!$omp parallel [clause[[,]clause] ...]
  structured-block
!$omp end parallel


!$omp parallel do [clause[[,]clause] ...]
  do i=0, n
    ...
  end do

Most of the OpenMP parallel directives accept a set of clauses which can be used to specify different multithreading and memory management options. The most important clauses are those that describe how code variables and arrays should be handled within the parallel region. For instance, programmers can specify that a set of variables should be treated as private, with each thread having its own copy of the variable, or as shared. The parallel type of each and every variable in the parallel region should be carefully examined by the programmer to avoid data race conditions.      

The OpenMP standard also includes few subroutines for common functions, such as omp_get_thread_num, which allows a thread to query the number of other threads executing a given parallel region. Refer to the "hello world" example below for details. 

The decision about the number of threads used by an OpenMP code is made prior to its execution by setting the OMP_NUM_THREADS environment variable. This is also presented in the "hello world" example below. 

"Hello world" programs

Listing 3 shows a simple "hello world" program written in C that uses the OpenMP parallel construct and two OpenMP library calls. Listing 4 shows the same program written in Fortran. The master thread spawns additional threads for the execution of the parallel region contained between lines 6-11 and 3-7 in the C and Fortran code respectively. Note how the value of the id variable (thread ID in the communicator) is different for each thread.

Listing 3. OpenMP hello world in C
#include <omp.h>
#include <stdio.h>

void main (){

#pragma omp parallel
	{
		int id =omp_get_thread_num();
		int nthreads = omp_get_num_threads();
		printf("Hello from thread %d of %d\n",id,nthreads);
	}
}
Listing 4. OpenMP hello world in Fortran
    program hello90
	integer:: id, nthreads,omp_get_thread_num,omp_get_num_threads
!$omp parallel private(id,nthreads)
    id = omp_get_thread_num()
    nthreads = omp_get_num_threads()
	write (*,*) 'Hello from thread', id, " of ", nthreads
!$omp end parallel
	end program

These codes can be compiled on a Cray supercomputer in the following way, assuming the compilers are wrapping around the GNU compiler collection:

Terminal 1. Compile OpenMP code
$ cc -fopenmp hello.c -o helloC     # C

$ ftn -fopenmp hello.f90 -o helloF  # Fortran

The codes can be executed in an interactive SLURM session or within a batch job. The srun launcher is required to spawn the multiple processes. For example, for an interactive session:

Terminal 2. Run OpenMP code
$ salloc -N 1 -n 1 -c 4 -p debug -t 0:01:00

$ export OMP_NUM_THREADS=4

$ srun -n 1 -c 4 ./helloF  # Fortran
 Hello from thread           3  of            4
 Hello from thread           0  of            4
 Hello from thread           2  of            4
 Hello from thread           1  of            4

Implementation of the toy problem

The parallel OpenMP implementation of the toy computational problem is illustrated in listing 5 for C and in listing 6 for Fortran.

The main loop which generates random points in the square and counts good hits is parallelised with the OpenMP parallel construct across all available threads (lines 30-42 in the C code; lines 25-36 in the Fortran code). At the end of the parallel region a collective "reduce summation" operation is applied to the partial counts of each thread ( Ncirc variable). The overall result is printed out by the master thread.

Note that we are using the critical OpenMP section to call the lcgrandom routine since it is not thread-safe (it operates on the global shared variables). Using critical OpenMP sections may decrease the overall performance of a multithreaded code. It is used here only for demonstration purposes. 

Listing 5. OpenMP toy in C
/* Compute pi using OpenMP */
#include <omp.h>
#include <stdio.h>

// Random number generator -- and not a very good one, either!

static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0;

// This is not a thread-safe random number generator

double lcgrandom() {
  long random_next;
  random_next = (MULTIPLIER * random_last + ADDEND)%PMOD;
  random_last = random_next;

  return ((double)random_next/(double)PMOD);
}

static long num_trials = 1000000;

int main(int argc, char **argv) {
  long i;
  long Ncirc = 0;
  double pi, x, y;
  double r = 1.0; // radius of circle
  double r2 = r*r;
#pragma omp parallel
{
#pragma omp for private(x,y) reduction(+:Ncirc)
  for (i = 0; i < num_trials; i++) {
#pragma omp critical (randoms)
{
    x = lcgrandom();
    y = lcgrandom();
}
    if ((x*x + y*y) <= r2)
      Ncirc++;
  }
}

  pi = 4.0 * ((double)Ncirc)/((double)num_trials);
  printf("\n \t Computing pi using OpenMP: \n");
  printf("\t For %ld trials, pi = %f\n", num_trials, pi);
  printf("\n");

  return 0;
}

Listing 6. OpenMP toy in Fortran
! Compute pi using OpenMP
! First, the pseudorandom number generator
        real function lcgrandom()
          integer*8, parameter :: MULTIPLIER = 1366
          integer*8, parameter :: ADDEND = 150889
          integer*8, parameter :: PMOD = 714025
          integer*8, save :: random_last = 0

          integer*8 :: random_next = 0
          random_next = mod((MULTIPLIER * random_last + ADDEND), PMOD)
          random_last = random_next
          lcgrandom = (1.0*random_next)/PMOD
          return
        end

! Now, we compute pi
        program darts
          implicit none
          integer*8 :: num_trials = 1000000, i = 0, Ncirc = 0
          real :: pi = 0.0, x = 0.0, y = 0.0, r = 1.0
          real :: r2 = 0.0
          real :: lcgrandom
          r2 = r*r

!$OMP parallel private(x,y) reduction(+:Ncirc)
!$OMP do
          do i = 1, num_trials
!$OMP critical (randoms)
            x = lcgrandom()
            y = lcgrandom()
!$OMP end critical (randoms)
            if ((x*x + y*y) .le. r2) then
              Ncirc = Ncirc+1
            end if
          end do
!$OMP end parallel

          pi = 4.0*((1.0*Ncirc)/(1.0*num_trials))
          print*, '     '
          print*, '     Computing pi using OpenMP:         '
          print*, '     For ', num_trials, ' trials, pi = ', pi
          print*, '     '

        end

To improve multithreaded performance, one alternative for this particular example is to exchange the call to lcgrandom with a thread-safe random number generator such as rand_r() for C. This is illustrated in listing 7. Thread-safe functions can be safely used within parallel regions without danger of causing data race conditions. 

Within the parallel region contained between lines 15-25 an individual seed value is created for each thread. The main program loop is divided between available threads by using the OpenMP loop directive. Two variables x and y are defined as private within the loop and a reduction operation is performed on the Ncirc variable. The reduction operation collects local values of Ncirc from all threads, performs the summation operation on it and stores the final result in the master thread's copy of Ncirc .

Listing 7. OpenMP toy in C with the use of rand_r()
/* Compute pi using OpenMP */
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>


static long num_trials = 1000000;

int main(int argc, char **argv) {
  long i;
  long Ncirc = 0;
  double pi, x, y;
  double r = 1.0; // radius of circle
  double r2 = r*r;
#pragma omp parallel
{
  unsigned int seed=omp_get_thread_num();
#pragma omp for private(x,y) reduction(+:Ncirc)
  for (i = 0; i < num_trials; i++) {
    x = (double)rand_r(&seed)/RAND_MAX;
    y = (double)rand_r(&seed)/RAND_MAX;
    if ((x*x + y*y) <= r2)
      Ncirc++;
  }
}

  pi = 4.0 * ((double)Ncirc)/((double)num_trials);
  printf("\n \t Computing pi using OpenMP: \n");
  printf("\t For %ld trials, pi = %f\n", num_trials, pi);
  printf("\n");

  return 0;
}

The implementation for Fortran programming language is not given here, since there is no direct equivalent of the rand_r() function for Fortran. Although the random_number() Fortran subroutine is thread-safe, it contains a critical region which decreases the performance of the resulting OpenMP implementation. A scalable OpenMP version of the code written in Fortran would involve using external libraries.

Useful enviornment variables

When the OMP_DISPLAY_AFFINITY environment variable is set to TRUE, the OpenMP runtime prints formatted affinity information for all OpenMP threads the first time the code enters a parallel region. Further logging happens when the affinity changes for any parallel region. Further information can be found on the page OMP_DISPLAY_AFFINITY (external site).

Terminal 1. Displaying affinity of an OpenMP code.
$ export OMP_DISPLAY_AFFINITY=TRUE
$ srun ./test
CCE OMP: host nid001011 pid 239539 tid 239539 thread 0 affinity:  2-6 130-134
CCE OMP: host nid001011 pid 239539 tid 239539 thread 0 affinity:  2 130
CCE OMP: host nid001011 pid 239539 tid 239544 thread 5 affinity:  4 132
CCE OMP: host nid001011 pid 239539 tid 239540 thread 1 affinity:  2 130
CCE OMP: host nid001011 pid 239539 tid 239548 thread 9 affinity:  6 134
CCE OMP: host nid001011 pid 239539 tid 239546 thread 7 affinity:  5 133
CCE OMP: host nid001011 pid 239539 tid 239543 thread 4 affinity:  4 132
CCE OMP: host nid001011 pid 239539 tid 239541 thread 2 affinity:  3 131
CCE OMP: host nid001011 pid 239539 tid 239542 thread 3 affinity:  3 131
CCE OMP: host nid001011 pid 239539 tid 239545 thread 6 affinity:  5 133
CCE OMP: host nid001011 pid 239539 tid 239547 thread 8 affinity:  6 134

 	 Computing pi using OpenMP: 
	 For 1000000 trials, pi = 3.140232


Related pages

  • For detailed information on how to compile OpenMP software on Pawsey systems, see Compiling.

External links