Profiling with ARM Performance Reports

ARM Performance Reports is a commercial profiling tool, which provides a high level report regarding the performance of parallel programs. It can be used as a first step in understanding the overall performance of a parallel code.

Prerequisite knowledge

You should be familiar with writing, compiling and running parallel codes to start profiling with ARM Performance Reports.

Introduction to MAP

Arm Performance Reports is a tool from the Arm product family available on Pawsey systems. For a running parallel code, Arm Performance Reports provide a way to gather summary information on its performance. The license covers profiling of x86/64 parallel applications.

Performance Reports can be used to generate summary profile report for serial, MPI, OpenMP and mixed mode executables. In contrast to Arm DDT and MAP, it is not driven by a graphical user interface. It writes the Performance Report to a file instead. Users can choose between txt, html and csv reports.

Arm Performance Reports is a very convenient tool for understanding the main performance characteristics of the code, and can provide answers to questions such as:

  • is the code memory bound?
  • is the code compute bound? 
  • what percentage of time is spent in MPI communication?   

The following example illustrates the steps to generate profile data for the application running on Magnus, Galaxy or Zeus. The use of Arm Performance Reports is then presented. 

Profiling steps

The following is an overview of the process for using ARM Performance Reports to profile your program:

  1. Generate the profile data for the application

  2. Use the ARM Performance Reports to generate a report from the profile data.

Examples

In this section we will provide a step-by-step introduction to Arm Performance Reports. We will work with an example C code profileme.c given below. The code is a parallel MPI implementation of the Pi-darts program introduced within the Intermediate Supercomputing training (external site)

Listing 1. The PI darts program.


/* Compute pi using the six basic MPI functions */
#include <mpi.h>
#include <stdio.h>
 
static long num_trials = 1000000;
static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0;

double lcgrandom() {
  long random_next;
  random_next = (MULTIPLIER * random_last + ADDEND)%PMOD;
  random_last = random_next;
  return ((double)random_next/(double)PMOD);
}

int main(int argc, char **argv) {
  long i;
  long Ncirc = 0;
  double pi, x, y;
  double r = 1.0; // radius of circle
  double r2 = r*r;
 
  int rank, size, manager = 0;
  MPI_Status status;
  long my_trials, temp;
  int j;
 
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  my_trials = num_trials/size;
  if (num_trials%(long)size > (long)rank) my_trials++;
  random_last = rank;
 
  for (i = 0; i < my_trials; i++) {
    x = lcgrandom();
    y = lcgrandom();
    if ((x*x + y*y) <= r2)
      Ncirc++;
  }
 
  if (rank == manager) {
    for (j = 1; j < size; j++) {
      MPI_Recv(&temp, 1, MPI_LONG, j, j, MPI_COMM_WORLD, &status);
      Ncirc += temp;
    }
    pi = 4.0 * ((double)Ncirc)/((double)num_trials);
    printf("\n \t Computing pi using six basic MPI functions: \n");
    printf("\t For %ld trials, pi = %f\n", num_trials, pi);
    printf("\n");
  } else {
    MPI_Send(&Ncirc, 1, MPI_LONG, manager, rank, MPI_COMM_WORLD);
  }
  MPI_Finalize();
  return 0;
}


Step 1: Generate a MAP MPI wrapper library

In order to properly profile an MPI application, a MPI wrapper library must be generated. You only need to it once for a given MPI implementation. In a suitable location issue the following commands.

Terminal 1. Generating MPI wrapper libraries
$ module load arm-forge/21.1.2
$ make-profiler-libraries
Creating Cray shared libraries in /software/projects/projcode/rsrchr/map-libs
Created the libraries:
   libmap-sampler.so       (and .so.1, .so.1.0, .so.1.0.0)
   libmap-sampler-pmpi.so  (and .so.1, .so.1.0, .so.1.0.0)
 
To instrument a program, add these compiler options:
   compilation for use with MAP - not required for Performance Reports:
      -g (or '-G2' for native Cray Fortran) (and -O3 etc.)
   linking (both MAP and Performance Reports):
      -dynamic -L/software/projects/projcode/rsrchr/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projcode/rsrchr/map-libs
 
Note: These libraries must be on the same NFS/Lustre/GPFS filesystem as your
program.
 
Before running your program (interactively or from a queue), set
LD_LIBRARY_PATH:
   export LD_LIBRARY_PATH=/software/projects/projcode/rsrchr/map-libs:$LD_LIBRARY_PATH
   map  ...
or add -Wl,-rpath=/software/projects/projcode/rsrchr/map-libs when linking your program.

Instructions output to the terminal describe the appropriate link stage arguments required to compile your code. Make a record of the output. The command will produce profiling library files in the working directory:

Terminal 2. Generated libraries can be found in your working directory.
$ ls /software/projects/projcode/rsrchr/map-libs
conftest.err              libmap-sampler-pmpi.so.1.0    libmap-sampler.so.1
libmap-sampler-pmpi.so    libmap-sampler-pmpi.so.1.0.0  libmap-sampler.so.1.0
libmap-sampler-pmpi.so.1  libmap-sampler.so             libmap-sampler.so.1.0.0

By default make-profiler-libraries will generate shared libraries. Use the following command to generate static profiler libraries:

Terminal 3. How to generate static libraries.
$ module load arm-forge/21.1.2
$ make-profiler-libraries --lib-type=static

Step 2: Compile and link your MPI application

Generate the executable of your program by linking your application with the generated wrapped libraries. The code does not need to be recompiled with the  -g  option. The linking step is the only one that needs modification in the application build process.

An example instance of the process running on Setonix is illustrated in Terminal 4.

Terminal 4. Linking your application to be profiled with MAP.
$ cc -dynamic -L/group/projcode/rsrchr/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projcode/rsrchr/map-libs -o profileme profileme.o

If your code is using Make or CMake for building, you will need to modify the linking option settings accordingly.  

Step 3: Execute the code to generate profiling information

You can execute your Arm Performance Reports job in one of the two ways described in the following sections.


Before running the perf-report executable you must set and export the MPICC variable to cc:

$ export MPICC=cc

Submit the job to the SLURM queueing system

Write a batch script to submit the profiling job to the scheduler. For instance, listing 2 describes a 15 minute single node job which executes 4 processes in the debug queue on Setonix. 

Listing 2. A sample batch script for a MAP job.
#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --partition=debug
#SBATCH --account=[your-project]
#SBATCH --time=0:15:00

module unload cray-libsci
module load arm-forge/21.1.2
export MPICC=cc
perf-report -o profile.html srun --export=all -n 4 ./profileme

Note that we are using -o option to choose the filename and the file format of the report. You can submit the script using the sbatch  command.

Use the interactive session

Allocate a profiling session in the debug queue by running

$ salloc --nodes=1 --ntasks=4 --partition=debug --account=[your-project] --time=0:15:00 --export=none

You can now run the profiling job.

Terminal 5. Interactive session to profile a program.
$ module load arm-forge/21.1.2
$ perf-report -o profile.html srun --export=all -n 4 ./profileme

Step 4: View the report

A successful execution will produce the profile.html file in the working directory. This file can be copied to local system and opened with favourite web browser. Below is the report generated for the profileme.c example code.


Next steps

Arm Performance Reports provides a great deal of functionality to analyse different aspects of the application's performance. Not all of the available functionality is described in the simple example below. For further information, consult the Arm Performance Reports


Note ARM Performance Reports does not support AMD GPUs, see Profiling GPU-Enabled codes for alternative tools.

Related pages