/
Profiling with ARM MAP

Profiling with ARM MAP

ARM MAP is a commercial profiling tool, and the recommended method of parallel profiling on Pawsey supercomputing systems. It provides a graphical user interface and remote client for analysing profiling information.

Prerequisite knowledge

You should be familiar with writing, compiling and running parallel codes to start profiling with ARM MAP.


The ARM Forge license supports a total number of 1024 running processes (tasks) at a time. For instance, the licence won't allow any other user to run a debugging job if user A is debugging a 512 task job and users B and C are profiling a 256 task job each.

Introduction to MAP

Arm MAP gathers detailed information regarding the performance of parallel code using a graphical user interface. Use ARM MAP for profiling serial, MPI, OpenMP and mixed mode executables.

There are two usage modes in Arm MAP:

  • Arm MAP Remote Client can be executed on the local machine (laptop or desktop) 
    In this mode the Remote Client can connect to the compute system's login node and read the profiling data provided on it. Data can be then analysed with the use of local platform.
  • Arm MAP GUI can be executed directly on the login node
    In this mode profiling session can be executed directly on the login node of the system.


Best Practice

Use the remote client mode when profiling with ARM MAP.

Profiling steps

The following is an overview of the process for using ARM MAP to profile your program:

  1. Generate the profile data for the application

  2. Use the ARM Map Remote Client to analyse the profile data.

Step-by-Step Example

In this section we will provide a step-by-step introduction to Arm MAP.

Step 1: Get the source code

This example profiles a MPI program which calculates the value of pi.

Create a file called darts-mpi.c  with the following source code:

Listing 1. Computing PI using MPI.
/* Compute pi using the six basic MPI functions */
#include <mpi.h>
#include <stdio.h>
 
static long num_trials = 1000000;
static long MULTIPLIER = 1366;
static long ADDEND = 150889;
static long PMOD = 714025;
long random_last = 0;

double lcgrandom() {
  long random_next;
  random_next = (MULTIPLIER * random_last + ADDEND)%PMOD;
  random_last = random_next;
  return ((double)random_next/(double)PMOD);
}

int main(int argc, char **argv) {
  long i;
  long Ncirc = 0;
  double pi, x, y;
  double r = 1.0; // radius of circle
  double r2 = r*r;
 
  int rank, size, manager = 0;
  MPI_Status status;
  long my_trials, temp;
  int j;
 
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  my_trials = num_trials/size;
  if (num_trials%(long)size > (long)rank) my_trials++;
  random_last = rank;
 
  for (i = 0; i < my_trials; i++) {
    x = lcgrandom();
    y = lcgrandom();
    if ((x*x + y*y) <= r2)
      Ncirc++;
  }
 
  if (rank == manager) {
    for (j = 1; j < size; j++) {
      MPI_Recv(&temp, 1, MPI_LONG, j, j, MPI_COMM_WORLD, &status);
      Ncirc += temp;
    }
    pi = 4.0 * ((double)Ncirc)/((double)num_trials);
    printf("\n \t Computing pi using six basic MPI functions: \n");
    printf("\t For %ld trials, pi = %f\n", num_trials, pi);
    printf("\n");
  } else {
    MPI_Send(&Ncirc, 1, MPI_LONG, manager, rank, MPI_COMM_WORLD);
  }
  MPI_Finalize();
  return 0;
}

Step 2: Generate a MAP MPI wrapper library

This step needs to be performed once only for a given MPI implementation. Therefore, this should be done separately for each Pawsey systems.

For example, on Setonix use "map-libs-setonix" in place of "map-libs"  in the commands below.

Replace "projectname" and "username" with your Pawsey project code and username, and issue the following commands:

Terminal 1. Generating wrapper libraries.
$ mkdir /software/projects/projectname/username/map-libs
$ cd /software/projects/projectname/username/map-libs
$ module load forge
$ make-profiler-libraries
Creating Cray shared libraries in /software/projects/projectname/username/map-libs
Created the libraries:
   libmap-sampler.so       (and .so.1, .so.1.0, .so.1.0.0)
   libmap-sampler-pmpi.so  (and .so.1, .so.1.0, .so.1.0.0)

To instrument a program, add these compiler options:
   compilation for use with MAP - not required for Performance Reports:
      -g (or '-G2' for native Cray Fortran) (and -O3 etc.)
   linking (both MAP and Performance Reports):
      -dynamic -L/software/projects/projectname/username/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projectname/username/map-libs

Note: These libraries must be on the same NFS/Lustre/GPFS filesystem as your program.

Before running your program (interactively or from a queue), set
LD_LIBRARY_PATH:
   export LD_LIBRARY_PATH=/software/projects/projectname/username/map-libs:$LD_LIBRARY_PATH
   map  ...
or add -Wl,-rpath=/software/projects/projectname/username/map-libs when linking your program.


The instructions output to the terminal describe the appropriate link stage arguments required to compile your code. Make a record of the output. The command will produce profiling library files in the working directory:

Terminal 2. Displaying the location of the libraries.
$ ls /software/projects/projectname/username/location
libmap-sampler-pmpi.so  libmap-sampler-pmpi.so.1  libmap-sampler-pmpi.so.1.0
libmap-sampler-pmpi.so.1.0.0  libmap-sampler.so


By default make-profiler-libraries will generate shared libraries. Use the following command to generate static profiler libraries:

$ module load forge
$ make-profiler-libraries --lib-type=static


Please note that separate profiler libraries needs to be generated for each supercomputer you want to run the profiler on. 

Step 3: Compile and link your MPI application

Generate the executable by following instructions given below:

  • use the -g compile option to retain symbolic information for compilation,
  • link your application with the link arguments generated via the make-profiler-libraries command.

This process is illustrated below for Setonix, in the directory containing the mpi-darts.c file: 

Terminal 3. Compiling the code.
$ cc -g -c darts-mpi.c
$ cc -dynamic -L/software/projects/projectname/username/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projectname/username/map-libs -o darts-mpi darts-mpi.o


Please note -G2 can be used instead of -g in PrgEnv-cray to allow a higher level of optimisation. (The MAP client will display a meaningful warning message if insufficient debugging information is available from the executable.)

Step 4: Execute the code to generate profiling information

You can execute your profiling job in two ways (choose one of the methods described below):

Option 1: Submit the job to the SLURM scheduling system
Write a Slurm batch script:

Listing 2. The Slurm batch script.
#!/bin/bash --login 
#SBATCH --account=<project>
#SBATCH --partition=debug
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --time=0:15:00

module load arm-forge/21.1.2
map --profile srun -n 4 ./darts-mpi

The above script describes a 15 minute single node job which executes 4 processes in the debug partition on Setonix. 

Note that we are using --profile option for map. This will cause the profiling data to be generated without using Arm MAP's GUI.

You can now submit the job to the scheduler:

$ sbatch job.slurm

Option 2: Use the interactive session
Allocate a profiling session in the debug partition by running:

$ salloc --nodes=1 --tasks-per-node=4 --partition=debug --account=<project> --time=0:15:00
salloc: Granted job allocation 2509285

You can now run the profiling job:

$ module load arm-forge/21.1.2
$ map --profile srun -n 4 ./profileme

A successful execution will produce a file with .map extension in the working directory. This file contains all profiling information. The full name of the file contains the name of the executable, number of processes, nodes, threads and the timestamp, e.g.:

$ ls *.map
profileme_4p_1n_1t_2022-03-04_10-30.map

Step 5: Download and install Arm Forge Remote Client

Visit the Linaro Arm Forge download page and download the Arm Forge Remote Client (available for Windows, OS/X and Linux).

Note that the version of the Remote Client needs to be compatible with the Arm Forge version available on the Pawsey's system you are planning to use for debugging.
You may need to navigate to the 'older versions of Linaro Forge' button to download the version that is compatible with the Setonix ARM-Forge version.

Run the module avail arm-forge command to check which versions are available on the particular system, e.g.:


$ module avail arm-forge

--------------------------- /software/pawsey/modulefiles -----------------------------
arm-forge/21.1.2


Install the correct/compatible version of the Arm Forge Remote Client by following instructions in the installer.

On running the client on your local machine, select the "ARM Map" tab on the left, then the"Configure" option from the "Remote Launch" menu. Choose "Add" and configure the remote launch settings. Settings for Setonix are shown on the screenshots below:

Choose "OK" to save the configuration.

Note that the correct Remote Installation Directory needs to be specified. The directory name might change with different OS and Arm Forge versions.

The optional remote script entry should be left blank. One can then try the "Test Remote Launch" for which you will need to enter your password. 

Note that "username" needs to be replaced with your Pawsey username.

Step 6: Execute the Remote Client and connect

Now start the Arm Forge Remote Client on your local machine and connect to the Pawsey's system (select the correct option from the "Remote Launch" menu). When connected select the "Load Profile Data File" and choose the appropriate Arm MAP profile file.

 

Step 7: Analyse profiling information

The main Arm MAP profiler window should appear. You can navigate through the code and analyse the performance of the code from different angles. The screenshot below presents the profile information for the example program. We can see that ~90% runtime is spent in the random number generator library call, mostly in memory accesses. 

 

Next steps

Arm MAP provides a great deal of functionality to analyse different aspects of the application's performance. Not all of the available functionality is described in the example above. For further information, consult the Arm MAP user guide


Note Arm MAP does not support AMD GPUs, see Profiling GPU-Enabled codes for alternative tools.

Related pages