Profiling with ARM MAP
ARM MAP is a commercial profiling tool, and the recommended method of parallel profiling on Pawsey supercomputing systems. It provides a graphical user interface and remote client for analysing profiling information.
Prerequisite knowledge
You should be familiar with writing, compiling and running parallel codes to start profiling with ARM MAP.
The ARM Forge license supports a total number of 1024 running processes (tasks) at a time. For instance, the licence won't allow any other user to run a debugging job if user A is debugging a 512 task job and users B and C are profiling a 256 task job each.
Introduction to MAP
Arm MAP gathers detailed information regarding the performance of parallel code using a graphical user interface. Use ARM MAP for profiling serial, MPI, OpenMP and mixed mode executables.
There are two usage modes in Arm MAP:
- Arm MAP Remote Client can be executed on the local machine (laptop or desktop)
In this mode the Remote Client can connect to the compute system's login node and read the profiling data provided on it. Data can be then analysed with the use of local platform. Arm MAP GUI can be executed directly on the login node
In this mode profiling session can be executed directly on the login node of the system.
Best Practice
Profiling steps
The following is an overview of the process for using ARM MAP to profile your program:
Generate the profile data for the application
- Use the ARM Map Remote Client to analyse the profile data.
Step-by-Step Example
In this section we will provide a step-by-step introduction to Arm MAP.
Step 1: Get the source code
This example profiles a MPI program which calculates the value of pi.
Create a file called darts-mpi.c
with the following source code:
/* Compute pi using the six basic MPI functions */ #include <mpi.h> #include <stdio.h> static long num_trials = 1000000; static long MULTIPLIER = 1366; static long ADDEND = 150889; static long PMOD = 714025; long random_last = 0; double lcgrandom() { long random_next; random_next = (MULTIPLIER * random_last + ADDEND)%PMOD; random_last = random_next; return ((double)random_next/(double)PMOD); } int main(int argc, char **argv) { long i; long Ncirc = 0; double pi, x, y; double r = 1.0; // radius of circle double r2 = r*r; int rank, size, manager = 0; MPI_Status status; long my_trials, temp; int j; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); my_trials = num_trials/size; if (num_trials%(long)size > (long)rank) my_trials++; random_last = rank; for (i = 0; i < my_trials; i++) { x = lcgrandom(); y = lcgrandom(); if ((x*x + y*y) <= r2) Ncirc++; } if (rank == manager) { for (j = 1; j < size; j++) { MPI_Recv(&temp, 1, MPI_LONG, j, j, MPI_COMM_WORLD, &status); Ncirc += temp; } pi = 4.0 * ((double)Ncirc)/((double)num_trials); printf("\n \t Computing pi using six basic MPI functions: \n"); printf("\t For %ld trials, pi = %f\n", num_trials, pi); printf("\n"); } else { MPI_Send(&Ncirc, 1, MPI_LONG, manager, rank, MPI_COMM_WORLD); } MPI_Finalize(); return 0; }
Step 2: Generate a MAP MPI wrapper library
This step needs to be performed once only for a given MPI implementation. Therefore, this should be done separately for each Pawsey systems.
For example, on Setonix use "map-libs-setonix" in place of "map-libs" in the commands below.
Replace "projectname" and "username" with your Pawsey project code and username, and issue the following commands:
$ mkdir /software/projects/projectname/username/map-libs $ cd /software/projects/projectname/username/map-libs $ module load forge $ make-profiler-libraries Creating Cray shared libraries in /software/projects/projectname/username/map-libs Created the libraries: libmap-sampler.so (and .so.1, .so.1.0, .so.1.0.0) libmap-sampler-pmpi.so (and .so.1, .so.1.0, .so.1.0.0) To instrument a program, add these compiler options: compilation for use with MAP - not required for Performance Reports: -g (or '-G2' for native Cray Fortran) (and -O3 etc.) linking (both MAP and Performance Reports): -dynamic -L/software/projects/projectname/username/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projectname/username/map-libs Note: These libraries must be on the same NFS/Lustre/GPFS filesystem as your program. Before running your program (interactively or from a queue), set LD_LIBRARY_PATH: export LD_LIBRARY_PATH=/software/projects/projectname/username/map-libs:$LD_LIBRARY_PATH map ... or add -Wl,-rpath=/software/projects/projectname/username/map-libs when linking your program.
The instructions output to the terminal describe the appropriate link stage arguments required to compile your code. Make a record of the output. The command will produce profiling library files in the working directory:
$ ls /software/projects/projectname/username/location libmap-sampler-pmpi.so libmap-sampler-pmpi.so.1 libmap-sampler-pmpi.so.1.0 libmap-sampler-pmpi.so.1.0.0 libmap-sampler.so
By default make-profiler-libraries will generate shared libraries. Use the following command to generate static profiler libraries:
$ module load forge $ make-profiler-libraries --lib-type=static
Please note that separate profiler libraries needs to be generated for each supercomputer you want to run the profiler on.
Step 3: Compile and link your MPI application
Generate the executable by following instructions given below:
- use the -g compile option to retain symbolic information for compilation,
- link your application with the link arguments generated via the make-profiler-libraries command.
This process is illustrated below for Setonix, in the directory containing the mpi-darts.c file:
$ cc -g -c darts-mpi.c $ cc -dynamic -L/software/projects/projectname/username/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projectname/username/map-libs -o darts-mpi darts-mpi.o
-G2
can be used instead of -g
in PrgEnv-cray to allow a higher level of optimisation. (The MAP client will display a meaningful warning message if insufficient debugging information is available from the executable.)Step 4: Execute the code to generate profiling information
You can execute your profiling job in two ways (choose one of the methods described below):
Option 1: Submit the job to the SLURM scheduling system
Write a Slurm batch script:
#!/bin/bash --login #SBATCH --account=<project> #SBATCH --partition=debug #SBATCH --nodes=1 #SBATCH --ntasks-per-node=4 #SBATCH --cpus-per-task=1 #SBATCH --threads-per-core=1 #SBATCH --time=0:15:00 module load arm-forge/21.1.2 map --profile srun -n 4 ./darts-mpi
The above script describes a 15 minute single node job which executes 4 processes in the debug partition on Setonix.
Note that we are using --profile
option for map. This will cause the profiling data to be generated without using Arm MAP's GUI.
You can now submit the job to the scheduler:
$ sbatch job.slurm
Option 2: Use the interactive session
Allocate a profiling session in the debug partition by running:
$ salloc --nodes=1 --tasks-per-node=4 --partition=debug --account=<project> --time=0:15:00 salloc: Granted job allocation 2509285
You can now run the profiling job:
$ module load arm-forge/21.1.2 $ map --profile srun -n 4 ./profileme
A successful execution will produce a file with .map extension in the working directory. This file contains all profiling information. The full name of the file contains the name of the executable, number of processes, nodes, threads and the timestamp, e.g.:
$ ls *.map profileme_4p_1n_1t_2022-03-04_10-30.map
Step 5: Download and install Arm Forge Remote Client
Visit the Linaro Arm Forge download page and download the Arm Forge Remote Client (available for Windows, OS/X and Linux).
Note that the version of the Remote Client needs to be compatible with the Arm Forge version available on the Pawsey's system you are planning to use for debugging.
You may need to navigate to the 'older versions of Linaro Forge' button to download the version that is compatible with the Setonix ARM-Forge version.
Run the module avail arm-forge
command to check which versions are available on the particular system, e.g.:
$ module avail arm-forge --------------------------- /software/pawsey/modulefiles ----------------------------- arm-forge/21.1.2
Install the correct/compatible version of the Arm Forge Remote Client by following instructions in the installer.
On running the client on your local machine, select the "ARM Map" tab on the left, then the"Configure" option from the "Remote Launch" menu. Choose "Add" and configure the remote launch settings. Settings for Setonix are shown on the screenshots below:
Choose "OK" to save the configuration.
Note that the correct Remote Installation Directory needs to be specified. The directory name might change with different OS and Arm Forge versions.
The optional remote script entry should be left blank. One can then try the "Test Remote Launch" for which you will need to enter your password.
Note that "username" needs to be replaced with your Pawsey username.
Step 6: Execute the Remote Client and connect
Now start the Arm Forge Remote Client on your local machine and connect to the Pawsey's system (select the correct option from the "Remote Launch" menu). When connected select the "Load Profile Data File" and choose the appropriate Arm MAP profile file.
Step 7: Analyse profiling information
The main Arm MAP profiler window should appear. You can navigate through the code and analyse the performance of the code from different angles. The screenshot below presents the profile information for the example program. We can see that ~90% runtime is spent in the random number generator library call, mostly in memory accesses.
Next steps
Arm MAP provides a great deal of functionality to analyse different aspects of the application's performance. Not all of the available functionality is described in the example above. For further information, consult the Arm MAP user guide.
Note Arm MAP does not support AMD GPUs, see Profiling GPU-Enabled codes for alternative tools.