Profiling with ARM Performance Reports
ARM Performance Reports is a commercial profiling tool, which provides a high level report regarding the performance of parallel programs. It can be used as a first step in understanding the overall performance of a parallel code.
Prerequisite knowledge
You should be familiar with writing, compiling and running parallel codes to start profiling with ARM Performance Reports.
Introduction to MAP
Arm Performance Reports is a tool from the Arm product family available on Pawsey systems. For a running parallel code, Arm Performance Reports provide a way to gather summary information on its performance. The license covers profiling of x86/64 parallel applications.
Performance Reports can be used to generate summary profile report for serial, MPI, OpenMP and mixed mode executables. In contrast to Arm DDT and MAP, it is not driven by a graphical user interface. It writes the Performance Report to a file instead. Users can choose between txt, html and csv reports.
Arm Performance Reports is a very convenient tool for understanding the main performance characteristics of the code, and can provide answers to questions such as:
- is the code memory bound?
- is the code compute bound?
- what percentage of time is spent in MPI communication?
The following example illustrates the steps to generate profile data for the application running on Magnus, Galaxy or Zeus. The use of Arm Performance Reports is then presented.
Profiling steps
The following is an overview of the process for using ARM Performance Reports to profile your program:
Generate the profile data for the application
- Use the ARM Performance Reports to generate a report from the profile data.
Examples
In this section we will provide a step-by-step introduction to Arm Performance Reports. We will work with an example C code profileme.c
given below. The code is a parallel MPI implementation of the Pi-darts program introduced within the Intermediate Supercomputing training (external site).
/* Compute pi using the six basic MPI functions */ #include <mpi.h> #include <stdio.h> static long num_trials = 1000000; static long MULTIPLIER = 1366; static long ADDEND = 150889; static long PMOD = 714025; long random_last = 0; double lcgrandom() { long random_next; random_next = (MULTIPLIER * random_last + ADDEND)%PMOD; random_last = random_next; return ((double)random_next/(double)PMOD); } int main(int argc, char **argv) { long i; long Ncirc = 0; double pi, x, y; double r = 1.0; // radius of circle double r2 = r*r; int rank, size, manager = 0; MPI_Status status; long my_trials, temp; int j; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); my_trials = num_trials/size; if (num_trials%(long)size > (long)rank) my_trials++; random_last = rank; for (i = 0; i < my_trials; i++) { x = lcgrandom(); y = lcgrandom(); if ((x*x + y*y) <= r2) Ncirc++; } if (rank == manager) { for (j = 1; j < size; j++) { MPI_Recv(&temp, 1, MPI_LONG, j, j, MPI_COMM_WORLD, &status); Ncirc += temp; } pi = 4.0 * ((double)Ncirc)/((double)num_trials); printf("\n \t Computing pi using six basic MPI functions: \n"); printf("\t For %ld trials, pi = %f\n", num_trials, pi); printf("\n"); } else { MPI_Send(&Ncirc, 1, MPI_LONG, manager, rank, MPI_COMM_WORLD); } MPI_Finalize(); return 0; }
Step 1: Generate a MAP MPI wrapper library
In order to properly profile an MPI application, a MPI wrapper library must be generated. You only need to it once for a given MPI implementation. In a suitable location issue the following commands.
$ module load arm-forge/21.1.2 $ make-profiler-libraries Creating Cray shared libraries in /software/projects/projcode/rsrchr/map-libs Created the libraries: libmap-sampler.so (and .so.1, .so.1.0, .so.1.0.0) libmap-sampler-pmpi.so (and .so.1, .so.1.0, .so.1.0.0) To instrument a program, add these compiler options: compilation for use with MAP - not required for Performance Reports: -g (or '-G2' for native Cray Fortran) (and -O3 etc.) linking (both MAP and Performance Reports): -dynamic -L/software/projects/projcode/rsrchr/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projcode/rsrchr/map-libs Note: These libraries must be on the same NFS/Lustre/GPFS filesystem as your program. Before running your program (interactively or from a queue), set LD_LIBRARY_PATH: export LD_LIBRARY_PATH=/software/projects/projcode/rsrchr/map-libs:$LD_LIBRARY_PATH map ... or add -Wl,-rpath=/software/projects/projcode/rsrchr/map-libs when linking your program.
Instructions output to the terminal describe the appropriate link stage arguments required to compile your code. Make a record of the output. The command will produce profiling library files in the working directory:
$ ls /software/projects/projcode/rsrchr/map-libs conftest.err libmap-sampler-pmpi.so.1.0 libmap-sampler.so.1 libmap-sampler-pmpi.so libmap-sampler-pmpi.so.1.0.0 libmap-sampler.so.1.0 libmap-sampler-pmpi.so.1 libmap-sampler.so libmap-sampler.so.1.0.0
By default make-profiler-libraries will generate shared libraries. Use the following command to generate static profiler libraries:
$ module load arm-forge/21.1.2 $ make-profiler-libraries --lib-type=static
Step 2: Compile and link your MPI application
Generate the executable of your program by linking your application with the generated wrapped libraries. The code does not need to be recompiled with the -g
option. The linking step is the only one that needs modification in the application build process.
An example instance of the process running on Setonix is illustrated in Terminal 4.
$ cc -dynamic -L/group/projcode/rsrchr/map-libs -lmap-sampler-pmpi -lmap-sampler -Wl,--eh-frame-hdr -Wl,-rpath=/software/projects/projcode/rsrchr/map-libs -o profileme profileme.o
If your code is using Make or CMake for building, you will need to modify the linking option settings accordingly.
Step 3: Execute the code to generate profiling information
You can execute your Arm Performance Reports job in one of the two ways described in the following sections.
Before running the perf-report
executable you must set and export the MPICC
variable to cc
:
$ export MPICC=cc
Submit the job to the SLURM queueing system
Write a batch script to submit the profiling job to the scheduler. For instance, listing 2 describes a 15 minute single node job which executes 4 processes in the debug
queue on Setonix.
#!/bin/bash --login #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --partition=debug #SBATCH --account=[your-project] #SBATCH --time=0:15:00 module unload cray-libsci module load arm-forge/21.1.2 export MPICC=cc perf-report -o profile.html srun --export=all -n 4 ./profileme
Note that we are using -o
option to choose the filename and the file format of the report. You can submit the script using the sbatch
command.
Use the interactive session
Allocate a profiling session in the debug
queue by running
$ salloc --nodes=1 --ntasks=4 --partition=debug --account=[your-project] --time=0:15:00 --export=none
You can now run the profiling job.
$ module load arm-forge/21.1.2 $ perf-report -o profile.html srun --export=all -n 4 ./profileme
Step 4: View the report
A successful execution will produce the profile.html
file in the working directory. This file can be copied to local system and opened with favourite web browser. Below is the report generated for the profileme.c
example code.
Next steps
Arm Performance Reports provides a great deal of functionality to analyse different aspects of the application's performance. Not all of the available functionality is described in the simple example below. For further information, consult the Arm Performance Reports.
Note ARM Performance Reports does not support AMD GPUs, see Profiling GPU-Enabled codes for alternative tools.