Setonix GPU Partition Quick Start

Check this page regularly as it will be updated frequently over the incoming months as the deployment of the software progresses.

In particular, currently:

  • GPU supported software modules are still in the process of being deployed

This page summarises the information needed to start using the Setonix GPU partitions.

On this page:

Overview

The GPU partition of Setonix is made up of 192 nodes, 38 of which are high memory nodes (512 GB RAM instead of 256GB). Each GPU node features 4 AMD MI250X GPUs, as depicted in Figure 1. Each MI250X comprises 2 Graphics Complex Die (GCD), with each effectively seen as a standalone GPU by the system. A 64-core AMD Trento CPU is connected to the four MI250X with the AMD InfinityFabric interconnect, the same interconnection between the GPU cards, with a peak bandwidth of 200Gb/s. For more information refer to the Setonix General Information. Each GCD can access 64GB of GPU memory. This totals to 128GB per MI250X, and 256GB per standard GPU node. 

Figure 1. A GPU node of Setonix

Supported Applications

Several scientific applications are already able to offload computations to the MI250X, many others are in the process of being ported to AMD GPUs. Here is a list of the main ones and their current status.

NameAMD GPU AccelerationModule on Setonix
AmberYesYes
GromacsYesYes
LAMMPSYesYes
NAMDYes
NekRSYes
PyTorchYesYes*
ROMSNo
TensorflowYesYes*

Table 1. List of popular applications. * indicates module is a container as module. 

Module names of AMD GPU applications end with the postfix amd-gfx90a. The most accurate list is given by the module  command:

$ module avail gfx90a



Tensorflow

Tensorflow is available as container at the following location,

/software/setonix/2022.11/containers/sif/amdih/tensorflow/rocm5.0-tf2.7-dev/tensorflow-rocm5.0-tf2.7-dev.sif 

but no module has been created for it yet.



Supported Numerical Libraries

Popular numerical routines and functions have been implemented by AMD to run on their GPU hardware. All of the following are available when loading the rocm/5.0.2  module.

NameDescription
rocFFTFast Fourier Transform. Documentation pages (external site).
rocBLASrocBLAS is the AMD library for Basic Linear Algebra Subprograms (BLAS) on the ROCm platform. Documentation pages (external site).
rocSOLVERrocSOLVER is a work-in-progress implementation of a subset of LAPACK functionality on the ROCm platform. Documentation pages (external site).

Table 2. Popular GPU numerical libraries.

Each of the above libraries has an equivalent HIP wrapper that enables compilation on both ROCm and NVIDIA platforms.

A complete list of available libraries can be found on this page (external site).

AMD ROCm installations

The default ROCm installation is rocm/5.2.3  provided by HPE Cray. In addition, Pawsey staff have installed the more recent rocm/5.4.3  from source using ROCm-from-source. It is an experimental installation and users might encounter compilation or linking errors. You are encouraged to explore it during development and to report any issues. For production jobs, however, we currently recommend using rocm/5.2.3.

Submitting Jobs

You can submit GPU jobs to the gpu, gpu-dev and gpu-highmem Slurm partitions using your GPU allocation.

Note that you will need to use a different project code for the --account/-A option. More specifically, it is your project code followed by the -gpu suffix. For instance, if your project code is project1234, then you will have to use project1234-gpu.

Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)

The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun.

Request for the amount of "allocation-packs" required for the job

With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation-packs". Each "allocation-pack" provides:

  • 1 whole CPU chiplet (8 CPU cores)
  • a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node) = 1/8 of the total available RAM
  • 1 GCD directly connected to that chiplet

For that, the request of resources only needs the number of nodes (–-nodes, -N) and the number of allocation-packs per node (--gres=gpu:number). The total of allocation-packs requested results from the multiplication of these two parameters. Note that the standard Slurm meaning of the second parameter IS NOT used at Pawsey. Instead, Pawsey's CLI filter interprets this parameter as:

  • the number of requested "allocation-packs" per node

Note that the "equivalent" option --gpus-per-node=number (which is also interpreted as the number of "allocation-packs" per node) is not recommended as we have found some bugs with its use.

Furthermore, in the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores. Therefore, users should not use --ntasks, --cpus-per-task, --mem, etc. in the request headers of the script ( #SBATCH directives), or in the request options given to salloc for interactive sessions. If, for some reason, the requirements for a job are indeed determined by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation-packs" that cover their needs. The "allocation-pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.

Pawsey also has some site specific recommendations for the use/management of resources with srun command. Users should explicitly provide a list of several parameters for the use of resources by srun. (The list of these parameters is made clear in the examples below.) Users should not assume that srun will inherit any of these parameters from the allocation request. Therefore, the real management of resources at execution time is performed by the command line options provided to srun. Note that, for the case of srun, options do have the standard Slurm meaning. 

--gpu-bind=closest may NOT work for all applications

Within the full explicit srun options for "managing resources", there are some that help to achieve optimal binding of GPUs to their directly connected chiplet on the CPU. There are two methods to achieve this optimal binding of GPUs. So, together with the full explicit srun options, the following two methods can be used:

  1. Include these two Slurm parameters: --gpus-per-task=<number> together with --gpu-bind=closest
  2. "Manual" optimal binding with the use of "two auxiliary techniques" (explained later in the main document).

The first method is simpler, but may still launch execution errors for some codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate


The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes. Most of the examples in the table provide are for typical jobs where multiple GPUs are allocated to the job as a whole but each of the tasks spawned by srun is binded and has direct access to only 1 GPU. For applications that require multiple GPUs per task, there 3 examples (*4, *5 & *7) where tasks are binded to multiple GPUs:

Required Resources per JobNew "simplified" way of requesting resourcesTotal Allocated resourcesCharge per hour

The use of full explicit srun options is now required
(only the 1st method for optimal binding is listed here)

1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU)#SBATCH --nodes=1
#SBATCH --gres=gpu:1
1 allocation-pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB CPU RAM
64 SU

*1

export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

1 CPU task (with 14 CPU threads each) all threads controlling the same 1 GCD

#SBATCH --nodes=1
#SBATCH --gres=gpu:2

2 allocation-packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB CPU RAM
128 SU

*2

export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --gres=gpu:3
3 allocation-packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB CPU RAM
192 SU

*3

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest <executable>

2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication

#SBATCH --nodes=1
#SBATCH --gres=gpu:4

4 allocation-packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB CPU RAM
256 SU

*4

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 2 -c 16 --gres=gpu:4 --gpus-per-task=2 --gpu-bind=closest <executable>

5 CPU tasks (single thread each) all threads/tasks able to see all 5 GPUs

#SBATCH --nodes=1
#SBATCH --gres=gpu:5

5 allocation-packs=
5 GPUs, 40 CPU cores (5 chiplets), 147.2 GB CPU RAM
320 SU

*5

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 1 -n 5 -c 8 --gres=gpu:5 <executable>

8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --exclusive
8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM
512 SU

*6

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest <executable>

8 CPU tasks (single thread each), each controlling 4 GCD with GPU-aware MPI communication#SBATCH --nodes=4
#SBATCH --exclusive
32 allocation-packs=
4 nodes, each with: 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM
2048 SU

*7

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 4 -n 8 -c 32 --gres=gpu:8 --gpus-per-task=4 --gpu-bind=closest <executable>

1 CPU task (single thread), controlling 1 GCD but avoiding other jobs to run in the same node for ideal performance test.#SBATCH --nodes=1
#SBATCH --exclusive
8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM
512 SU

*8

export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

Notes for the request of resources:

  • Note that this simplified way of resource request is based on requesting a number of "allocation-packs", so that standard use of Slurm parameters for allocation should not be used for GPU resources.
  • The --nodes (-N) option indicates the number of nodes requested to be allocated.
  • The --gres=gpu:number option indicates the number of allocation-packs requested to be allocated per node. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)
  • The --exclusive option requests all the resources from the number of requested nodes. When this option is used, there is no need for the use of --gres=gpu:number during allocation and, indeed, its use is not recommended in this case.
  • Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via srun options.
  • The same simplified resource request should be used for the request of interactive sessions with salloc.
  • IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)

Notes for the use/management of resources with srun:

  • Note that, for the case of srun, options do have the standard Slurm meaning.
  • The following options need to be explicitly provided to srun and not assumed to be inherited with some default value from the allocation request:
    • The --nodes (-N) option indicates the number of nodes to be used by the srun step.
    • The --ntasks (-n) option indicates the total number of tasks to be spawned by the srun step. By default, tasks are spawned evenly across the number of allocated nodes.
    • The --cpus-per-task (-c) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation-packs" and then "reserving" whole chiplets per srun task, even if the real number is 1 thread per task. The real number of threads is controlled with the OMP_NUM_THREADS environment variable.
    • The --gres=gpu:number option indicates the number of GPUs per node to be used by the srun step. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)
    • The --gpus-per-task option indicates the number of GPUs to be binded to each task spawned by the srun step via the -n option. Note that this option neglects sharing of the assigned GPUs to a task with other tasks. (See cases *4, *5 and *7 and their notes for non-intuitive cases.)
  • And for optimal binding, the following should be used:
    • The --gpu-bind=closest indicates that the chosen GPUs to be binded to each task should be the optimal (physically closest) to the chiplet assigned to each task.
    • IMPORTANT: The use of --gpu-bind=closest will assign optimal binding but may still NOT work and launch execution errors for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. Method 2 is explained later in the main document.


  • (*1) This is the only case where srun may work fine with default inherited option values. Nevertheless, it is a good practice to always use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8) for the srun task and control the real number of threads with the OMP_NUM_THREADS environment variable. Although the use of gres=gpu, gpus-per-task & gpu-bind is reduntant in this case, we keep them for encouraging their use, which is strictly needed in the most of cases (except case *5).
  • (*2) The required CPU threads per task is 14 and that is controlled with the OMP_NUM_THREADS environment variable. But still the two full chiplets (-c 16) are indicated for each srun task.
  • (*3) The settings explicitly "reserve" a whole chiplet (-c 8) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*4) Each task needs to be in direct communication with 2 GCDs. For that, each of the CPU task reserve "two-full-chiplets". IMPORTANT: The use of -c 16 "reserves" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to the chiplets reserved for each task. The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*5) Sometimes, the executable (and not the scheduler) performs all the management of GPUs, like in the case of Tensorflow distributed training, and other Machine Learning Applications. If all the management logic for the GPUs is performed by the executable, then all the available resources should be exposed to it. IMPORTANT: In this case, the --gpu-bind option should not be provided. Neither the --gpus-per-task option should be provided, as all the available GPUs are to be available to all tasks. The real number of threads is controlled with the OMP_NUM_THREADS variable. And, in order to allow GPU-aware MPI communication, the environment variable  MPICH_GPU_SUPPORT_ENABLED  is set to 1. These last two settings may not be necessary for aplications like Tensorflow.
  • (*6) All GPUs in the node are requested, which mean all the resources available in the node via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). The use of -c 8 provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 8).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*7) All resources in each node are requested via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). Each task needs to be in direct communication with 4 GCDs. For that, each of the CPU task reserve "four-full-chiplets". IMPORTANT: The use of -c 32 "reserves" a "four-chiplets-long" separation among the two CPU cores that are to be used per node (8 srun tasks in total, -n 8 ). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. In this way, each task will be in direct communication to the closest four logical GPUs in the node with respect to the chiplets reserved for each task. And, in order to allow GPU-aware MPI communication, the environment variable  MPICH_GPU_SUPPORT_ENABLED  is set to 1. The --gres=gpu:8 option assigns 8 GPUs per node to the srun step (32 GPUs in total as 4 nodes are being assigned).
  • (*8) All GPUs in the node are requested using the --exclusive option, but only 1 CPU chiplet - 1 GPU "unit" (or allocation-pack) is used in the srun step.

General notes:

  • The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged

An extensive explanation on the use of the GPU nodes (including request by "allocation packs" and the "manual" binding) is in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.

Compiling software

If you are using ROCm libraries, such as rocFFT, to offload computations to GPUs, you should be able to use any compiler to link those to your code.

For HIP code use hipcc. And, for code making use of OpenMP offloading, you must use:

  • hipcc for c/c++
  • ftn (wrapper for cray-fortran from PrgEnv-cray) for fortran. This compiler also allows GPU offloading with OpenACC.

When using hipcc, note that the location of the MPI headers and libraries are not automatically included (contrary to the automatic inclusion when using the Cray wrapper scripts). Therefore, if your code also requires MPI, the location of the MPI headers and libraries must be provided to hipcc as well as the GPU Transport Layer libraries:

MPI include and library flags for hipcc
-I${MPICH_DIR}/include
-L${MPICH_DIR}/lib -lmpi 
-L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa

Also to ensure proper use of GPU-GPU MPI communication codes must be compiled and run with the following environment variable set:

MPI environment variable for GPU-GPU communication
export MPICH_GPU_SUPPORT_ENABLED=1


Accounting

Each MI250X GCD, which corresponds to a Slurm GPU, is charged 64 SU per hour. This means the use of an entire GPU node is charged 512 SU per hour. In general, a job is charged the largest proportion of core, memory, or GPU usage rounded up to 1/8ths of a node (corresponding to an individual MI250X GCD). Note that GPU node usage is accounted against GPU allocations with the -gpu suffix, which are separate to CPU allocations.

Programming AMD GPUs

You can program AMD MI250X GPUs using HIP, which is the programming framework equivalent to the one of NVIDIA, CUDA. The HIP platform is available after having loaded the rocm  module.

The complete AMD documentation on how to program with HIP can be found here (external site).

Example Jobscripts

The following are some brief examples of requesting GPUs via Slurm batch scripts on Setonix. For more detail, particularly regarding using shared nodes and the CPU binding for optimal placement relative to GPUs,  refer to Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.

Example 1 : One process with a single GPU using shared node access
#!/bin/bash --login

#SBATCH --account=project-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example
#SBATCH --gres=gpu:1           #1 GPU per node (1 "allocation-pack" in total for the job)
#SBATCH --time=00:05:00

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
module list

#----
#MPI & OpenMP settings
export OMP_NUM_THREADS=1 #This controls the real number of threads per task

#----
#Execution
srun -N 1 -n 1 -c 8 --gres=gpu:1 ./program
Example 2 : Single CPU process that use the eight GPUs of the node
#!/bin/bash --login

#SBATCH --account=project-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (8 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
module list

#----
#MPI & OpenMP settings
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
srun -N 1 -n 1 -c 64 --gres=gpu:8 ./program
Example 3 : Eight MPI processes each with a single GPU (use exclusive node access)
#!/bin/bash --login

#SBATCH --account=project-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (8 "allocation packs" in total for the job)
#SBATCH --time=00:05:00

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
module list

#----
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real number of threads per task

#----
#Execution
srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest ./program

Method 1 may fail for some applications.

The use of --gpu-bind=closest may not work for all codes. For those codes, "manual" binding may be the only reliable method if they relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

Some codes, like {{OpenMM}}, also make use of the runtime environment variables and require explicitly setting ROCR_VISIBLE_DEVICES

Setting visible devices manually
export ROCR_VISIBLE_DEVICES=0,1 # selects the first two GCDS on GPU 1. 


Full guides