Check this page regularly as it will be updated frequently over the incoming months as the deployment of the software progresses.
In particular, currently:
- GPU supported software modules are still in the process of being deployed
This page summarises the information needed to start using the Setonix GPU partitions.
Overview
The GPU partition of Setonix is made up of 192 nodes, 38 of which are high memory nodes (512 GB RAM instead of 256GB). Each GPU node features 4 AMD MI250X GPUs, as depicted in Figure 1. Each MI250X comprises 2 Graphics Complex Die (GCD), with each effectively seen as a standalone GPU by the system. A 64-core AMD Trento CPU is connected to the four MI250X with the AMD InfinityFabric interconnect, the same interconnection between the GPU cards, with a peak bandwidth of 200Gb/s. For more information refer to the Setonix General Information.
Figure 1. A GPU node of Setonix
Supported Applications
Several scientific applications are already able to offload computations to the MI250X, many others are in the process of being ported to AMD GPUs. Here is a list of the main ones and their current status.
Name | AMD GPU Acceleration | Module on Setonix |
---|---|---|
Amber | Yes | |
Gromacs | Yes | |
LAMMPS | Yes | |
NAMD | Yes | |
NekRS | Yes | |
PyTorch | Yes | |
ROMS | No | |
Tensorflow | Yes |
Table 1. List of popular applications
Module names of AMD GPU applications end with the postfix amd-gfx90a
. The most accurate list is given by the module
command:
$ module avail gfx90a
Tensorflow
Tensorflow is available as container at the following location,
/software/setonix/2022.11/containers/sif/amdih/tensorflow/rocm5.0-tf2.7-dev/tensorflow-rocm5.0-tf2.7-dev.sif
but no module has been created for it yet.
Supported Numerical Libraries
Popular numerical routines and functions have been implemented by AMD to run on their GPU hardware. All of the following are available when loading the rocm/5.0.2
module.
Name | Description |
---|---|
rocFFT | Fast Fourier Transform. Documentation pages (external site). |
rocBLAS | rocBLAS is the AMD library for Basic Linear Algebra Subprograms (BLAS) on the ROCm platform. Documentation pages (external site). |
rocSOLVER | rocSOLVER is a work-in-progress implementation of a subset of LAPACK functionality on the ROCm platform. Documentation pages (external site). |
Table 2. Popular GPU numerical libraries.
Each of the above libraries has an equivalent HIP wrapper that enables compilation on both ROCm and NVIDIA platforms.
A complete list of available libraries can be found on this page (external site).
AMD ROCm installations
The default ROCm installation is rocm/5.0.2
provided by HPE Cray. In addition, Pawsey staff have installed the more recent rocm/5.4.3
from source using ROCm-from-source. It is an experimental installation and users might encounter compilation or linking errors. You are encouraged to explore it during development and to report any issues. For production jobs, however, we currently recommend using rocm/5.0.2
.
Submitting Jobs
You can submit GPU jobs to the gpu
, gpu-dev
and gpu-highmem
Slurm partitions using your GPU allocation.
Note that you will need to use a different project code for the --account
/-A option. More specifically, it is your project code followed by the -gpu
suffix. For instance, if your project code is project1234
, then you will have to use project1234-gpu
.
GPUs must be explicitly requested to Slurm using the --gres=gpu:<num_gpus>
, --gpus-per-task=<num_gpus>
or --gpus-per-node=<num_gpus>
options. The --gpus-per-node
option is recommended.
Compiling software
If you are using ROCm libraries, such as rocFFT, to offload computations to GPUs, you should be able to use any compiler to link those to your code.
For HIP code as well as one making use of OpenMP offloading, you must use hipcc
.
If a HIP or OpenMP offload code also requires MPI, the location of the MPI headers and libraries that are usually automatically included by the Cray wrapper scripts must also be provided ti hipcc
as well as the GPU Transport Layer libraries:
-I${MPICH_DIR}/include -L${MPICH_DIR}/lib -lmpi -L${CRAY_MPICH_ROOTDIR}/gtl/lib -lmpi_gtl_hsa
OpenACC for Fortran codes is implemented in the Cray Fortran compiler.
Accounting
Each MI250X GCD, which corresponds to a Slurm GPU, is charged 64 SU per hour. This means the use of an entire GPU node is charged 512 SU per hour. In general, a job is charged the largest proportion of core, memory, or GPU usage rounded up to 1/8ths of a node (corresponding to an individual MI250X GCD). Note that GPU node usage is accounted against GPU allocations with the -gpu
suffix, which are separate to CPU allocations.
Programming AMD GPUs
You can program AMD MI250X GPUs using HIP, which is the programming framework equivalent to the one of NVIDIA, CUDA. The HIP platform is available after having loaded the rocm
module.
The complete AMD documentation on how to program with HIP can be found here (external site).
Example Jobscripts
The following are some brief examples of requesting GPUs via Slurm batch scripts on Setonix. For more detail, particularly regarding the allocation of whole slurm-sockets when using shared nodes and the CPU binding for optimal placement relative to GPUs, refer to Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.
#!/bin/bash --login #SBATCH --account=project-gpu #SBATCH --partition=gpu #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --sockets-per-node=1 #SBATCH --gpus-per-node=1 #SBATCH --time=00:05:00 #---- #Loading needed modules (adapt this for your own purposes): module load PrgEnv-cray module load rocm craype-accel-amd-gfx90a module list #---- #MPI & OpenMP settings export OMP_NUM_THREADS=1 #This controls the real number of threads per task #---- #Execution srun -c 8 ./program
#!/bin/bash --login #SBATCH --account=project-gpu #SBATCH --partition=gpu #SBATCH --ntasks=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --sockets-per-node=8 #SBATCH --gpus-per-node=8 #SBATCH --time=00:05:00 #SBATCH --exclusive #---- #Loading needed modules (adapt this for your own purposes): module load PrgEnv-cray module load rocm craype-accel-amd-gfx90a module list #---- #MPI & OpenMP settings export OMP_NUM_THREADS=1 #This controls the real number of threads per task #---- #Execution srun -c 8 ./program
#!/bin/bash --login #SBATCH --account=project-gpu #SBATCH --partition=gpu #SBATCH --ntasks=8 #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=8 #SBATCH --sockets-per-node=8 #SBATCH --gpus-per-node=8 #SBATCH --time=00:05:00 #SBATCH --exclusive #---- #Loading needed modules (adapt this for your own purposes): module load PrgEnv-cray module load rocm craype-accel-amd-gfx90a module list #---- #MPI & OpenMP settings export OMP_NUM_THREADS=1 #This controls the real number of threads per task #---- #First preliminar "hack": create a selectGPU wrapper to be used for # binding only 1 GPU per each task spawned by srun wrapper="selectGPU_${SLURM_JOBID}.sh" cat << EOF > $wrapper #!/bin/bash export ROCR_VISIBLE_DEVICES=\$SLURM_LOCALID exec \$* EOF chmod +x ./$wrapper #---- #Second preliminar "hack": generate an ordered list of CPU-cores (each on a different slurm-socket) # to be matched with the correct GPU in the srun command using --cpu-bind option. CPU_BIND="map_cpu:48,56,16,24,0,8,32,40" #---- #Execution srun -c 8 --cpu-bind=${CPU_BIND} ./$wrapper ./program #---- #Deleting the wrapper rm -f ./$wrapper