SLURM upgrade to version 22.05.8
SLURM scheduler version has been updated to SLURM 22.05.8 and a new CLI filter has been installed on the GPU nodes in order to provide optimal binding of GPUs.
Slurm use for CPU-only nodes
The request and use of resources for the CPU-only nodes has not changed, so users may keep using their already working Slurm batch scripts for submitting jobs.
The only recommendation that we may raise at this point is that Slurm has announced the "separation" of the request of resources and the srun
launcher. This implies that, in future versions of Slurm, srun
will not inherit the exact parameters as requested for the allocation. Since previous versions of Slurm, this has already happened to the --cpus-per-task
(or -c)
option, which needs to be explicitly set in each srun
command, independently of its setting during the request for resources. Therefore, we recommend users to be aware of the upcoming changes on Slurm and adopt, as a best practice, the explicit set all the parameters of srun
that indicate the resources to be used in the command, rather than assuming that these parameters are inherited correctly by default. (Indeed, this practice has now become a requirement for the use of the GPU nodes in Setonix, as you can read in the following section.)
Slurm use for GPU nodes
The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via Request for the amount of "allocation-packs" required for the job With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation-packs". Each "allocation-pack" provides: For that, the request of resources only needs the number of nodes ( Note that the "equivalent" option Furthermore, in the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores. Therefore, users should not use Pawsey also has some site specific recommendations for the use/management of resources with --gpu-bind=closest may NOT work for all applications Within the full explicit The first method is simpler, but may still launch execution errors for some codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate. The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes. Most of the examples in the table provide are for typical jobs where multiple GPUs are allocated to the job as a whole but each of the tasks spawned by The use of full explicit 1 CPU task (with 14 CPU threads each) all threads controlling the same 1 GCD 5 CPU tasks (single thread each) all threads/tasks able to see all 5 GPUs Notes for the request of resources: Notes for the use/management of resources with General notes:Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)
salloc
or (#SBATCH
pragmas) and the options for the use of resources during execution of the code via srun
.–-nodes
, -N
) and the number of allocation-packs per node (--gres=gpu:number
). The total of allocation-packs requested results from the multiplication of these two parameters. Note that the standard Slurm meaning of the second parameter IS NOT used at Pawsey. Instead, Pawsey's CLI filter interprets this parameter as:--gpus-per-node=number
(which is also interpreted as the number of "allocation-packs" per node) is not recommended as we have found some bugs with its use.--ntasks
, --cpus-per-task
, --mem
, etc. in the request headers of the script ( #SBATCH
directives), or in the request options given to salloc
for interactive sessions. If, for some reason, the requirements for a job are indeed determined by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation-packs" that cover their needs. The "allocation-pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.srun
command. Users should explicitly provide a list of several parameters for the use of resources by srun
. (The list of these parameters is made clear in the examples below.) Users should not assume that srun
will inherit any of these parameters from the allocation request. Therefore, the real management of resources at execution time is performed by the command line options provided to srun
. Note that, for the case of srun
, options do have the standard Slurm meaning. srun
options for "managing resources", there are some that help to achieve optimal binding of GPUs to their directly connected chiplet on the CPU. There are two methods to achieve this optimal binding of GPUs. So, together with the full explicit srun
options, the following two methods can be used:--gpus-per-task=<number>
together with --gpu-bind=closest
srun
is binded and has direct access to only 1 GPU. For applications that require multiple GPUs per task, there 3 examples (*4, *5 & *7) where tasks are binded to multiple GPUs:Required Resources per Job New "simplified" way of requesting resources Total Allocated resources Charge per hour srun
options is now required
(only the 1st
method for optimal binding is listed here)1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU) #SBATCH --nodes=1
#SBATCH --gres=gpu:1
1 allocation-pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB CPU RAM64 SU
*1
export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>
#SBATCH --nodes=1
#SBATCH --gres=gpu:2
2 allocation-packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB CPU RAM128 SU
*2
export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>
3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --gres=gpu:3
3 allocation-packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB CPU RAM192 SU
*3
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest <executable>
2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --gres=gpu:4
4 allocation-packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB CPU RAM256 SU *4
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 2 -c 16 --gres=gpu:4 --gpus-per-task=2 --gpu-bind=closest <executable>
#SBATCH --nodes=1
#SBATCH --gres=gpu:5
5 allocation-packs=
5 GPUs, 40 CPU cores (5 chiplets), 147.2 GB CPU RAM320 SU
*5
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 1 -n 5 -c 8 --gres=gpu:5 <executable>
8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication #SBATCH --nodes=1
#SBATCH --exclusive
8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM512 SU *6
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest <executable>
8 CPU tasks (single thread each), each controlling 4 GCD with GPU-aware MPI communication #SBATCH --nodes=4
#SBATCH --exclusive
32 allocation-packs=
4 nodes, each with: 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM2048 SU *7
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 4 -n 8 -c 32 --gres=gpu:8 --gpus-per-task=4 --gpu-bind=closest <executable>
1 CPU task (single thread), controlling 1 GCD but avoiding other jobs to run in the same node for ideal performance test. #SBATCH --nodes=1
#SBATCH --exclusive
8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM512 SU *8
export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>
--nodes
(-N
) option indicates the number of nodes requested to be allocated.--gres=gpu:number
option indicates the number of allocation-packs requested to be allocated per node. (The "equivalent" option --gpus-per-node=number
is not recommended as we have found some bugs with its use.)--exclusive
option requests all the resources from the number of requested nodes. When this option is used, there is no need for the use of --gres=gpu:number
during allocation and, indeed, its use is not recommended in this case.srun
options.salloc
.srun
:srun
, options do have the standard Slurm meaning.srun
and not assumed to be inherited with some default value from the allocation request:--nodes
(-N
) option indicates the number of nodes to be used by the srun
step.--ntasks
(-n
) option indicates the total number of tasks to be spawned by the srun
step. By default, tasks are spawned evenly across the number of allocated nodes.--cpus-per-task
(-c
) option should be set to multiples of 8 (whole chiplets) to guarantee that srun
will distribute the resources in "allocation-packs" and then "reserving" whole chiplets per srun
task, even if the real number is 1 thread per task. The real number of threads is controlled with the OMP_NUM_THREADS
environment variable.--gres=gpu:number
option indicates the number of GPUs per node to be used by the srun
step. (The "equivalent" option --gpus-per-node=number
is not recommended as we have found some bugs with its use.)--gpus-per-task
option indicates the number of GPUs to be binded to each task spawned by the srun
step via the -n
option. Note that this option neglects sharing of the assigned GPUs to a task with other tasks. (See cases *4, *5 and *7 and their notes for non-intuitive cases.)--gpu-bind=closest
indicates that the chosen GPUs to be binded to each task should be the optimal (physically closest) to the chiplet assigned to each task.--gpu-bind=closest
will assign optimal binding but may still NOT work and launch execution errors for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. Method 2 is explained later in the main document.srun
may work fine with default inherited option values. Nevertheless, it is a good practice to always use full explicit options of srun
to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8
) for the srun
task and control the real number of threads with the OMP_NUM_THREADS
environment variable. Although the use of gres=gpu, gpus-per-task & gpu-bind
is reduntant in this case, we keep them for encouraging their use, which is strictly needed in the most of cases (except case *5).OMP_NUM_THREADS
environment variable. But still the two full chiplets (-c 16
) are indicated for each srun
task.-c 8
) for each srun
task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun
(-n 3
). The real number of threads is controlled with the OMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variable
MPICH_GPU_SUPPORT_ENABLED
is set to 1.-c 16
"reserves" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun
tasks, -n 2
). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to the chiplets reserved for each task. The real number of threads is controlled with the OMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variable
MPICH_GPU_SUPPORT_ENABLED
is set to 1.--gpu-bind
option should not be provided. Neither the --gpus-per-task
option should be provided, as all the available GPUs are to be available to all tasks. The real number of threads is controlled with the OMP_NUM_THREADS
variable. And, in order to allow GPU-aware MPI communication, the environment variable
MPICH_GPU_SUPPORT_ENABLED
is set to 1. These last two settings may not be necessary for aplications like Tensorflow.--exclusive
allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). The use of -c 8
provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun
(-n 8
). The real number of threads is controlled with the OMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variable
MPICH_GPU_SUPPORT_ENABLED
is set to 1.--exclusive
allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). Each task needs to be in direct communication with 4 GCDs. For that, each of the CPU task reserve "four-full-chiplets". IMPORTANT: The use of -c 32
"reserves" a "four-chiplets-long" separation among the two CPU cores that are to be used per node (8 srun
tasks in total, -n 8
). The real number of threads is controlled with the OMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest
. In this way, each task will be in direct communication to the closest four logical GPUs in the node with respect to the chiplets reserved for each task. And, in order to allow GPU-aware MPI communication, the environment variable
MPICH_GPU_SUPPORT_ENABLED
is set to 1. The --gres=gpu:8
option assigns 8 GPUs per node to the srun step (32 GPUs in total as 4 nodes are being assigned).exclusive
option, but only 1 CPU chiplet - 1 GPU "unit" (or allocation-pack) is used in the srun
step.
An extensive explanation on the use of the GPU nodes (including these updates and the "manual" binding) is in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.