Example Slurm Batch Scripts for Setonix on GPU Compute Nodes

Example Slurm Batch Scripts for Setonix on GPU Compute Nodes

On this page

Node architecture

The GPU node architecture is different from that on the CPU-only nodes. The following diagram shows the connections between the CPU and GPUs on the node, which will assist with understanding recommendations for Slurm job scripts later on this page. Note that the numbering of the cores of the CPU has a slightly different order to that of the GPUs. Each GCD can access 64GB of GPU memory. This totals to 128GB per MI250X, and 256GB per standard GPU node. 

Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Compute Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples explained in the rest of this document, we use the numbering of the cores and bus IDs of the GCD to identify the allocated chiplets and GCDs, and their binding.)

Important: GCD vs GPU and effective meaning when allocating GPU resources at Pawsey

A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as --gres=gpu:number is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8. (Note that the "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)

Furthermore, Pawsey DOES NOT use standard Slurm meaning for the --gres=gpu:number parameter. The meaning of this parameter has been superseeded to represent the request for a number of "allocation-packs". The new representation has been implemented to achieve best performance. Therefore, the current allocation method uses the "allocation-pack" as the basic allocation unit and, as explained in the rest of this document, users should only request for the number of "allocation-packs" that fullfill the needs of the job. Each allocation-pack provides:

  • 1 whole CPU chiplet (8 CPU cores)

  • ~32 GB memory (1/8 of the total available RAM)

  • 1 GCD (slurm GPU) directly connected to that chiplet



IMPORTANT: Shared jobs may not receive optimal binding

For jobs that only use a partial set of resources of the node (non-exclusive jobs that share the rest of the node with other jobs), the current Setonix GPU configuration may not provide perfect allocation and binding, which may impact performance depending on the amount of CPU-GPU communication. This is under active investigation, and the recommendations included in this document will serve to achieve optimal allocations in most of the cases, but is not 100% guaranteed. Therefore, if you detect that imperfect binding or the use of shared nodes (even with optimal binding) is impacting the performance of your jobs, it is recommended to use exclusive nodes where possible, noticing that the project will still be charged for the whole node even if part of the resources remain idle. Please also report the observed issues to Pawsey's helpdesk.

Each GPU node also has an attached NVMe device with up to 3500 GiB of usable storage.

Further details of the node architecture are also available on the GPU node architecture page.

Slurm use of GPU nodes

Project name to access the GPU nodes is different

IMPORTANT: Add "-gpu" to your project account in order to access the GPU nodes

The default project name will not give you access to the GPU nodes. So, in order to access the GPU nodes, users need to add the postfix "-gpu" to their project name and explicitly indicate it in the resource request options:

#SBATCH -A <projectName>-gpu

So, for example, if your project name is "rottnest0001" the setting would be:

#SBATCH -A rottnest0001-gpu

This applies for all GPU partitions (gpu, gpu-dev & gpu-highmem).



Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)

The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun.

Request for the amount of "allocation-packs" required for the job

With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation-packs". Each "allocation-pack" provides:

  • 1 whole CPU chiplet (8 CPU cores)

  • a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node) = 1/8 of the total available RAM

  • 1 GCD directly connected to that chiplet

For that, the request of resources only needs the number of nodes (–-nodes, -N) and the number of allocation-packs per node (--gres=gpu:number). The total of allocation-packs requested results from the multiplication of these two parameters. Note that the standard Slurm meaning of the second parameter IS NOT used at Pawsey. Instead, Pawsey's CLI filter interprets this parameter as:

  • the number of requested "allocation-packs" per node

Note that the "equivalent" option --gpus-per-node=number (which is also interpreted as the number of "allocation-packs" per node) is not recommended as we have found some bugs with its use.

Furthermore, in the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores. Therefore, users should not use --ntasks, --cpus-per-task, --mem, etc. in the request headers of the script ( #SBATCH directives), or in the request options given to salloc for interactive sessions. If, for some reason, the requirements for a job are indeed determined by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation-packs" that cover their needs. The "allocation-pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.

Pawsey also has some site specific recommendations for the use/management of resources with srun command. Users should explicitly provide a list of several parameters for the use of resources by srun. (The list of these parameters is made clear in the examples below.) Users should not assume that srun will inherit any of these parameters from the allocation request. Therefore, the real management of resources at execution time is performed by the command line options provided to srun. Note that, for the case of srun, options do have the standard Slurm meaning. 


--gpu-bind=closest may NOT work for all applications

Within the full explicit srun options for "managing resources", there are some that help to achieve optimal binding of GPUs to their directly connected chiplet on the CPU. There are two methods to achieve this optimal binding of GPUs. So, together with the full explicit srun options, the following two methods can be used:

  1. Include these two Slurm parameters: --gpus-per-task=<number> together with --gpu-bind=closest

  2. "Manual" optimal binding with the use of "two auxiliary techniques" (explained later in the main document).

The first method is simpler, but may still launch execution errors for some codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate



The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes. Most of the examples in the table provide are for typical jobs where multiple GPUs are allocated to the job as a whole but each of the tasks spawned by srun is binded and has direct access to only 1 GPU. For applications that require multiple GPUs per task, there 3 examples (*4, *5 & *7) where tasks are binded to multiple GPUs:

Required Resources per Job

New "simplified" way of requesting resources

Total Allocated resources

Charge per hour

The use of full explicit srun options is now required
(only the 1st method for optimal binding is listed here)

1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU)

#SBATCH --nodes=1
#SBATCH --gres=gpu:1

1 allocation-pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB CPU RAM

64 SU

*1

export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

1 CPU task (with 14 CPU threads each) all threads controlling the same 1 GCD

#SBATCH --nodes=1
#SBATCH --gres=gpu:2

2 allocation-packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB CPU RAM

128 SU

*2

export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication

#SBATCH --nodes=1
#SBATCH --gres=gpu:3

3 allocation-packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB CPU RAM

192 SU

*3

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest <executable>

2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication

#SBATCH --nodes=1
#SBATCH --gres=gpu:4

4 allocation-packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB CPU RAM

256 SU

*4

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 1 -n 2 -c 16 --gres=gpu:4 --gpus-per-task=2 --gpu-bind=closest <executable>

5 CPU tasks (single thread each) all threads/tasks able to see all 5 GPUs

#SBATCH --nodes=1
#SBATCH --gres=gpu:5

5 allocation-packs=
5 GPUs, 40 CPU cores (5 chiplets), 147.2 GB CPU RAM

320 SU

*5

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 1 -n 5 -c 8 --gres=gpu:5 <executable>

8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication

#SBATCH --nodes=1
#SBATCH --exclusive

8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM

512 SU

*6

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest <executable>

8 CPU tasks (single thread each), each controlling 4 GCD with GPU-aware MPI communication

#SBATCH --nodes=4
#SBATCH --exclusive

32 allocation-packs=
4 nodes, each with: 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM

2048 SU

*7

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 4 -n 8 -c 32 --gres=gpu:8 --gpus-per-task=4 --gpu-bind=closest <executable>

1 CPU task (single thread), controlling 1 GCD but avoiding other jobs to run in the same node for ideal performance test.

#SBATCH --nodes=1
#SBATCH --exclusive

8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM

512 SU

*8

export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>



By default, each node will also have 128 GiB of NVMe storage available under the /tmp and /var/tmp directories. A larger amount of storage can be requested, up to 3400 GiB, by adding tmp:<some-number>G to the --gres option.
For example, to request 5 GPUs and 1000 GiB of NVMe storage, use the following in an sbatch script:

#SBATCH --gres=gpu:5,tmp:1000G

Notes for the request of resources:

  • Note that this simplified way of resource request is based on requesting a number of "allocation-packs", so that standard use of Slurm parameters for allocation should not be used for GPU resources.

  • The --nodes (-N) option indicates the number of nodes requested to be allocated.

  • The --gres=gpu:number option indicates the number of allocation-packs requested to be allocated per node. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)

  • The --exclusive option requests all the resources from the number of requested nodes. When this option is used, there is no need for the use of --gres=gpu:number during allocation and, indeed, its use is not recommended in this case.

  • There is currently an issue with NVMe allocation with --exclusive, where only 128 GiB is made available regardless of --gres=tmp settings. If more than 128 GiB is required, one should for now request all 8 GCDs explicitly with, for example, --gres=gpu:8,tmp=3500G.

  • There is no maximum NVMe allocation limit enforced for non-exclusive use, but we ask that no more than 2679 GiB be requested in this circumstance so that there other jobs can share the node.

  • Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via srun options.

  • The same simplified resource request should be used for the request of interactive sessions with salloc.

  • IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)

Notes for the use/management of resources with srun:

  • Note that, for the case of srun, options do have the standard Slurm meaning.

  • The following options need to be explicitly provided to srun and not assumed to be inherited with some default value from the allocation request:

    • The --nodes (-N) option indicates the number of nodes to be used by the srun step.

    • The --ntasks (-n) option indicates the total number of tasks to be spawned by the srun step. By default, tasks are spawned evenly across the number of allocated nodes.

    • The --cpus-per-task (-c) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation-packs" and then "reserving" whole chiplets per srun task, even if the real number is 1 thread per task. The real number of threads is controlled with the OMP_NUM_THREADS environment variable.

    • The --gres=gpu:number option indicates the number of GPUs per node to be used by the srun step. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)

    • The --gpus-per-task option indicates the number of GPUs to be binded to each task spawned by the srun step via the -n option. Note that this option neglects sharing of the assigned GPUs to a task with other tasks. (See cases *4, *5 and *7 and their notes for non-intuitive cases.)

  • And for optimal binding, the following should be used:

    • The --gpu-bind=closest indicates that the chosen GPUs to be binded to each task should be the optimal (physically closest) to the chiplet assigned to each task.

    • IMPORTANT: The use of --gpu-bind=closest will assign optimal binding but may still NOT work and launch execution errors for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. Method 2 is explained later in the main document.



  • (*1) This is the only case where srun may work fine with default inherited option values. Nevertheless, it is a good practice to always use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8) for the srun task and control the real number of threads with the OMP_NUM_THREADS environment variable. Although the use of gres=gpu, gpus-per-task & gpu-bind is reduntant in this case, we keep them for encouraging their use, which is strictly needed in the most of cases (except case *5).

  • (*2) The required CPU threads per task is 14 and that is controlled with the OMP_NUM_THREADS environment variable. But still the two full chiplets (-c 16) are indicated for each srun task.

  • (*3) The settings explicitly "reserve" a whole chiplet (-c 8) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.

  • (*4) Each task needs to be in direct communication with 2 GCDs. For that, each of the CPU task reserve "two-full-chiplets". IMPORTANT: The use of -c 16 "reserves" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to the chiplets reserved for each task. The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.

  • (*5) Sometimes, the executable (and not the scheduler) performs all the management of GPUs, like in the case of Tensorflow distributed training, and other Machine Learning Applications. If all the management logic for the GPUs is performed by the executable, then all the available resources should be exposed to it. IMPORTANT: In this case, the --gpu-bind option should not be provided. Neither the --gpus-per-task option should be provided, as all the available GPUs are to be available to all tasks. The real number of threads is controlled with the OMP_NUM_THREADS variable. And, in order to allow GPU-aware MPI communication, the environment variable  MPICH_GPU_SUPPORT_ENABLED  is set to 1. These last two settings may not be necessary for aplications like Tensorflow.

  • (*6) All GPUs in the node are requested, which mean all the resources available in the node via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). The use of -c 8 provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 8).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.

  • (*7) All resources in each node are requested via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). Each task needs to be in direct communication with 4 GCDs. For that, each of the CPU task reserve "four-full-chiplets". IMPORTANT: The use of -c 32 "reserves" a "four-chiplets-long" separation among the two CPU cores that are to be used per node (8 srun tasks in total, -n 8 ). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. In this way, each task will be in direct communication to the closest four logical GPUs in the node with respect to the chiplets reserved for each task. And, in order to allow GPU-aware MPI communication, the environment variable  MPICH_GPU_SUPPORT_ENABLED  is set to 1. The --gres=gpu:8 option assigns 8 GPUs per node to the srun step (32 GPUs in total as 4 nodes are being assigned).

  • (*8) All GPUs in the node are requested using the --exclusive option, but only 1 CPU chiplet - 1 GPU "unit" (or allocation-pack) is used in the srun step.

General notes:

  • The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged

Note that examples above are just for quick reference and that they do not show the use of the 2nd method for optiomal binding (which may be the only way to achieve optimal binding for some applications). So, the rest of this page will describe in detail both methods of optimal binding and also show full job script examples for their use on Setonix GPU nodes.

Methods to achieve optimal binding of GCDs/GPUs

As mentioned above and, as the node diagram in the top of the page suggests, the optimal placement of GCDs and CPU cores for each task is to have direct communication among the CPU chiplet and the GCD in use. So, according to the node diagram, tasks being executed in cores in Chiplet 0 should be using GPU 4 (Bus D1), tasks in Chiplet 1 should be using GPU 5 (Bus D6), etc.

Method 1: Use of srun parameters for optimal binding

This is the most intuitive (and simple) method for achieving optimal placement of CPUs and GPUs in each task spawned by srun. This method consists in providing the  --gpus-per-task and the --gpu-bind=closest parameters. So, for example, in a job that requires the use of 8 CPU tasks (single threaded) with 1 GPU per task, the srun command to be used is:

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest myMPIExecutable

The explanation of this method will be completed in the following sections where a very useful code (hello_jobstep) will be used to confirm optimal (or sub-optimal, or incorrect) binding of GCDs (Slurm GPUs) and chiplets for srun job steps. Other examples of its use are already listed in the table in the subsection above and its use in full scripts will be provided at the end of this page.

It is important to be aware that this method works fine for most codes, but not for all. Codes suffering MPI communication errors with this methodology, should try the "manual" binding method described next.

Method 2: "Manual" method for optimal binding

Thanks to CSC center and Lumi staff

We acknowledge that the use of this method to control CPU and GCD placement was initially taken from the LUMI supercomputing documentation at CSC. From there, we have further automated parts of it for its use in shared GPU nodes. We are very thankful to LUMI staff for their collaborative support in the use and configuration of Setonix.

For codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GCD and attempting to use GPU-to-GPU (GCD-to-GCD) enabled MPI communication, the first method may fail, giving errors similar to:

Terminal N. Example error message for some GPU-aware MPI
$ srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest ./myCode_mpiGPU.exe
GTL_DEBUG: [0] hsa_amd_ipc_memory_attach (in gtlt_hsa_ops.c at line 1544):
HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.
MPICH ERROR [Rank 0] [job id 339192.2] [Mon Sep  4 13:00:27 2023] [nid001004] - Abort(407515138) (rank 0 in comm 0):
Fatal error in PMPI_Waitall: Invalid count, error stack:

For these codes, the alternative is to use a "manual" method. This second method is more elaborated than the first but, as said, may be the only option for some codes.

In this "manual" method, the  --gpus-per-task and the --gpu-bind parameters (key of the first method) should NOT be provided. And, instead of those two parameters, we use two auxiliary techniques:

  1. A wrapper script that sets a single and different value of ROCR_VISIBLE_DEVICE variable for each srun task, then assigning a single and different GCD (logical/Slurm GPU) per task.

  2. An ordered list of CPU cores in the --cpu-bind option of srun to explicitly indicate the CPU cores where each task will be placed.

These two auxiliary techniques are applied together and work in coordination to ensure the best possible match of CPU cores and GCDs.

Auxiliary technique 1: Using a wrapper to select 1 different GCD (logical/Slurm GPU) for each of the tasks spawned by srun

This first auxiliary technique uses the following wrapper script:

Listing N. selectGPU_X.sh wrapper script for "manually" selecting 1 GPU per task
#!/bin/bash

export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID
exec $*

(Note that the user is responsible for creating this wrapper in their working directory and set the correct permissions for it to work. The wrapper need to have execution permissions. The command: chmod 755 selectGPU_X.sh, or similar will do the job for that.)

The wrapper script defines the value of the ROCm environment variable ROCR_VISIBLE_DEVICES with the value of the Slurm environment variable SLURM_LOCALID. It then executes the rest of the parameters given to the script which are the usual execution instructions for the program intended to be executed. The SLURM_LOCALID variable has the identification number of the task within each of the nodes (not a global identification, but an identification number local to the node). Further details about the variable are available in the Slurm documentation.

The wrapper should be called first and then the executable (and its parameters, if any). For example, in a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task, the srun command to be used is:

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
CPU_BIND=$(generate_CPU_BIND.sh map_cpu)
srun -N 1 -n 8 -c 8 --gres=gpu:8 --cpu-bind=${CPU_BIND} ./selectGPU_X.sh myMPIExecutable

The wrapper will be ran by each of the 8 tasks spawned by srun (-n 8) and will assign a different and single value to ROCR_VISIBLE_DEVICES for each of the tasks. Furthermore, the task with SLURM_LOCALID=0 will be receive GCD 0 (Bus C1) as the only visible Slurm GPU for the task. The task with SLURM_LOCALID=1 will receive GPU 1 (Bus C6), and so forth.

The definition of CPU_BIND and its use in the --cpu-bind option of the srun command is part of the second auxiliary technique. As mentioned above, the "manual" method consist of two auxiliary techniques that need to be applied together.  So the application of a second the second technique is compulsory and is explained in the following sub-section.

Auxiliary technique 2: Using a list of CPU cores to control task placement

This second auxiliary technique uses an ordered list of CPU cores to be binded to each of the tasks spawned by srun. An example of a "hardcoded" ordered list that would bind correctly the 8 GCDs across the 4 GPU cards in a node would be:

CPU_BIND="map_cpu:49,57,17,25,0,9,33,41"

("map_cpu" is a Slurm indicator of the type of binding to be used. Please read the Slurm documentation for further details.)

According to the node diagram at the top of this page, it is clear that this list consists of 1 CPU core per chiplet. What may not be very intuitive is the ordering. But after a second look, it can be seen that the order follows the identification numbers of the GCDs (logical/Slurm GPUs) in the node, so that each of the CPU cores correspond to the chiplet that is directly connected to each of the GCDs (in order). Then, the set of commands to use for a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task would be:

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
CPU_BIND="map_cpu:49,57,17,25,0,9,33,41"
srun -N 1 -n 8 -c 8 --gres=gpu:8 --cpu-bind=${CPU_BIND} ./selectGPU_X.sh myMPIExecutable

This provides the optimal binding in a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task.

For jobs that are hybrid, that is, that require multiple CPU threads per task, the list needs to be modified to be a list of masks instead of CPU core IDs. The explanation of the use of this list of masks will be given in the next subsection that also describes the use of an auxiliary script to generate the lists of CPU cores or mask for general cases.

For jobs that request exclusive use of the GPU nodes, the settings described in the example so far are enough for achieving optimal binding with the "manual" method. This works because the identification numbers of all the GPUs and the CPU cores that will be assigned to the job are known before hand (as all the resources of the node are what is requested). But when the job requires a reduced amount of resources, so that the request shares the rest of the node with other jobs, the GPUs and CPU cores that are to be allocated to the job are not known before submitting the script for execution. And, therefore, a "hardcoded" list of CPU cores that will always work to achieve optimal binding cannot be defined beforehand. To avoid this problem, for jobs that request resources in shared nodes, we provide a script that can generate the correct list once the job starts execution.

Use of generate_CPU_BIND.sh script for generating an ordered list of CPU cores for optimal binding

The generation of the ordered list to be used with the --cpu-bind option of srun can be automated within the script generate_CPU_BIND.sh, which is available by default to all users through the module pawseytools (loaded by default).

Use the script generate_CPU_BIND.sh only in GPU nodes

The use of the script generate_CPU_BIND.sh is only meaningful in GPU nodes and will report errors if executed on CPU nodes, like:

ERROR:root:Driver not initialized (amdgpu not found in modules)

or similar.

The generate_CPU_BIND.sh script receives one parameter (map_cpu OR mask_cpu) and gives back the best ordered list of CPU-cores or CPU-masks for optimal communication between tasks and GPUs.

For a better understanding of what this script generates and how it is useful, we can use an interactive session. Note that the request parameters given salloc  only include the number of nodes and the number of Slurm GPUs (GCDs) per node to request a number of "allocation-packs" (as described at the top of this page). In this case, 3 "allocation-packs" are requested:

Terminal N. Explaining the use of the script "generate_CPU_BIND.sh" from an salloc session
$ salloc -N 1 --gres=gpu:3 -A yourProject-gpu --partition=gpu-dev
salloc: Granted job allocation 1370877


$ scontrol show jobid $SLURM_JOBID
JobId=1370877 JobName=interactive
   UserId=quokka(20146) GroupId=quokka(20146) MCS_label=N/A
   Priority=16818 Nice=0 Account=rottnest0001-gpu QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:48 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=16:45:41 EligibleTime=16:45:41
   AccrueTime=Unknown
   StartTime=16:45:41 EndTime=17:45:41 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=16:45:41 Scheduler=Main
   Partition=gpu AllocNode:Sid=joey-02:253180
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nid001004
   BatchHost=nid001004
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=48,mem=88320M,node=1,billing=192,gres/gpu=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/scratch/rottnest0001/quokka/hello_jobstep
   Power=
   CpusPerTres=gres:gpu:8
   MemPerTres=gpu:29440
   TresPerNode=gres:gpu:3   


$ rocm-smi --showhw
======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU  DID   GFX RAS   SDMA RAS  UMC RAS   VBIOS           BUS 
0    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:C9:00.0 
1    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:D1:00.0 
2    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:D6:00.0 
================================================================================
============================= End of ROCm SMI Log ==============================


$ generate_CPU_BIND.sh map_cpu
map_cpu:21,2,14


$ generate_CPU_BIND.sh mask_cpu
mask_cpu:0000000000FF0000,00000000000000FF,000000000000FF00

As can be seen, 3 "allocation-packs" were requested, and the total amount of allocated resources are written in the output of the scontrol command, including the 3 GCDs (logical/Slurm GPUs) and 88.32GB of memory. The rocm-smi command gives a list of the three allocated devices, listed locally as GPU:0-BUS_ID:C9, GPU:1-BUS_ID:D1 & GPU:2-BUS_ID:D6.

When using generate_CPU_BIND.sh script with the parameter map_cpu, it creates a list of CPU-cores that can be used in the srun command for optimal binding. In this case, we get: map_cpu:21,2,14 which, in order, correspond to the slurm-sockets chiplet2,chiplet0,chiplet1; which are the ones in direct connection to the C9,D1,D6 GCDs respectively. (Check the GPU node architecture diagram at the top of this page.)

For jobs that require several threads per CPU task, srun would need a list of masks instead of CPU core IDs. The generate_CPU_BIND.sh script can generate this list when the parameter mask_cpu is used. Then, the script creates a list of hexadecimal CPU-masks that can be used for optimally binding an hybrid job. In this case, we get: mask_cpu:0000000000FF0000,00000000000000FF,000000000000FF00 . These masks, in order, correspond to masks that activate only the CPU-cores of chiplet2, chiplet0 & chiplet1; which are the ones in direct connection to the C9,D1,D6 GCDs respectively. (Check the GPU node architecture diagram at the top of this page and external SLURM documentation for detailed explanation about masks.)

An extensive documentation about the use of masks is in the online documentation of Slurm, but a brief explanation can be given here. First thing to notice is that masks have 16 hexadecimal characters and each of the characters can be understood as an hexadecimal "mini-mask" that correspond to 4 CPU-cores. Then, a pair of characters will cover 8 CPU-cores, that is: each pair of characters represents a chiplet. Then, for example, the second mask in the list (00000000000000FF) disables all the cores of the CPU for their use by the second MPI task, and only make available the first 8 cores, which correspond to chiplet0. (Remember to read numbers with the usual increase in hierarchy: right to left.) Then, the first character (right to left) is the hexadecimal mini-mask of CPU cores C00-C03, and the second character (right to left) is the hexadecimal mini-mask of CPU cores C04-C07.