Excerpt

New way of request (`#SBATCH`) and use (`srun`) of resources for GPU nodes

The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun.

With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation packs". Each "allocation pack" consists of:

1 whole CPU chiplet (8 CPU cores)
a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node)
1 GPU directly connected to that chiplet

For that, the request of resources only needs the number of nodes (–-nodes, -N) and the number of GPUs per node (--gpus-per-node). The total number of requested GPUs, resulting from the multiplication of these two parameters, will be interpreted as an indication of the total number of requeted "allocation packs".

In the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores, so don't use --ntasks, --cpus-per-task, --mem, etc. in the request headers of the script ( #SBATCH directives), or in the request options of salloc. If, for some reason, the job requirements are dictated by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation packs" that meet their needs. The "allocation pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.

The use/management of resources with srun is another story. After the requested resources are allocated, the srun command should be explicitly provided with enough parameters indicating how resources are to be used by the srun step and the spawned tasks. So the real management of resources is performed by the command line options of srun. No default parameters should be considered for srun.

The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes:

Warning

title	--gpu-bind=closest may NOT work for all applications

There are now two methods to achieve optimal binding of GPUs:

The use srun parameters for optimal binding: --gpus-per-task=<number> together with --gpu-bind=closest
"Manual" optimal binding with the use of "two auxiliary techniques".

The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate.

Required Resources per Job	New "simplified" way of requesting resources	Total Allocated resources	Charge per hour	The use of full explicit `srun` options is now required (only the `1st` method for optimal binding is listed here)
1 CPU task (single CPU thread) controlling 1 GPU	`#SBATCH --nodes=1` `#SBATCH --gpus-per-node=1`	1 allocation pack = 1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM	64 SU	`^*1` `export OMP_NUM_THREADS=1` `srun -N 1 -n 1 -c 8 --gpus-per-node=1 --gpus-per-task=1`
14 CPU threads all controlling the same 1 GPU	`#SBATCH --nodes=1` `#SBATCH --gpus-per-node=2`	2 allocation packs= 2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM	128 SU	`^*2` `export OMP_NUM_THREADS=14` `srun -N 1 -n 1 -c 16 --gpus-per-node=1 --gpus-per-task=1`
3 CPU tasks (single thread each), each controlling 1 GPU with GPU-aware MPI communication	`#SBATCH --nodes=1` `#SBATCH --gpus-per-node=3`	3 allocation packs= 3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM	192 SU	`^*3` `export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1` `srun -N 1 -n 3 -c 8 --gpus-per-node=3 --gpus-per-task=1 --gpu-bind=closest`
2 CPU tasks (single thread each), each task controlling 2 GPUs with GPU-aware MPI communication	`#SBATCH --nodes=1` `#SBATCH --gpus-per-node=4`	4 allocation packs= 4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM	256 SU	^*4 `export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1` `srun -N 1 -n 2 -c 16 --gpus-per-node=4 --gpus-per-task=2 --gpu-bind=closest`
8 CPU tasks (single thread each), each controlling 1 GPU with GPU-aware MPI communication	`#SBATCH --nodes=1` `#SBATCH --exclusive`	8 allocation packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM	512 SU	`export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1` `srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest`

Notes for the request of resources:

Note that this simplified way of resource request is based on requesting a number of "allocation packs".
Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via srun options.
The same simplified resource request should be used for the request of interactive sessions with salloc.
IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)

Notes for the use/management of resources with srun:

IMPORTANT: The use of --gpu-bind=closest may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required.
The --cpus-per-task (-c) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation packs" and then "reserving" whole chiplets per srun task, even if the real number of threads is 1 thread per task is 1. The real number of threads with the OMP_NUM_THREADS variable.
(^*1) This is the only case where srun may work fine with default inherited option values. Nevertheless, it is a good practice to use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8) for the srun task and control the real number of threads with the OMP_NUM_THREADS variable.
(^*2) The required CPU threads per task is 14 but two full chiplets (-c 16) are indicated for each srun task and the number of threads is controlled with the OMP_NUM_THREADS variable.
(^*3) The settings explicitly "reserve" a whole chiplet (-c 8) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
(^*4) Note the use of -c 16 to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.

General notes:

The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged

...

Again, there is a difference is in the values of the ROCR_VISIBLE_GPU_IDs in the results of both methods. With the first method, these values are always 0 while, in the second method, these values are the ones given by the wrapper that "manually" selects the GPUs. This difference has proven to be important and may be the reason why the "manual" binding is the only option for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

Example scripts for: Exclusive access to the GPU nodes

In this section, a series of example slurm job scripts are presented in order for the users to be able to use them as a point of departure for preparing their own scripts. The examples presented here make use of most of the important concepts, tools and techniques explained in the previous section, so we encourage users to take a look into that top section of this page first.

...

Ui tabs

Ui tab

title	Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Example scripts for: Shared access to the GPU nodes

Shared node 1 GPU job

Jobs that need only one GPU for their execution are going to be sharing the GPU compute node with other jobs. That is, they will run in shared access, which is the default so no request for exclusive access is performed. The following script is an example of a job requesting just 1 GPU:

...

Ui tabs

Ui tab

title	Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. Method 1 for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "001" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "008" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "016" is on chiplet:2 and directly connected to GPU with Bus_ID:C9

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "019" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "002" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "009" is on chiplet:1 and directly connected to GPU with Bus_ID:D6

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

When the code is hybrid on the CPU side (MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core per L3 cache chiplet (slurm-socket) needs to be accessible to each of the srun tasks. This needs to be controlled by the OMP_NUM_THREADS environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding is applied.

...

Ui tabs

Ui tab

title	Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. Method 1 for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the threads has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	Method 2: "Manual" optimal binding of GPUs and chiplets

Use mask_cpu for hybrid jobs on the CPU side instead of map_cpu

For hybrid jobs on the CPU side use mask_cpu for the cpu-bind option and NOT map_cpu. Also, control the number of CPU threads per task with OMP_NUM_THREADS.

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun. In this case, the list needs to be created using the mask_cpu parameter:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each thread has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Example scripts for: Packing GPU jobs

Pack the execution of 8 independent instances each using 1 GPU

...

Version	Old Version 155	New Version 156
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Sept 07, 2023	Sept 07, 2023

Versions Compared

Key

New way of request (`#SBATCH`) and use (`srun`) of resources for GPU nodes

Example scripts for: Exclusive access to the GPU nodes

Example scripts for: Shared access to the GPU nodes

Shared node 1 GPU job

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

Example scripts for: Packing GPU jobs

Pack the execution of 8 independent instances each using 1 GPU

Content Comparison

Versions Compared

Key

New way of request (#SBATCH) and use (srun) of resources for GPU nodes

Example scripts for: Exclusive access to the GPU nodes

Example scripts for: Shared access to the GPU nodes

Shared node 1 GPU job

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

Example scripts for: Packing GPU jobs

Pack the execution of 8 independent instances each using 1 GPU

New way of request (`#SBATCH`) and use (`srun`) of resources for GPU nodes