Excerpt

Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)

The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun.

Note

title	Request for the amount of "allocation-packs" required for the job

With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation-packs". Each "allocation-pack" provides:

1 whole CPU chiplet (8 CPU cores)
a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node) = 1/8 of the total available RAM
1 GCD directly connected to that chiplet

For that, the request of resources only needs the number of nodes (–-nodes, -N) and the number of allocation-packs per node (--gres=gpu:number). The total of allocation-packs requested results from the multiplication of these two parameters. Note that the standard Slurm meaning of the second parameter IS NOT used at Pawsey. Instead, Pawsey's CLI filter interprets this parameter as:

the number of requested "allocation-packs" per node

Note that the "equivalent" option --gpus-per-node=number (which is also interpreted as the number of "allocation-packs" per node) is not recommended as we have found some bugs with its use.

Furthermore, in the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores. Therefore, users should not use --ntasks, --cpus-per-task, --mem, etc. in the request headers of the script ( #SBATCH directives), or in the request options given to salloc for interactive sessions. If, for some reason, the requirements for a job are indeed determined by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation-packs" that cover their needs. The "allocation-pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.

Pawsey also has some site specific recommendations for the use/management of resources with srun command. Users should explicitly provide a list of several parameters for the use of resources by srun. (The list of these parameters is made clear in the examples below.) Users should not assume that srun will inherit any of these parameters from the allocation request. Therefore, the real management of resources at execution time is performed by the command line options provided to srun. Note that, for the case of srun, options do have the standard Slurm meaning.

Warning

title	--gpu-bind=closest may NOT work for all applications

Within the full explicit srun options for "managing resources", there are some that help to achieve optimal binding of GPUs to their directly connected chiplet on the CPU. There are two methods to achieve this optimal binding of GPUs. So, together with the full explicit srun options, the following two methods can be used:

Include these two Slurm parameters: --gpus-per-task=<number> together with --gpu-bind=closest
"Manual" optimal binding with the use of "two auxiliary techniques" (explained later in the main document).

The first method is simpler, but may still launch execution errors for some codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate.

The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes (those interested in cases . Most of the examples in the table provide are for typical jobs where multiple GPUs are accessible by 1 or more tasks, pay attention to cases 4,5 & 7)allocated to the job as a whole but each of the tasks spawned by srun is binded and has direct access to only 1 GPU. For applications that require multiple GPUs per task, there 3 examples (^*4, ^*5 & ^*7) where tasks are binded to multiple GPUs:

Required Resources per Job	New "simplified" way of requesting resources	Total Allocated resources	Charge per hour	The use of full explicit `srun` options is now required (only the `1st` method for optimal binding is listed here)
1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU)	`#SBATCH --nodes=1` `#SBATCH --gres=gpu:1`	1 allocation-pack = 1 GPU, 8 CPU cores (1 chiplet), 29.44 GB CPU RAM	64 SU	`^*1` `export OMP_NUM_THREADS=1` `srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>`
1 CPU task (with 14 CPU threads each) all threads controlling the same 1 GCD	`#SBATCH --nodes=1` `#SBATCH --gres=gpu:2`	2 allocation-packs= 2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB CPU RAM	128 SU	`^*2` `export OMP_NUM_THREADS=14` `srun -N 1 -n 1 -c 16 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>`
3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication	`#SBATCH --nodes=1` `#SBATCH --gres=gpu:3`	3 allocation-packs= 3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB CPU RAM	192 SU	`^*3` `export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1` `srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest <executable>`
2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication	`#SBATCH --nodes=1` `#SBATCH --gres=gpu:4`	4 allocation-packs= 4 GPU, 32 CPU cores (4 chiplets), 117.76 GB CPU RAM	256 SU	^*4 `export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1` `srun -N 1 -n 2 -c 16 --gres=gpu:4 --gpus-per-task=2 --gpu-bind=closest <executable>`
5 CPU tasks (single thread each) all threads/tasks able to see all 5 GPUs	`#SBATCH --nodes=1` `#SBATCH --gres=gpu:5`	5 allocation-packs= 5 GPUs, 40 CPU cores (5 chiplets), 147.2 GB CPU RAM	320 SU	`^*5` `export MPICH_GPU_SUPPORT_ENABLED=1export OMP_NUM_THREADS=1` `srun -N 1 -n 5 -c 8 --gres=gpu:5 <executable>`
8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication	`#SBATCH --nodes=1` `#SBATCH --exclusive`	8 allocation-packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM	512 SU	^*6 `export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1` `srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest <executable>`
8 CPU tasks (single thread each), each controlling 4 GCD with GPU-aware MPI communication	`#SBATCH --nodes=4` `#SBATCH --exclusive`	32 allocation-packs= 4 nodes, each with: 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM	2048 SU	^*7 `export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1` `srun -N 4 -n 8 -c 32 --gres=gpu:8 --gpus-per-task=4 --gpu-bind=closest <executable>`
1 CPU task (single thread), controlling 1 GCD but avoiding other jobs to run in the same node for ideal performance test.	`#SBATCH --nodes=1` `#SBATCH --exclusive`	8 allocation-packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM	512 SU	^*8 `export OMP_NUM_THREADS=1` `srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>`

Notes for the request of resources:

Note that this simplified way of resource request is based on requesting a number of "allocation-packs", so that standard use of Slurm parameters for allocation should not be used for GPU resources.
The --nodes (-N) option indicates the number of nodes requested to be allocated.
The --gres=gpu:number option indicates the number of allocation-packs requested to be allocated per node. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)
The --exclusive option requests all the resources from the number of requested nodes. When this option is used, there is no need for the use of --gres=gpu:number during allocation and, indeed, its use is not recommended in this case.
Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via srun options.
The same simplified resource request should be used for the request of interactive sessions with salloc.
IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)

Notes for the use/management of resources with srun:

Note that, for the case of srun, options do have the standard Slurm meaning.
The following options need to be explicitly provided to srun and not assumed to be inherited with some default value from the allocation request:
- The --nodes (-N) option indicates the number of nodes to be used by the srun step.
- The --ntasks (-n) option indicates the total number of tasks to be spawned by the srun step. By default, tasks are spawned evenly across the number of allocated nodes.
- The --cpus-per-task (-c) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation-packs" and then "reserving" whole chiplets per srun task, even if the real number is 1 thread per task. The real number of threads is controlled with the OMP_NUM_THREADS environment variable.
- The --gres=gpu:number option indicates the number of GPUs per node to be used by the srun step. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)
- The --gpus-per-task option indicates the number of GPUs to be binded to each task spawned by the srun step via the -n option. Note that this option neglects sharing of the assigned GPUs to a task with other tasks. (See cases ^*4, ^*5 and ^*7 and their notes for non-intuitive cases.)
And for optimal binding, the following should be used:
- The --gpu-bind=closest indicates that the chosen GPUs to be binded to each task should be the optimal (physically closest) to the chiplet assigned to each task.
- IMPORTANT: The use of --gpu-bind=closest will assign optimal binding but may still NOT work and launch execution errors for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. Method 2 is explained later in the main document.

(^*1) This is the only case where srun may work fine with default inherited option values. Nevertheless, it is a good practice to always use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8) for the srun task and control the real number of threads with the OMP_NUM_THREADS environment variable. Although the use of gres=gpu, gpus-per-task & gpu-bind is reduntant in this case, we keep them for encouraging their use, which is strictly needed in the most of cases (except case ^*5).
(^*2) The required CPU threads per task is 14 and that is controlled with the OMP_NUM_THREADS environment variable. But still the two full chiplets (-c 16) are indicated for each srun task.
(^*3) The settings explicitly "reserve" a whole chiplet (-c 8) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
(^*4) Each task needs to be in direct communication with 2 GCDs. For that, each of the CPU task reserve "two-full-chiplets". IMPORTANT: The use of -c 16 "reserves" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to the chiplets reserved for each task. The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
(^*5) Sometimes, the executable (and not the scheduler) performs all the management of GPUs, like in the case of Tensorflow distributed training, and other Machine Learning Applications. If all the management logic for the GPUs is performed by the executable, then all the available resources should be exposed to it. IMPORTANT: In this case, the --gpu-bind option should not be provided. Neither the --gpus-per-task option should be provided, as all the available GPUs are to be available to all tasks. The real number of threads is controlled with the OMP_NUM_THREADS variable. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1. These last two settings may not be necessary for aplications like Tensorflow.
(^*6) All GPUs in the node are requested, which mean all the resources available in the node via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). The use of -c 8 provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 8). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
(^*7) All resources in each node are requested via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). Each task needs to be in direct communication with 4 GCDs. For that, each of the CPU task reserve "four-full-chiplets". IMPORTANT: The use of -c 32 "reserves" a "four-chiplets-long" separation among the two CPU cores that are to be used per node (8 srun tasks in total, -n 8 ). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. In this way, each task will be in direct communication to the closest four logical GPUs in the node with respect to the chiplets reserved for each task. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1. The --gres=gpu:8 option assigns 8 GPUs per node to the srun step (32 GPUs in total as 4 nodes are being assigned).
(*8) All GPUs in the node are requested using the --exclusive option, but only 1 CPU chiplet - 1 GPU "unit" (or allocation-pack) is used in the srun step.

General notes:

The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged

Note that examples above are just for quick reference and that they do not show the use of the 2nd method for optiomal binding (which may be the only way to achieve optimal binding for some applications). So, the rest of this page will describe in detail both methods of optimal binding and also show full job script examples for their use on Setonix GPU nodes.

Methods to achieve optimal binding of GCDs/GPUs

As mentioned above and, as the node diagram in the top of the page suggests, the optimal placement of GCDs and CPU cores for each task is to have direct communication among the CPU chiplet and the GCD in use. So, according to the node diagram, tasks being executed in cores in Chiplet 0 should be using GPU 4 (Bus D1), tasks in Chiplet 1 should be using GPU 5 (Bus D6), etc.

...

Some applications, like Tensorflow and other Machine Learning applications, may requiere access to all the available GPUs in the node. In this case, the optimal binding and communication cannot be granted by the scheduler when assigning resources to the the srun launcher. Then, the full responsability for the optimal use of the resources relies on the code itself.

...

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_2NodesExclusive_16GPUs_8VisiblePerTask.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=16GPUExclusiveNode-8GPUsVisiblePerTask
#SBATCH --partition=gpu
#SBATCH --nodes=2              #2 nodes in this example 
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Tricks#Loading usedneeded onlymodules for(these themay development/debuggingnot ofbe the script.
#This section should be commented or removed from a proper production script.
shopt -s expand_aliases
alias "generate_CPU_BIND.sh"="$MYSOFTWARE/pawseytools/generate_CPU_BIND.sh"
alias

#----
#Loading needed modules (these may not be needed for needed for Tensorflow) (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings if needed (these may not be needed for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      No optimal binding is provided by the scheduler.
#      Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
#      Each task have access to all the 8 available GPUs in the node wher it's running.
#      Optimal use of resources is now responsability of the code.
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 2 -n 16 -c 8 --gres=gpu:8 ${theExe} | sort -n

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

...

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Output for a 1 GPU job (using only 1 allocation-pack in a shared node)16 GPU job with 16 tasks each of the task accessing the 8 GPUs in their running node

$ sbatch exampleScript_2NodesExclusive_16GPUs_8VisiblePerTask.sh
Submitted batch job 3230987798215

$ cat slurm-7798215.out
...
#------------------------#
Test code execution:
 0: MPI 000 - OMP 000 - HWT 001 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 1: MPI 001 - OMP 000 - HWT 008 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 2: MPI 002 - OMP 000 - HWT 016 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 3: MPI 003 - OMP 000 - HWT 024 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 4: MPI 004 - OMP 000 - HWT 032 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 5: MPI 005 - OMP 000 - HWT 040 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 6: MPI 006 - OMP 000 - HWT 049 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 7: MPI 007 - OMP 000 - HWT 056 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 8: MPI 008 - OMP 000 - HWT 000 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 9: MPI 009 - OMP 000 - HWT 008 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
10: MPI 010 - OMP 000 - HWT 016 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
11: MPI 011 - OMP 000 - HWT 025 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
12: MPI 012 - OMP 000 - HWT 032 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
13: MPI 013 - OMP 000 - HWT 040 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
14: MPI 014 - OMP 000 - HWT 048 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
15: MPI 015 - OMP 000 - HWT 056 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
...
#------------------------#
Done

...

More importantly for this example, each of the MPI tasks has have access to the 8 GCDs (logical/Slurm GPU) in their node. Proper and optimal GPU management and communication is responsability of the code. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Some applications may requiere access that each of the spawned task have access to multiple GPUs. In this case, some optimal binding and communication can still be granted by the scheduler when assigning resources to the srun launcher. Although final responsability for the optimal use of the resources in each task relies on the code itself.

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 6 GCDs (logical/Slurm GPUs) on 1 node (6 "allocation-packs" in total). The resources request use the following two parameters:

#SBATCH --nodes=1 #1 node in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, some optimal binding can still be achieved by the scheduler providing 2 GPUs to each of the tasks:

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_1NodeShared_6GPUs_2VisiblePerTask.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=6GPUSharedNode-2GPUsVisiblePerTask
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gres=gpu:6           #6 GPUs per node (6 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules (these may not be needed for Tensorflow) (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings if needed (these may not needed for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      For best possible GPU binding using slurm options,
#      "--gpus-per-task=2" and "--gpu-bind=closest" will provide the best GPUs to the tasks.
#      But best is still not optimal.
#      Each task have access to 2 available GPUs in the node where it's running.
#      Optimal use of resources of each of the 2GPUs accesible per task is now responsability of the code.
#      IMPORTANT: Note the use of "-c 16" to "reserve" 2 chiplets per task and be consistent with
#                 the use of "--gpus-per-task=2" to "reserve" 2 GPUs per task
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 3 -c 16 --gres=gpu:6 --gpus-per-task=2 --gpu-bind=closest ${theExe} | sort -n

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

And the output after executing this example is:

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Output for a 6 GPU job with 3 tasks and 2 GPUs per task

$ sbatch exampleScript_1NodeShared_6GPUs_2VisiblePerTask.sh
Submitted batch job 7842635

$ cat slurm-7842635.out
...
#------------------------#
Test code execution:
0: MPI 000 - OMP 000 - HWT 000 - Node nid002948 - RunTime_GPU_ID 0,1 - ROCR_VISIBLE_GPU_ID 0,1 - GPU_Bus_ID d1,d6
1: MPI 001 - OMP 000 - HWT 016 - Node nid002948 - RunTime_GPU_ID 0,1 - ROCR_VISIBLE_GPU_ID 0,1 - GPU_Bus_ID c9,ce
2: MPI 002 - OMP 000 - HWT 032 - Node nid002948 - RunTime_GPU_ID 0,1 - ROCR_VISIBLE_GPU_ID 0,1 - GPU_Bus_ID d9,de
...
#------------------------#
Done

The output of the hello_jobstep code tells us that job ran 3 MPI tasks on node nid002948. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Clearly, each of the CPU tasks run on a different chiplet. But more important, the spacing of the chiplets is every 16 cores (two chiplets), thanks to the "-c 16" setting in the srun command, allowing for the best binding of the 2 GPUs assigned to each task.

More importantly for this example, each of the MPI tasks have access to 2 GCDs (logical/Slurm GPU) in their node. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). The assigned GPUs are indeed the 2 closest to the CPU core. Final proper and optimal GPU management and communication is responsability of the code.

Example scripts for: Packing GPU jobs

...

Version	Old Version 195	New Version 196
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Feb 01, 2024	Feb 01, 2024

Versions Compared

Key

Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)

Methods to achieve optimal binding of GCDs/GPUs

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Example scripts for: Packing GPU jobs

Content Comparison

Versions Compared

Key

Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)

Methods to achieve optimal binding of GCDs/GPUs

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Example scripts for: Packing GPU jobs