...

Section

Column

Figure 1. GPU node architecture. Note here the that the GPU's shown here are equivalent to a GCD (see heremore info about this is in the Setonix General Information).

Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Compute Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples explained in the rest of this document, we use the numbering of the cores and bus IDs of the GCD to identify the allocated chiplets and GCDs, and their binding.)

...

Note

title	Use this for GPU-aware MPI codes

To use GPU-aware Cray MPICH, users must set the following modules and environment variables:

module load craype-accel-amd-gfx90a
module load rocm/<VERSION>
export MPICH_GPU_SUPPORT_ENABLED=1

...

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Explaining the use of the "hello_jobstep" code from an salloc session (compiling)

$ cd $MYSCRATCH
$ git clone https://github.com/PawseySC/hello_jobstep.git
Cloning into 'hello_jobstep'...
...
Resolving deltas: 100% (41/41), done.
$ cd hello_jobstep


$ module load PrgEnv-cray craype-accel-amd-gfx90a rocm/<VERSION>
$ make hello_jobstep
CC -std=c++11 -fopenmp --rocm-path=/opt/rocm -x hip -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -I/opt/rocm/include -c hello_jobstep.cpp
CC -fopenmp --rocm-path=/opt/rocm -L/opt/rocm/lib -lamdhip64 hello_jobstep.o -o hello_jobstep

...

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_1NodeShared_1GPU.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=1GPUSharedNode
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gres=gpu:1           #1 GPU per node (1 "allocation-pack" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings
#Not needed for 1GPU:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      For optimal GPU binding using slurm options,
#      "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs      
#      (Although in this case this can be avoided as only 1 "allocation-pack" has been requested)
echo -e#      "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest ${theExe} | sort -n

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

And the output after executing this example is:

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Output for a 1 GPU job (using only 1 allocation-pack in a shared node)

$ sbatch exampleScript_1NodeShared_1GPU.sh
Submitted batch job 323098

$ cat slurm-323098.out
...
#-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
#         for the code SHOULD be defined by the environment variables above.
#      (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
#      (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
#      (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
0:srun MPI-l 000-u -N OMP 0001 -n 1 -c HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
...
#8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest ${theExe}

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#
Done

The output of the hello_jobstep code tells us that the CPU-core "002" and GPU with Bus_ID:D1 were utilised by the job. Optimal binding is guaranteed for a single "allocation-pack" as memory, CPU chiplet and GPU of each pack is optimal.

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:

#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	C. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900pxbashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900pxbashDJangoTerminal N. Output for 3 GPUs job shared access. Method 1 for optimal binding.

"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

And the output after executing this example is:

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Output for a 1 GPU job (using only 1 allocation-pack in a shared node)

$ sbatch exampleScript_1NodeShared_1GPU.sh
Submitted batch job 323098

$ cat slurm-323098.out
...
#------------------------#
Test code execution:
0: MPI 000 - OMP 000 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
...
#------------------------#
Done

The output of the hello_jobstep code tells us

...

that

...

the

...

CPU-core

...

"002" and GPU with Bus_ID:D1 were utilised by the job. Optimal binding is guaranteed for a single "allocation-pack" as memory, CPU chiplet and GPU of each pack is optimal.

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

...

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal:

CPU core "001" is on chiplet:0 and directly connected to GCD (logical GPU) with Bus_ID:D1
CPU core "008" is on chiplet:1 and directly connected to GCD (logical GPU) with Bus_ID:D6
CPU core "016" is on chiplet:2 and directly connected to GCD (logical GPU) with Bus_ID:C9

According to the architecture diagram, this binding configuration is optimal.

...

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

...

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
`#SBATCH --nodes=1 #1 nodes in this example`
`#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)`
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the `srun` options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only `srun` parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper t and 2) generate an ordered list to be used in the -`-cpu-bind` option of `srun`:
900pxbashEmacsListing N
Ui tab
title C. Method 2: "Manual" optimal binding of GPUs and chiplets
1: Optimal binding using srun parameters
For optimal binding using `srun` parameters the options "`--gpus-per-task`" & "`--gpu-bind=closest`" need to be used:
900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod2bindMethod1.shtrueNote that the wrapper for selecting the GCDs (logical/Slurm GPUs) is being created with a redirection to the cat command. Also node that its name uses the `SLURM_JOBID` environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:
900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. "Manual" method (method 2) Method 1 for optimal binding.

The output of the `hello_jobstep` code tells us that job ran on node `nid001004` and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the `OMP_NUM_THREADS` environment variable in the script) and can be identified with the `HWT` number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).
After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal:
CPU core "
`019`
`001`" is on `chiplet:`
2
`0` and directly connected to GCD (logical GPU) with `Bus_ID:`
C9
`D1`
CPU core "`002008`" is on `chiplet:01` and directly connected to GCD (logical GPU) with `Bus_ID:D1D6`
CPU core "`009016`" is on `chiplet:12` and directly connected to GCD (logical GPU) with `Bus_ID:D6C9`
According to the architecture diagram, this binding configuration is optimal.
"Click" in the TAB above to read the script and output for the other method of GPU binding.

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

When the code is hybrid on the CPU side (MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core chiplet needs to be accessible per srun task. This is controlled by the OMP_NUM_THREADS environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding (method 2) is applied.

In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:

#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:

export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	D. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900pxbashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900pxbashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. Method 1 for optimal binding.The output of

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Use mask_cpu for hybrid jobs

Ui tab

title	C. Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper t and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GCDs (logical/Slurm GPUs) is being created with a redirection to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 5 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the threads MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal.

Method 1 may fail for some applications.This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication

:

CPU core "019" is on chiplet:2 and directly connected to GCD (logical GPU) with Bus_ID:C9
CPU core "002" is on chiplet:0 and directly connected to GCD (logical GPU) with Bus_ID:D1
CPU core "009" is on chiplet:1 and directly connected to GCD (logical GPU) with Bus_ID:D6

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	D. Method 2: "Manual" optimal binding of GPUs and chiplets

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

...

When the code is hybrid on the CPU side
...
For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper and 2) generate an ordered list to be used in the -`-cpu-bind` option of `srun`. In this case, the list needs to be created using the `mask_cpu` parameter:
...
Note that the wrapper for selecting the GPUs (logical/Slurm GPUs) is being created with a redirection to the cat command. Also node that its name uses the `SLURM_JOBID` environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.
Now, let's take a look to the output after executing the script:
...
The output of the `hello_jobstep` code tells us that job ran on node `nid001004` and that 3 MPI tasks were spawned. Each of the MPI tasks has 5 CPU-core assigned to it (with the use of the `OMP_NUM_THREADS` environment variable in the script) and can be identified with the `HWT` number. Also, each thread has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).
After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal.
"Click" in the TAB above to read the script and output for the other method of GPU binding.

Example scripts for: Jobs where each task needs access to multiple GPUs

Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node

Some applications, like Tensorflow and other Machine Learning applications, may requiere access to all the available GPUs in the node. In this case, the optimal binding and communication cannot be granted by the scheduler when assigning resources to the srun launcher. Then, the full responsability for the optimal use of the resources relies on the code itself.

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 2 nodes (16 "allocation-packs" in total). The resources request use the following two parameters:

#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of each node are exclusive to this job # #8 GPUs per node (16 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, optimal binding cannot be achieved by the scheduler, so no settings for optimal binding are given to the launcher. Also, all the GPUs in the node are available to each of the tasks:

...

width	900px

...

language	bash
theme	Emacs
title	Listing N. exampleScript_2NodesExclusive_16GPUs_8VisiblePerTask.sh
linenumbers	true

...

(MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core chiplet needs to be accessible per srun task. This is controlled by the OMP_NUM_THREADS environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding (method 2) is applied.

In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:

#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:

export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	D. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. Method 1 for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the threads has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	D. Method 2: "Manual" optimal binding of GPUs and chiplets

Use mask_cpu for hybrid jobs on the CPU side instead of map_cpu

For hybrid jobs on the CPU side use mask_cpu for the cpu-bind option and NOT map_cpu. Also, control the number of CPU threads per task with OMP_NUM_THREADS.

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper and 2) generate an ordered list to be used in the --cpu-bind option of srun. In this case, the list needs to be created using the mask_cpu parameter:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs (logical/Slurm GPUs) is being created with a redirection to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each thread has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Example scripts for: Jobs where each task needs access to multiple GPUs

Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node

Some applications, like Tensorflow and other Machine Learning applications, may requiere access to all the available GPUs in the node. In this case, the optimal binding and communication cannot be granted by the scheduler when assigning resources to the srun launcher. Then, the full responsability for the optimal use of the resources relies on the code itself.

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 2 nodes (16 "allocation-packs" in total). The resources request use the following two parameters:

#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of each node are exclusive to this job # #8 GPUs per node (16 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, optimal binding cannot be achieved by the scheduler, so no settings for optimal binding are given to the launcher. Also, all the GPUs in the node are available to each of the tasks:

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_2NodesExclusive_16GPUs_8VisiblePerTask.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=16GPUExclusiveNode-8GPUsVisiblePerTask
#SBATCH --partition=gpu
#SBATCH --nodes=2              #2 nodes in this example 
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules (adapt this for your own purposes):
#For the hello_jobstep example:
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
#OR for a tensorflow example:
#module load tensorflow/<version>
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings if needed (these won't work for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real    #8 GPUs per node (16 "allocation packs" in totalCPU-cores per task for the job)executable
#SBATCH
#--time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules (these may not be needed for Tensorflow) (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings if needed (these may not be needed for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      No optimal binding is provided by the scheduler.--
#TensorFlow settings if needed:
#  The following two variables control the real number of threads in Tensorflow code:
#export TF_NUM_INTEROP_THREADS=1    #Number of threads for independent operations
#export TF_NUM_INTRAOP_THREADS=1    #Number of threads within individual operations 

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      Each task needs access to all the 8 available GPUs in the node where it's running.
#      So, no optimal binding can be provided by the scheduler.
#      Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
#      Optimal use of resources is now responsability of the code.
#      "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
#         for the code SHOULD be defined by the environment variables above.
#      (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
#      Therefore,(The "--gpus-per-tasku" and "--gpu-bind" are not used.
#      Each task have access to all the 8 available GPUs in the node wher it's running.
#      Optimal use of resources is now responsability of the code.option is for unbuffered output, so that output is displayed as soon as it's generated.)
#      (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 2 -n 16 -c 8 --gres=gpu:8 ${theExe} | sort -n
#srun -l -u -N 2 -n 16 -c 8 --gres=gpu:8 python3 ${tensorFlowScript}

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

...

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Output for a 16 GPU job with 16 tasks each of the task accessing the 8 GPUs in their running node

$ sbatch exampleScript_2NodesExclusive_16GPUs_8VisiblePerTask.sh
Submitted batch job 7798215

$ cat slurm-7798215.out
...
#------------------------#
Test code execution:
 0: MPI 000 - OMP 000 - HWT 001 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 1: MPI 001 - OMP 000 - HWT 008 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 2: MPI 002 - OMP 000 - HWT 016 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 3: MPI 003 - OMP 000 - HWT 024 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 4: MPI 004 - OMP 000 - HWT 032 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 5: MPI 005 - OMP 000 - HWT 040 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 6: MPI 006 - OMP 000 - HWT 049 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 7: MPI 007 - OMP 000 - HWT 056 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 8: MPI 008 - OMP 000 - HWT 000 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
 9: MPI 009 - OMP 000 - HWT 008 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
10: MPI 010 - OMP 000 - HWT 016 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
11: MPI 011 - OMP 000 - HWT 025 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
12: MPI 012 - OMP 000 - HWT 032 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
13: MPI 013 - OMP 000 - HWT 040 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
14: MPI 014 - OMP 000 - HWT 048 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
15: MPI 015 - OMP 000 - HWT 056 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
...
#------------------------#
Done

The output of the hello_jobstep code tells us that job ran 8 MPI tasks on node nid002944 and other 8 MPI tasks on node nid002946. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Clearly, each of the CPU tasks run on a different chiplet.

More importantly for this example, each of the MPI tasks have access to the 8 GCDs (logical/Slurm GPU) in their node. Proper and optimal GPU management and communication is responsability of the code. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Some applications may requiere each of the spawned task to have access to multiple GPUs. In this case, some optimal binding and communication can still be granted by the scheduler when assigning resources with the srun launcher. Although final responsability for the optimal use of the multiple GPUs assigned to each task relies on the code itself.

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 6 GCDs (logical/Slurm GPUs) on 1 node (6 "allocation-packs" in total). The resources request use the following two parameters:

#SBATCH --nodes=1 #1 node in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, some best binding can still be achieved by the scheduler providing 2 GPUs to each of the tasks:

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_1NodeShared_6GPUs_2VisiblePerTask.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=6GPUSharedNode-2GPUsVisiblePerTask
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gres=gpu:6           #6 GPUs per node (6 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules (these may not be needed for Tensorflow) (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings if needed (these may not needed for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      For best possible GPU binding using slurm options,
#     ,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
...
#------------------------#
Done

The output of the hello_jobstep code tells us that job ran 8 MPI tasks on node nid002944 and other 8 MPI tasks on node nid002946. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Clearly, each of the CPU tasks run on a different chiplet.

More importantly for this example, each of the MPI tasks have access to the 8 GCDs (logical/Slurm GPU) in their node. Proper and optimal GPU management and communication is responsability of the code. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Some applications may requiere each of the spawned task to have access to multiple GPUs. In this case, some optimal binding and communication can still be granted by the scheduler when assigning resources with the srun launcher. Although final responsability for the optimal use of the multiple GPUs assigned to each task relies on the code itself.

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 6 GCDs (logical/Slurm GPUs) on 1 node (6 "allocation-packs" in total). The resources request use the following two parameters:

#SBATCH --nodes=1 #1 node in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, some best binding can still be achieved by the scheduler providing 2 GPUs to each of the tasks:

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_1NodeShared_6GPUs_2VisiblePerTask.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=6GPUSharedNode-2GPUsVisiblePerTask
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gres=gpu:6           #6 GPUs per node (6 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings if needed (these won't work for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      For best possible GPU binding using slurm options,
#      "--gpus-per-task=2" and "--gpu-bind=closest" will provide the best GPUs to the tasks.
#      But best is still not optimal.
#      Each task have access to 2 available GPUs in the node where it's running.
#      Optimal use of resources of each of the 2GPUs accesible per task is now responsability of the code.
#      IMPORTANT: Note the use of "-c 16" to "reserve" 2 chiplets per task and is consistent with
#                 the use of "--gpus-per-task=2" andto "--gpu-bind=closestreserve" will2 provideGPUs theper besttask. GPUsThen, to the tasks.REAL #number of
#    But best is still not optimal. #      Each taskthreads havefor accessthe tocode 2SHOULD availablebe GPUsdefined inby the nodeenvironment where it's runningvariables above.
#      Optimal use of resources of each of(The "-l" option is for displaying, at the 2GPUsbeginning accesibleof pereach taskline, isthe nowtaskID responsabilitythat ofgenerates the codeoutput.)
#      IMPORTANT: Note the use of "-c 16" to "reserve" 2 chiplets per task and be consistent with(The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
#      (If the output needs to be sorted for clarity, then add the"| usesort of "--gpus-per-task=2-n" toat "reserve"the 2end GPUsof perthe taskcommand.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 3 -c 16 --gres=gpu:6 --gpus-per-task=2 --gpu-bind=closest ${theExe} | sort -n

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

...

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_1NodeExclusive_8GPUs_jobPacking.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=JobPacking8GPUsExclusive-bindMethod1
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (8 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Job Packing Wrapper: Each srun-task will use a different instance of the executable.
jobPackingWrapper="jobPackingWrapper.sh"

#----
#MPI & OpenMP settings
#No need for 1GPU steps:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1    OpenMP settings
#No need for 1GPU steps:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
#        #This controlsfor the realcode CPU-coresSHOULD perbe taskdefined forby the executableenvironment variables above.
#---- #Execution #Note: srun needs the explicit indication full parameters(The "-l" option is for usedisplaying, at the beginning of resourceseach inline, the job step.
#taskID that generates the output.)
#      (The "-u" option is Thesefor areunbuffered independentoutput, fromso thethat allocationoutput parametersis (whichdisplayed areas notsoon inheritedas by srunit's generated.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest ./${jobPackingWrapper}

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

...

Version	Old Version 198	New Version Current
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Feb 01, 2024	Jul 08, 2024

Versions Compared

Key

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

Example scripts for: Jobs where each task needs access to multiple GPUs

Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node

Example scripts for: Jobs where each task needs access to multiple GPUs

Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Page Comparison

Versions Compared

Key

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

Example scripts for: Hybrid jobs (multiple threads) on the CPU side

Example scripts for: Jobs where each task needs access to multiple GPUs

Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node

Example scripts for: Jobs where each task needs access to multiple GPUs

Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node

Shared nodes: Many GPUs requested but 2 GPUs binded to each task

Shared nodes: Many GPUs requested but 2 GPUs binded to each task