Content Comparison

...

Exclusive Node Multi-GPU job: 8 GPUs, each of them controlled by one MPI task

As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. This example considers a job that will make use of the 8 GPUs on 1 node

...

(8 "allocation packs"). The resources request use the following two parameters:

#SBATCH --nodes=1 #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job# #8 GPUs per node (8 "allocation packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeExclusive_8GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 8 GPUs job exclusive access

The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "001" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "008" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "016" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "024" is on chiplet:3 and directly connected to GPU with Bus_ID:CE
CPU core "032" is on chiplet:4 and directly connected to GPU with Bus_ID:D9
CPU core "040" is on chiplet:5 and directly connected to GPU with Bus_ID:DE
CPU core "048" is on chiplet:6 and directly connected to GPU with Bus_ID:C1
CPU core "056" is on chiplet:7 and directly connected to GPU with Bus_ID:C6

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_1NodeExclusive_8GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 8 GPUs job exclusive access

The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "054" is on chiplet:6 and directly connected to GPU with Bus_ID:C1
CPU core "063" is on chiplet:7 and directly connected to GPU with Bus_ID:C6
CPU core "018" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "026" is on chiplet:3 and directly connected to GPU with Bus_ID:CE
CPU core "006" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "013" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "033" is on chiplet:4 and directly connected to GPU with Bus_ID:D9
CPU core "047" is on chiplet:5 and directly connected to GPU with Bus_ID:DE

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

N Exclusive Nodes Multi-GPU job: 8*N GPUs, each of them controlled by one MPI task

As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. The same procedure mentioned above for the single exclusive node job should be applied for multi-node exclusive jobs. The only difference when requesting resources is the number of exclusive nodes requested. So So, for example, for a job requiring 2 exclusive nodes (16 GPUs or 16 "allocation packs") the only change to be made in the header of the script given above isresources request use the following two parameters:

#SBATCH --nodes=2 2 #2 nodes in this example

And, for using the resources, the srun command line needs to be updated to handle the correct number of nodes and tasks:

...

#SBATCH --exclusive #All resources of the node are exclusive to this job# #8 GPUs per node (16 "allocation packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

...

Shared node 1 GPU job

Jobs that need only one 1 GPU for their execution are going to be sharing the GPU compute node with other jobs. That is, they will run in shared access, which is the default so no request for exclusive access is performed. The following script is an example of a job requesting just 1 GPU:

As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. In this case we ask for 1 allocation pack with:

#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=1 #1 GPUs per node (1 "allocation pack" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As only 1 allocation pack is requested, there is no need to take any other action for optimal binding of CPU chiplet and GPU as it is guaranteed:

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_1NodeShared_1GPU.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=1GPUSharedNode
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gpus-per-node=1      #1 GPUs per node (1 "allocation packspack" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings
#Not needed for 1GPU:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      For optimal GPU binding using slurm options,
#      "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs      
#      (Although in this case this can be avoided as only 1 "allocation pack" has been requested)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 1 -c 8 --gpus-per-node=1 ${theExe} | sort -n

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

...

#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=3 #3 GPUs per node (3 "allocation packs" in total for the job)

And the use of the Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

...

When the code is hybrid on the CPU side (MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core per L3 cache chiplet (slurm-socket) needs to be accessible to each of the srun tasksper srun task. This needs to be is controlled by the OMP_NUM_THREADS environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding (method 2) is applied.

In the following example, we use 3 GPUs (3 "allocation packs"1 per MPI task) and the number of CPU threads per task is 5:

...

. As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. In this case we ask for 3 allocation packs with:

#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=3 #3 GPUs per node (3 "allocation packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:

export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable

The scripts for optimal binding using srun parameters or using a "manual" use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

...

This kind of packing can be performed with the help of an additional packing-wrapper script (jobPackWrapper.sh) that rules the independent execution of different codes (or different instances of the same code) to be ran by each of the srun-tasks spawned by srun. (It is important to understand that these instances do not interact with each other via MPI messaging.) The isolation of each code/instance should be performed via the logic included in this packing-wrapper script.

In the following example, the packing-wrapper creates 8 different output directories and then launches 8 different instances of the hello_nompi code. The output of each of the executions is saved in a different case directory and file. In this case, the executable do not receive any further parameters but, in practice, users should define the logic for their own purposes and, if needed, include the logic to provide receive different parameters to for each instance.

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. jobPackWrapper.sh
linenumbers	true

#!/bin/bash
#Job Pack Wrapper: Each srun-task will use a different instance of the executable.
#                  For this specific example, each srun-task will run on a different case directory
#                  and create an isolated log file.
#                  (Adapt wrapper script for your own purposes.)

caseHere=case_${SLURM_PROCID}

exeDir=${MYSCRATCH}/hello_jobstep
exeName=hello_nompi #Using the no-MPI version of the code
theExe=${exeDir}/${exeName}

logHere=log_${exeName}_${SLURM_JOBID}_${SLURM_PROCID}.out
mkdir -p $caseHere
cd $caseHere

${theExe} > ${logHere} 2>&1

Note that besides the use of the additional packing-wrapper, the rest of the script is very similar to the single-node exclusive examples given above. As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. This example considers a job that will make use of the 8 GPUs on 1 node (8 "allocation packs"). Each pack will be used by each of the instances controlled by the packing-wrapper. The resources request use the following two parameters:

#SBATCH --nodes=1 #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job# #8 GPUs per node (8 "allocation packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. For srun, this is not different to an MPI job with 8 tasks. But in reality, this is not an MPI job. On the contrary, srun will spawn 8 tasks, each one of them executing the packing-wrapper, but the logic of the packing-wrapper allows for 8 independent executions of the desired code(s).

As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. exampleScript_1NodeExclusive_8GPUs_jobPacking.sh
linenumbers	true

#!/bin/bash --login
#SBATCH --job-name=JobPack8GPUsExclusive-bindMethod1
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (8 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Job Pack Wrapper: Each srun-task will use a different instance of the executable.
jobPackWrapper="jobPackWrapper.sh"

#----
#MPI & OpenMP settings
#No need for 1GPU steps:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest ./${jobPackWrapper}

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

...

Version	Old Version 156	New Version 157
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Sept 07, 2023	Sept 07, 2023

Versions Compared

Key

Exclusive Node Multi-GPU job: 8 GPUs, each of them controlled by one MPI task

N Exclusive Nodes Multi-GPU job: 8*N GPUs, each of them controlled by one MPI task

Shared node 1 GPU job