...
Section | ||
---|---|---|
|
Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Compute Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples explained in the rest of this document, we use the numbering of the cores and bus IDs of the GCD to identify the allocated chiplets and GCDs, and their binding.)
...
Note | ||
---|---|---|
| ||
To use GPU-aware Cray MPICH, users must set the following modules and environment variables:
|
...
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
...
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
And the output after executing this example is:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
The output of the hello_jobstep
code tells us that the CPU-core "002
" and GPU with Bus_ID:D1
were utilised by the job. Optimal binding is guaranteed for a single "allocation-pack" as memory, CPU chiplet and GPU of each pack is optimal.
Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | |||||
---|---|---|---|---|---|
|
And the output after executing this example is:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
The output of the hello_jobstep
code tells us
...
that
...
the
...
CPU-core
...
"002
" and GPU with Bus_ID:D1
were utilised by the job. Optimal binding is guaranteed for a single "allocation-pack" as memory, CPU chiplet and GPU of each pack is optimal.
Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)
...
After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal:
- CPU core "
001
" is onchiplet:0
and directly connected to GCD (logical GPU) withBus_ID:D1
- CPU core "
008
" is onchiplet:1
and directly connected to GCD (logical GPU) withBus_ID:D6
- CPU core "
016
" is onchiplet:2
and directly connected to GCD (logical GPU) withBus_ID:C9
According to the architecture diagram, this binding configuration is optimal.
...
This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.
...
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | |||||||
---|---|---|---|---|---|---|---|
|
Example scripts for: Hybrid jobs (multiple threads) on the CPU side
When the code is hybrid on the CPU side (MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core chiplet needs to be accessible per srun
task. This is controlled by the OMP_NUM_THREADS
environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding (method 2) is applied.
In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:
export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
Example scripts for: Hybrid jobs (multiple threads) on the CPU side
...
When the code is hybrid on the CPU side
...
For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper and 2) generate an ordered list to be used in the --cpu-bind
option of srun
. In this case, the list needs to be created using the mask_cpu
parameter:
...
Note that the wrapper for selecting the GPUs (logical/Slurm GPUs) is being created with a redirection to the cat command. Also node that its name uses the SLURM_JOBID
environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.
Now, let's take a look to the output after executing the script:
...
The output of the hello_jobstep
code tells us that job ran on node nid001004
and that 3 MPI tasks were spawned. Each of the MPI tasks has 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS
environment variable in the script) and can be identified with the HWT
number. Also, each thread has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).
After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal.
"Click" in the TAB above to read the script and output for the other method of GPU binding.
Example scripts for: Jobs where each task needs access to multiple GPUs
Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node
Some applications, like Tensorflow and other Machine Learning applications, may requiere access to all the available GPUs in the node. In this case, the optimal binding and communication cannot be granted by the scheduler when assigning resources to the srun
launcher. Then, the full responsability for the optimal use of the resources relies on the code itself.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 2 nodes (16 "allocation-packs" in total). The resources request use the following two parameters:
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of each node are exclusive to this job
# #8 GPUs per node (16 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, optimal binding cannot be achieved by the scheduler, so no settings for optimal binding are given to the launcher. Also, all the GPUs in the node are available to each of the tasks:
...
width | 900px |
---|
...
language | bash |
---|---|
theme | Emacs |
title | Listing N. exampleScript_2NodesExclusive_16GPUs_8VisiblePerTask.sh |
linenumbers | true |
...
(MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core chiplet needs to be accessible per srun
task. This is controlled by the OMP_NUM_THREADS
environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding (method 2) is applied.
In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:
export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Example scripts for: Jobs where each task needs access to multiple GPUs
Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node
Some applications, like Tensorflow and other Machine Learning applications, may requiere access to all the available GPUs in the node. In this case, the optimal binding and communication cannot be granted by the scheduler when assigning resources to the srun
launcher. Then, the full responsability for the optimal use of the resources relies on the code itself.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 2 nodes (16 "allocation-packs" in total). The resources request use the following two parameters:
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of each node are exclusive to this job
# #8 GPUs per node (16 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, optimal binding cannot be achieved by the scheduler, so no settings for optimal binding are given to the launcher. Also, all the GPUs in the node are available to each of the tasks:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
...
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
The output of the hello_jobstep
code tells us that job ran 8 MPI tasks on node nid002944
and other 8 MPI tasks on node nid002946
. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS
environment variable in the script) and can be identified with the HWT
number. Clearly, each of the CPU tasks run on a different chiplet.
More importantly for this example, each of the MPI tasks have access to the 8 GCDs (logical/Slurm GPU) in their node. Proper and optimal GPU management and communication is responsability of the code. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).
Shared nodes: Many GPUs requested but 2 GPUs binded to each task
Some applications may requiere each of the spawned task to have access to multiple GPUs. In this case, some optimal binding and communication can still be granted by the scheduler when assigning resources with the srun
launcher. Although final responsability for the optimal use of the multiple GPUs assigned to each task relies on the code itself.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 6 GCDs (logical/Slurm GPUs) on 1 node (6 "allocation-packs" in total). The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, some best binding can still be achieved by the scheduler providing 2 GPUs to each of the tasks:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
The output of the hello_jobstep
code tells us that job ran 8 MPI tasks on node nid002944
and other 8 MPI tasks on node nid002946
. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS
environment variable in the script) and can be identified with the HWT
number. Clearly, each of the CPU tasks run on a different chiplet.
More importantly for this example, each of the MPI tasks have access to the 8 GCDs (logical/Slurm GPU) in their node. Proper and optimal GPU management and communication is responsability of the code. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).
Shared nodes: Many GPUs requested but 2 GPUs binded to each task
Some applications may requiere each of the spawned task to have access to multiple GPUs. In this case, some optimal binding and communication can still be granted by the scheduler when assigning resources with the srun
launcher. Although final responsability for the optimal use of the multiple GPUs assigned to each task relies on the code itself.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 6 GCDs (logical/Slurm GPUs) on 1 node (6 "allocation-packs" in total). The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, some best binding can still be achieved by the scheduler providing 2 GPUs to each of the tasks:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
...
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
...
- Setonix User Guide
- Example Slurm Batch Scripts for Setonix on CPU Compute Nodes
- Setonix General Information: GPU node architecture