Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
Node architecture
The GPU node architecture is different from that on the CPU-only nodes. The following diagram shows the connections between the CPU and GPUs on the node, which will assist with understanding recommendations for Slurm job scripts later on this page. Note that the numbering of the cores of the CPU has a slightly different order to that of the GPUs.
Section | ||
---|---|---|
|
Each GPU node have 4 MI250X GPU cards, which in turn have 2 logical GPUs; so each GPU node has 8 GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. And more important, each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GPUs towards which the communication is optimal. Then, communication of a chiplet with other GPUs is not optimal as it requires at least an additional communication hop. (In the examples below, we use the numbering of the cores and bus IDs of the GPUs to identify the allocated chiplets and GPUs, and their binding.)
In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation pack". Users should then only request for a number of "allocation packs". Each allocation pack consists of:
- 1 whole CPU chiplet (8 CPU cores)
- ~32 GB memory
- 1 GPU directly connected to that chiplet
Note | ||
---|---|---|
| ||
For jobs that only use a partial set of resources of the node (non-exclusive jobs that share the rest of the node with other jobs), the current Setonix GPU configuration may not provide perfect allocation and binding, which may impact performance depending on the amount of CPU-GPU communication. This is under active investigation, and the recommendations included in this document will serve to achieve optimal allocations in most of the cases, but is not 100% guaranteed. Therefore, if you detect that imperfect binding or the use of shared nodes (even with optimal binding) is impacting the performance of your jobs, it is recommended to use exclusive nodes where possible, noticing that the project will still be charged for the whole node even if part of the resources remain idle. Also report the observed issues to Pawsey's helpdesk. |
Further details of the node architecture are also available on the GPU node architecture page.
Slurm use of GPU nodes
Project name to access the GPU nodes is different
Note | ||
---|---|---|
| ||
The default project name will not give you access to the GPU nodes. So, in order to access the GPU nodes, users need to add the postfix "-gpu" to their project name and explicitly indicate it in the resource request options:
So, for example, if your project name is "rottnest0001" the setting would be:
This applies for all GPU partitions (gpu, gpu-dev & gpu-highmem). |
Excerpt | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
New way of request ( |
Warning | ||
---|---|---|
| ||
There are now two methods to achieve optimal binding of GPUs:
The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate. |
Required Resources per Job | New "simplified" way of requesting resources | Total Allocated resources | Charge per hour | The use of full explicit |
---|---|---|---|---|
1 CPU task (single CPU thread) controlling 1 GPU | #SBATCH --nodes=1 #SBATCH --gpus-per-node=1 | 1 allocation pack = 1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM | 64 SU |
|
14 CPU threads all controlling the same 1 GPU |
| 2 allocation packs= 2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM | 128 SU |
|
3 CPU tasks (single thread each), each controlling 1 GPU with GPU-aware MPI communication | #SBATCH --nodes=1 #SBATCH --gpus-per-node=3 | 3 allocation packs= 3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM | 192 SU |
|
2 CPU tasks (single thread each), each task controlling 2 GPUs with GPU-aware MPI communication |
| 4 allocation packs= 4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM | 256 SU |
|
8 CPU tasks (single thread each), each controlling 1 GPU with GPU-aware MPI communication | #SBATCH --nodes=1 #SBATCH --exclusive | 8 allocation packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM | 512 SU | export MPICH_GPU_SUPPORT_ENABLED=1 srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest |
Notes for the request of resources:
- Note that this simplified way of resource request is based on requesting a number of "allocation packs".
- Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via
srun
options. - The same simplified resource request should be used for the request of interactive sessions with
salloc
. - IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)
Notes for the use/management of resources with srun:
- IMPORTANT: The use of
--gpu-bind=closest
may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. - The --cpus-per-task (
-c
) option should be set to multiples of 8 (whole chiplets) to guarantee thatsrun
will distribute the resources in "allocation packs" and then "reserving" whole chiplets persrun
task, even if the real number is 1 thread per task. The real number of threads with theOMP_NUM_THREADS
variable. - (*1) This is the only case where
srun
may work fine with default inherited option values. Nevertheless, it is a good practice to use full explicit options ofsrun
to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8
) for thesrun
task and control the real number of threads with theOMP_NUM_THREADS
variable. - (*2) The required CPU threads per task is 14 but two full chiplets (-c 16) are indicated for each
srun
task and the number of threads is controlled with theOMP_NUM_THREADS
variable. - (*3) The settings explicitly "reserve" a whole chiplet (
-c 8
) for eachsrun
task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3
). The real number of threads is controlled with theOMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option--gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variableMPICH_GPU_SUPPORT_ENABLED
is set to 1. - (*4) Note the use of
-c 16
to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of thesrun
tasks,-n 2
). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.
General notes:
- The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged
Methods to achieve optimal binding of GPUs
As mentioned above and, as the node diagram in the top of the page suggests, the optimal placement of GPUs and CPU cores for each task is to have direct communication among the CPU chiplet and the GPU in use. So, according to the node diagram, tasks being executed in cores in Chiplet 0 should be using GPU 4 (Bus D1), tasks in Chiplet 1 should be using GPU 5 (Bus D6), etc.
Method 1: Use of srun
parameters for optimal binding
This is the most intuitive (and simple) method for achieving optimal placement of CPUs and GPUs in each task spawned by srun. This method consists in providing the --gpus-per-task
and the --gpu-bind=closest
parameters. So, for example, in a job that requires the use of 8 CPU tasks (single threaded) with 1 GPU per task, the srun
command to be used is:
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest myMPIExecutable
The explanation of this method will be completed in the following sections where a very useful code (hello_jobstep
) will be used to confirm optimal (or sub-optimal, or incorrect) binding of GPUs and chiplets for srun
job steps. Other examples of its use are already listed in the table in the subsection above and its use in full scripts will be provided at the end of this page.
It is important to be aware that this method works fine for most codes, but not for all. Codes suffering MPI communication errors with this methodology, should try the "manual" binding method described next.
Method 2: "Manual" method for optimal binding
Info | ||
---|---|---|
| ||
We acknowledge that the use of this method to control CPU and GPU placement was initially taken from the LUMI supercomputing documentation at CSC. From there, we have further automated parts of it for its use in shared GPU nodes. We are very thankful to LUMI staff for their collaborative support in the use and configuration of Setonix. |
For codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication, the first method may fail, giving errors similar to:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
For these codes, the alternative is to use a "manual" method. This second method is more elaborated than the first but, as said, may be the only option for some codes.
In this "manual" method, the --gpus-per-task
and the --gpu-bind
parameters (key of the first method) should NOT be provided. And, instead of those two parameters, we use two auxiliary techniques:
- A wrapper script that sets a single and different value of
ROCR_VISIBLE_DEVICE
variable for eachsrun
task, then assigning a single and different GPU per task. - An ordered list of CPU cores in the
--cpu-bind
option ofsrun
to explicitly indicate the CPU cores where each task will be placed.
These two auxiliary techniques work in coordination to ensure the best possible match of CPU cores and GPUs.
Auxiliary technique 1: Using a wrapper to select 1 different GPU for each of the tasks spawned by srun
This first auxiliary technique uses the following wrapper script:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
The wrapper script defines the value of the ROCm environment variable ROCR_VISIBLE_DEVICES
with the value of the Slurm environment variable SLURM_LOCALID
. It then executes the rest of the parameters given to the script which are the usual execution instructions for the program intended to be executed. The SLURM_LOCALID
variable has the identification number of the task within each of the nodes (not a global identification, but an identification number local to the node). Further details about the variable are available in the Slurm documentation.
The wrapper should be called first and then the executable (and its parameters, if any). For example, in a job that requires the use of 8 CPU tasks (single threaded) with 1 GPU per task, the srun
command to be used is:
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1srun -N 1 -n 8 -c 8 --gpus-per-node=8 --cpu-bind=${CPU_BIND} ./selectGPU_X.sh myMPIExecutable
The wrapper will be ran by each of the 8 tasks spawned by srun
(-n 8
) and will assign a different and single value to ROCR_VISIBLE_DEVICES
for each of the tasks. Furthermore, the task with SLURM_LOCALID=0
will be receive GPU 0 (Bus C1)
as the only visible GPU for the task. The task with SLURM_LOCALID=1
will receive GPU 1 (Bus C6)
, and so forth.
As mentioned above, the "manual" method consist of two auxiliary techniques working together. The second technique consists in providing an ordered list of the desired CPU cores to be binded to the tasks. The use of the --cpu-bind=${CPU_BIND}
controls that binding, as detailed in the following sub-section.
Auxiliary technique 2: Using a list of CPU cores to control task placement
This second auxiliary technique uses an ordered list of CPU cores to be binded to each of the tasks spawned by srun
. An example of a "hardcoded" ordered list that would bind correctly the 8 GPUs in a node would be:
CPU_BIND="map_cpu:49,57,17,25,0,9,33,41"
("map_cpu
" is a Slurm indicator of the type of binding to be used. Please read the Slurm documentation for further details.)
According to the node diagram at the top of this page, it is clear that this list consists of 1 CPU core per chiplet. What may not be very intuitive is the ordering. But after a second look, it can be seen that the order follows the identification numbers of the GPUs in the node, so that each of the CPU cores correspond to the chiplet that is directly connected to each of the GPUs (in order). Then, the set of commands to use for a job that requires the use of 8 CPU tasks (single threaded) with 1 GPU per task would be:
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
CPU_BIND="map_cpu:49,57,17,25,0,9,33,41"
srun -N 1 -n 8 -c 8 --gpus-per-node=8 --cpu-bind=${CPU_BIND} ./selectGPU_X.sh myMPIExecutable
This provides the optimal binding in a job that requires the use of 8 CPU tasks (single threaded) with 1 GPU per task.
For jobs that are hybrid, that is, that require multiple CPU threads per task, the list needs to be modified to be a list of masks instead of CPU core IDs. The explanation of the use of this list of masks will be given in the next subsection that also describes the use of an auxiliary script to generate the lists of CPU cores or mask for general cases.
For jobs that request exclusive use of the GPU nodes, the settings described in the example so far are enough for achieving optimal binding with the "manual" method. This works because the identification numbers of all the GPUs and the CPU cores that will be assigned to the job are known before hand (as all the resources of the node are what is requested). But when the job requires a reduced amount of resources, so that the request shares the rest of the node with other jobs, the GPUs and CPU cores that are to be allocated to the job are not known before submitting the script for execution. And, therefore, a "hardcoded" list of CPU cores that will always work to achieve optimal binding cannot be defined beforehand. To avoid this problem, for jobs that request resources in shared nodes, we provide a script that can generate the correct list once the job starts execution.
Use of generate_CPU_BIND.sh script for generating an ordered list of CPU cores for optimal binding
The generation of the ordered list to be used with the --cpu-bind
option of srun
can be automated within the script generate_CPU_BIND.sh
, which is available by default to all users through the module pawseytools
(loaded by default).
Note | ||
---|---|---|
| ||
The use of the script
or similar. |
The generate_CPU_BIND.sh
script receives one parameter (map_cpu
OR mask_cpu
) and gives back the best ordered list of CPU-cores or CPU-masks for optimal communication between tasks and GPUs.
For a better understanding of what this script generates and how it is useful, we can use an interactive session. Note that the request parameters given salloc
only include the number of nodes and the number of GPUs per node to request a number of "allocation packs" (as described at the top of this page). In this case, 3 "allocation packs" are requested:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
As can be seen, 3 "allocation packs" were requested, and the total amount of allocated resources are written in the output of the scontrol
command, including the 3 GPUs and 88.32GB of memory. The rocm-smi
command gives a list of the three allocated GPUs, listed locally as GPU:0-BUS_ID:C9
, GPU:1-BUS_ID:D1
& GPU:2-BUS_ID:D6
.
When using generate_CPU_BIND.sh
script with the parameter map_cpu
, it creates a list of CPU-cores that can be used in the srun
command for optimal binding. In this case, we get: map_cpu:21,2,14
which, in order, correspond to the slurm-sockets chiplet2,chiplet0,chiplet1
; which are the ones in direct connection to the C9,D1,D6
GPUs respectively. (Check the GPU node architecture diagram at the top of this page.)
For jobs that require several threads per CPU task, srun
would need a list of masks instead of CPU core IDs. The generate_CPU_BIND.sh
script can generate this list when the parameter mask_cpu
is used. Then, the script creates a list of hexadecimal CPU-masks that can be used for optimally binding an hybrid job. In this case, we get: mask_cpu:0000000000FF0000,00000000000000FF,000000000000FF00
. These masks, in order, correspond to masks that activate only the CPU-cores of chiplet2, chiplet0 & chiplet1
; which are the ones in direct connection to the C9,D1,D6
GPUs respectively. (Check the GPU node architecture diagram at the top of this page and external SLURM documentation for detailed explanation about masks.)
An extensive documentation about the use of masks is in the online documentation of Slurm, but a brief explanation can be given here. First thing to notice is that masks have 16 hexadecimal characters and each of the characters can be understood as an hexadecimal "mini-mask" that correspond to 4 CPU-cores. Then, a pair of characters will cover 8 CPU-cores, that is: each pair of characters represents a chiplet. Then, for example, the second mask in the list (00000000000000FF
) disables all the cores of the CPU for their use by the second MPI task, and only make available the first 8 cores, which correspond to chiplet0
. (Remember to read numbers with the usual increase in hierarchy: right to left.) Then, the first character (right to left) is the hexadecimal mini-mask of CPU cores C00-C03
, and the second character (right to left) is the hexadecimal mini-mask of CPU cores C04-C07
.
To understand what the hexadecimal character really means we need to use their corresponding conversion to a binary number. To fully understand this, let's focus first on a hypothetical example. Let's assume, as an example, that one would like to make available only the third (C02
) and the fourth (C03
) CPU-cores, and that one would use binary numbers to represent a mini-mask of their availability or disability. Again, increasing hierarchy from right to left, the binary-mini-mask would be "1100
" (third and fourth cores available). This binary-mini-mask represents the decimal number "12
", and the hexadecimal-mini-mask is "C
". Now, if the 4 cores of the mini-mask are to be available to the task, then the binary-mini-mask would be "1111
", which represents the decimal number "15
" and the hexadecimal-mini-mask is "F". With this in mind, it can be seen that the full masks in the original list represent availability of only the cores in chiplet2
(and nothing else) for the first task (and its threads) spawned by srun
, only the cores of chiplet0
for the second task and only the cores of chiplet1
for the third task.
In practice, it is common to use the output provided by the generate_CPU_BIND.sh
script and assign it to a variable which is then used within the srun
command. So, a job that requires the use of 8 CPU tasks (single threaded) with 1 GPU per task, the set of commands to be used would be:
export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1
CPU_BIND=$(generate_CPU_BIND.sh map_cpu)
srun -N 1 -n 8 -c 8 --gpus-per-node=8 --cpu-bind=${CPU_BIND} ./selectGPU_X.sh myMPIExecutable
Note that the selectGPU_X.sh
wrapper is part of the first auxiliary technique of the "manual" method of optimal binding and is described in the sub-sections above.
The explanation of the "manual" method will be completed in the following sections where a very useful code (hello_jobstep
) will be used to confirm optimal (or sub-optimal) binding of GPUs and chiplets for srun
job steps.
(If users want to list the generation script in order to check the logic within, they can use the following command:
cat $(which generate_CPU_BIND.sh)
)
MPI & OpenMP settings
Use OMP_NUM_THREADS
to control the threads launched per task
As mentioned in the previous section, allocation of resources is granted in "allocation packs" with 8 cores (1 chiplet) per GPU. Also briefly mentioned in previous section is the need of "reserving" chunks of whole chiplets (multiples of 8 CPU cores) in the srun
command via the --cpus-per-task
( -c
) option. But the use of this option in srun
is still more a "reservation" parameter for the srun
tasks to be binded to the whole chiplets, rather than an indication of the "real number of threads" to be used by the executable. The real number of threads to be used by the executable needs to be controled by the OpenMP environment variable OMP_NUM_THREADS
. In other words, we use --cpus-per-task
to make available whole chiplets to the srun
task, but use OMP_NUM_THREADS
to control the real number of threads per srun
task.
For pure MPI-GPU jobs it is recommended to set OMP_NUM_THREADS=1
before executing the srun
command and avoid unexpected use of OpenMP threads:
export OMP_NUM_THREADS=1
srun ... -c 8 ...
For GPU codes with hybrid management on the CPU side (MPI + OpenMP + GPU), the environment variable needs to be set to the required number of threads per MPI task. For example, if 4 threads per task are required, then settings should be:
export OMP_NUM_THREADS=4
srun ... -c 8 ...
Also mentioned above is the example of a case where the "real number of threads" is 14 (which is greater than 8) and, therefore, requiring more than one chiplet. In that case, srun should reserve the number of chiplets per task that satisfy the demand using multiples of 8 in the --cpus-per-task
(-c
) option, togehter with the set the real number of threads with the OMP_NUM_THREADS
environment variable:
export OMP_NUM_THREADS=14
srun ... -c 16 ...
GPU-Aware MPI
Note | ||
---|---|---|
| ||
To use GPU-aware Cray MPICH, users must set the following modules and environment variables:
|
Test code: hello_jobstep
Info | ||
---|---|---|
| ||
In this page, an MPI+OpenMP+HIP "Hello, World" program (hello_jobstep) will be used to clarify the placement of tasks on CPU-cores and the associated GPU bindings. |
Later in this page, some full examples of batch scripts for the most common scenarios for executing jobs on GPU compute nodes are presented. In order to show how GPUs are bound to the CPU cores assigned to the job, we make use of the hello_jobstep
code within these same examples. For this reason, before presenting the full example, we use this section to explain important details of the test code. (If researchers want to test the code by themselves, this is the forked repository for Pawsey: hello_jobstep repository.)
Compilation and basic use of hello_jobstep
test code
The explanation of the test code will be provided with the output of an interactive session that use 3 "allocation packs" to get access to the 3 GPUs and 3 full CPU chiplets in different ways.
First part is creating the session and check that the resources were granted as 3 allocation packs:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
Now compile the code:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
Now check which are the GPUs available, their labels and, more importantly, their BUS_ID
for the current allocation:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
In a first test, we observe what happens when no "management" parameters are given to srun
. So, in this "non-recommended" setting, the output is:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
As can be seen, each MPI task have been assigned to a CPU core in a different chiplet. But all three GPUs that have been allocated are visible to each of the tasks. Although some codes are able to deal with this kind of available resources, this is not the recommended best practice. The recommended best practice is to provide only 1 GPU per task and, even more, to provide the optimal GPU that is in direct connection to the CPU chiplet that handles the task.
Using hello_jobstep
code for testing optimal bindng for a pure MPI job (single threaded) 1 GPU per task
Starting from the same allocation as above (3 "allocation packs"), now all the parameters needed to define the correct use of resources are provided to srun
. In this case, 3 MPI tasks are to be ran (single threaded) each task making use of 1 GPU. As described above, there are two methods to achieve optimal binding of the GPUs. The first method only uses Slurm parameters to indicate how resources are to be used by srun
. In this case:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
As can be seen, GPU-BUS_ID:D1
is having direct communication with a CPU-core in chiplet
0. Also GPU-BUS_ID:D6
is in direct communication with chiplet1, and GPU-BUS_ID:C9
with chiplet2
, resulting in an optimal 1-to-1 binding.
A similar result can be obtained with the "manual" method for optimal binding. As detailed in sub-sections above, this method uses a wrapper (selectGPU_X.sh
, listed above) to define which GPU is going to be visible to each task, and also the uses an ordered list of CPU cores (created with the script generate_CPU_BIND.sh
, also described above) to bind the correct CPU core to each task. In this case:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
As can be seen, GPU-BUS_ID:C9
is having direct communication with a CPU-core in chiplet
2. Also GPU-BUS_ID:D1
is in direct communication with chiplet0, and GPU-BUS_ID:D6
with chiplet1
, again resulting in an optimal 1-to-1 binding. (Note that in the "manual" method none of these options are provided to srun: --gpus-per-task
nor --gpu-bind
.)
There are some differences with the result shown above from the first and second methods of optimal binding. A first difference is the order in which the GPUs and chiplets are assigned to each task. But this first difference is not important, as long as the communication among CPU chiplet and GPU is optimal, as is the case for both methods. A second difference is in the values of the ROCR_VISIBLE_GPU_ID
s. With the first method, these values are always 0 while, in the second method, these values are the ones given by the wrapper that "manually" selects the GPUs. This second difference has proven to be important and may be the reason why the "manual" binding is the only option for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.
Using hello_jobstep
code for testing optimal binding for a hybrid (MPI + several OpenMP threads), 1 GPU per MPI task
If the code is hybrid on the CPU side and needs the use of several OpenMP CPU threads, we then use the OMP_NUM_THREADS
environment variable to control the number of threads. So, again, starting from the previous session with 3 "allocation packs", consider a case for 3 MPI tasks, 4 OpenMP threads per task and 1 GPU per task:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
When the "manual" optimal binding is required, the mask_cpu parameter needs to be used in the generator script (and in the --cpu_bind
option of srun
):
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
As explained in previous section, the masks provided by the generate_CPU_BIND.sh
script make available only the cores of chiplet2
to the first MPI task and its OpenMP threads, only the cores of chiplet0
to the second task and only the cores of chiplet1
to the third MPI task and its OpenMP threads.
From the output of the hello_jobstep
code, it can be noted that the OpenMP threads use CPU-cores in the same CPU chiplet as the main thread (or MPI task). And all the CPU-cores of the corresponding chiplet are in direct communication with the GPU that has a direct physical connection to it. (Check the architecture diagram at the top of this page.)
Again, there is a difference is in the values of the ROCR_VISIBLE_GPU_ID
s in the results of both methods. With the first method, these values are always 0 while, in the second method, these values are the ones given by the wrapper that "manually" selects the GPUs. This difference has proven to be important and may be the reason why the "manual" binding is the only option for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.
Example scripts for: Exclusive access to the GPU nodes
In this section, a series of example slurm job scripts are presented in order for the users to be able to use them as a point of departure for preparing their own scripts. The examples presented here make use of most of the important concepts, tools and techniques explained in the previous section, so we encourage users to take a look into that top section of this page first.
Exclusive Node Multi-GPU job: 8 GPUs, each of them controlled by one MPI task
As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. This example considers a job that will make use of the 8 GPUs on 1 node (8 "allocation packs"). The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
N Exclusive Nodes Multi-GPU job: 8*N GPUs, each of them controlled by one MPI task
As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. The same procedure mentioned above for the single exclusive node job should be applied for multi-node exclusive jobs. The only difference when requesting resources is the number of exclusive nodes requested. So, for example, for a job requiring 2 exclusive nodes (16 GPUs or 16 "allocation packs") the resources request use the following two parameters:
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Example scripts for: Shared access to the GPU nodes
Shared node 1 GPU job
Jobs that need only 1 GPU for their execution are going to be sharing the GPU compute node with other jobs. That is, they will run in shared access, which is the default so no request for exclusive access is performed.
As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. In this case we ask for 1 allocation pack with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=1 #1 GPUs per node (1 "allocation pack" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As only 1 allocation pack is requested, there is no need to take any other action for optimal binding of CPU chiplet and GPU as it is guaranteed:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
And the output after executing this example is:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
The output of the hello_jobstep
code tells us that the CPU-core "002
" and GPU with Bus_ID:D1
were utilised by the job. Optimal binding is guaranteed for a single "allocation pack" as memory, CPU chiplet and GPU of each pack is optimal.
Shared node 3 MPI tasks each controlling 1 GPU
As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. In this case we ask for 3 allocation packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=3 #3 GPUs per node (3 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Example scripts for: Hybrid jobs (multiple threads) on the CPU side
When the code is hybrid on the CPU side (MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core chiplet needs to be accessible per srun
task. This is controlled by the OMP_NUM_THREADS
environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding (method 2) is applied.
In the following example, we use 3 GPUs (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. In this case we ask for 3 allocation packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=3 #3 GPUs per node (3 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:
export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Ui tabs | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
Example scripts for: Packing GPU jobs
Pack the execution of 8 independent instances each using 1 GPU
This kind of packing can be performed with the help of an additional packing-wrapper script (jobPackWrapper.sh
) that rules the independent execution of different codes (or different instances of the same code) to be ran by each of the srun-tasks spawned by srun
. (It is important to understand that these instances do not interact with each other via MPI messaging.) The isolation of each code/instance should be performed via the logic included in this packing-wrapper script.
In the following example, the packing-wrapper creates 8 different output directories and then launches 8 different instances of the hello_nompi
code. The output of each of the executions is saved in a different case directory and file. In this case, the executable do not receive any further parameters but, in practice, users should define the logic for their own purposes and, if needed, include the logic to receive different parameters for each instance.
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
Note that besides the use of the additional packing-wrapper, the rest of the script is very similar to the single-node exclusive examples given above. As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. This example considers a job that will make use of the 8 GPUs on 1 node (8 "allocation packs"). Each pack will be used by each of the instances controlled by the packing-wrapper. The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. For srun
, this is not different to an MPI job with 8 tasks. But in reality, this is not an MPI job. On the contrary, srun
will spawn 8 tasks, each one of them executing the packing-wrapper, but the logic of the packing-wrapper allows for 8 independent executions of the desired code(s).
As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
After execution of the main slurm bash script, 8 case directories are created (each one of them tagged with their corresponding SLURM_PROCID)
. And within each of them there is a log file corresponding the execution of each instance that ran according to the logic of the jobPackWrapper.sh
script:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
Comparing the output of each of the instances of the hello_nompi
code to the GPU node architecture diagram, it can be seen that the binding of the allocated GPUs to the L3 cache group chiplets (slurm-sockets) is the optimal for each of them.
Related pages
- Setonix User Guide
- Example Slurm Batch Scripts for Setonix on CPU Compute Nodes
- Setonix General Information: GPU node architecture