...
In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation-pack". Users should then only request for a number of "allocation-packs". Each allocation-pack consists of:
- 1 whole CPU chiplet (8 CPU cores)
- ~32 GB memory
- 1 GCD (slurm GPU) directly connected to that chiplet
...
Excerpt | |||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
New way of request ( |
Warning | ||
---|---|---|
| ||
There are now two methods to achieve optimal binding of GPUs:
The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate. |
Required Resources per Job | New "simplified" way of requesting resources | Total Allocated resources | Charge per hour | The use of full explicit |
---|---|---|---|---|
1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU) | #SBATCH --nodes=1 #SBATCH --gpus-per-node=1 | 1 allocation-pack = 1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM | 64 SU |
|
14 CPU threads all controlling the same 1 GCD |
| 2 allocation-packs= 2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM | 128 SU |
|
3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication | #SBATCH --nodes=1 #SBATCH --gpus-per-node=3 | 3 allocation-packs= 3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM | 192 SU |
|
2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication |
| 4 allocation-packs= 4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM | 256 SU |
|
8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication | #SBATCH --nodes=1 #SBATCH --exclusive | 8 allocation-packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM | 512 SU | export MPICH_GPU_SUPPORT_ENABLED = 1 srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest |
Notes for the request of resources:
- Note that this simplified way of resource request is based on requesting a number of "allocation-packs".
- Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via
srun
options. - The same simplified resource request should be used for the request of interactive sessions with
salloc
. - IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)
Notes for the use/management of resources with srun:
- IMPORTANT: The use of
--gpu-bind=closest
may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. - The --cpus-per-task (
-c
) option should be set to multiples of 8 (whole chiplets) to guarantee thatsrun
will distribute the resources in "allocation-packs" and then "reserving" whole chiplets persrun
task, even if the real number is 1 thread per task. The real number of threads with theOMP_NUM_THREADS
variable. - (*1) This is the only case where
srun
may work fine with default inherited option values. Nevertheless, it is a good practice to use full explicit options ofsrun
to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8
) for thesrun
task and control the real number of threads with theOMP_NUM_THREADS
variable. - (*2) The required CPU threads per task is 14 but two full chiplets (-c 16) are indicated for each
srun
task and the number of threads is controlled with theOMP_NUM_THREADS
variable. - (*3) The settings explicitly "reserve" a whole chiplet (
-c 8
) for eachsrun
task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3
). The real number of threads is controlled with theOMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option--gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variableMPICH_GPU_SUPPORT_ENABLED
is set to 1. - (*4) Note the use of
-c 16
to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of thesrun
tasks,-n 2
). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.
General notes:
- The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged
...
For a better understanding of what this script generates and how it is useful, we can use an interactive session. Note that the request parameters given salloc
only include the number of nodes and the number of Slurm GPUs (GCDs) per node to request a number of "allocation-packs" (as described at the top of this page). In this case, 3 "allocation-packs" are requested:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
As can be seen, 3 "allocation-packs" were requested, and the total amount of allocated resources are written in the output of the scontrol
command, including the 3 GCDs (logical/Slurm GPUs) and 88.32GB of memory. The rocm-smi
command gives a list of the three allocated devices, listed locally as GPU:0-BUS_ID:C9
, GPU:1-BUS_ID:D1
& GPU:2-BUS_ID:D6
.
...
As mentioned in the previous section, allocation of resources is granted in "allocation-packs" with 8 cores (1 chiplet) per GCD. Also briefly mentioned in previous section is the need of "reserving" chunks of whole chiplets (multiples of 8 CPU cores) in the srun
command via the --cpus-per-task
( -c
) option. But the use of this option in srun
is still more a "reservation" parameter for the srun
tasks to be binded to the whole chiplets, rather than an indication of the "real number of threads" to be used by the executable. The real number of threads to be used by the executable needs to be controled by the OpenMP environment variable OMP_NUM_THREADS
. In other words, we use --cpus-per-task
to make available whole chiplets to the srun
task, but use OMP_NUM_THREADS
to control the real number of threads per srun
task.
...
The explanation of the test code will be provided with the output of an interactive session that use 3 "allocation-packs" to get access to the 3 GCDs (logical/Slurm GPUs) and 3 full CPU chiplets in different ways.
First part is creating the session and check that the resources were granted as 3 allocation-packs:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
...
Starting from the same allocation as above (3 "allocation-packs"), now all the parameters needed to define the correct use of resources are provided to srun
. In this case, 3 MPI tasks are to be ran (single threaded) each task making use of 1 GCD (logical/Slurm GPU). As described above, there are two methods to achieve optimal binding. The first method only uses Slurm parameters to indicate how resources are to be used by srun
. In this case:
...
If the code is hybrid on the CPU side and needs the use of several OpenMP CPU threads, we then use the OMP_NUM_THREADS
environment variable to control the number of threads. So, again, starting from the previous session with 3 "allocation-packs", consider a case for 3 MPI tasks, 4 OpenMP threads per task and 1 GCD (logical/Slurm GPU) per task:
...
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 1 node (8 "allocation-packs"). The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
...
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. The same procedure mentioned above for the single exclusive node job should be applied for multi-node exclusive jobs. The only difference when requesting resources is the number of exclusive nodes requested. So, for example, for a job requiring 2 exclusive nodes (16 GCDs (logical/Slurm GPUs) or 16 "allocation-packs") the resources request use the following two parameters:
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
...
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 1 allocation-pack with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=1 #1 GPUs per node (1 "allocation-pack" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As only 1 allocation-pack is requested, there is no need to take any other action for optimal binding of CPU chiplet and GPU as it is guaranteed:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
...
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
The output of the hello_jobstep
code tells us that the CPU-core "002
" and GPU with Bus_ID:D1
were utilised by the job. Optimal binding is guaranteed for a single "allocation-pack" as memory, CPU chiplet and GPU of each pack is optimal.
...
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
...
In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gpus-per-node=3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:
...