...
Note | ||||||
---|---|---|---|---|---|---|
| ||||||
A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as |
...
- 1 whole CPU chiplet (8 CPU cores)
- ~32 GB memory (1/8 of the total available RAM)
- 1 GCD (slurm GPU) directly connected to that chiplet
Note | ||
---|---|---|
| ||
For jobs that only use a partial set of resources of the node (non-exclusive jobs that share the rest of the node with other jobs), the current Setonix GPU configuration may not provide perfect allocation and binding, which may impact performance depending on the amount of CPU-GPU communication. This is under active investigation, and the recommendations included in this document will serve to achieve optimal allocations in most of the cases, but is not 100% guaranteed. Therefore, if you detect that imperfect binding or the use of shared nodes (even with optimal binding) is impacting the performance of your jobs, it is recommended to use exclusive nodes where possible, noticing that the project will still be charged for the whole node even if part of the resources remain idle. Also Please also report the observed issues to Pawsey's helpdesk. |
...
Excerpt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
New way of request ( |
Warning | ||
---|---|---|
| ||
Within the full explicit srun parameters for optimal binding. So, together with the full explicit
The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate. |
Required Resources per Job | New "simplified" way of requesting resources | Total Allocated resources | Charge per hour | The use of full explicit | ||
---|---|---|---|---|---|---|
1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU) | #SBATCH --nodes=1 #SBATCH --gpus-per-node=gres=gpu:1 | 1 allocation-pack = 1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM | 64 SU |
| gpus-per-node=
| |
1 CPU task (with 14 CPU threads each) all threads controlling the same 1 GCD |
| 2 allocation-packs= 2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM | 128 SU |
| gpus-per-node=
| |
3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication | #SBATCH --nodes=1 #SBATCH -- | gpus-per-node=gres=gpu:3 | 3 allocation-packs= 3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM | 192 SU |
| gpus-per-node=
|
2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication |
| 4 allocation-packs= 4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM | 256 SU |
| gpus-per-node=
| closest
|
8 5 CPU tasks ( single thread each), each controlling 1 GCD with GPU-aware MPI communicationwith 2 CPU threads each) all threads/tasks able to see all 5 GPUs |
| 5 allocation-packs= 8 GPU5 GPUs, 64 40 CPU cores (8 5 chiplets), 235 147.2 GB RAM | 512 320 SU | export MPICH_GPU_SUPPORT_ENABLED = 1
| 1
| c 8 --gpus-per-node=
|
8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication | #SBATCH --nodes=1 #SBATCH --exclusive | 8 allocation-packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM | 512 SU |
| closest
Notes for the request of resources:
Note that this simplified way of resource request
| ||||
8 CPU tasks (single thread each), each controlling 4 GCD with GPU-aware MPI communication | #SBATCH --nodes=4 #SBATCH --exclusive | 32 allocation-packs= 4 nodes, each with: 8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM | 2048 SU |
|
Notes for the request of resources:
- Note that this simplified way of resource request is based on requesting a number of "allocation-packs".
- Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via
srun
options. - The same simplified resource request should be used for the request of interactive sessions with
salloc
. - IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)
Notes for the use/management of resources with srun:
- IMPORTANT: The use of
--gpu-bind=closest
may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. - The
--cpus-per-task
(-c
) option should be set to multiples of 8 (whole chiplets) to guarantee thatsrun
will distribute the resources in "allocation-packs" and then "reserving" whole chiplets persrun
task, even if the real number is 1 thread per task. The real number of threads is controlled with the theOMP_NUM_THREADS
environment variable. - (*1) This is the only case where
srun
may work fine with default inherited option values. Nevertheless, it is a good practice to always use full explicit options ofsrun
to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8
) for thesrun
task and control the real number of threads with theOMP_NUM_THREADS
environment variable. - (*2) The required CPU threads per task is 14 but two full chiplets (-c 16) are indicated for each
srun
task and the number of threads is controlled with theOMP_NUM_THREADS
variable.(*3) The settings explicitly "reserve" a whole chiplet (-c 8
) for eachsrun
task. This and that is controlled with theOMP_NUM_THREADS
environment variable. But still the two full chiplets (-c 16
) are indicated for eachsrun
task. - (*3) The settings explicitly "reserve" a whole chiplet (
-c 8
) for eachsrun
task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned bysrun
(-n 3
). The real number of threads is controlled with theOMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option--gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variableMPICH_GPU_SUPPORT_ENABLED
is set to 1. - (*4) Each task needs to be in direct communication with 2 GCDs. For that, each of the CPU task reserve "two-full-chiplets". The use of
-c 16
"reserves" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of thesrun
tasks,-n 2
). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to the chiplets reserved for each task. - (*5) Sometimes, the executable performs all the management of GPUs requested. If all the management logic is performed by the executable, then all the available resources should be exposed to it. In this case no options for optimal binding are given and only the number of gpus per node to be exposed to the job (
--gres=gpu:number
) is given. - (*6) All GPUs in the node are requested, which mean all the resources available in the node via the
--exclusive
allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). The use of-c 8
provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by bysrun
(-n 38
). The real number of threads is controlled with theOMP_NUM_THREADS
variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option--gpu-bind=closest
. And, in order to allow GPU-aware MPI communication, the environment variableMPICH_GPU_SUPPORT_ENABLED
is set to 1.(*4) Note the use of-c 16
to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of thesrun
tasks,-n 2
). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.
General notes:
- The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged
...