Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Section


Column

Figure 1. GPU node architecture. Note here the the GPU's shown here are equivalent to a GCD (see here).


Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Complex Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. And more important, each Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GPUs towards which the communication is optimal. Then, communication of a chiplet with other GPUs is not optimal as it requires at least an additional communication hopGCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples below, we use the numbering of the cores and bus IDs of the GPUs GCD to identify the allocated chiplets and GPUsGCDs, and their binding.)

Note
titleImportant: GCD vs GPU

A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as gpu-per-node is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8. 

Anchor
gcdgpu
gcdgpu


In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation pack". Users should then only request for a number of "allocation packs". Each allocation pack consists of:

  • 1 whole CPU chiplet (8 CPU cores)
  • ~32 GB memory
  • 1 GCD (slurm GPU) directly connected to that chiplet

...

Excerpt

New way of request (#SBATCH) and use (srun) of resources for GPU nodes

The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun.

With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation packs". Each "allocation pack" consists of:

  • 1 whole CPU chiplet (8 CPU cores)
  • a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node)
  • 1 GPU GCD directly connected to that chiplet

For that, the request of resources only needs the number of nodes (–-nodes, -N) and the number of GPUs per node (--gpus-per-node). The total number of requested GCDs (equivalent to slurm GPUs), resulting from the multiplication of these two parameters, will be interpreted as an indication of the total number of requeted "allocation packs".

In the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores, so don't use --ntasks, --cpus-per-task, --mem, etc. in the request headers of the script ( #SBATCH directives), or in the request options of salloc. If, for some reason, the job requirements are dictated by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation packs" that meet their needs. The "allocation pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.

The use/management of resources with srun is another story. After the requested resources are allocated, the srun command should be explicitly provided with enough parameters indicating how resources are to be used by the srun step and the spawned tasks. So the real management of resources is performed by the command line options of srun. No default parameters should be considered for srun.

The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes:

Warning
title--gpu-bind=closest may NOT work for all applications

There are now two methods to achieve optimal binding of GPUs:

  1. The use srun parameters for optimal binding: --gpus-per-task=<number> together with --gpu-bind=closest
  2. "Manual" optimal binding with the use of "two auxiliary techniques".

The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate


Required Resources per JobNew "simplified" way of requesting resourcesTotal Allocated resourcesCharge per hour

The use of full explicit srun options is now required
(only the 1st method for optimal binding is listed here)

1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU)#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
1 allocation pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM
64 SU

*1

export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gpus-per-node=1 --gpus-per-task=1

14 CPU threads all controlling the same 1 GPUGCD

#SBATCH --nodes=1
#SBATCH --gpus-per-node=2

2 allocation packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM
128 SU

*2

export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gpus-per-node=1 --gpus-per-task=1

3 CPU tasks (single thread each), each controlling 1 GPU GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --gpus-per-node=3
3 allocation packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM
192 SU

*3

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 3 -c 8 --gpus-per-node=3 --gpus-per-task=1 --gpu-bind=closest

2 CPU tasks (single thread each), each task controlling 2 GPUs GCDs with GPU-aware MPI communication

#SBATCH --nodes=1
#SBATCH --gpus-per-node=4

4 allocation packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM
256 SU

*4

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 2 -c 16 --gpus-per-node=4 --gpus-per-task=2 --gpu-bind=closest

8 CPU tasks (single thread each), each controlling 1 GPU GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --exclusive
8 allocation packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM
512 SUexport MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest

Notes for the request of resources:

  • Note that this simplified way of resource request is based on requesting a number of "allocation packs".
  • Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via srun options.
  • The same simplified resource request should be used for the request of interactive sessions with salloc.
  • IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)

Notes for the use/management of resources with srun:

  • IMPORTANT: The use of --gpu-bind=closest may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required.
  • The --cpus-per-task (-c) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation packs" and then "reserving" whole chiplets per srun task, even if the real number is 1 thread per task. The real number of threads with the OMP_NUM_THREADS variable.
  • (*1) This is the only case where srun may work fine with default inherited option values. Nevertheless, it is a good practice to use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8) for the srun task and control the real number of threads with the OMP_NUM_THREADS variable.
  • (*2) The required CPU threads per task is 14 but two full chiplets (-c 16) are indicated for each srun task and the number of threads is controlled with the OMP_NUM_THREADS variable.
  • (*3) The settings explicitly "reserve" a whole chiplet (-c 8) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*4) Note the use of -c 16 to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.

General notes:

  • The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged

Methods to achieve optimal binding of GCDs/GPUs

As mentioned above and, as the node diagram in the top of the page suggests, the optimal placement of GPUs GCDs and CPU cores for each task is to have direct communication among the CPU chiplet and the GPU GCD in use. So, according to the node diagram, tasks being executed in cores in Chiplet 0 should be using GPU 4 (Bus D1), tasks in Chiplet 1 should be using GPU 5 (Bus D6), etc.

...