Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...


Note
titleImportant: GCD vs GPU
Anchor
gcdgpu
gcdgpu

A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as gpu-per-node is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8. 

Anchor
gcdgpugcdgpu


In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation pack". Users should then only request for a number of "allocation packs". Each allocation pack consists of:

...

The explanation of this method will be completed in the following sections where a very useful code (hello_jobstep) will be used to confirm optimal (or sub-optimal, or incorrect) binding of GCDs (Slurm GPUs) and chiplets for srun job steps. Other examples of its use are already listed in the table in the subsection above and its use in full scripts will be provided at the end of this page.

...

Info
titleThanks to CSC center and Lumi staff

We acknowledge that the use of this method to control CPU and GPU GCD placement was initially taken from the LUMI supercomputing documentation at CSC. From there, we have further automated parts of it for its use in shared GPU nodes. We are very thankful to LUMI staff for their collaborative support in the use and configuration of Setonix.

For codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU GCD and attempting to use GPU-to-GPU (GCD-to-GCD) enabled MPI communication, the first method may fail, giving errors similar to:

...

  1. A wrapper script that sets a single and different value of ROCR_VISIBLE_DEVICE variable for each srun task, then assigning a single and different GCD (logical/Slurm GPU) per task.
  2. An ordered list of CPU cores in the --cpu-bind option of srun to explicitly indicate the CPU cores where each task will be placed.

These two auxiliary techniques work in coordination to ensure the best possible match of CPU cores and GPUsGCDs.

Auxiliary technique 1: Using a wrapper to select 1 different GCD (logical or Slurm GPU) for each of the tasks spawned by srun

This first auxiliary technique uses the following wrapper script:

...

The wrapper should be called first and then the executable (and its parameters, if any). For example, in a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task, the srun command to be used is:

...

The wrapper will be ran by each of the 8 tasks spawned by srun (-n 8) and will assign a different and single value to ROCR_VISIBLE_DEVICES for each of the tasks. Furthermore, the task with SLURM_LOCALID=0 will be receive GPU GCD 0 (Bus C1) as the only visible Slurm GPU for the task. The task with SLURM_LOCALID=1 will receive GPU 1 (Bus C6), and so forth.

...