Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Reverted from v. 168

...

Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Compute Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples below, we use the numbering of the cores and bus IDs of the GCD to identify the allocated chiplets and GCDs, and their binding.)

Note
titleImportant: GCD vs GPU
Anchor
gcdgpu
gcdgpu

A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as gpu-per-node is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8. 



In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation pack". Users should then only request for a number of "allocation packs". Each allocation pack consists of:

...

Column
width900px


Code Block
languagebash
themeEmacs
titleListing N. selectGPU_X.sh wrapper script for "manually" selecting 1 GPU per task
linenumberstrue
#!/bin/bash

export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID
exec $*

...


The wrapper script defines the value of the ROCm environment variable ROCR_VISIBLE_DEVICES with the value of the Slurm environment variable SLURM_LOCALID. It then executes the rest of the parameters given to the script which are the usual execution instructions for the program intended to be executed. The SLURM_LOCALID variable has the identification number of the task within each of the nodes (not a global identification, but an identification number local to the node). Further details about the variable are available in the Slurm documentation.

...