...
Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Compute Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples below, we use the numbering of the cores and bus IDs of the GCD to identify the allocated chiplets and GCDs, and their binding.)
Note | ||||||
---|---|---|---|---|---|---|
| ||||||
A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as |
In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation pack". Users should then only request for a number of "allocation packs". Each allocation pack consists of:
...
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
...
The wrapper script defines the value of the ROCm environment variable ROCR_VISIBLE_DEVICES
with the value of the Slurm environment variable SLURM_LOCALID
. It then executes the rest of the parameters given to the script which are the usual execution instructions for the program intended to be executed. The SLURM_LOCALID
variable has the identification number of the task within each of the nodes (not a global identification, but an identification number local to the node). Further details about the variable are available in the Slurm documentation.
...