Content Comparison

...

Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Compute Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples below, we use the numbering of the cores and bus IDs of the GCD to identify the allocated chiplets and GCDs, and their binding.)

Note

title	Important: GCD vs GPU

Anchor

	gcdgpu
	gcdgpu

A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as gpu-per-node is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8.

In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation pack". Users should then only request for a number of "allocation packs". Each allocation pack consists of:

...

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing N. selectGPU_X.sh wrapper script for "manually" selecting 1 GPU per task
linenumbers	true

#!/bin/bash

export ROCR_VISIBLE_DEVICES=$SLURM_LOCALID
exec $*

...

The wrapper script defines the value of the ROCm environment variable ROCR_VISIBLE_DEVICES with the value of the Slurm environment variable SLURM_LOCALID. It then executes the rest of the parameters given to the script which are the usual execution instructions for the program intended to be executed. The SLURM_LOCALID variable has the identification number of the task within each of the nodes (not a global identification, but an identification number local to the node). Further details about the variable are available in the Slurm documentation.

...

Version	Old Version 170	New Version 171
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Oct 20, 2023	Oct 20, 2023

Versions Compared

Key