Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Note
titleImportant: GCD vs GPU
Anchor
gcdgpu
gcdgpu

A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as --gres=gpu:number is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8.  (Note that the "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)



In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation-pack". Users should then only request for a number of "allocation-packs". Each allocation-pack consists of:

...

Excerpt

New way of request (#SBATCH) and use (srun) of resources for GPU nodes

The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun.

With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation-packs". Each "allocation-pack" consists of:

  • 1 whole CPU chiplet (8 CPU cores)
  • a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node) = 1/8 of the total available RAM
  • 1 GCD directly connected to that chiplet

For that, the request of resources only needs the number of nodes (–-nodes, -N) and the number of GPUs per node (--gres=gpu:number ). The total number of requested GCDs (equivalent to slurm GPUs), resulting from the multiplication of these two parameters, will be interpreted as an indication of the total number of requeted "allocation-packs".

In the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores, so don't use --ntasks, --cpus-per-task, --mem, etc. in the request headers of the script ( #SBATCH directives), or in the request options given to salloc for interactive sessions. If, for some reason, the requirements for a job are indeed determined by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation-packs" that cover their needs. The "allocation-pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.

The use/management of resources with srun is another story. No default parameters should be assumed to be inherited from the allocation towards the srun. Therefore, after the requested resources are allocated, the srun command should be explicitly provided with enough parameters indicating how resources are to be used by the srun step and the spawned tasks. So the real management of resources at execution time is performed by the command line options of srun.

The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes:

Warning
title--gpu-bind=closest may NOT work for all applications

Within the full explicit srun options for "managing resources", there are some that help to achieve optimal binding of GPUs to their directly connected chiplet on the CPU. There are now two methods to achieve this optimal binding of GPUs. So, together with the full explicit srun options, the following two methods can be used:

  1. Include these two Slurm parameters: --gpus-per-task=<number> together with --gpu-bind=closest
  2. "Manual" optimal binding with the use of "two auxiliary techniques".

The first method is simpler, but may still launch execution errors for some codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate


Required Resources per JobNew "simplified" way of requesting resourcesTotal Allocated resourcesCharge per hour

The use of full explicit srun options is now required
(only the 1st method for optimal binding is listed here)

1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU)#SBATCH --nodes=1
#SBATCH --gres=gpu:1
1 allocation-pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM
64 SU

*1

export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

1 CPU task (with 14 CPU threads each) all threads controlling the same 1 GCD

#SBATCH --nodes=1
#SBATCH --gres=gpu:2

2 allocation-packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM
128 SU

*2

export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --gres=gpu:3
3 allocation-packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM
192 SU

*3

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest <executable>

2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication

#SBATCH --nodes=1
#SBATCH --gres=gpu:4

4 allocation-packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM
256 SU

*4

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 2 -c 16 --gres=gpu:4 --gpus-per-task=2 --gpu-bind=closest <executable>

5 CPU tasks (with 2 CPU threads each) all threads/tasks able to see all 5 GPUs

#SBATCH --nodes=1
#SBATCH --gres=gpu:5

5 allocation-packs=
5 GPUs, 40 CPU cores (5 chiplets), 147.2 GB RAM
320 SU

*5

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=2
srun -N 1 -n 5 -c 8 --gres=gpu:5 <executable>

8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --exclusive
8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM
512 SU

*6

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest <executable>

8 CPU tasks (single thread each), each controlling 4 GCD with GPU-aware MPI communication#SBATCH --nodes=4
#SBATCH --exclusive
32 allocation-packs=
4 nodes, each with: 8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM
2048 SU

*7

export MPICH_GPU_SUPPORT_ENABLED=1
export OMP_NUM_THREADS=1

srun -N 4 -n 8 -c 32 --gres=gpu:8 --gpus-per-task=4 --gpu-bind=closest <executable>

1 CPU tasks (single thread), controlling 1 GCD but avoiding other jobs to run in the same node for ideal performance.#SBATCH --nodes=1
#SBATCH --exclusive
8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM
512 SU

*8

export OMP_NUM_THREADS=1
srun -N 1 -n 8 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>

Notes for the request of resources:

  • Note that this simplified way of resource request is based on requesting a number of "allocation-packs".
  • The --nodes (-N) option indicates the number of nodes requested to be allocated.
  • The --gres=gpu:number option indicates the number of allocation-packs requested to be allocated per node. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)
  • The --exclusive option requests all the resources from the number of requested nodes. When this option is used, there is no need for the use of --gres=gpu:number during allocation and, indeed, its use is not recommended in this case.
  • Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via srun options.
  • The same simplified resource request should be used for the request of interactive sessions with salloc.
  • IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)

Notes for the use/management of resources with srun:

  • The --nodes (-N) option indicates the number of nodes to be used by the srun step.
  • The --ntasks (-n) option indicates the total number of tasks to be spawned by the srun step.
  • The --cpus-per-task (-c) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation-packs" and then "reserving" whole chiplets per srun task, even if the real number is 1 thread per task. The real number of threads is controlled with the OMP_NUM_THREADS environment variable.
  • The --gres=gpu:number option indicates the number of GPUs per node to be used by the srun step. (The "equivalent" option --gpus-per-node=number is not recommended as we have found some bugs with its use.)
  • The --gpus-per-task option indicates the number of GPUs to be binded to each task spawned by the srun step via the -n option.
  • The --gpu-bind=closest indicates that the chosen GPUs to be binded to each task should be the optimal (physically closest) to the chiplet assigned to each task.
  • IMPORTANT: The use of --gpu-bind=closest will assign optimal binding but may still NOT work and launch execution errors for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required.
  • (*1) This is the only case where srun may work fine with default inherited option values. Nevertheless, it is a good practice to always use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8) for the srun task and control the real number of threads with the OMP_NUM_THREADS environment variable.
  • (*2) The required CPU threads per task is 14 and that is controlled with the OMP_NUM_THREADS environment variable. But still the two full chiplets (-c 16) are indicated for each srun task.
  • (*3) The settings explicitly "reserve" a whole chiplet (-c 8) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*4) Each task needs to be in direct communication with 2 GCDs. For that, each of the CPU task reserve "two-full-chiplets". The use of -c 16 "reserves" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to the chiplets reserved for each task. The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*5) Sometimes, the executable performs all the management of GPUs requested. If all the management logic is performed by the executable, then all the available resources should be exposed to it. In this case no options for optimal binding are given and only the number of gpus per node to be exposed to the job (--gres=gpu:number ) is given. The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*6) All GPUs in the node are requested, which mean all the resources available in the node via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). The use of -c 8 provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 8).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*7) All GPUs in the node are requested, which mean all the resources available in the node via the --exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). Each task needs to be in direct communication with 4 GCDs. For that, each of the CPU task reserve "four-full-chiplets". The use of -c 32 "reserves" a "four-chiplets-long" separation among the two CPU cores that are to be used per node (8 srun tasks in total, -n 8 ). In this way, each task will be in direct communication to the closest four logical GPUs in the node with respect to the chiplets reserved for each task. The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED  is set to 1. The --gres=gpu:8 option assigns 8 GPUs per node to the srun step (32 GPUs in total as 4 nodes are being assigned).
  • (*8) All GPUs in the node are requested using the --exclusive option, but only 1 CPU chiplet - 1 GPU "unit" (or allocation-pack) is used in the srun step.

General notes:

  • The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged

...

#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gres=gpu:1              #1 GPUs GPU per node (1 "allocation-pack" in total for the job)

...

Column
width900px


Code Block
languagebash
themeEmacs
titleListing N. exampleScript_1NodeShared_1GPU.sh
linenumberstrue
#!/bin/bash --login
#SBATCH --job-name=1GPUSharedNode
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gres=gpu:1           #1 GPUsGPU per node (1 "allocation-pack" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings
#Not needed for 1GPU:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      For optimal GPU binding using slurm options,
#      "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs      
#      (Although in this case this can be avoided as only 1 "allocation-pack" has been requested)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 1 -c 8 --gres=gpu:1 ${theExe} | sort -n

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"


...

#SBATCH --nodes=1              #1 #1 nodes in this example 
#SBATCH --gres=gpu:3      #3    #3 GPUs per node (3 "allocation-packs" in total for the job)

...

#SBATCH --nodes=1              #1 #1 nodes in this example 
#SBATCH --gres=gpu:3      #3 GPUs per node (3 "allocation-packs" in total for the job)

...