Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In order to achieve best performance, the current allocation method uses a basic allocation unit called "allocation-pack". Users should then only request for a number of "allocation-packs". Each allocation-pack consists of:

  • 1 whole CPU chiplet (8 CPU cores)
  • ~32 GB memory
  • 1 GCD (slurm GPU) directly connected to that chiplet

...

Excerpt

New way of request (#SBATCH) and use (srun) of resources for GPU nodes

The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun.

With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation-packs". Each "allocation-pack" consists of:

  • 1 whole CPU chiplet (8 CPU cores)
  • a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node)
  • 1 GCD directly connected to that chiplet

For that, the request of resources only needs the number of nodes (–-nodes, -N) and the number of GPUs per node (--gpus-per-node). The total number of requested GCDs (equivalent to slurm GPUs), resulting from the multiplication of these two parameters, will be interpreted as an indication of the total number of requeted "allocation-packs".

In the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores, so don't use --ntasks, --cpus-per-task, --mem, etc. in the request headers of the script ( #SBATCH directives), or in the request options of salloc. If, for some reason, the job requirements are dictated by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation-packs" that meet their needs. The "allocation-pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit.

The use/management of resources with srun is another story. After the requested resources are allocated, the srun command should be explicitly provided with enough parameters indicating how resources are to be used by the srun step and the spawned tasks. So the real management of resources is performed by the command line options of srun. No default parameters should be considered for srun.

The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes:

Warning
title--gpu-bind=closest may NOT work for all applications

There are now two methods to achieve optimal binding of GPUs:

  1. The use srun parameters for optimal binding: --gpus-per-task=<number> together with --gpu-bind=closest
  2. "Manual" optimal binding with the use of "two auxiliary techniques".

The first method is simpler, but may not work for all codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate


Required Resources per JobNew "simplified" way of requesting resourcesTotal Allocated resourcesCharge per hour

The use of full explicit srun options is now required
(only the 1st method for optimal binding is listed here)

1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU)#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
1 allocation-pack =
1 GPU, 8 CPU cores (1 chiplet), 29.44 GB RAM
64 SU

*1

export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gpus-per-node=1 --gpus-per-task=1

14 CPU threads all controlling the same 1 GCD

#SBATCH --nodes=1
#SBATCH --gpus-per-node=2

2 allocation-packs=
2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB RAM
128 SU

*2

export OMP_NUM_THREADS=14
srun -N 1 -n 1 -c 16 --gpus-per-node=1 --gpus-per-task=1

3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --gpus-per-node=3
3 allocation-packs=
3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB RAM
192 SU

*3

export MPICH_GPU_SUPPORT_ENABLED = 1
export OMP_NUM_THREADS=1

srun -N 1 -n 3 -c 8 --gpus-per-node=3 --gpus-per-task=1 --gpu-bind=closest

2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication

#SBATCH --nodes=1
#SBATCH --gpus-per-node=4

4 allocation-packs=
4 GPU, 32 CPU cores (4 chiplets), 117.76 GB RAM
256 SU

*4

export MPICH_GPU_SUPPORT_ENABLED = 1
export OMP_NUM_THREADS=1

srun -N 1 -n 2 -c 16 --gpus-per-node=4 --gpus-per-task=2 --gpu-bind=closest

8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication#SBATCH --nodes=1
#SBATCH --exclusive
8 allocation-packs=
8 GPU, 64 CPU cores (8 chiplets), 235 GB RAM
512 SUexport MPICH_GPU_SUPPORT_ENABLED = 1
export OMP_NUM_THREADS=1

srun -N 1 -n 8 -c 8 --gpus-per-node=8 --gpus-per-task=1 --gpu-bind=closest

Notes for the request of resources:

  • Note that this simplified way of resource request is based on requesting a number of "allocation-packs".
  • Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via srun options.
  • The same simplified resource request should be used for the request of interactive sessions with salloc.
  • IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)

Notes for the use/management of resources with srun:

  • IMPORTANT: The use of --gpu-bind=closest may NOT work for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required.
  • The --cpus-per-task (-c) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation-packs" and then "reserving" whole chiplets per srun task, even if the real number is 1 thread per task. The real number of threads with the OMP_NUM_THREADS variable.
  • (*1) This is the only case where srun may work fine with default inherited option values. Nevertheless, it is a good practice to use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8) for the srun task and control the real number of threads with the OMP_NUM_THREADS variable.
  • (*2) The required CPU threads per task is 14 but two full chiplets (-c 16) are indicated for each srun task and the number of threads is controlled with the OMP_NUM_THREADS variable.
  • (*3) The settings explicitly "reserve" a whole chiplet (-c 8) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3).  The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest. And, in order to allow GPU-aware MPI communication, the environment variable MPICH_GPU_SUPPORT_ENABLED is set to 1.
  • (*4) Note the use of -c 16 to "reserve" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to each chiplets.

General notes:

  • The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged

...

For a better understanding of what this script generates and how it is useful, we can use an interactive session. Note that the request parameters given salloc  only include the number of nodes and the number of Slurm GPUs (GCDs) per node to request a number of "allocation-packs" (as described at the top of this page). In this case, 3 "allocation-packs" are requested:

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal N. Explaining the use of the script "generate_CPU_BIND.sh" from an salloc session
$ salloc -N 1 --gpus-per-node=3 -A yourProject-gpu --partition=gpu-dev
salloc: Granted job allocation 1370877


$ scontrol show jobid $SLURM_JOBID
JobId=1370877 JobName=interactive
   UserId=quokka(20146) GroupId=quokka(20146) MCS_label=N/A
   Priority=16818 Nice=0 Account=rottnest0001-gpu QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:48 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=16:45:41 EligibleTime=16:45:41
   AccrueTime=Unknown
   StartTime=16:45:41 EndTime=17:45:41 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=16:45:41 Scheduler=Main
   Partition=gpu AllocNode:Sid=joey-02:253180
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nid001004
   BatchHost=nid001004
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=48,mem=88320M,node=1,billing=192,gres/gpu=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/scratch/rottnest0001/quokka/hello_jobstep
   Power=
   CpusPerTres=gres:gpu:8
   MemPerTres=gpu:29440
   TresPerNode=gres:gpu:3   


$ rocm-smi --showhw
======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU  DID   GFX RAS   SDMA RAS  UMC RAS   VBIOS           BUS 
0    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:C9:00.0 
1    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:D1:00.0 
2    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:D6:00.0 
================================================================================
============================= End of ROCm SMI Log ==============================


$ generate_CPU_BIND.sh map_cpu
map_cpu:21,2,14


$ generate_CPU_BIND.sh mask_cpu
mask_cpu:0000000000FF0000,00000000000000FF,000000000000FF00


As can be seen, 3 "allocation-packs" were requested, and the total amount of allocated resources are written in the output of the scontrol command, including the 3 GCDs (logical/Slurm GPUs) and 88.32GB of memory. The rocm-smi command gives a list of the three allocated devices, listed locally as GPU:0-BUS_ID:C9, GPU:1-BUS_ID:D1 & GPU:2-BUS_ID:D6.

...

As mentioned in the previous section, allocation of resources is granted in "allocation-packs" with 8 cores (1 chiplet) per GCD. Also briefly mentioned in previous section is the need of "reserving" chunks of whole chiplets (multiples of 8 CPU cores) in the srun command via the --cpus-per-task ( -c ) option. But the use of this option in srun is still more a "reservation" parameter for the srun tasks to be binded to the whole chiplets, rather than an indication of the "real number of threads" to be used by the executable. The real number of threads to be used by the executable needs to be controled by the  OpenMP environment variable OMP_NUM_THREADS. In other words, we use --cpus-per-task to make available whole chiplets to the srun task, but use OMP_NUM_THREADS to control the real number of threads per srun task.

...

The explanation of the test code will be provided with the output of an interactive session that use 3 "allocation-packs" to get access to the 3 GCDs (logical/Slurm GPUs) and 3 full CPU chiplets in different ways.

First part is creating the session and check that the resources were granted as 3 allocation-packs:

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal N. Explaining the use of the "hello_jobstep" code from an salloc session (allocation and check)
$ salloc -N 1 --gpus-per-node=3 -A <yourProject>-gpu --partition=gpu-dev
salloc: Granted job allocation 339185

$ scontrol show jobid $SLURM_JOBID
JobId=339185 JobName=interactive
   UserId=quokka(20146) GroupId=quokka(20146) MCS_label=N/A
   Priority=16818 Nice=0 Account=rottnest0001-gpu QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
   RunTime=00:00:48 TimeLimit=01:00:00 TimeMin=N/A
   SubmitTime=16:45:41 EligibleTime=16:45:41
   AccrueTime=Unknown
   StartTime=16:45:41 EndTime=17:45:41 Deadline=N/A
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=16:45:41 Scheduler=Main
   Partition=gpu AllocNode:Sid=joey-02:253180
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=nid001004
   BatchHost=nid001004
   NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
   TRES=cpu=48,mem=88320M,node=1,billing=192,gres/gpu=3
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=(null)
   WorkDir=/scratch/rottnest0001/quokka/hello_jobstep
   Power=
   CpusPerTres=gres:gpu:8
   MemPerTres=gpu:29440
   TresPerNode=gres:gpu:3   


...

Starting from the same allocation as above (3 "allocation-packs"), now all the parameters needed to define the correct use of resources are provided to srun. In this case, 3 MPI tasks are to be ran (single threaded) each task making use of 1 GCD (logical/Slurm GPU). As described above, there are two methods to achieve optimal binding. The first method only uses Slurm parameters to indicate how resources are to be used by srun. In this case:

...

If the code is hybrid on the CPU side and needs the use of several OpenMP CPU threads, we then use the OMP_NUM_THREADS environment variable to control the number of threads. So, again, starting from the previous session with 3 "allocation-packs", consider a case for 3 MPI tasks, 4 OpenMP threads per task and 1 GCD (logical/Slurm GPU) per task:

...

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 1 node (8 "allocation-packs"). The resources request use the following two parameters:

#SBATCH --nodes=1   #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
#                   #8 GPUs per node (8 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

...

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. The same procedure mentioned above for the single exclusive node job should be applied for multi-node exclusive jobs. The only difference when requesting resources is the number of exclusive nodes requested. So, for example, for a job requiring 2 exclusive nodes (16 GCDs (logical/Slurm GPUs) or 16 "allocation-packs") the resources request use the following two parameters:

#SBATCH --nodes=2   #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
#                   #8 GPUs per node (16 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

...

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 1 allocation-pack with:

#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gpus-per-node=1      #1 GPUs per node (1 "allocation-pack" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As only 1 allocation-pack is requested, there is no need to take any other action for optimal binding of CPU chiplet and GPU as it is guaranteed:

Column
width900px


Code Block
languagebash
themeEmacs
titleListing N. exampleScript_1NodeShared_1GPU.sh
linenumberstrue
#!/bin/bash --login
#SBATCH --job-name=1GPUSharedNode
#SBATCH --partition=gpu
#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gpus-per-node=1      #1 GPUs per node (1 "allocation -pack" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)

#----
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
exeDir=$MYSCRATCH/hello_jobstep
exeName=hello_jobstep
theExe=$exeDir/$exeName

#----
#MPI & OpenMP settings
#Not needed for 1GPU:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1           #This controls the real CPU-cores per task for the executable

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      For optimal GPU binding using slurm options,
#      "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs      
#      (Although in this case this can be avoided as only 1 "allocation -pack" has been requested)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 1 -c 8 --gpus-per-node=1 ${theExe} | sort -n

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"


...

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal N. Output for a 1 GPU job (using only 1 allocation-pack in a shared node)
$ sbatch exampleScript_1NodeShared_1GPU.sh
Submitted batch job 323098

$ cat slurm-323098.out
...
#------------------------#
Test code execution:
0: MPI 000 - OMP 000 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
...
#------------------------#
Done


The output of the hello_jobstep code tells us that the CPU-core "002" and GPU with Bus_ID:D1 were utilised by the job. Optimal binding is guaranteed for a single "allocation-pack" as memory, CPU chiplet and GPU of each pack is optimal.

...

As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:

#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gpus-per-node=3      #3 GPUs per node (3 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.

...

In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:

#SBATCH --nodes=1              #1 nodes in this example 
#SBATCH --gpus-per-node=3      #3 GPUs per node (3 "allocation-packs" in total for the job)

Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:

...