Node architecture
The GPU node architecture is different from that on the CPU-only nodes. The following diagram shows the connections between the CPU and GPUs on the node, which will assist with understanding recommendations for Slurm job scripts later on this page. Note that the numbering of the cores of the CPU has a slightly different order to that of the GPUs. Each GCD can access 64GB of GPU memory. This totals to 128GB per MI250X, and 256GB per standard GPU node.

Figure 1. GPU node architecture. Note that the GPU's shown here are equivalent to a GCD (more info about this is in the Setonix General Information). |
Each GPU node have 4 MI250X GPU cards, which in turn have 2 Graphical Compute Die (GCD), which are seen as 2 logical GPUs; so each GPU node has 8 GCDs that is equivalent to 8 slurm GPUs. On the other hand, the single AMD CPU chip has 64 cores organised in 8 groups that share the same L3 cache. Each of these L3 cache groups (or chiplets) have a direct Infinity Fabric connection with just one of the GCDs, providing optimal bandwidth. Each chiplet can communicate with other GCDs, albeit at a lower bandwidth due to the additional communication hops. (In the examples explained in the rest of this document, we use the numbering of the cores and bus IDs of the GCD to identify the allocated chiplets and GCDs, and their binding.)
 A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as --gres=gpu:number
is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8. (Note that the "equivalent" option --gpus-per-node=number
is not recommended as we have found some bugs with its use.) Furthermore, Pawsey DOES NOT use standard Slurm meaning for the --gres=gpu:number parameter. The meaning of this parameter has been superseeded to represent the request for a number of "allocation-packs". The new representation has been implemented to achieve best performance. Therefore, the current allocation method uses the "allocation-pack" as the basic allocation unit and, as explained in the rest of this document, users should only request for the number of "allocation-packs" that fullfill the needs of the job. Each allocation-pack provides: - 1 whole CPU chiplet (8 CPU cores)
- ~32 GB memory (1/8 of the total available RAM)
- 1 GCD (slurm GPU) directly connected to that chiplet
For jobs that only use a partial set of resources of the node (non-exclusive jobs that share the rest of the node with other jobs), the current Setonix GPU configuration may not provide perfect allocation and binding, which may impact performance depending on the amount of CPU-GPU communication. This is under active investigation, and the recommendations included in this document will serve to achieve optimal allocations in most of the cases, but is not 100% guaranteed. Therefore, if you detect that imperfect binding or the use of shared nodes (even with optimal binding) is impacting the performance of your jobs, it is recommended to use exclusive nodes where possible, noticing that the project will still be charged for the whole node even if part of the resources remain idle. Please also report the observed issues to Pawsey's helpdesk. |
Further details of the node architecture are also available on the GPU node architecture page.
Slurm use of GPU nodes
Project name to access the GPU nodes is different
The default project name will not give you access to the GPU nodes. So, in order to access the GPU nodes, users need to add the postfix "-gpu" to their project name and explicitly indicate it in the resource request options: #SBATCH -A <projectName>-gpu
So, for example, if your project name is "rottnest0001" the setting would be: #SBATCH -A rottnest0001-gpu
This applies for all GPU partitions (gpu, gpu-dev & gpu-highmem). |
Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via salloc or (#SBATCH pragmas) and the options for the use of resources during execution of the code via srun . With a new CLI filter that Pawsey staff had put in place for the GPU nodes, the request of resources in GPU nodes should be thought as requesting a number of "allocation-packs". Each "allocation-pack" provides: - 1 whole CPU chiplet (8 CPU cores)
- a bit less of 32 GB memory (29.44 GB of memory, to be exact, allowing some memory for the system to operate the node) = 1/8 of the total available RAM
- 1 GCD directly connected to that chiplet
For that, the request of resources only needs the number of nodes (–-nodes , -N ) and the number of allocation-packs per node (--gres=gpu:number ). The total of allocation-packs requested results from the multiplication of these two parameters. Note that the standard Slurm meaning of the second parameter IS NOT used at Pawsey. Instead, Pawsey's CLI filter interprets this parameter as: - the number of requested "allocation-packs" per node
Note that the "equivalent" option --gpus-per-node=number (which is also interpreted as the number of "allocation-packs" per node) is not recommended as we have found some bugs with its use. |
Furthermore, in the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores. Therefore, users should not use --ntasks , --cpus-per-task , --mem , etc. in the request headers of the script ( #SBATCH directives), or in the request options given to salloc for interactive sessions. If, for some reason, the requirements for a job are indeed determined by the number of CPU cores or the amount of memory, then users should estimate the number of "allocation-packs" that cover their needs. The "allocation-pack" is the minimal unit of resources that can be managed, so that all allocation requests should be indeed multiples of this basic unit. Pawsey also has some site specific recommendations for the use/management of resources with srun command. Users should explicitly provide a list of several parameters for the use of resources by srun . (The list of these parameters is made clear in the examples below.) Users should not assume that srun will inherit any of these parameters from the allocation request. Therefore, the real management of resources at execution time is performed by the command line options provided to srun . Note that, for the case of srun , options do have the standard Slurm meaning.
Within the full explicit srun options for "managing resources", there are some that help to achieve optimal binding of GPUs to their directly connected chiplet on the CPU. There are two methods to achieve this optimal binding of GPUs. So, together with the full explicit srun options, the following two methods can be used: - Include these two Slurm parameters:
--gpus-per-task=<number> together with --gpu-bind=closest - "Manual" optimal binding with the use of "two auxiliary techniques" (explained later in the main document).
The first method is simpler, but may still launch execution errors for some codes. "Manual" binding may be the only useful method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. An example of such a code is Slate. |
The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes. Most of the examples in the table provide are for typical jobs where multiple GPUs are allocated to the job as a whole but each of the tasks spawned by srun is binded and has direct access to only 1 GPU. For applications that require multiple GPUs per task, there 3 examples (*4, *5 & *7) where tasks are binded to multiple GPUs: Required Resources per Job | New "simplified" way of requesting resources | Total Allocated resources | Charge per hour | The use of full explicit srun options is now required (only the 1st method for optimal binding is listed here) |
1 CPU task (single CPU thread) controlling 1 GCD (Slurm GPU) | #SBATCH --nodes=1
#SBATCH --gres=gpu:1 | 1 allocation-pack = 1 GPU, 8 CPU cores (1 chiplet), 29.44 GB CPU RAM | 64 SU |
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>
| 1 CPU task (with 14 CPU threads each) all threads controlling the same 1 GCD | #SBATCH --nodes=1
#SBATCH --gres=gpu:2
| 2 allocation-packs= 2 GPUs, 16 CPU cores (2 chiplets), 58.88 GB CPU RAM | 128 SU |
srun -N 1 -n 1 -c 16 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>
| 3 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication | #SBATCH --nodes=1
#SBATCH --gres=gpu:3 | 3 allocation-packs= 3 GPUs, 24 CPU cores (3 chiplets), 88.32 GB CPU RAM | 192 SU |
srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest <executable>
| 2 CPU tasks (single thread each), each task controlling 2 GCDs with GPU-aware MPI communication | #SBATCH --nodes=1
#SBATCH --gres=gpu:4
| 4 allocation-packs= 4 GPU, 32 CPU cores (4 chiplets), 117.76 GB CPU RAM | 256 SU | *4 export MPICH_GPU_SUPPORT_ENABLED=1
srun -N 1 -n 2 -c 16 --gres=gpu:4 --gpus-per-task=2 --gpu-bind=closest <executable>
| 5 CPU tasks (single thread each) all threads/tasks able to see all 5 GPUs | #SBATCH --nodes=1
#SBATCH --gres=gpu:5
| 5 allocation-packs= 5 GPUs, 40 CPU cores (5 chiplets), 147.2 GB CPU RAM | 320 SU |
srun -N 1 -n 5 -c 8 --gres=gpu:5 <executable>
| 8 CPU tasks (single thread each), each controlling 1 GCD with GPU-aware MPI communication | #SBATCH --nodes=1
#SBATCH --exclusive | 8 allocation-packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM | 512 SU | *6 export MPICH_GPU_SUPPORT_ENABLED=1
srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest <executable>
| 8 CPU tasks (single thread each), each controlling 4 GCD with GPU-aware MPI communication | #SBATCH --nodes=4
#SBATCH --exclusive | 32 allocation-packs= 4 nodes, each with: 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM | 2048 SU | *7 export MPICH_GPU_SUPPORT_ENABLED=1
srun -N 4 -n 8 -c 32 --gres=gpu:8 --gpus-per-task=4 --gpu-bind=closest <executable>
| 1 CPU task (single thread), controlling 1 GCD but avoiding other jobs to run in the same node for ideal performance test. | #SBATCH --nodes=1
#SBATCH --exclusive | 8 allocation-packs= 8 GPU, 64 CPU cores (8 chiplets), 235 GB CPU RAM | 512 SU | *8 export OMP_NUM_THREADS=1
srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest <executable>
| Notes for the request of resources: - Note that this simplified way of resource request is based on requesting a number of "allocation-packs", so that standard use of Slurm parameters for allocation should not be used for GPU resources.
- The
--nodes (-N ) option indicates the number of nodes requested to be allocated. - The
option indicates the number of allocation-packs requested to be allocated per node. (The "equivalent" option --gpus-per-node=number
is not recommended as we have found some bugs with its use.) - The
--exclusive option requests all the resources from the number of requested nodes. When this option is used, there is no need for the use of --gres=gpu:number
during allocation and, indeed, its use is not recommended in this case. - Users should not include any other Slurm allocation option that may indicate some "calculation" of required memory or CPU cores. The management of resources should only be performed after allocation via
srun options. - The same simplified resource request should be used for the request of interactive sessions with
salloc . - IMPORTANT: In addition to the request parameters shown in the table, users should indeed use other Slurm request parameters related to partition, walltime, job naming, output, email, etc. (Check the examples of the full Slurm batch scripts.)
Notes for the use/management of resources with srun : - Note that, for the case of
srun , options do have the standard Slurm meaning. - The following options need to be explicitly provided to
srun and not assumed to be inherited with some default value from the allocation request:- The
--nodes (-N ) option indicates the number of nodes to be used by the srun step. - The
--ntasks (-n ) option indicates the total number of tasks to be spawned by the srun step. By default, tasks are spawned evenly across the number of allocated nodes. - The
--cpus-per-task (-c ) option should be set to multiples of 8 (whole chiplets) to guarantee that srun will distribute the resources in "allocation-packs" and then "reserving" whole chiplets per srun task, even if the real number is 1 thread per task. The real number of threads is controlled with the OMP_NUM_THREADS environment variable. - The
option indicates the number of GPUs per node to be used by the srun step. (The "equivalent" option --gpus-per-node=number
is not recommended as we have found some bugs with its use.) - The
--gpus-per-task option indicates the number of GPUs to be binded to each task spawned by the srun step via the -n option. Note that this option neglects sharing of the assigned GPUs to a task with other tasks. (See cases *4, *5 and *7 and their notes for non-intuitive cases.)
- And for optimal binding, the following should be used:
- The
--gpu-bind=closest indicates that the chosen GPUs to be binded to each task should be the optimal (physically closest) to the chiplet assigned to each task. - IMPORTANT: The use of
--gpu-bind=closest will assign optimal binding but may still NOT work and launch execution errors for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. For those cases, the use of the "manual" optimal binding (method 2) is required. Method 2 is explained later in the main document.
- (*1) This is the only case where
srun may work fine with default inherited option values. Nevertheless, it is a good practice to always use full explicit options of srun to indicate the resources needed for the executable. In this case, the settings explicitly "reserve" a whole chiplet (-c 8 ) for the srun task and control the real number of threads with the OMP_NUM_THREADS environment variable. Although the use of gres=gpu, gpus-per-task & gpu-bind is reduntant in this case, we keep them for encouraging their use, which is strictly needed in the most of cases (except case *5). - (*2) The required CPU threads per task is 14 and that is controlled with the
OMP_NUM_THREADS environment variable. But still the two full chiplets (-c 16 ) are indicated for each srun task. - (*3) The settings explicitly "reserve" a whole chiplet (
-c 8 ) for each srun task. This provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 3 ). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest . And, in order to allow GPU-aware MPI communication, the environment variable
is set to 1. - (*4) Each task needs to be in direct communication with 2 GCDs. For that, each of the CPU task reserve "two-full-chiplets". IMPORTANT: The use of
-c 16 "reserves" a "two-chiplets-long" separation among the two CPU cores that are to be used (one for each of the srun tasks, -n 2 ). In this way, each task will be in direct communication to the two logical GPUs in the MI250X card that has optimal connection to the chiplets reserved for each task. The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest . And, in order to allow GPU-aware MPI communication, the environment variable
is set to 1. - (*5) Sometimes, the executable (and not the scheduler) performs all the management of GPUs, like in the case of Tensorflow distributed training, and other Machine Learning Applications. If all the management logic for the GPUs is performed by the executable, then all the available resources should be exposed to it. IMPORTANT: In this case, the
--gpu-bind option should not be provided. Neither the --gpus-per-task option should be provided, as all the available GPUs are to be available to all tasks. The real number of threads is controlled with the OMP_NUM_THREADS variable. And, in order to allow GPU-aware MPI communication, the environment variable
is set to 1. These last two settings may not be necessary for aplications like Tensorflow. - (*6) All GPUs in the node are requested, which mean all the resources available in the node via the
--exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). The use of -c 8 provides "one-chiplet-long" separation among each of the CPU cores to be allocated for the tasks spawned by srun (-n 8 ). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest . And, in order to allow GPU-aware MPI communication, the environment variable
is set to 1. - (*7) All resources in each node are requested via the
--exclusive allocation option (there is no need to indicate the number of GPUs per node when using exclusive allocation). Each task needs to be in direct communication with 4 GCDs. For that, each of the CPU task reserve "four-full-chiplets". IMPORTANT: The use of -c 32 "reserves" a "four-chiplets-long" separation among the two CPU cores that are to be used per node (8 srun tasks in total, -n 8 ). The real number of threads is controlled with the OMP_NUM_THREADS variable. The requirement of optimal binding of GPU to corresponding chiplet is indicated with the option --gpu-bind=closest . In this way, each task will be in direct communication to the closest four logical GPUs in the node with respect to the chiplets reserved for each task. And, in order to allow GPU-aware MPI communication, the environment variable
is set to 1. The --gres=gpu:8 option assigns 8 GPUs per node to the srun step (32 GPUs in total as 4 nodes are being assigned). - (*8) All GPUs in the node are requested using the --
exclusive option, but only 1 CPU chiplet - 1 GPU "unit" (or allocation-pack) is used in the srun step.
General notes: - The allocation charge is for the total of allocated resources and not for the ones that are explicitly used in the execution, so all idle resources will also be charged
Note that examples above are just for quick reference and that they do not show the use of the 2nd method for optiomal binding (which may be the only way to achieve optimal binding for some applications). So, the rest of this page will describe in detail both methods of optimal binding and also show full job script examples for their use on Setonix GPU nodes.
Methods to achieve optimal binding of GCDs/GPUs
As mentioned above and, as the node diagram in the top of the page suggests, the optimal placement of GCDs and CPU cores for each task is to have direct communication among the CPU chiplet and the GCD in use. So, according to the node diagram, tasks being executed in cores in Chiplet 0 should be using GPU 4 (Bus D1), tasks in Chiplet 1 should be using GPU 5 (Bus D6), etc.
Method 1: Use of srun
parameters for optimal binding
This is the most intuitive (and simple) method for achieving optimal placement of CPUs and GPUs in each task spawned by srun. This method consists in providing the --gpus-per-task
and the --gpu-bind=closest
parameters. So, for example, in a job that requires the use of 8 CPU tasks (single threaded) with 1 GPU per task, the srun
command to be used is:
srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest myMPIExecutable
The explanation of this method will be completed in the following sections where a very useful code (hello_jobstep
) will be used to confirm optimal (or sub-optimal, or incorrect) binding of GCDs (Slurm GPUs) and chiplets for srun
job steps. Other examples of its use are already listed in the table in the subsection above and its use in full scripts will be provided at the end of this page.
It is important to be aware that this method works fine for most codes, but not for all. Codes suffering MPI communication errors with this methodology, should try the "manual" binding method described next.
Method 2: "Manual" method for optimal binding
We acknowledge that the use of this method to control CPU and GCD placement was initially taken from the LUMI supercomputing documentation at CSC. From there, we have further automated parts of it for its use in shared GPU nodes. We are very thankful to LUMI staff for their collaborative support in the use and configuration of Setonix. |
For codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GCD and attempting to use GPU-to-GPU (GCD-to-GCD) enabled MPI communication, the first method may fail, giving errors similar to:
$ srun -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest ./myCode_mpiGPU.exe
GTL_DEBUG: [0] hsa_amd_ipc_memory_attach (in gtlt_hsa_ops.c at line 1544):
HSA_STATUS_ERROR_INVALID_ARGUMENT: One of the actual arguments does not meet a precondition stated in the documentation of the corresponding formal argument.
MPICH ERROR [Rank 0] [job id 339192.2] [Mon Sep 4 13:00:27 2023] [nid001004] - Abort(407515138) (rank 0 in comm 0):
Fatal error in PMPI_Waitall: Invalid count, error stack: |
For these codes, the alternative is to use a "manual" method. This second method is more elaborated than the first but, as said, may be the only option for some codes.
In this "manual" method, the --gpus-per-task
and the --gpu-bind
parameters (key of the first method) should NOT be provided. And, instead of those two parameters, we use two auxiliary techniques:
- A wrapper script that sets a single and different value of
variable for each srun
task, then assigning a single and different GCD (logical/Slurm GPU) per task. - An ordered list of CPU cores in the
option of srun
to explicitly indicate the CPU cores where each task will be placed.
These two auxiliary techniques work in coordination to ensure the best possible match of CPU cores and GCDs.
Auxiliary technique 1: Using a wrapper to select 1 different GCD (logical/Slurm GPU) for each of the tasks spawned by srun
This first auxiliary technique uses the following wrapper script:
exec $* |
(Note that the wrapper need to have execution permissions. The command: "chmod 755", or similar will do the job for that.)
The wrapper script defines the value of the ROCm environment variable ROCR_VISIBLE_DEVICES
with the value of the Slurm environment variable SLURM_LOCALID
. It then executes the rest of the parameters given to the script which are the usual execution instructions for the program intended to be executed. The SLURM_LOCALID
variable has the identification number of the task within each of the nodes (not a global identification, but an identification number local to the node). Further details about the variable are available in the Slurm documentation.
The wrapper should be called first and then the executable (and its parameters, if any). For example, in a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task, the srun
command to be used is:
srun -N 1 -n 8 -c 8 --gres=gpu:8 --cpu-bind=${CPU_BIND} ./ myMPIExecutable
The wrapper will be ran by each of the 8 tasks spawned by srun
(-n 8
) and will assign a different and single value to ROCR_VISIBLE_DEVICES
for each of the tasks. Furthermore, the task with SLURM_LOCALID=0
will be receive GCD 0 (Bus C1)
as the only visible Slurm GPU for the task. The task with SLURM_LOCALID=1
will receive GPU 1 (Bus C6)
, and so forth.
As mentioned above, the "manual" method consist of two auxiliary techniques working together. The second technique consists in providing an ordered list of the desired CPU cores to be binded to the tasks. The use of the --cpu-bind=${CPU_BIND}
controls that binding, as detailed in the following sub-section.
Auxiliary technique 2: Using a list of CPU cores to control task placement
This second auxiliary technique uses an ordered list of CPU cores to be binded to each of the tasks spawned by srun
. An example of a "hardcoded" ordered list that would bind correctly the 8 GCDs across the 4 GPU cards in a node would be:
" is a Slurm indicator of the type of binding to be used. Please read the Slurm documentation for further details.)
According to the node diagram at the top of this page, it is clear that this list consists of 1 CPU core per chiplet. What may not be very intuitive is the ordering. But after a second look, it can be seen that the order follows the identification numbers of the GCDs (logical/Slurm GPUs) in the node, so that each of the CPU cores correspond to the chiplet that is directly connected to each of the GCDs (in order). Then, the set of commands to use for a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task would be:
srun -N 1 -n 8 -c 8 --gres=gpu:8 --cpu-bind=${CPU_BIND} ./ myMPIExecutable
This provides the optimal binding in a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task.
For jobs that are hybrid, that is, that require multiple CPU threads per task, the list needs to be modified to be a list of masks instead of CPU core IDs. The explanation of the use of this list of masks will be given in the next subsection that also describes the use of an auxiliary script to generate the lists of CPU cores or mask for general cases.
For jobs that request exclusive use of the GPU nodes, the settings described in the example so far are enough for achieving optimal binding with the "manual" method. This works because the identification numbers of all the GPUs and the CPU cores that will be assigned to the job are known before hand (as all the resources of the node are what is requested). But when the job requires a reduced amount of resources, so that the request shares the rest of the node with other jobs, the GPUs and CPU cores that are to be allocated to the job are not known before submitting the script for execution. And, therefore, a "hardcoded" list of CPU cores that will always work to achieve optimal binding cannot be defined beforehand. To avoid this problem, for jobs that request resources in shared nodes, we provide a script that can generate the correct list once the job starts execution.
Use of script for generating an ordered list of CPU cores for optimal binding
The generation of the ordered list to be used with the --cpu-bind
option of srun
can be automated within the script
, which is available by default to all users through the module pawseytools
(loaded by default).
The use of the script is only meaningful in GPU nodes and will report errors if executed on CPU nodes, like: ERROR:root:Driver not initialized (amdgpu not found in modules)
or similar. |
script receives one parameter (map_cpu
OR mask_cpu
) and gives back the best ordered list of CPU-cores or CPU-masks for optimal communication between tasks and GPUs.
For a better understanding of what this script generates and how it is useful, we can use an interactive session. Note that the request parameters given salloc
only include the number of nodes and the number of Slurm GPUs (GCDs) per node to request a number of "allocation-packs" (as described at the top of this page). In this case, 3 "allocation-packs" are requested:
$ salloc -N 1 --gres=gpu:3 -A yourProject-gpu --partition=gpu-dev
salloc: Granted job allocation 1370877
$ scontrol show jobid $SLURM_JOBID
JobId=1370877 JobName=interactive
UserId=quokka(20146) GroupId=quokka(20146) MCS_label=N/A
Priority=16818 Nice=0 Account=rottnest0001-gpu QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:48 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=16:45:41 EligibleTime=16:45:41
StartTime=16:45:41 EndTime=17:45:41 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=16:45:41 Scheduler=Main
Partition=gpu AllocNode:Sid=joey-02:253180
ReqNodeList=(null) ExcNodeList=(null)
NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
$ rocm-smi --showhw
======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
0 7408 DISABLED ENABLED DISABLED 113-D65201-042 0000:C9:00.0
1 7408 DISABLED ENABLED DISABLED 113-D65201-042 0000:D1:00.0
2 7408 DISABLED ENABLED DISABLED 113-D65201-042 0000:D6:00.0
============================= End of ROCm SMI Log ==============================
$ map_cpu
$ mask_cpu
mask_cpu:0000000000FF0000,00000000000000FF,000000000000FF00 |
As can be seen, 3 "allocation-packs" were requested, and the total amount of allocated resources are written in the output of the scontrol
command, including the 3 GCDs (logical/Slurm GPUs) and 88.32GB of memory. The rocm-smi
command gives a list of the three allocated devices, listed locally as GPU:0-BUS_ID:C9
When using
script with the parameter map_cpu
, it creates a list of CPU-cores that can be used in the srun
command for optimal binding. In this case, we get: map_cpu:21,2,14
which, in order, correspond to the slurm-sockets chiplet2,chiplet0,chiplet1
; which are the ones in direct connection to the C9,D1,D6
GCDs respectively. (Check the GPU node architecture diagram at the top of this page.)
For jobs that require several threads per CPU task, srun
would need a list of masks instead of CPU core IDs. The
script can generate this list when the parameter mask_cpu
is used. Then, the script creates a list of hexadecimal CPU-masks that can be used for optimally binding an hybrid job. In this case, we get: mask_cpu:0000000000FF0000,00000000000000FF,000000000000FF00
. These masks, in order, correspond to masks that activate only the CPU-cores of chiplet2, chiplet0 & chiplet1
; which are the ones in direct connection to the C9,D1,D6
GCDs respectively. (Check the GPU node architecture diagram at the top of this page and external SLURM documentation for detailed explanation about masks.)
An extensive documentation about the use of masks is in the online documentation of Slurm, but a brief explanation can be given here. First thing to notice is that masks have 16 hexadecimal characters and each of the characters can be understood as an hexadecimal "mini-mask" that correspond to 4 CPU-cores. Then, a pair of characters will cover 8 CPU-cores, that is: each pair of characters represents a chiplet. Then, for example, the second mask in the list (00000000000000FF
) disables all the cores of the CPU for their use by the second MPI task, and only make available the first 8 cores, which correspond to chiplet0
. (Remember to read numbers with the usual increase in hierarchy: right to left.) Then, the first character (right to left) is the hexadecimal mini-mask of CPU cores C00-C03
, and the second character (right to left) is the hexadecimal mini-mask of CPU cores C04-C07
To understand what the hexadecimal character really means we need to use their corresponding conversion to a binary number. To fully understand this, let's focus first on a hypothetical example. Let's assume, as an example, that one would like to make available only the third (C02
) and the fourth (C03
) CPU-cores, and that one would use binary numbers to represent a mini-mask of their availability or disability. Again, increasing hierarchy from right to left, the binary-mini-mask would be "1100
" (third and fourth cores available). This binary-mini-mask represents the decimal number "12
", and the hexadecimal-mini-mask is "C
". Now, if the 4 cores of the mini-mask are to be available to the task, then the binary-mini-mask would be "1111
", which represents the decimal number "15
" and the hexadecimal-mini-mask is "F". With this in mind, it can be seen that the full masks in the original list represent availability of only the cores in chiplet2
(and nothing else) for the first task (and its threads) spawned by srun
, only the cores of chiplet0
for the second task and only the cores of chiplet1
for the third task.
In practice, it is common to use the output provided by the
script and assign it to a variable which is then used within the srun
command. So, a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task, the set of commands to be used would be:
CPU_BIND=$( map_cpu)
srun -N 1 -n 8 -c 8 --gres=gpu:8 --cpu-bind=${CPU_BIND} ./ myMPIExecutable
Note that the
wrapper is part of the first auxiliary technique of the "manual" method of optimal binding and is described in the sub-sections above.
The explanation of the "manual" method will be completed in the following sections where a very useful code (hello_jobstep
) will be used to confirm optimal (or sub-optimal) binding of GCDs (logical Slurm GPUs) and chiplets for srun
job steps.
(If users want to list the generation script in order to check the logic within, they can use the following command:
cat $(which
MPI & OpenMP settings
to control the threads launched per task
As mentioned in the previous section, allocation of resources is granted in "allocation-packs" with 8 cores (1 chiplet) per GCD. Also briefly mentioned in previous section is the need of "reserving" chunks of whole chiplets (multiples of 8 CPU cores) in the srun
command via the --cpus-per-task
( -c
) option. But the use of this option in srun
is still more a "reservation" parameter for the srun
tasks to be binded to the whole chiplets, rather than an indication of the "real number of threads" to be used by the executable. The real number of threads to be used by the executable needs to be controled by the OpenMP environment variable OMP_NUM_THREADS
. In other words, we use --cpus-per-task
to make available whole chiplets to the srun
task, but use OMP_NUM_THREADS
to control the real number of threads per srun
For pure MPI-GPU jobs it is recommended to set OMP_NUM_THREADS=1
before executing the srun
command and avoid unexpected use of OpenMP threads:
srun ... -c 8 ...
For GPU codes with hybrid management on the CPU side (MPI + OpenMP + GPU), the environment variable needs to be set to the required number of threads per MPI task. For example, if 4 threads per task are required, then settings should be:
srun ... -c 8 ...
Also mentioned above is the example of a case where the "real number of threads" is 14 (which is greater than 8) and, therefore, requiring more than one chiplet. In that case, srun should reserve the number of chiplets per task that satisfy the demand using multiples of 8 in the --cpus-per-task
) option, togehter with the set the real number of threads with the OMP_NUM_THREADS
environment variable:
srun ... -c 16 ...
To use GPU-aware Cray MPICH, users must set the following modules and environment variables: module load craype-accel-amd-gfx90a
module load rocm/<VERSION>
Test code: hello_jobstep
In this page, an MPI+OpenMP+HIP "Hello, World" program (hello_jobstep) will be used to clarify the placement of tasks on CPU-cores and the associated GPU bindings.
We acknowledge Tom Papatheodore and Oak Ridge National Lab (ORNL) for allowing Pawsey to fork the repository of this useful code and to use it within our own documentation and training material. We also acknowledge the very useful information available in the ORNL documentation for systems similar to Setonix, particularly the Crusher system. |
Later in this page, some full examples of batch scripts for the most common scenarios for executing jobs on GPU nodes are presented. In order to show how GCDs are bound to the CPU cores assigned to the job, we make use of the hello_jobstep
code within these same examples. For this reason, before presenting the full example, we use this section to explain important details of the test code. (If researchers want to test the code by themselves, this is the forked repository for Pawsey: hello_jobstep repository.)
Compilation and basic use of hello_jobstep
test code
The explanation of the test code will be provided with the output of an interactive session that use 3 "allocation-packs" to get access to the 3 GCDs (logical/Slurm GPUs) and 3 full CPU chiplets in different ways.
First part is creating the session and check that the resources were granted as 3 allocation-packs:
$ salloc -N 1 --gres=gpu:3 -A <yourProject>-gpu --partition=gpu-dev
salloc: Granted job allocation 339185
$ scontrol show jobid $SLURM_JOBID
JobId=339185 JobName=interactive
UserId=quokka(20146) GroupId=quokka(20146) MCS_label=N/A
Priority=16818 Nice=0 Account=rottnest0001-gpu QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
RunTime=00:00:48 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=16:45:41 EligibleTime=16:45:41
StartTime=16:45:41 EndTime=17:45:41 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=16:45:41 Scheduler=Main
Partition=gpu AllocNode:Sid=joey-02:253180
ReqNodeList=(null) ExcNodeList=(null)
NumNodes=1 NumCPUs=48 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:1 CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
TresPerNode=gres:gpu:3 |
Now compile the code:
$ git clone
Cloning into 'hello_jobstep'...
Resolving deltas: 100% (41/41), done.
$ cd hello_jobstep
$ module load PrgEnv-cray craype-accel-amd-gfx90a rocm/<VERSION>
$ make hello_jobstep
CC -std=c++11 -fopenmp --rocm-path=/opt/rocm -x hip -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -I/opt/rocm/include -c hello_jobstep.cpp
CC -fopenmp --rocm-path=/opt/rocm -L/opt/rocm/lib -lamdhip64 hello_jobstep.o -o hello_jobstep
Now check the current allocations available devices, specifically their BUS_ID
$ rocm-smi --showhw
======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
0 7408 DISABLED ENABLED DISABLED 113-D65201-042 0000:C9:00.0
1 7408 DISABLED ENABLED DISABLED 113-D65201-042 0000:D1:00.0
2 7408 DISABLED ENABLED DISABLED 113-D65201-042 0000:D6:00.0
============================= End of ROCm SMI Log ============================== |
Using hello_jobstep
code for testing a non-recommended practice
In a first test, we observe what happens when no "management" parameters are given to srun
. So, in this "non-recommended" setting, the output is:
$ export OMP_NUM_THREADS=1; srun -N 1 -n 3 ./hello_jobstep | sort -n
MPI 000 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0,1,2 - ROCR_VISIBLE_GPU_ID 0,1,2 - GPU_Bus_ID c9,d1,d6
MPI 001 - OMP 000 - HWT 001 - Node nid001004 - RunTime_GPU_ID 0,1,2 - ROCR_VISIBLE_GPU_ID 0,1,2 - GPU_Bus_ID c9,d1,d6
MPI 002 - OMP 000 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0,1,2 - ROCR_VISIBLE_GPU_ID 0,1,2 - GPU_Bus_ID c9,d1,d6 |
As can be seen, each MPI task can be assigned to the same chiplet by the scheduler, which is not a recommended practice. Also, all three GCDs (logical/Slurm GPUs) that have been allocated are visible to each of the tasks. Although some codes are able to deal with this kind of available resources, this is not the recommended best practice. The recommended best practice is to assign CPU tasks to different chiplets and to provide only 1 GCD per task and, even more, to provide the optimal bandwidth between CPU and GCD.
Using hello_jobstep
code for testing optimal binding for a pure MPI job (single threaded) 1 GPU per task
Starting from the same allocation as above (3 "allocation-packs"), now all the parameters needed to define the correct use of resources are provided to srun
. In this case, 3 MPI tasks are to be ran (single threaded) each task making use of 1 GCD (logical/Slurm GPU). As described above, there are two methods to achieve optimal binding. The first method only uses Slurm parameters to indicate how resources are to be used by srun
. In this case:
$ export OMP_NUM_THREADS=1; srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest ./hello_jobstep | sort -n
MPI 000 - OMP 000 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 001 - OMP 000 - HWT 009 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 002 - OMP 000 - HWT 017 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9 |
As can be seen, GPU-BUS_ID:D1
is having direct communication with a CPU-core in chiplet
0. Also GPU-BUS_ID:D6
is in direct communication with chiplet1, and GPU-BUS_ID:C9
with chiplet2
, resulting in an optimal 1 chiplet to 1 GCD binding.
A similar result can be obtained with the "manual" method for optimal binding. As detailed in sub-sections above, this method uses a wrapper (
, listed above) to define which GCD (logical/Slurm GPU) is going to be visible to each task, and also the uses an ordered list of CPU cores (created with the script
, also described above) to bind the correct CPU core to each task. In this case:
$ CPU_BIND=$( map_cpu)
$ echo $CPU_BIND
$ export OMP_NUM_THREADS=1; srun -N 1 -n 3 -c 8 --gres=gpu:3 --cpu-bind=${CPU_BIND} ./ ./hello_jobstep | sort -n
MPI 000 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 001 - OMP 000 - HWT 003 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
MPI 002 - OMP 000 - HWT 015 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6 |
As can be seen, GPU-BUS_ID:C9
is having direct communication with a CPU-core in chiplet
2. Also GPU-BUS_ID:D1
is in direct communication with chiplet0, and GPU-BUS_ID:D6
with chiplet1
, again resulting in an optimal 1-to-1 binding. (Note that in the "manual" method none of these options are provided to srun: --gpus-per-task
nor --gpu-bind
There are some differences with the result shown above from the first and second methods of optimal binding. Although the ordering chiplets for a given rank is different here, this is not imporant since the CPU-to-GCD affinity is optimal. The key difference is in the values of the ROCR_VISIBLE_GPU_ID
s. With the first method, these values are always 0 while, in the second method, these values are the ones given by the wrapper that "manually" selects the GPUs. This second difference has proven to be important and may be the reason why the "manual" binding is the only option for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.
Using hello_jobstep
code for testing optimal binding for a hybrid (MPI + several OpenMP threads), 1 GCD (logical/Slurm GPU) per MPI task
If the code is hybrid on the CPU side and needs the use of several OpenMP CPU threads, we then use the OMP_NUM_THREADS
environment variable to control the number of threads. So, again, starting from the previous session with 3 "allocation-packs", consider a case for 3 MPI tasks, 4 OpenMP threads per task and 1 GCD (logical/Slurm GPU) per task:
$ export OMP_NUM_THREADS=4; srun -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest ./hello_jobstep | sort -n
MPI 000 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 000 - OMP 001 - HWT 003 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 000 - OMP 002 - HWT 005 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 000 - OMP 003 - HWT 006 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 001 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 001 - OMP 001 - HWT 011 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 001 - OMP 002 - HWT 013 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 001 - OMP 003 - HWT 014 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 002 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 002 - OMP 001 - HWT 019 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 002 - OMP 002 - HWT 021 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 002 - OMP 003 - HWT 022 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9 |
When the "manual" optimal binding is required, the mask_cpu parameter needs to be used in the generator script (and in the --cpu_bind
option of srun
$ CPU_BIND=$( mask_cpu)
$ echo $CPU_BIND
$ export OMP_NUM_THREADS=4; srun -N 1 -n 3 -c 8 --gres=gpu:3 --cpu-bind=${CPU_BIND} ./ ./hello_jobstep | sort -n
MPI 000 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 000 - OMP 001 - HWT 018 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 000 - OMP 002 - HWT 021 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 000 - OMP 003 - HWT 022 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 001 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
MPI 001 - OMP 001 - HWT 003 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
MPI 001 - OMP 002 - HWT 005 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
MPI 001 - OMP 003 - HWT 006 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
MPI 002 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
MPI 002 - OMP 001 - HWT 011 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
MPI 002 - OMP 002 - HWT 013 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
MPI 002 - OMP 003 - HWT 014 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6 |
As explained in previous section, the masks provided by the
script make available only the cores of chiplet2
to the first MPI task and its OpenMP threads, only the cores of chiplet0
to the second task and only the cores of chiplet1
to the third MPI task and its OpenMP threads.
From the output of the hello_jobstep
code, it can be noted that the OpenMP threads use CPU-cores in the same CPU chiplet as the main thread (or MPI task). And all the CPU-cores of the corresponding chiplet are in direct communication with the GCD (logical/Slurm GPU) that has a direct physical connection to it. (Check the architecture diagram at the top of this page.)
Again, there is a difference is in the values of the ROCR_VISIBLE_GPU_ID
s in the results of both methods. With the first method, these values are always 0 while, in the second method, these values are the ones given by the wrapper that "manually" selects the GCDs (logical/Slurm GPUs). This difference has proven to be important and may be the reason why the "manual" binding is the only option for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.
Using hello_jobstep
code for testing visibility of all allocated GPUs to each of the tasks
Some codes, like tensorflow and other machine learning engines, require visibility of all GPU resources for an internal-to-the-code management of resources. In that case, optimal binding cannot be provided to the code and then the responsability of optimal binding and communication among the resources is given completely to the code. In that case, the recommended settings for the srun
command are:
$ export OMP_NUM_THREADS=1; srun -N 1 -n 3 -c 8 --gres=gpu:3 ./hello_jobstep | sort -n
MPI 000 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0,1,2 - ROCR_VISIBLE_GPU_ID 0,1,2 - GPU_Bus_ID c9,d1,d6
MPI 001 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0,1,2 - ROCR_VISIBLE_GPU_ID 0,1,2 - GPU_Bus_ID c9,d1,d6
MPI 002 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0,1,2 - ROCR_VISIBLE_GPU_ID 0,1,2 - GPU_Bus_ID c9,d1,d6 |
As can be seen, each MPI task is assigned to a different chiplet. Also, all three GCDs (logical/Slurm GPUs) that have been allocated are visible to each of the tasks which, for these codes, is what they need to run properly.
Example scripts for: Exclusive access to the GPU nodes with optimal binding
In this section, a series of example slurm job scripts are presented in order for the users to be able to use them as a point of departure for preparing their own scripts. The examples presented here make use of most of the important concepts, tools and techniques explained in the previous section, so we encourage users to take a look into that top section of this page first.
Exclusive Node Multi-GPU job: 8 GCDs (logical/Slurm GPUs), each of them controlled by one MPI task
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 1 node (8 "allocation-packs"). The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
For optimal binding using srun parameters the options "--gpus-per-task " & "--gpu-bind=closest " need to be used:
#!/bin/bash --login
#SBATCH --job-name=8GPUExclusiveNode-bindMethod1
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For optimal GPU binding using slurm options,
# "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339314
$ cat slurm-339314.out
Test code execution:
0: MPI 000 - OMP 000 - HWT 001 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
1: MPI 001 - OMP 000 - HWT 008 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
2: MPI 002 - OMP 000 - HWT 016 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
3: MPI 003 - OMP 000 - HWT 024 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID ce
4: MPI 004 - OMP 000 - HWT 032 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d9
5: MPI 005 - OMP 000 - HWT 040 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID de
6: MPI 006 - OMP 000 - HWT 048 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c1
7: MPI 007 - OMP 000 - HWT 056 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c6
Done |
The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GCD is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GCD (logical GPU) to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal: - CPU core "
001 " is on chiplet:0 and directly connected to GPU with Bus_ID:D1 - CPU core "
008 " is on chiplet:1 and directly connected to GPU with Bus_ID:D6 - CPU core "
016 " is on chiplet:2 and directly connected to GPU with Bus_ID:C9 - CPU core "
024 " is on chiplet:3 and directly connected to GPU with Bus_ID:CE - CPU core "
032 " is on chiplet:4 and directly connected to GPU with Bus_ID:D9 - CPU core "
040 " is on chiplet:5 and directly connected to GPU with Bus_ID:DE - CPU core "
048 " is on chiplet:6 and directly connected to GPU with Bus_ID:C1 - CPU core "
056 " is on chiplet:7 and directly connected to GPU with Bus_ID:C6
According to the architecture diagram, this binding configuration is optimal. This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. |
"Click" in the TAB above to read the script and output for the other method of GPU binding. |
For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GCD (logical/Slurm GPU) and 2) generate an ordered list to be used in the --cpu-bind option of srun :
#!/bin/bash --login
#SBATCH --job-name=8GPUExclusiveNode-bindManual
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#---- Needed for "manual" optimal binding of GPUs and chiplets
#First "aux technique": create a selectGPU wrapper to be used for
# binding only 1 GPU per each task spawned by srun
# Here we use ROCR_VISIBLE_DEVICES environment variable for this purpose
# but, depending on the type of application, some other variables may need to be set too
# (check documentation).
cat << EOF > $wrapper
exec \$*
chmod +x ./$wrapper
#---- Needed for "manual" optimal binding of GPUs and chiplets
#Second "aux technique": generate an ordered list of CPU-cores (each on a different slurm-socket)
# to be matched with the correct GPU in the srun command using --cpu-bind option.
# Script "" serves this purpose. This script is available
# to all users through the module pawseytools, which is loaded by default.
CPU_BIND=$( map_cpu)
if [ $lastResult -ne 0 ]; then
echo "Exiting as the map generation for CPU_BIND failed" 1>&2
rm -f ./$wrapper #deleting the wrapper
exit 1
echo -e "\n\n#------------------------#"
echo "The chosen CPU_BIND is:"
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For "manual" binding you should NOT use "--gpus-per-task" NOR "--gpu-bind"
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# "--cpu-bind=${CPU_BIND} ./$wrapper" create the optimal binding of GPUs "manually"
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 8 -c 8 --gres=gpu:8 --cpu-bind=${CPU_BIND} ./$wrapper ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
#Deleting wrappers
rm -f ./$wrapper #deleting the wrapper of the first auxiliary technique for "manual" binding
echo -e "\n\n#------------------------#"
echo "Done" |
Note that the wrapper for selecting the GCDs (logical GPUs) is being created with a redirection to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised. Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339313
$ cat slurm-339313.out
The chosen CPU_BIND is:
Test code execution:
0: MPI 000 - OMP 000 - HWT 054 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c1
1: MPI 001 - OMP 000 - HWT 063 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID c6
2: MPI 002 - OMP 000 - HWT 018 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID c9
3: MPI 003 - OMP 000 - HWT 026 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 3 - GPU_Bus_ID ce
4: MPI 004 - OMP 000 - HWT 006 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 4 - GPU_Bus_ID d1
5: MPI 005 - OMP 000 - HWT 013 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 5 - GPU_Bus_ID d6
6: MPI 006 - OMP 000 - HWT 033 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 6 - GPU_Bus_ID d9
7: MPI 007 - OMP 000 - HWT 047 - Node nid001000 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 7 - GPU_Bus_ID de
Done |
The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that affinity is optimal: - CPU core "
054 " is on chiplet:6 and directly connected to GPU with Bus_ID:C1 - CPU core "
063 " is on chiplet:7 and directly connected to GPU with Bus_ID:C6 - CPU core "
018 " is on chiplet:2 and directly connected to GPU with Bus_ID:C9 - CPU core "
026 " is on chiplet:3 and directly connected to GPU with Bus_ID:CE - CPU core "
006 " is on chiplet:0 and directly connected to GPU with Bus_ID:D1 - CPU core "
013 " is on chiplet:1 and directly connected to GPU with Bus_ID:D6 - CPU core "
033 " is on chiplet:4 and directly connected to GPU with Bus_ID:D9 - CPU core "
047 " is on chiplet:5 and directly connected to GPU with Bus_ID:DE
"Click" in the TAB above to read the script and output for the other method of GPU binding. |
N Exclusive Nodes Multi-GPU job: 8*N GCDs (logical/Slurm GPUs), each of them controlled by one MPI task
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. The same procedure mentioned above for the single exclusive node job should be applied for multi-node exclusive jobs. The only difference when requesting resources is the number of exclusive nodes requested. So, for example, for a job requiring 2 exclusive nodes (16 GCDs (logical/Slurm GPUs) or 16 "allocation-packs") the resources request use the following two parameters:
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
For optimal binding using srun parameters the options "--gpus-per-task " & "--gpu-bind=closest " need to be used:
#!/bin/bash --login
#SBATCH --job-name=16GPUExclusiveNode-bindMethod1
#SBATCH --partition=gpu
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For optimal GPU binding using slurm options,
# "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 2 -n 16 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339319
$ cat slurm-339319.out
Test code execution:
0: MPI 000 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
1: MPI 001 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
2: MPI 002 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
3: MPI 003 - OMP 000 - HWT 024 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID ce
4: MPI 004 - OMP 000 - HWT 037 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d9
5: MPI 005 - OMP 000 - HWT 040 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID de
6: MPI 006 - OMP 000 - HWT 048 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c1
7: MPI 007 - OMP 000 - HWT 056 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c6
8: MPI 008 - OMP 000 - HWT 000 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
9: MPI 009 - OMP 000 - HWT 008 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
10: MPI 010 - OMP 000 - HWT 023 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
11: MPI 011 - OMP 000 - HWT 024 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID ce
12: MPI 012 - OMP 000 - HWT 032 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d9
13: MPI 013 - OMP 000 - HWT 041 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID de
14: MPI 014 - OMP 000 - HWT 048 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c1
15: MPI 015 - OMP 000 - HWT 056 - Node nid001006 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c6
Done |
According to the architecture diagram, this binding configuration is optimal. This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. |
"Click" in the TAB above to read the script and output for the other method of GPU binding. |
For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper and 2) generate an ordered list to be used in the --cpu-bind option of srun :
#!/bin/bash --login
#SBATCH --job-name=16GPUExclusiveNode-bindManual
#SBATCH --partition=gpu
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#---- Needed for "manual" optimal binding of GPUs and chiplets
#First "aux technique": create a selectGPU wrapper to be used for
# binding only 1 GPU per each task spawned by srun
# Here we use ROCR_VISIBLE_DEVICES environment variable for this purpose
# but, depending on the type of application, some other variables may need to be set too
# (check documentation).
cat << EOF > $wrapper
exec \$*
chmod +x ./$wrapper
#---- Needed for "manual" optimal binding of GPUs and chiplets
#Second "aux technique": generate an ordered list of CPU-cores (each on a different slurm-socket)
# to be matched with the correct GPU in the srun command using --cpu-bind option.
# Script "" serves this purpose. This script is available
# to all users through the module pawseytools, which is loaded by default.
CPU_BIND=$( map_cpu)
if [ $lastResult -ne 0 ]; then
echo "Exiting as the map generation for CPU_BIND failed" 1>&2
rm -f ./$wrapper #deleting the wrapper
exit 1
echo -e "\n\n#------------------------#"
echo "The chosen CPU_BIND is:"
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For "manual" binding you should NOT use "--gpus-per-task" NOR "--gpu-bind"
# "--cpu-bind=${CPU_BIND} ./$wrapper" create the optimal binding of GPUs "manually"
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 2 -n 16 -c 8 --gres=gpu:8 --cpu-bind=${CPU_BIND} ./$wrapper ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
#Deleting wrappers
rm -f ./$wrapper #deleting the wrapper of the first auxiliary technique for "manual" binding
echo -e "\n\n#------------------------#"
echo "Done" |
Note that the wrapper for selecting the GPUs is being created with a redirection to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised. Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339313
$ cat slurm-339313.out
The chosen CPU_BIND is:
Test code execution:
0: MPI 000 - OMP 000 - HWT 053 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c1
1: MPI 001 - OMP 000 - HWT 061 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
2: MPI 002 - OMP 000 - HWT 016 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID c9
3: MPI 003 - OMP 000 - HWT 029 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID ce
4: MPI 004 - OMP 000 - HWT 004 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID d1
5: MPI 005 - OMP 000 - HWT 009 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID d6
6: MPI 006 - OMP 000 - HWT 034 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID d9
7: MPI 007 - OMP 000 - HWT 040 - Node nid002924 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID de
8: MPI 008 - OMP 000 - HWT 053 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 0 - Bus_ID c1
9: MPI 009 - OMP 000 - HWT 061 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 1 - Bus_ID c6
10: MPI 010 - OMP 000 - HWT 016 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 2 - Bus_ID c9
11: MPI 011 - OMP 000 - HWT 029 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 3 - Bus_ID ce
12: MPI 012 - OMP 000 - HWT 004 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 4 - Bus_ID d1
13: MPI 013 - OMP 000 - HWT 009 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 5 - Bus_ID d6
14: MPI 014 - OMP 000 - HWT 034 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 6 - Bus_ID d9
15: MPI 015 - OMP 000 - HWT 040 - Node nid002926 - RT_GPU_ID 0 - GPU_ID 7 - Bus_ID de
Done |
According to the architecture diagram, this binding configuration is optimal. "Click" in the TAB above to read the script and output for the other method of GPU binding. |
Example scripts for: Shared access to the GPU nodes with optimal binding
Shared node 1 GPU job
Jobs that need only 1 GCD (logical/Slurm GPU) for their execution are going to be sharing the GPU node with other jobs. That is, they will run in shared access, which is the default so no request for exclusive access is performed.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 1 allocation-pack with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:1 #1 GPU per node (1 "allocation-pack" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As only 1 allocation-pack is requested, there is no need to take any other action for optimal binding of CPU chiplet and GPU as it is guaranteed:
#!/bin/bash --login
#SBATCH --job-name=1GPUSharedNode
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:1 #1 GPU per node (1 "allocation-pack" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#MPI & OpenMP settings
#Not needed for 1GPU:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For optimal GPU binding using slurm options,
# "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs
# (Although in this case this can be avoided as only 1 "allocation-pack" has been requested)
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
And the output after executing this example is:
$ sbatch
Submitted batch job 323098
$ cat slurm-323098.out
Test code execution:
0: MPI 000 - OMP 000 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
Done |
The output of the hello_jobstep
code tells us that the CPU-core "002
" and GPU with Bus_ID:D1
were utilised by the job. Optimal binding is guaranteed for a single "allocation-pack" as memory, CPU chiplet and GPU of each pack is optimal.
Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
For optimal binding using srun parameters the options "--gpus-per-task " & "--gpu-bind=closest " need to be used:
#!/bin/bash --login
#SBATCH --job-name=3GPUSharedNode-bindMethod1
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For optimal GPU binding using slurm options,
# "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339314
$ cat slurm-339314.out
Test code execution:
0: MPI 000 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
1: MPI 001 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
2: MPI 002 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
Done |
The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal: - CPU core "
001 " is on chiplet:0 and directly connected to GCD (logical GPU) with Bus_ID:D1 - CPU core "
008 " is on chiplet:1 and directly connected to GCD (logical GPU) with Bus_ID:D6 - CPU core "
016 " is on chiplet:2 and directly connected to GCD (logical GPU) with Bus_ID:C9
According to the architecture diagram, this binding configuration is optimal. This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. |
"Click" in the TAB above to read the script and output for the other method of GPU binding. |
For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper t and 2) generate an ordered list to be used in the --cpu-bind option of srun :
#!/bin/bash --login
#SBATCH --job-name=3GPUSharedNode-bindManual
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)
#Tricks used only for the development/debugging of the script.
#This section should be commented or removed from a proper production script.
shopt -s expand_aliases
alias ""="$MYSOFTWARE/pawseytools/"
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#---- Needed for "manual" optimal binding of GPUs and chiplets
#First "aux technique": create a selectGPU wrapper to be used for
# binding only 1 GPU per each task spawned by srun
# Here we use ROCR_VISIBLE_DEVICES environment variable for this purpose
# but, depending on the type of application, some other variables may need to be set too
# (check documentation).
cat << EOF > $wrapper
exec \$*
chmod +x ./$wrapper
#---- Needed for "manual" optimal binding of GPUs and chiplets
#Second "aux technique": generate an ordered list of CPU-cores (each on a different slurm-socket)
# to be matched with the correct GPU in the srun command using --cpu-bind option.
# Script "" serves this purpose. This script is available
# to all users through the module pawseytools, which is loaded by default.
CPU_BIND=$( map_cpu)
if [ $lastResult -ne 0 ]; then
echo "Exiting as the map generation for CPU_BIND failed" 1>&2
rm -f ./$wrapper #deleting the wrapper
exit 1
echo -e "\n\n#------------------------#"
echo "The chosen CPU_BIND is:"
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
##Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For "manual" binding you should NOT use "--gpus-per-task" NOR "--gpu-bind"
# "--cpu-bind=${CPU_BIND} ./$wrapper" create the optimal binding of GPUs "manually"
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 3 -c 8 --gres=gpu:3 --cpu-bind=${CPU_BIND} ./$wrapper ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
#Deleting wrappers
rm -f ./$wrapper #deleting the wrapper of the first auxiliary technique for "manual" binding
echo -e "\n\n#------------------------#"
echo "Done" |
Note that the wrapper for selecting the GCDs (logical/Slurm GPUs) is being created with a redirection to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised. Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339313
$ cat slurm-339313.out
The chosen CPU_BIND is:
Test code execution:
0: MPI 000 - OMP 000 - HWT 019 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
1: MPI 001 - OMP 000 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
2: MPI 002 - OMP 000 - HWT 009 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
Done |
The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal:
- CPU core "
019 " is on chiplet:2 and directly connected to GCD (logical GPU) with Bus_ID:C9 - CPU core "
002 " is on chiplet:0 and directly connected to GCD (logical GPU) with Bus_ID:D1 - CPU core "
009 " is on chiplet:1 and directly connected to GCD (logical GPU) with Bus_ID:D6
According to the architecture diagram, this binding configuration is optimal. "Click" in the TAB above to read the script and output for the other method of GPU binding. |
Example scripts for: Hybrid jobs (multiple threads) on the CPU side
When the code is hybrid on the CPU side (MPI + OpenMP) the logic is similar to the above examples, except that more than 1 CPU-core chiplet needs to be accessible per srun
task. This is controlled by the OMP_NUM_THREADS
environment variable and will also imply a change in the settings for the optimal binding of resources when the "manual" binding (method 2) is applied.
In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. In this case we ask for 3 allocation-packs with:
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header. And the real number of threads per task is controlled with:
export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
For optimal binding using srun parameters the options "--gpus-per-task " & "--gpu-bind=closest " need to be used:
#!/bin/bash --login
#SBATCH --job-name=3GPU-Hybrid5CPU-bindMethod1
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For optimal GPU binding using slurm options,
# "--gpus-per-task=1" and "--gpu-bind=closest" create the optimal binding of GPUs
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 3 -c 8 --gres=gpu:3 --gpus-per-task=1 --gpu-bind=closest ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339314
$ cat slurm-339314.out
Test code execution:
0: MPI 000 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
0: MPI 000 - OMP 001 - HWT 003 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
0: MPI 000 - OMP 002 - HWT 004 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
0: MPI 000 - OMP 003 - HWT 005 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
0: MPI 000 - OMP 004 - HWT 007 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
1: MPI 001 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
1: MPI 001 - OMP 001 - HWT 011 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
1: MPI 001 - OMP 002 - HWT 012 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
1: MPI 001 - OMP 003 - HWT 013 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
1: MPI 001 - OMP 004 - HWT 015 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
2: MPI 002 - OMP 000 - HWT 017 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
2: MPI 002 - OMP 001 - HWT 018 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
2: MPI 002 - OMP 002 - HWT 020 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
2: MPI 002 - OMP 003 - HWT 022 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
2: MPI 002 - OMP 004 - HWT 023 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9 ...
Done |
The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the threads has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal. This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication. |
"Click" in the TAB above to read the script and output for the other method of GPU binding. |
For hybrid jobs on the CPU side use mask_cpu for the cpu-bind option and NOT map_cpu . Also, control the number of CPU threads per task with OMP_NUM_THREADS . |
For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper and 2) generate an ordered list to be used in the --cpu-bind option of srun . In this case, the list needs to be created using the mask_cpu parameter:
#!/bin/bash --login
#SBATCH --job-name=3GPU-Hybrid5CPU-bindManual
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:3 #3 GPUs per node (3 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#(Note that there is not request for exclusive access to the node)
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#---- Needed for "manual" optimal binding of GPUs and chiplets
#First "aux technique": create a selectGPU wrapper to be used for
# binding only 1 GPU per each task spawned by srun
# Here we use ROCR_VISIBLE_DEVICES environment variable for this purpose
# but, depending on the type of application, some other variables may need to be set too
# (check documentation).
cat << EOF > $wrapper
exec \$*
chmod +x ./$wrapper
#---- Needed for "manual" optimal binding of GPUs and chiplets
#Second "aux technique": generate an ordered list of CPU-MASKS (each on a different slurm-socket)
# to be matched with the correct GPU in the srun command using --cpu-bind option.
# Script "" serves this purpose. This script is available
# to all users through the module pawseytools, which is loaded by default.
CPU_BIND=$( mask_cpu)
if [ $lastResult -ne 0 ]; then
echo "Exiting as the map generation for CPU_BIND failed" 1>&2
rm -f ./$wrapper #deleting the wrapper
exit 1
echo -e "\n\n#------------------------#"
echo "The chosen CPU_BIND is:"
#MPI & OpenMP settings
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=5 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For "manual" binding you should NOT use "--gpus-per-task" NOR "--gpu-bind"
# "--cpu-bind=${CPU_BIND} ./$wrapper" create the optimal binding of GPUs "manually"
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 3 -c 8 --gres=gpu:3 --cpu-bind=${CPU_BIND} ./$wrapper ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
#Deleting wrappers
rm -f ./$wrapper #deleting the wrapper of the first auxiliary technique for "manual" binding
echo -e "\n\n#------------------------#"
echo "Done" |
Note that the wrapper for selecting the GPUs (logical/Slurm GPUs) is being created with a redirection to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised. Now, let's take a look to the output after executing the script:
$ sbatch
Submitted batch job 339313
$ cat slurm-339313.out
The chosen CPU_BIND is:
Test code execution:
0: MPI 000 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
0: MPI 000 - OMP 001 - HWT 019 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
0: MPI 000 - OMP 002 - HWT 020 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
0: MPI 000 - OMP 003 - HWT 021 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
0: MPI 000 - OMP 004 - HWT 023 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
1: MPI 001 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
1: MPI 001 - OMP 001 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
1: MPI 001 - OMP 002 - HWT 004 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
1: MPI 001 - OMP 003 - HWT 005 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
1: MPI 001 - OMP 004 - HWT 007 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 1 - GPU_Bus_ID d1
2: MPI 002 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
2: MPI 002 - OMP 001 - HWT 011 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
2: MPI 002 - OMP 002 - HWT 012 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
2: MPI 002 - OMP 003 - HWT 013 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
2: MPI 002 - OMP 004 - HWT 015 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 2 - GPU_Bus_ID d6
Done |
The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each thread has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the binding is optimal. "Click" in the TAB above to read the script and output for the other method of GPU binding. |
Example scripts for: Jobs where each task needs access to multiple GPUs
Exclusive nodes: all 8 GPUs in each node accessible to all 8 tasks in the node
Some applications, like Tensorflow and other Machine Learning applications, may requiere access to all the available GPUs in the node. In this case, the optimal binding and communication cannot be granted by the scheduler when assigning resources to the srun
launcher. Then, the full responsability for the optimal use of the resources relies on the code itself.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 2 nodes (16 "allocation-packs" in total). The resources request use the following two parameters:
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of each node are exclusive to this job
# #8 GPUs per node (16 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, optimal binding cannot be achieved by the scheduler, so no settings for optimal binding are given to the launcher. Also, all the GPUs in the node are available to each of the tasks:
#!/bin/bash --login
#SBATCH --job-name=16GPUExclusiveNode-8GPUsVisiblePerTask
#SBATCH --partition=gpu
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#Loading needed modules (adapt this for your own purposes):
#For the hello_jobstep example:
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
#OR for a tensorflow example:
#module load tensorflow/<version>
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#MPI & OpenMP settings if needed (these won't work for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#TensorFlow settings if needed:
# The following two variables control the real number of threads in Tensorflow code:
#export TF_NUM_INTEROP_THREADS=1 #Number of threads for independent operations
#export TF_NUM_INTRAOP_THREADS=1 #Number of threads within individual operations
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# Each task needs access to all the 8 available GPUs in the node where it's running.
# So, no optimal binding can be provided by the scheduler.
# Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
# Optimal use of resources is now responsability of the code.
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 2 -n 16 -c 8 --gres=gpu:8 ${theExe}
#srun -l -u -N 2 -n 16 -c 8 --gres=gpu:8 python3 ${tensorFlowScript}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
And the output after executing this example is:
$ sbatch
Submitted batch job 7798215
$ cat slurm-7798215.out
Test code execution:
0: MPI 000 - OMP 000 - HWT 001 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
1: MPI 001 - OMP 000 - HWT 008 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
2: MPI 002 - OMP 000 - HWT 016 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
3: MPI 003 - OMP 000 - HWT 024 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
4: MPI 004 - OMP 000 - HWT 032 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
5: MPI 005 - OMP 000 - HWT 040 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
6: MPI 006 - OMP 000 - HWT 049 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
7: MPI 007 - OMP 000 - HWT 056 - Node nid002944 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
8: MPI 008 - OMP 000 - HWT 000 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
9: MPI 009 - OMP 000 - HWT 008 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
10: MPI 010 - OMP 000 - HWT 016 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
11: MPI 011 - OMP 000 - HWT 025 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
12: MPI 012 - OMP 000 - HWT 032 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
13: MPI 013 - OMP 000 - HWT 040 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
14: MPI 014 - OMP 000 - HWT 048 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
15: MPI 015 - OMP 000 - HWT 056 - Node nid002946 - RunTime_GPU_ID 0,1,2,3,4,5,6,7 - ROCR_VISIBLE_GPU_ID 0,1,2,3,4,5,6,7 - GPU_Bus_ID c1,c6,c9,ce,d1,d6,d9,de
Done |
The output of the hello_jobstep
code tells us that job ran 8 MPI tasks on node nid002944
and other 8 MPI tasks on node nid002946
. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS
environment variable in the script) and can be identified with the HWT
number. Clearly, each of the CPU tasks run on a different chiplet.
More importantly for this example, each of the MPI tasks have access to the 8 GCDs (logical/Slurm GPU) in their node. Proper and optimal GPU management and communication is responsability of the code. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).
Shared nodes: Many GPUs requested but 2 GPUs binded to each task
Some applications may requiere each of the spawned task to have access to multiple GPUs. In this case, some optimal binding and communication can still be granted by the scheduler when assigning resources with the srun
launcher. Although final responsability for the optimal use of the multiple GPUs assigned to each task relies on the code itself.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 6 GCDs (logical/Slurm GPUs) on 1 node (6 "allocation-packs" in total). The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, some best binding can still be achieved by the scheduler providing 2 GPUs to each of the tasks:
#!/bin/bash --login
#SBATCH --job-name=6GPUSharedNode-2GPUsVisiblePerTask
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Definition of the executable (we assume the example code has been compiled and is available in $MYSCRATCH):
#MPI & OpenMP settings if needed (these won't work for Tensorflow):
export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# For best possible GPU binding using slurm options,
# "--gpus-per-task=2" and "--gpu-bind=closest" will provide the best GPUs to the tasks.
# But best is still not optimal.
# Each task have access to 2 available GPUs in the node where it's running.
# Optimal use of resources of each of the 2GPUs accesible per task is now responsability of the code.
# IMPORTANT: Note the use of "-c 16" to "reserve" 2 chiplets per task and is consistent with
# the use of "--gpus-per-task=2" to "reserve" 2 GPUs per task. Then, the REAL number of
# threads for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
# (If the output needs to be sorted for clarity, then add "| sort -n" at the end of the command.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 3 -c 16 --gres=gpu:6 --gpus-per-task=2 --gpu-bind=closest ${theExe}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
And the output after executing this example is:
$ sbatch
Submitted batch job 7842635
$ cat slurm-7842635.out
Test code execution:
0: MPI 000 - OMP 000 - HWT 000 - Node nid002948 - RunTime_GPU_ID 0,1 - ROCR_VISIBLE_GPU_ID 0,1 - GPU_Bus_ID d1,d6
1: MPI 001 - OMP 000 - HWT 016 - Node nid002948 - RunTime_GPU_ID 0,1 - ROCR_VISIBLE_GPU_ID 0,1 - GPU_Bus_ID c9,ce
2: MPI 002 - OMP 000 - HWT 032 - Node nid002948 - RunTime_GPU_ID 0,1 - ROCR_VISIBLE_GPU_ID 0,1 - GPU_Bus_ID d9,de
Done |
The output of the hello_jobstep
code tells us that job ran 3 MPI tasks on node nid002948
. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS
environment variable in the script) and can be identified with the HWT
number. Clearly, each of the CPU tasks run on a different chiplet. But more important, the spacing of the chiplets is every 16 cores (two chiplets), thanks to the "-c 16
" setting in the srun
command, allowing for the best binding of the 2 GPUs assigned to each task.
More importantly for this example, each of the MPI tasks have access to 2 GCDs (logical/Slurm GPU) in their node. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). The assigned GPUs are indeed the 2 closest to the CPU core, as can be verified with the architecture diagram provided at the top of this page. Final proper and optimal GPU management and communication is responsability of the code.
Example scripts for: Packing GPU jobs
Packing the execution of 8 independent instances each using 1 GCD (logical/Slurm GPU)
This kind of packing can be performed with the help of an additional job-packing-wrapper script (
) that rules the independent execution of different codes (or different instances of the same code) to be ran by each of the srun-tasks spawned by srun
. (It is important to understand that these instances do not interact with each other via MPI messaging.) The isolation of each code/instance should be performed via the logic included in this job-packing-wrapper script.
In the following example, the job-packing-wrapper creates 8 different output directories and then launches 8 different instances of the hello_nompi
code. The output of each of the executions is saved in a different case directory and file. In this case, the executable do not receive any further parameters but, in practice, users should define the logic for their own purposes and, if needed, include the logic to receive different parameters for each instance.
#Job Packing Wrapper: Each srun-task will use a different instance of the executable.
# For this specific example, each srun-task will run on a different case directory
# and create an isolated log file.
# (Adapt wrapper script for your own purposes.)
echo "Executing job-packing-wrapper instance with caseHere=${caseHere}"
exeName=hello_nompi #Using the no-MPI version of the code
mkdir -p $caseHere
cd $caseHere
${theExe} > ${logHere} 2>&1 |
Note that besides the use of the additional job-packing-wrapper, the rest of the script is very similar to the single-node exclusive examples given above. As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 1 node (8 "allocation-packs"). Each allocated-pack of GPU resources will be used by each of the instances controlled by the job-packing-wrapper. The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation-packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. For srun
, this is not different to an MPI job with 8 tasks. But in reality, this is not an MPI job. On the contrary, srun
will spawn 8 tasks, each one of them executing the job-packing-wrapper, but the logic of the job-packing-wrapper allows for 8 independent executions of the desired code(s).
As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun
parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:
#!/bin/bash --login
#SBATCH --job-name=JobPacking8GPUsExclusive-bindMethod1
#SBATCH --partition=gpu
#SBATCH --nodes=1 #1 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (8 "allocation-packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=<yourProject>-gpu #IMPORTANT: use your own project and the -gpu suffix
#Loading needed modules (adapt this for your own purposes):
module load PrgEnv-cray
module load rocm/<VERSION> craype-accel-amd-gfx90a
echo -e "\n\n#------------------------#"
module list
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#Job Packing Wrapper: Each srun-task will use a different instance of the executable.
#MPI & OpenMP settings
#No need for 1GPU steps:export MPICH_GPU_SUPPORT_ENABLED=1 #This allows for GPU-aware MPI communication among GPUs
export OMP_NUM_THREADS=1 #This controls the real CPU-cores per task for the executable
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
# (The "-l" option is for displaying, at the beginning of each line, the taskID that generates the output.)
# (The "-u" option is for unbuffered output, so that output is displayed as soon as it's generated.)
echo -e "\n\n#------------------------#"
echo "Test code execution:"
srun -l -u -N 1 -n 8 -c 8 --gres=gpu:8 --gpus-per-task=1 --gpu-bind=closest ./${jobPackingWrapper}
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
echo -e "\n\n#------------------------#"
echo "Done" |
After execution of the main slurm bash script, 8 case directories are created (each one of them tagged with their corresponding SLURM_PROCID)
. And within each of them there is a log file corresponding the execution of each instance that ran according to the logic of the
$ sbatch
Submitted batch job 339328
$ startDir=$PWD; for iDir in $(ls -d case_*); do echo $iDir; cd $iDir; ls; cat *; cd $startDir; done
MAIN 000 - OMP 000 - HWT 002 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MAIN 000 - OMP 000 - HWT 009 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MAIN 000 - OMP 000 - HWT 017 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MAIN 000 - OMP 000 - HWT 025 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID ce
MAIN 000 - OMP 000 - HWT 032 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d9
MAIN 000 - OMP 000 - HWT 044 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID de
MAIN 000 - OMP 000 - HWT 049 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c1
MAIN 000 - OMP 000 - HWT 057 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c6 |
Comparing the output of each of the instances of the hello_nompi
code to the GPU node architecture diagram, it can be seen that the binding of the allocated GCDs (logical/Slurm GPUs) to the L3 cache group chiplets (slurm-sockets) is the optimal for each of them.
Related pages