Content Comparison

...

These two auxiliary techniques work in coordination to ensure the best possible match of CPU cores and GCDs.

Auxiliary technique 1: Using a wrapper to select 1 different GCD (logical

...

/Slurm GPU) for each of the tasks spawned by `srun`

This first auxiliary technique uses the following wrapper script:

...

This second auxiliary technique uses an ordered list of CPU cores to be binded to each of the tasks spawned by srun. An example of a "hardcoded" ordered list that would bind correctly the 8 GPUs GCDs across the 4 GPU cards in a node would be:

CPU_BIND="map_cpu:49,57,17,25,0,9,33,41"

...

According to the node diagram at the top of this page, it is clear that this list consists of 1 CPU core per chiplet. What may not be very intuitive is the ordering. But after a second look, it can be seen that the order follows the identification numbers of the GCDs (logical/Slurm GPUs) in the node, so that each of the CPU cores correspond to the chiplet that is directly connected to each of the GPUs GCDs (in order). Then, the set of commands to use for a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task would be:

export MPICH_GPU_SUPPORT_ENABLED=1 export OMP_NUM_THREADS=1 CPU_BIND="map_cpu:49,57,17,25,0,9,33,41" srun -N 1 -n 8 -c 8 --gpus-per-node=8 --cpu-bind=${CPU_BIND} ./selectGPU_X.sh myMPIExecutable

This provides the optimal binding in a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task.

For jobs that are hybrid, that is, that require multiple CPU threads per task, the list needs to be modified to be a list of masks instead of CPU core IDs. The explanation of the use of this list of masks will be given in the next subsection that also describes the use of an auxiliary script to generate the lists of CPU cores or mask for general cases.

...

For a better understanding of what this script generates and how it is useful, we can use an interactive session. Note that the request parameters given salloc only include the number of nodes and the number of Slurm GPUs (GCDs) per node to request a number of "allocation packs" (as described at the top of this page). In this case, 3 "allocation packs" are requested:

...

As can be seen, 3 "allocation packs" were requested, and the total amount of allocated resources are written in the output of the scontrol command, including the 3 GCDs (logical/Slurm GPUs) and 88.32GB of memory. The rocm-smi command gives a list of the three allocated GPUsdevices, listed locally as GPU:0-BUS_ID:C9, GPU:1-BUS_ID:D1 & GPU:2-BUS_ID:D6.

When using generate_CPU_BIND.sh script with the parameter map_cpu, it creates a list of CPU-cores that can be used in the srun command for optimal binding. In this case, we get: map_cpu:21,2,14 which, in order, correspond to the slurm-sockets chiplet2,chiplet0,chiplet1; which are the ones in direct connection to the C9,D1,D6 GPUs GCDs respectively. (Check the GPU node architecture diagram at the top of this page.)

For jobs that require several threads per CPU task, srun would need a list of masks instead of CPU core IDs. The generate_CPU_BIND.sh script can generate this list when the parameter mask_cpu is used. Then, the script creates a list of hexadecimal CPU-masks that can be used for optimally binding an hybrid job. In this case, we get: mask_cpu:0000000000FF0000,00000000000000FF,000000000000FF00 . These masks, in order, correspond to masks that activate only the CPU-cores of chiplet2, chiplet0 & chiplet1; which are the ones in direct connection to the C9,D1,D6 GPUs respectively GCDs respectively. (Check the GPU node architecture diagram at the top of this page and external SLURM documentation for detailed explanation about masks.)

...

In practice, it is common to use the output provided by the generate_CPU_BIND.sh script and assign it to a variable which is then used within the srun command. So, a job that requires the use of 8 CPU tasks (single threaded) with 1 GCD (logical/Slurm GPU) per task, the set of commands to be used would be:

...

The explanation of the "manual" method will be completed in the following sections where a very useful code (hello_jobstep) will be used to confirm optimal (or sub-optimal) binding of GCDs (logical Slurm GPUs) and chiplets for srun job steps.

...

As mentioned in the previous section, allocation of resources is granted in "allocation packs" with 8 cores (1 chiplet) per GPUGCD. Also briefly mentioned in previous section is the need of "reserving" chunks of whole chiplets (multiples of 8 CPU cores) in the srun command via the --cpus-per-task ( -c ) option. But the use of this option in srun is still more a "reservation" parameter for the srun tasks to be binded to the whole chiplets, rather than an indication of the "real number of threads" to be used by the executable. The real number of threads to be used by the executable needs to be controled by the OpenMP environment variable OMP_NUM_THREADS. In other words, we use --cpus-per-task to make available whole chiplets to the srun task, but use OMP_NUM_THREADS to control the real number of threads per srun task.

...

Later in this page, some full examples of batch scripts for the most common scenarios for executing jobs on GPU compute nodes are presented. In order to show how GPUs GCDs are bound to the CPU cores assigned to the job, we make use of the hello_jobstep code within these same examples. For this reason, before presenting the full example, we use this section to explain important details of the test code. (If researchers want to test the code by themselves, this is the forked repository for Pawsey: hello_jobstep repository.)

...

The explanation of the test code will be provided with the output of an interactive session that use 3 "allocation packs" to get access to the 3 GCDs (logical/Slurm GPUs) and 3 full CPU chiplets in different ways.

...

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Explaining the use of the "hello_jobstep" code from an salloc session (compiling)

$ cd $MYSCRATCH
$ git clone https://github.com/PawseySC/hello_jobstep.git
Cloning into 'hello_jobstep'...
...
Resolving deltas: 100% (41/41), done.
$ cd hello_jobstep


$ module load PrgEnv-cray craype-accel-amd-gfx90a rocm
$ make hello_jobstep
CC -std=c++11 -fopenmp --rocm-path=/opt/rocm -x hip -D__HIP_ARCH_GFX90A__=1 --offload-arch=gfx90a -I/opt/rocm/include -c hello_jobstep.cpp
CC -fopenmp --rocm-path=/opt/rocm -L/opt/rocm/lib -lamdhip64 hello_jobstep.o -o hello_jobstep

Now check which are the GPUs current allocations available , their labels and, more importantly, devices, specifically their BUS_ID for the current allocation:

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Explaining the use the "hello_jobstep" code from an salloc session (list allocated GPUs)

$ rocm-smi --showhw
======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU  DID   GFX RAS   SDMA RAS  UMC RAS   VBIOS           BUS           
0    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:C9:00.0  
1    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:D1:00.0  
2    7408  DISABLED  ENABLED   DISABLED  113-D65201-042  0000:D6:00.0  
================================================================================
============================= End of ROCm SMI Log ==============================

...

As can be seen, each MPI task have been assigned to a CPU core in a different chiplet. But all three GCDs (logical/Slurm GPUs) that have been allocated are visible to each of the tasks. Although some codes are able to deal with this kind of available resources, this is not the recommended best practice. The recommended best practice is to provide only 1 GPU GCD per task and, even more, to provide the optimal GPU that is in direct connection to the CPU chiplet that handles the taskbandwidth between CPU and GCD.

Using `hello_jobstep` code for testing optimal bindng for a pure MPI job (single threaded) 1 GPU per task

Starting from the same allocation as above (3 "allocation packs"), now all the parameters needed to define the correct use of resources are provided to srun. In this case, 3 MPI tasks are to be ran (single threaded) each task making use of 1 GCD (logical/Slurm GPU). As described above, there are two methods to achieve optimal binding of the GPUs. The first method only uses Slurm parameters to indicate how resources are to be used by srun. In this case:

...

As can be seen, GPU-BUS_ID:D1 is having direct communication with a CPU-core in chiplet0. Also GPU-BUS_ID:D6 is in direct communication with chiplet1, and GPU-BUS_ID:C9 with chiplet2, resulting in an optimal 1 -chiplet to - 1 GCD binding.

A similar result can be obtained with the "manual" method for optimal binding. As detailed in sub-sections above, this method uses a wrapper (selectGPU_X.sh, listed above) to define which GCD (logical/Slurm GPU) is going to be visible to each task, and also the uses an ordered list of CPU cores (created with the script generate_CPU_BIND.sh, also described above) to bind the correct CPU core to each task. In this case:

...

There are some differences with the result shown above from the first and second methods of optimal binding. A first difference is the order in which the GPUs and chiplets are assigned to each task. But this first difference is not important, as long as the communication among CPU chiplet and GPU is optimal, as is the case for both methods. A second Although the ordering chiplets for a given rank is different here, this is not imporant since the CPU-to-GCD affinity is optimal. The key difference is in the values of the ROCR_VISIBLE_GPU_IDs. With the first method, these values are always 0 while, in the second method, these values are the ones given by the wrapper that "manually" selects the GPUs. This second difference has proven to be important and may be the reason why the "manual" binding is the only option for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

Using `hello_jobstep` code for testing optimal binding for a hybrid (MPI + several OpenMP threads), 1 GCD (logical/Slurm GPU) per MPI task

If the code is hybrid on the CPU side and needs the use of several OpenMP CPU threads, we then use the OMP_NUM_THREADS environment variable to control the number of threads. So, again, starting from the previous session with 3 "allocation packs", consider a case for 3 MPI tasks, 4 OpenMP threads per task and 1 GCD (logical/Slurm GPU) per task:

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal N. Testing srun settings (method 1) for optimal binding for a case with 4 CPU threads per task and 1 GPU per task

$ export OMP_NUM_THREADS=4; srun -N 1 -n 3 -c 8 --gpus-per-node=3 --gpus-per-task=1 --gpu-bind=closest ./hello_jobstep | sort -n
MPI 000 - OMP 000 - HWT 000 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 000 - OMP 001 - HWT 003 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 000 - OMP 002 - HWT 005 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 000 - OMP 003 - HWT 006 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d1
MPI 001 - OMP 000 - HWT 008 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 001 - OMP 001 - HWT 011 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 001 - OMP 002 - HWT 013 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 001 - OMP 003 - HWT 014 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID d6
MPI 002 - OMP 000 - HWT 016 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 002 - OMP 001 - HWT 019 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 002 - OMP 002 - HWT 021 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9
MPI 002 - OMP 003 - HWT 022 - Node nid001004 - RunTime_GPU_ID 0 - ROCR_VISIBLE_GPU_ID 0 - GPU_Bus_ID c9

...

From the output of the hello_jobstep code, it can be noted that the OpenMP threads use CPU-cores in the same CPU chiplet as the main thread (or MPI task). And all the CPU-cores of the corresponding chiplet are in direct communication with the GCD (logical/Slurm GPU) that has a direct physical connection to it. (Check the architecture diagram at the top of this page.)

Again, there is a difference is in the values of the ROCR_VISIBLE_GPU_IDs in the results of both methods. With the first method, these values are always 0 while, in the second method, these values are the ones given by the wrapper that "manually" selects the GCDs (logical/Slurm GPUs). This difference has proven to be important and may be the reason why the "manual" binding is the only option for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

...

In this section, a series of example slurm job scripts are presented in order for the users to be able to use them as a point of departure for preparing their own scripts. The examples presented here make use of most of the important concepts, tools and techniques explained in the previous section, so we encourage users to take a look into that top section of this page first.

Exclusive Node Multi-GPU job: 8 GCDs (logical/Slurm GPUs), each of them controlled by one MPI task

As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 1 node (8 "allocation packs"). The resources request use the following two parameters:

...

Ui tabs

Ui tab

title	A. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeExclusive_8GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 8 GPUs job exclusive access

The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU GCD is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GCD (logical GPU) to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "001" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "008" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "016" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "024" is on chiplet:3 and directly connected to GPU with Bus_ID:CE
CPU core "032" is on chiplet:4 and directly connected to GPU with Bus_ID:D9
CPU core "040" is on chiplet:5 and directly connected to GPU with Bus_ID:DE
CPU core "048" is on chiplet:6 and directly connected to GPU with Bus_ID:C1
CPU core "056" is on chiplet:7 and directly connected to GPU with Bus_ID:C6

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	A. Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GCD (logical/Slurm GPU) and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_1NodeExclusive_8GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GCDs (logical GPUs) is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 8 GPUs job exclusive access

The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is affinity is optimal:

CPU core "054" is on chiplet:6 and directly connected to GPU with Bus_ID:C1
CPU core "063" is on chiplet:7 and directly connected to GPU with Bus_ID:C6
CPU core "018" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "026" is on chiplet:3 and directly connected to GPU with Bus_ID:CE
CPU core "006" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "013" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "033" is on chiplet:4 and directly connected to GPU with Bus_ID:D9
CPU core "047" is on chiplet:5 and directly connected to GPU with Bus_ID:DE

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

N Exclusive Nodes Multi-GPU job: 8*N GCDs (logical/Slurm GPUs), each of them controlled by one MPI task

As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. The same procedure mentioned above for the single exclusive node job should be applied for multi-node exclusive jobs. The only difference when requesting resources is the number of exclusive nodes requested. SoSo, for example, for a job requiring 2 exclusive nodes (16 GCDs (logical/Slurm GPUs) or 16 "allocation packs") the resources request use the following two parameters:

...

Ui tabs

Ui tab

title	B. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	B. Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" redirection to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

...

Shared node 1 GPU job

Jobs that need only 1 GCD (logical/Slurm GPU) for their execution are going to be sharing the GPU compute node with other jobs. That is, they will run in shared access, which is the default so no request for exclusive access is performed.

...

The output of the hello_jobstep code tells us that the CPU-core "002" and GPU with Bus_ID:D1 were utilised by the job. Optimal binding is guaranteed for a single "allocation pack" as memory, CPU chiplet and GPU of each pack is optimal.

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. In this case we ask for 3 allocation packs with:

...

Ui tabs

Ui tab

title	C. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. Method 1 for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "001" is on chiplet:0 and directly connected to GCD (logical GPU) with Bus_ID:D1
CPU core "008" is on chiplet:1 and directly connected to GCD (logical GPU) with Bus_ID:D6
CPU core "016" is on chiplet:2 and directly connected to GCD (logical GPU) with Bus_ID:C9

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	C. Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU t and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GCDs (logical/Slurm GPUs) is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "019" is on chiplet:2 and directly connected to GCD (logical GPU) with Bus_ID:C9
CPU core "002" is on chiplet:0 and directly connected to GCD (logical GPU) with Bus_ID:D1
CPU core "009" is on chiplet:1 and directly connected to GCD (logical GPU) with Bus_ID:D6

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

...

Version	Old Version 165	New Version 166
Changes made by	Pascal Elahi	Pascal Elahi
Saved on	Sept 14, 2023	Sept 15, 2023

Versions Compared

Key

Auxiliary technique 1: Using a wrapper to select 1 different GCD (logical

/Slurm GPU) for each of the tasks spawned by `srun`

Using `hello_jobstep` code for testing optimal bindng for a pure MPI job (single threaded) 1 GPU per task

Using `hello_jobstep` code for testing optimal binding for a hybrid (MPI + several OpenMP threads), 1 GCD (logical/Slurm GPU) per MPI task

Exclusive Node Multi-GPU job: 8 GCDs (logical/Slurm GPUs), each of them controlled by one MPI task

N Exclusive Nodes Multi-GPU job: 8*N GCDs (logical/Slurm GPUs), each of them controlled by one MPI task

Shared node 1 GPU job

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

Content Comparison

Versions Compared

Key

Auxiliary technique 1: Using a wrapper to select 1 different GCD (logical

/Slurm GPU) for each of the tasks spawned by srun

Using hello_jobstep code for testing optimal bindng for a pure MPI job (single threaded) 1 GPU per task

Using hello_jobstep code for testing optimal binding for a hybrid (MPI + several OpenMP threads), 1 GCD (logical/Slurm GPU) per MPI task

Exclusive Node Multi-GPU job: 8 GCDs (logical/Slurm GPUs), each of them controlled by one MPI task

N Exclusive Nodes Multi-GPU job: 8*N GCDs (logical/Slurm GPUs), each of them controlled by one MPI task

Shared node 1 GPU job

Shared node 3 MPI tasks each controlling 1 GCD (logical/Slurm GPU)

/Slurm GPU) for each of the tasks spawned by `srun`

Using `hello_jobstep` code for testing optimal bindng for a pure MPI job (single threaded) 1 GPU per task

Using `hello_jobstep` code for testing optimal binding for a hybrid (MPI + several OpenMP threads), 1 GCD (logical/Slurm GPU) per MPI task