...

In the following example, we use 3 GCDs (logical/slurm GPUs) (1 per MPI task) and the number of CPU threads per task is 5. As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. In this case we ask for 3 allocation packs with:

...

Ui tabs

Ui tab

title	D. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. Method 1 for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the threads has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	D. Method 2: "Manual" optimal binding of GPUs and chiplets

Use mask_cpu for hybrid jobs on the CPU side instead of map_cpu

For hybrid jobs on the CPU side use mask_cpu for the cpu-bind option and NOT map_cpu. Also, control the number of CPU threads per task with OMP_NUM_THREADS.

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun. In this case, the list needs to be created using the mask_cpu parameter:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs (logical/Slurm GPUs) is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each thread has only 1 visible GCD (logical/Slurm GPU). The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Example scripts for: Packing GPU jobs

Pack the execution of 8 independent instances each using 1 GCD (logical/Slurm GPU)

This kind of packing can be performed with the help of an additional packing-wrapper script (jobPackWrapper.sh) that rules the independent execution of different codes (or different instances of the same code) to be ran by each of the srun-tasks spawned by srun. (It is important to understand that these instances do not interact with each other via MPI messaging.) The isolation of each code/instance should be performed via the logic included in this packing-wrapper script.

...

Note that besides the use of the additional packing-wrapper, the rest of the script is very similar to the single-node exclusive examples given above. As As for all scripts, we provide the parameters for requesting the necessary "allocation packs" for the job. This example considers a job that will make use of the 8 GCDs (logical/Slurm GPUs) on 1 node (8 "allocation packs"). Each pack will be used by each of the instances controlled by the packing-wrapper. The resources request use the following two parameters:

...

Comparing the output of each of the instances of the hello_nompi code to the GPU node architecture diagram, it can be seen that the binding of the allocated GCDs (logical/Slurm GPUs) to the L3 cache group chiplets (slurm-sockets) is the optimal for each of them.

...

Version	Old Version 166	New Version 167
Changes made by	Pascal Elahi	Pascal Elahi
Saved on	Sept 15, 2023	Sept 15, 2023

Versions Compared

Key

Example scripts for: Packing GPU jobs

Pack the execution of 8 independent instances each using 1 GCD (logical/Slurm GPU)

Content Comparison

Versions Compared

Key

Example scripts for: Packing GPU jobs

Pack the execution of 8 independent instances each using 1 GCD (logical/Slurm GPU)