Content Comparison

...

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	A. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeExclusive_8GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 8 GPUs job exclusive access

The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "001" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "008" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "016" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "024" is on chiplet:3 and directly connected to GPU with Bus_ID:CE
CPU core "032" is on chiplet:4 and directly connected to GPU with Bus_ID:D9
CPU core "040" is on chiplet:5 and directly connected to GPU with Bus_ID:DE
CPU core "048" is on chiplet:6 and directly connected to GPU with Bus_ID:C1
CPU core "056" is on chiplet:7 and directly connected to GPU with Bus_ID:C6

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	A. Method 2: "Manual" optimal binding of GPUs and chiplets

Ui tab

title

Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_1NodeExclusive_8GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 8 GPUs job exclusive access

The output of the hello_jobstep code tells us that job ran on node nid001000 and that 8 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "054" is on chiplet:6 and directly connected to GPU with Bus_ID:C1
CPU core "063" is on chiplet:7 and directly connected to GPU with Bus_ID:C6
CPU core "018" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "026" is on chiplet:3 and directly connected to GPU with Bus_ID:CE
CPU core "006" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "013" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "033" is on chiplet:4 and directly connected to GPU with Bus_ID:D9
CPU core "047" is on chiplet:5 and directly connected to GPU with Bus_ID:DE

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

...

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	B. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	Nothing

Ui tab

title	B. Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_2NodesExclusive_16GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 16 GPUs job (2 nodes) exclusive access

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

...

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	C. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. Method 1 for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "001" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "008" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
CPU core "016" is on chiplet:2 and directly connected to GPU with Bus_ID:C9

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	None

Ui tab

title	C. Method 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px

bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for 3 GPUs job shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

CPU core "019" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
CPU core "002" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
CPU core "009" is on chiplet:1 and directly connected to GPU with Bus_ID:D6

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

...

The use/management of the allocated resources is controlled by the srun options and some environmental variables. As mentioned above, there are two methods for achieving optimal binding. The method that uses only srun parameters is preferred (method 1), but may not always work and, in that case, the "manual" method (method 2) may be needed. The two scripts for the different methods for optimal binding are in the following tabs:

Ui tabs

Ui tab

title	D. Method 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod1.shtrue

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. Method 1 for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the threads has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

Ui tab

title	Method 2: "Manual" optimal binding of GPUs and chiplets

Ui tab

title	D. Method 2: "Manual" optimal binding of GPUs and chiplets

Use mask_cpu for hybrid jobs on the CPU side instead of map_cpu

For hybrid jobs on the CPU side use mask_cpu for the cpu-bind option and NOT map_cpu. Also, control the number of CPU threads per task with OMP_NUM_THREADS.

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun. In this case, the list needs to be created using the mask_cpu parameter:

900px

bashEmacsListing N. exampleScript_1NodeShared_Hybrid5CPU_3GPUs_bindMethod2.shtrue

Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px

bashDJangoTerminal N. Output for hybrid job with 3 tasks each with 5 CPU threads and 1 GPU shared access. "Manual" method (method 2) for optimal binding.

The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has 5 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each thread has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.

...

Version	Old Version 161	New Version 162
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Sept 07, 2023	Sept 07, 2023

Versions Compared

Key