Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Ui tabs


Ui tab
titleMethod 1: Optimal binding using srun parameters

For optimal binding using srun parameters the options "--gpus-per-task" & "--gpu-bind=closest" need to be used:

900px


bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod1.shtrue


Now, let's take a look to the output after executing the script:

900px


bashDJangoTerminal N. Output for 3 GPUs job shared access. Method 1 for optimal binding.


The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:

  • CPU core "001" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
  • CPU core "008" is on chiplet:1 and directly connected to GPU with Bus_ID:D6
  • CPU core "016" is on chiplet:2 and directly connected to GPU with Bus_ID:C9

According to the architecture diagram, this binding configuration is optimal.

Method 1 may fail for some applications.

This first method is simpler, but may not work for all codes. "Manual" binding (method 2) may be the only reliable method for codes relying OpenMP or OpenACC pragma's for moving data from/to host to/from GPU and attempting to use GPU-to-GPU enabled MPI communication.

"Click" in the TAB above to read the script and output for the other method of GPU binding.


Ui tab
titleMethod 2: "Manual" optimal binding of GPUs and chipletsNone



Ui tab
titleMethod 2: "Manual" optimal binding of GPUs and chiplets

For "manual" binding, two auxiliary techniques need to be performed: 1) use of a wrapper that selects the correct GPU and 2) generate an ordered list to be used in the --cpu-bind option of srun:

900px


bashEmacsListing N. exampleScript_1NodeShared_3GPUs_bindMethod2.shtrue


Note that the wrapper for selecting the GPUs is being created with a redirection "trick" to the cat command. Also node that its name uses the SLURM_JOBID environment variable to make this wrapper unique to this job, and that the wrapper is deleted when execution is finalised.

Now, let's take a look to the output after executing the script:

900px


bashDJangoTerminal N. Output for 3 GPUs job shared access. "Manual" method (method 2) for optimal binding.


The output of the hello_jobstep code tells us that job ran on node nid001004 and that 3 MPI tasks were spawned. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS environment variable in the script) and can be identified with the HWT number. Also, each of the MPI tasks has only 1 visible GPU. The hardware identification of the GPU is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).

After checking the architecture diagram at the top of this page, it can be clearly seen that each of the assigned CPU-cores for the job is on a different L3 cache group chiplet (slurm-socket). But more importantly, it can be seen that the assigned GPU to each of the MPI tasks is the GPU that is directly connected to that chiplet, so that binding is optimal:


  • CPU core "019" is on chiplet:2 and directly connected to GPU with Bus_ID:C9
  • CPU core "002" is on chiplet:0 and directly connected to GPU with Bus_ID:D1
  • CPU core "009" is on chiplet:1 and directly connected to GPU with Bus_ID:D6

According to the architecture diagram, this binding configuration is optimal.

"Click" in the TAB above to read the script and output for the other method of GPU binding.


...