...
Excerpt | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Pawsey's way for requesting resources on GPU nodes (different to standard Slurm)The request of resources for the GPU nodes has changed dramatically. The main reason for this change has to do with Pawsey's efforts to provide a method for optimal binding of the GPUs to the CPU cores in direct physical connection for each task. For this, we decided to completely separate the options used for resource request via
Furthermore, in the request of resources, users should not indicate any other Slurm allocation option related to memory or CPU cores. Therefore, users should not use Pawsey also has some site specific recommendations for the use/management of resources with
The following table provides some examples that will serve as a guide for requesting resources in the GPU nodes (those interested in cases . Most of the examples in the table provide are for typical jobs where multiple GPUs are accessible by 1 or more tasks, pay attention to cases 4,5 & 7)allocated to the job as a whole but each of the tasks spawned by
Notes for the request of resources:
Notes for the use/management of resources with
General notes:
|
Note that examples above are just for quick reference and that they do not show the use of the 2nd method for optiomal binding (which may be the only way to achieve optimal binding for some applications). So, the rest of this page will describe in detail both methods of optimal binding and also show full job script examples for their use on Setonix GPU nodes.
Methods to achieve optimal binding of GCDs/GPUs
As mentioned above and, as the node diagram in the top of the page suggests, the optimal placement of GCDs and CPU cores for each task is to have direct communication among the CPU chiplet and the GCD in use. So, according to the node diagram, tasks being executed in cores in Chiplet 0 should be using GPU 4 (Bus D1), tasks in Chiplet 1 should be using GPU 5 (Bus D6), etc.
...
Some applications, like Tensorflow and other Machine Learning applications, may requiere access to all the available GPUs in the node. In this case, the optimal binding and communication cannot be granted by the scheduler when assigning resources to the the srun
launcher. Then, the full responsability for the optimal use of the resources relies on the code itself.
...
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
...
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
...
More importantly for this example, each of the MPI tasks has have access to the 8 GCDs (logical/Slurm GPU) in their node. Proper and optimal GPU management and communication is responsability of the code. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job).
Shared nodes: Many GPUs requested but 2 GPUs binded to each task
Some applications may requiere access that each of the spawned task have access to multiple GPUs. In this case, some optimal binding and communication can still be granted by the scheduler when assigning resources to the srun
launcher. Although final responsability for the optimal use of the resources in each task relies on the code itself.
As for all scripts, we provide the parameters for requesting the necessary "allocation-packs" for the job. This example considers a job that will make use of the 6 GCDs (logical/Slurm GPUs) on 1 node (6 "allocation-packs" in total). The resources request use the following two parameters:
#SBATCH --nodes=1 #1 node in this example
#SBATCH --gres=gpu:6 #6 GPUs per node (6 "allocation packs" in total for the job)
Note that only these two allocation parameters are needed to provide the information for the requested number of allocation-packs, and no other parameter related to memory or CPU cores should be provided in the request header.
The use/management of the allocated resources is controlled by the srun
options and some environmental variables. As mentioned above, some optimal binding can still be achieved by the scheduler providing 2 GPUs to each of the tasks:
Column | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
| |||||||||||
|
And the output after executing this example is:
Column | |||||||||
---|---|---|---|---|---|---|---|---|---|
| |||||||||
|
The output of the hello_jobstep
code tells us that job ran 3 MPI tasks on node nid002948
. Each of the MPI tasks has only 1 CPU-core assigned to it (with the use of the OMP_NUM_THREADS
environment variable in the script) and can be identified with the HWT
number. Clearly, each of the CPU tasks run on a different chiplet. But more important, the spacing of the chiplets is every 16 cores (two chiplets), thanks to the "-c 16
" setting in the srun
command, allowing for the best binding of the 2 GPUs assigned to each task.
More importantly for this example, each of the MPI tasks have access to 2 GCDs (logical/Slurm GPU) in their node. The hardware identification is done via the Bus_ID (as the other GPU_IDs are not physical but relative to the job). The assigned GPUs are indeed the 2 closest to the CPU core. Final proper and optimal GPU management and communication is responsability of the code.
Example scripts for: Packing GPU jobs
...