Page Comparison

Excerpt
Machine Learning workloads are supported on Setonix through a custom TensorFlow container developed by Pawsey. This page illustrates its usage.

...

$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

Column

Note
This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates.

...

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing 1. distribute_tf.sh : An example batch script to run a TensorFlow distributed training job.

#!/bin/bash --login
#SBATCH --account=pawsey12345-gpujob-name=distribute_tf
#SBATCH --partition=gpu
#SBATCH --exclusive
#SBATCH --nodes=2    module  load tensorflow/rocm5.6-tf2.12  #If additional python packages have been#2 installednodes in user'sthis ownexample virtual
environment
VENV_PATH=/software/projects/pawsey12345/matilda/myenv/bin/activate

#Clear definition#SBATCH --exclusive            #All resources of the python script containing the tensorflow training case
PYTHON_SCRIPT=/software/projects/pawsey12345/matilda/matilda-machinelearning/models/01_horovod_mnist.py

#Launch for execution indicating resources to the srun command
srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH && python3 $PYTHON_SCRIPT node are exclusive to this job
#                              #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules:
module load tensorflow/<version>
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#If additional python packages have been installed in user's own virtual environment
VENV_PATH=/software/projects/pawsey12345/matilda/myenv/bin/activate

#----
#Clear definition of the python script containing the tensorflow training case
PYTHON_SCRIPT=/software/projects/pawsey12345/matilda/matilda-machinelearning/models/01_horovod_mnist.py

#----
#TensorFlow settings if needed:
#  The following two variables control the real number of threads in Tensorflow code:
export TF_NUM_INTEROP_THREADS=1    #Number of threads for independent operations
export TF_NUM_INTRAOP_THREADS=1    #Number of threads within individual operations 

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      Each task needs access to all the 8 available GPUs in the node where it's running.
#      So, no optimal binding can be provided by the scheduler.
#      Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
#      Optimal use of resources is now responsability of the code.
#      "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
#         for the code SHOULD be defined by the environment variables above.
echo -e "\n\n#------------------------#"
echo "Code execution:"
srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH &&  python3 $PYTHON_SCRIPT"

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

Here, the training distribution takes place on 8 GPUs per node. Note the use of the TensorFlow environment variables TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS to control the real number of threads to be used by the code (we recommend to leave them as 1). (The reasoning of resource request and indications to srun command on GPU nodes is explained extensively in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.).

...

Version	Old Version 15	New Version 16
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Feb 01, 2024	Feb 09, 2024

Versions Compared

Key