You can run TensorFlow with Horovod on multiple Setonix nodes. The best way is to submit a job to Slurm using sbatch and a batch script. Let's assume you prepared a Python script implementing your TensorFlow program, and you also have created a virtual environment you want to be active while executing the script. Here is a batch script that implements such a scenario.
#!/bin/bash --login
#SBATCH --job-name=tensorflow_multiGPU
#SBATCH --partition=gpu
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix
#----
#Loading Tensorflow module from the default software stack:
module load tensorflow/2.17-rocm6.3.3
echo -e "\n\n#------------------------#"
module list
#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#----
#If additional python packages have been installed in user's own virtual environment
VENV_PATH=$MYSOFTWARE/manual/software/pythonEnvironments/tensorflowContainer-environments/myenv2.17
#----
#Definition of the python script containing the tensorflow training case
PYTHON_SCRIPT_DIR=$MYSCRATCH/machinelearning/models
PYTHON_SCRIPT=$PYTHON_SCRIPT_DIR/00_myTensorflowScript.py
#----
#TensorFlow settings if needed:
# The following two variables control the real number of threads in Tensorflow code:
export TF_NUM_INTEROP_THREADS=1 #Number of threads for independent operations
export TF_NUM_INTRAOP_THREADS=1 #Number of threads within individual operations
#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# Each task needs access to all the 8 available GPUs in the node where it's running.
# So, no optimal binding can be provided by the scheduler.
# Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
# Optimal use of resources is now responsability of the code.
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
echo -e "\n\n#------------------------#"
echo "Code execution:"
#When no virtual environement is needed:
#srun -N 2 -n 16 -c 8 --gres=gpu:8 python3 $PYTHON_SCRIPT
#When using a virtual environment:
srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH/bin/activate && python3 $PYTHON_SCRIPT"
#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
#----
#Done
echo -e "\n\n#------------------------#"
echo "Done" |
|
Here, the training distribution takes place on 16 GPUS (8 GPUs per node). Note the use of the TensorFlow environment variables TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS to control the real number of threads to be used by the code (we recommend to leave them as 1). (Note that the resource request for GPU nodes is different from the usual Slurm allocation requests and also the parameters to be given to the srun command. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes).
Here we use the bash wrapper to call the BASH interpreter within the container to execute a BASH command line which activates the environment and invokes python3 to execute the Python script. For a more complex sequence of commands, it is advised to create a support BASH script to be executed with the bash alias. (If no virtual environment is needed, then sections related to the use of additional packages can be ommited from the script.). Here the name of the python training script is hypothetical example. For detailed examples of Python scripts, please refer to related pages on this topic.
Pages in this section