Page Comparison

Insert excerpt

	TensorFlow
	TensorFlow
name	WarningAfterJune2024
nopanel	true

.

Column

Note
This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates.

...

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal 3. Running a ML Python script interactively on a compute node

$ salloc -p gpu --nodes=1 --gres=gpu:1 -A yourProjectName-gpu --time=00:20:00
salloc: Granted job allocation 4360927
salloc: Waiting for resource configuration
salloc: Nodes nid002828 are ready for job

$ export ROCM_PATH=/opt/rocm #Workaround for path errors with new CPE. Will be removed after container fix.
$ module load tensorflow/rocm5.6-tf2.12

$ module list
Currently Loaded Modules:
1) craype-x86-milan 7) pawsey 13) cray-libsci/23.09.1.1
2) libfabric/1.15.2.0 8) pawseytools 14) PrgEnv-gnu/8.4.0
3) craype-network-ofi 9) gcc/12.2.0 15) singularity/4.1.0-mpi
4) perftools-base/23.09.0 10) craype/2.7.23 16) tensorflow/rocm5.6-tf2.12
5) xpmem/2.8.4-1.0_7.3__ga37cbd9.shasta 11) cray-dsmml/0.2.2
6) pawseyenv/2024.05 12) cray-mpich/8.1.27

$ python3 01_horovod_mnist.py 
2023-09-07 14:32:18.907641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 0 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-09-07 14:32:23.886297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 134200961 MB memory:  -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
[...]
Epoch 1/40
1875/1875 [==============================] - 7s 1ms/step - loss: 0.3005 - accuracy: 0.9133
Epoch 2/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1417 - accuracy: 0.9575
Epoch 3/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1066 - accuracy: 0.9681
[...]
Epoch 39/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0191 - accuracy: 0.9935
Epoch 40/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0186 - accuracy: 0.9938
[...]
INFO:root:This is process with rank 0 and local rank 0: my prediction is [[ -3.5764134  -6.1231604  -1.5476028   2.1744065 -14.56255    -5.4938045
  -20.374353   12.388017   -3.1701622  -1.0773858]]

...

FROM quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

To pull the image to your local desktop with Docker you can use:

$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

To know more about our recommendations of container builds with Docker and later translation into Singularity format for their use in Setonix please refer to the Containers Documentation.

...

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing 1. runTensorflow.sh : An example batch script to run a TensorFlow distributed training job.

#!/bin/bash --login
#SBATCH --job-name=distributedtensorflow_tfmultiGPU
#SBATCH --partition=gpu
#SBATCH --nodes=2              #2 nodes in this example 
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules:
export ROCM_PATH=/opt/rocm #Workaround for path errors with new CPE. Will be removed after container fix.
module load tensorflow/<version>
module load tensorflow/<version> #Adapt this line for the correct version
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#If additional python packages have been installed in user's own virtual environment
VENV_PATH=$MYSOFTWARE/manual/software/pythonEnvironments/tensorflowContainer-environments/myenv

#----
#Clear#Definition definition of the python script containing the tensorflow training case
PYTHON_SCRIPT_DIR=$MYSRATCH$MYSCRATCH/matilda-machinelearning/models/01_horovod_mnist
PYTHON_SCRIPT=$PYTHON_SCRIPT_DIR/00_myTensorflowScript.py

#----
#TensorFlow settings if needed:
#  The following two variables control the real number of threads in Tensorflow code:
export TF_NUM_INTEROP_THREADS=1    #Number of threads for independent operations
export TF_NUM_INTRAOP_THREADS=1    #Number of threads within individual operations 

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      Each task needs access to all the 8 available GPUs in the node where it's running.
#      So, no optimal binding can be provided by the scheduler.
#      Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
#      Optimal use of resources is now responsability of the code.
#      "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
#         for the code SHOULD be defined by the environment variables above.
echo -e "\n\n#------------------------#"
echo "Code execution:"
#When no usingvirtual aenvironement virtualis environmentneeded:
srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH/bin/activate && python3 $PYTHON_SCRIPT"
#When nousing a virtual environement is neededenvironment:
#srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH/bin/activate && python3 $PYTHON_SCRIPT"

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

Here, the training distribution takes place on 16 GPUS (8 GPUs per node). Note the use of the TensorFlow environment variables TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS to control the real number of threads to be used by the code (we recommend to leave them as 1). (Note that the resource request for GPU nodes is different from the usual Slurm allocation requests and also the parameters to be given to the srun command. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes).

...

Version	Old Version 46	New Version Current
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Dec 05, 2024	Dec 10, 2024

Versions Compared

Key