Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Insert excerpt
TensorFlow
TensorFlow
nameWarningAfterJune2024
nopaneltrue

.

Column


Note

This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates.


...

Setonix can support Machine Learning workloads with the use of the large number of AMD GPUs installed on the system. AMD maintains a TensorFlow branch with added support for its GPUs. An official AMD container is also available but, unfortunately, it unusable on Setonix due to its lack of support for both , the Cray MPI and some core Python packages. Nevertheless, Pawsey has developed its own TensorFlow container, which can properly run on Setonix. This container is installed on Setonix and available through the module system. The Pawsey TensorFlow container is the only supported way to run TensorFlow on Setonix.

...

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 2. A simple interaction with the TensorFlow module.
$ module load tensorflow/rocm5.6-tf2.12 

$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-09-07 14:29:15.551224: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.__version__
'2.12.0'
>>> exit()

$ 


Here is another example of running a simple training script on a GPU node during an interactive session:

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 3. Running a ML Python script interactively on a compute node
$ salloc -p gpu --nodes=1 --gres=gpu:1 -A yourProjectName-gpu --time=00:20:00
salloc: Granted job allocation 4360927
salloc: Waiting for resource configuration
salloc: Nodes nid002828 are ready for job

$ module load tensorflow/rocm5.6-tf2.12 

$ python3 01_horovod_mnist.py 
2023-09-07 14:32:18.907641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 0 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-09-07 14:32:23.886297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 134200961 MB memory:  -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
[...]
module list
Currently Loaded Modules:
1) craype-x86-milan 7) pawsey 13) cray-libsci/23.09.1.1
2) libfabric/1.15.2.0 8) pawseytools 14) PrgEnv-gnu/8.4.0
3) craype-network-ofi 9) gcc/12.2.0 15) singularity/4.1.0-mpi
4) perftools-base/23.09.0 10) craype/2.7.23 16) tensorflow/rocm5.6-tf2.12
5) xpmem/2.8.4-1.0_7.3__ga37cbd9.shasta 11) cray-dsmml/0.2.2
6) pawseyenv/2024.05 12) cray-mpich/8.1.27

$ python3 01_horovod_mnist.py 
2023-09-07 14:32:18.907641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 0 and local rank 0: mygpus predictionavailable is [[ -3.5764134  -6.1231604  -1.5476028   2.1744065 -14.56255    -5.4938045
  -20.374353   12.388017   -3.1701622  -1.0773858]]


(Note that when requesting the interactive allocation, users should use their correct project name instead of the "yourProjectName" place holder used in the example. Also notice the use of the "-gpu" postfix to the project name in order to be able to access any partition with GPU-nodes. Also note that the resource request for GPU nodes is different from the usual Slurm allocation requests. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes.)

Installing additional Python packages

There are at least two ways on which users can install additional Python packages that are required and the container lacks. The first way is to build a user's own container image FROM Pawsey Tensorflow container. The second way is the use of a virtual environment saved on Setonix itself, on which the user can install this additional Python packages and be loaded from there. This second way is our recommended procedure.

The trick is to create a virtual environment using the Python installation within the container. This ensures that your Packages are installed considering what is already installed on the container and not on Setonix. However, the virtual environment will be created on the host filesystem, ideally Setonix's /software. Filesystems of Setonix are mounted by default on containers, are writable from within the container, and hence pip  can install additional packages. Additionally, virtual environments can be preserved from one container run to the next.

To do so, you will need to open a BASH shell within the container. Here is a practical example.

...

width900px
Code Block
languagebash
themeDJango
titleTerminal 4. Installing additional Python packages using virtual environments
$ module load tensorflow/rocm5.6-tf2.12   
$ bash
Singularity> python3 -m venv --system-site-packages myenv  
Singularity> source myenv/bin/activate
(myenv) Singularity> python3 -m pip install xarray
Collecting xarray
  Using cached xarray-2023.8.0-py3-none-any.whl (1.0 MB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from xarray) (23.1)
Collecting pandas>=1.4
  Using cached pandas-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from xarray) (1.23.5)
Collecting pytz>=2020.1
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.4->xarray) (1.16.0)
Installing collected packages: pytz, tzdata, python-dateutil, pandas, xarray
Successfully installed pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3.post1 tzdata-2023.3 xarray-2023.8.0
(myenv) Singularity> python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2023-09-07 14:59:00.339696: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> import xarray
>>> exit
>>> 
(myenv) Singularity> 
exit
/software/projects/pawsey12345/matilda> ls
myenv 

as you can see, the environment stays on the filesystem and can be used in later runs.

...

width900px
Code Block
languagebash
themeDJango
titleTerminal 5. The environment can be used once again.
$ module load tensorflow/rocm5.6-tf2.12   
$ bash
Singularity> source myenv/bin/activate
(myenv) Singularity>

...

Note

If you think that a Python package must be included in the Pawsey container instead of being installed on a host virtual environment (as suggested above) users can submit a ticket to the Help Desk. Pawse staff can then evaluate the request based on the importance and spread of use in the Machine Learning community.

Distributed training

You can run TensorFlow on multiple Setonix nodes. The best way is to submit a job to Slurm using sbatch  and a batch script. Let's assume you prepared a Python script implementing your TensorFlow program, and you also have created a virtual environment you want to be active while executing the script. Here is a batch script that implements such a scenario.

...

width900px
Code Block
languagebash
themeEmacs
titleListing 1. distribute_tf.sh : An example batch script to run a TensorFlow distributed training job.
#!/bin/bash --login
#SBATCH --job-name=distribute_tf
#SBATCH --partition=gpu
#SBATCH --nodes=2              #2 nodes in this example 
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules:
module load tensorflow/<version>
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#If additional python packages have been installed in user's own virtual environment
VENV_PATH=/software/projects/pawsey12345/matilda/myenv/bin/activate

#----
#Clear definition of the python script containing the tensorflow training case
PYTHON_SCRIPT=/software/projects/pawsey12345/matilda/matilda-machinelearning/models/01_horovod_mnist.py

#----
#TensorFlow settings if needed:
#  The following two variables control the real number of threads in Tensorflow code:
export TF_NUM_INTEROP_THREADS=1    #Number of threads for independent operations
export TF_NUM_INTRAOP_THREADS=1    #Number of threads within individual operations 

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      Each task needs access to all the 8 available GPUs in the node where it's running.
#      So, no optimal binding can be provided by the scheduler.
#      Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
#      Optimal use of resources is now responsability of the code.
#      "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
#         for the code SHOULD be defined by the environment variables above.
echo -e "\n\n#------------------------#"
echo "Code execution:"
srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH &&  python3 $PYTHON_SCRIPT"

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"

Here, the training distribution takes place on 8 GPUs per node. Note the use of the TensorFlow environment variables TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS to control the real number of threads to be used by the code (we recommend to leave them as 1). (The reasoning of resource request and indications to srun command on GPU nodes is explained extensively in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.).

As you might have guessed we use the bash  alias to call the BASH interpreter within the container to execute a BASH command line which activates the environment and invokes python3 to execute the script. For a more complex sequence of commands, it is advised to create a support BASH script to be executed with the bash  alias. (If no virtual environment is needed, then sections related to this can be ommited from the script.). Here the name of the python training script is just an example, although the name has been taken from the examples that are described in the following pages of this topic.

Users' own containers built on top of Pawsey Tensorflow container

Pawsey TensorFlow container image is publicly distributed on quay.io (external site). We recommend to use a local installation of Docker in your own desktop to build your container on top of Pawsey's TensorFlow image starting your Dockerfile with:

FROM quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

To pull the image to your local desktop with Docker you can use:

$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

...

are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-09-07 14:32:23.886297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 134200961 MB memory:  -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
[...]
Epoch 1/40
1875/1875 [==============================] - 7s 1ms/step - loss: 0.3005 - accuracy: 0.9133
Epoch 2/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1417 - accuracy: 0.9575
Epoch 3/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1066 - accuracy: 0.9681
[...]
Epoch 39/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0191 - accuracy: 0.9935
Epoch 40/40
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0186 - accuracy: 0.9938
[...]
INFO:root:This is process with rank 0 and local rank 0: my prediction is [[ -3.5764134  -6.1231604  -1.5476028   2.1744065 -14.56255    -5.4938045
  -20.374353   12.388017   -3.1701622  -1.0773858]]



Note that when requesting the interactive allocation, users should use their correct project name instead of the "yourProjectName" place holder used in the example. Also notice the use of the "-gpu" postfix to the project name in order to be able to access any partition with GPU-nodes. Also note that the resource request for GPU nodes is different from the usual Slurm allocation requests and also the parameters to be given to the srun command. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes.

Installing additional Python packages

There are at least two ways on which users can install additional Python packages that are required and the container lacks. The first way is to build a user's own container image from the Pawsey Tensorflow container. The second way is the use of a virtual environment saved on Setonix itself, on which the user can install this additional Python packages and be loaded from there. This second way is our recommended procedure.

The trick is to create a virtual environment using the Python installation within the container. This ensures that your Packages are installed considering what is already installed on the container and not on Setonix. However, the virtual environment will be created on the host filesystem, ideally Setonix's /software. Filesystems of Setonix are mounted by default on containers, are writable from within the container, and hence pip  can install additional packages. Additionally, virtual environments can be preserved from one container run to the next. We recommend to install this virtual environments in some understandable path under $MYSOFTWARE/manual/software.

To do so, you will need to open a BASH shell within the container. Thanks to the installation of the TensorFlow container as a module, there is no need to explicitly call the singularity command. Instead, the containerised installation provides the bash wrapper that does all the work for the users and then provide an interactive bash session inside the Singularity container. Here is a practical example that installs xarray package into a virtual environment named myenv (it is key to use the --system-site-packages option to preserve the installation of python and packages present in the container):

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 4. Installing additional Python packages using virtual environments
$ module load tensorflow/rocm5.6-tf2.12 
$ mkdir -p $MYSOFTWARE/manual/software/pythonEnvironments/tensorflowContainer-environments
$ cd $MYSOFTWARE/manual/software/pythonEnvironments/tensorflowContainer-environments
$ bash

Singularity> python3 -m venv --system-site-packages myenv  
Singularity> source myenv/bin/activate

(myenv) Singularity> python3 -m pip install xarray
Collecting xarray
  Using cached xarray-2023.8.0-py3-none-any.whl (1.0 MB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from xarray) (23.1)
Collecting pandas>=1.4
  Using cached pandas-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from xarray) (1.23.5)
Collecting pytz>=2020.1
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.4->xarray) (1.16.0)
Installing collected packages: pytz, tzdata, python-dateutil, pandas, xarray
Successfully installed pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3.post1 tzdata-2023.3 xarray-2023.8.0

# Now test the use of the installed package
(myenv) Singularity> python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2023-09-07 14:59:00.339696: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> import xarray
>>> exit()

(myenv) Singularity> exit

$ ls -l
drwxr-sr-x 5 matilda pawsey12345 4096 Apr 22 16:33 myenv


as you can see, the environment stays on the filesystem and can be used in later runs.

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 5. The environment can be used once again.
$ module load tensorflow/rocm5.6-tf2.12   

$ bash

Singularity> source $MYSOFTWARE/manual/software/pythonEnvironments/tensorflowContainer-environments/myenv/bin/activate
(myenv) Singularity>



.

Column


Note

If users consider that there is a Python package that must be included in the Pawsey Tensorflow image, instead of being installed on a host virtual environment (as suggested above), they can submit a ticket to the Help Desk. Pawsey staff can then evaluate the request based on the importance of the package and spread of use in the Machine Learning community.


Users' own containers built on top of Pawsey Tensorflow container

Pawsey TensorFlow container image is publicly distributed on quay.io (external site). We recommend to use a local installation of Docker in your own desktop to build your container on top of Pawsey's TensorFlow image starting your Dockerfile with the line:

FROM quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

To pull the image to your local desktop with Docker you can use:

$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

To know more about our recommendations of container builds with Docker and later translation into Singularity format for their use in Setonix please refer to the Containers Documentation.

Distributed training

You can run TensorFlow on multiple Setonix nodes. The best way is to submit a job to Slurm using sbatch  and a batch script. Let's assume you prepared a Python script implementing your TensorFlow program, and you also have created a virtual environment you want to be active while executing the script. Here is a batch script that implements such a scenario.

Column
width900px


Code Block
languagebash
themeEmacs
titleListing 1. runTensorflow.sh : An example batch script to run a TensorFlow distributed training job.
#!/bin/bash --login
#SBATCH --job-name=tensorflow_multiGPU
#SBATCH --partition=gpu
#SBATCH --nodes=2              #2 nodes in this example 
#SBATCH --exclusive            #All resources of the node are exclusive to this job
#                              #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix

#----
#Loading needed modules:
module load tensorflow/<version> #Adapt this line for the correct version
echo -e "\n\n#------------------------#"
module list

#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}

#----
#If additional python packages have been installed in user's own virtual environment
VENV_PATH=$MYSOFTWARE/manual/software/pythonEnvironments/tensorflowContainer-environments/myenv

#----
#Definition of the python script containing the tensorflow training case
PYTHON_SCRIPT_DIR=$MYSCRATCH/machinelearning/models
PYTHON_SCRIPT=$PYTHON_SCRIPT_DIR/00_myTensorflowScript.py

#----
#TensorFlow settings if needed:
# The following two variables control the real number of threads in Tensorflow code:
export TF_NUM_INTEROP_THREADS=1 #Number of threads for independent operations
export TF_NUM_INTRAOP_THREADS=1 #Number of threads within individual operations 

#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
#      These are independent from the allocation parameters (which are not inherited by srun)
#      Each task needs access to all the 8 available GPUs in the node where it's running.
#      So, no optimal binding can be provided by the scheduler.
#      Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
#      Optimal use of resources is now responsability of the code.
#      "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
#       for the code SHOULD be defined by the environment variables above.
echo -e "\n\n#------------------------#"
echo "Code execution:"
#When no virtual environement is needed:
srun -N 2 -n 16 -c 8 --gres=gpu:8 python3 $PYTHON_SCRIPT
#When using a virtual environment:
#srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH/bin/activate && python3 $PYTHON_SCRIPT"

#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20

#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"


Here, the training distribution takes place on 16 GPUS (8 GPUs per node). Note the use of the TensorFlow environment variables TF_NUM_INTEROP_THREADS and TF_NUM_INTRAOP_THREADS to control the real number of threads to be used by the code (we recommend to leave them as 1). (Note that the resource request for GPU nodes is different from the usual Slurm allocation requests and also the parameters to be given to the srun command. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes).

Here we use the bash wrapper to call the BASH interpreter within the container to execute a BASH command line which activates the environment and invokes python3 to execute the Python script. For a more complex sequence of commands, it is advised to create a support BASH script to be executed with the bash  alias. (If no virtual environment is needed, then sections related to the use of additional packages can be ommited from the script.). Here the name of the python training script is hypothetical example. For detailed examples of Python scripts, please refer to related pages on this topic.

Related pages

External Resources