Excerpt |
---|
Machine Learning workloads are supported on Setonix through a custom TensorFlow container developed by Pawsey. This page illustrates its usage. |
Introduction
Setonix can support Machine Learning workloads thanks to the large number of AMD GPUs installed on the system. AMD maintains a TensorFlow branch with added support for its GPUs. An official AMD container is also available but lacks both support for Cray MPI and some core Python packages, making it unusable on Setonix. For this reason, Pawsey developed its own TensorFlow container, which is installed on Setonix and available through the module system. The Pawsey TensorFlow container is the only supported way to run TensorFlow on Setonix.
Column |
---|
Note |
---|
This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates. |
|
The TensorFlow module
Currently, there are two TensorFlow containers available on Setonix:
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 1. Look for the TensorFlow module |
---|
| $ module avail tensorflow
--------------------------------------------------------- /software/setonix/2023.08/containers/views/modules -------------------------------------------------------------
tensorflow/rocm5.5-tf2.11-dev tensorflow/rocm5.6-tf2.12 (D)
|
|
The tensorflow/rocm5.5-tf2.11-dev
module is deprecated and should not be used. It will be removed at the next maintenance.
The container is deployed using SHPC (Singularity Registry HPC). SHPC generates a module (listed above) providing convenient aliases for key programs within the container. This means you can run the container's python3
executable without explicitly loading and executing Singularity. Here is a very simple example.
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 2. A simple interaction with the TensorFlow module. |
---|
| $ module load tensorflow/rocm5.6-tf2.12
$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-09-07 14:29:15.551224: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.__version__
'2.12.0' |
|
Here is another example of running a simple training script on a GPU node:
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 3. Running a ML Python script interactively on a compute node |
---|
| $ salloc -p gpu --nodes=1 --gres=gpu:1 --ntasks-per-node=1 -Apawsey0001-gpu
salloc: Granted job allocation 4360927
salloc: Waiting for resource configuration
salloc: Nodes nid002828 are ready for job
$ module load tensorflow/rocm5.6-tf2.12
$ python3 01_horovod_mnist.py
2023-09-07 14:32:18.907641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 0 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-09-07 14:32:23.886297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 134200961 MB memory: -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
[...]
INFO:root:This is process with rank 0 and local rank 0: my prediction is [[ -3.5764134 -6.1231604 -1.5476028 2.1744065 -14.56255 -5.4938045
-20.374353 12.388017 -3.1701622 -1.0773858]]
|
|
Installing additional Python packages
Column |
---|
Note |
---|
If you think there is a Python package that must be included in the container as widely used in the Machine Learning community, you can submit a ticket to the Help Desk and we will evaluate your request. |
|
You can use a virtual environment to install additional Python packages you require and the container lacks. The trick is to create a virtual environment using the Python installation within the container. This ensures that your Packages are installed considering what is already installed on the container and not on Setonix. However, the virtual environment will be created on the host filesystem, ideally Setonix's /software
. Filesystems of Setonix are mounted by default on containers, are writable from within the container, and hence pip
can install additional packages. Additionally, virtual environments can be preserved from one container run to the next.
To do so, you will need to open a BASH shell within the container. Here is a practical example.
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 4. Installing additional Python packages using virtual environments |
---|
| $ module load tensorflow/rocm5.6-tf2.12
$ bash
Singularity> python3 -m venv --system-site-packages myenv
Singularity> source myenv/bin/activate
(myenv) Singularity> python3 -m pip install xarray
Collecting xarray
Using cached xarray-2023.8.0-py3-none-any.whl (1.0 MB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from xarray) (23.1)
Collecting pandas>=1.4
Using cached pandas-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from xarray) (1.23.5)
Collecting pytz>=2020.1
Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Collecting python-dateutil>=2.8.2
Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting tzdata>=2022.1
Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.4->xarray) (1.16.0)
Installing collected packages: pytz, tzdata, python-dateutil, pandas, xarray
Successfully installed pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3.post1 tzdata-2023.3 xarray-2023.8.0
(myenv) Singularity> python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2023-09-07 14:59:00.339696: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> import xarray
>>> exit
>>>
(myenv) Singularity>
exit
/software/projects/pawsey0001/cdipietrantonio> ls
myenv |
|
as you can see, the environment stays on the filesystem and can be used in later runs.
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 5. The environment can be used once again. |
---|
| $ module load tensorflow/rocm5.6-tf2.12
$ bash
Singularity> source myenv/bin/activate
(myenv) Singularity> |
|
Distributed training
You can run TensorFlow on multiple Setonix nodes. The best way is to submit a job to Slurm using sbatch
and a batch script. Let's assume you prepared a Python script implementing your TensorFlow program, and you also have created a virtual environment you want to be active while executing the script. Here is a batch script that implements such a scenario.
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | Emacs |
---|
title | Listing 1. An example batch script to run a TensorFlow distributed training job. |
---|
| #!/bin/bash
#SBATCH --account=pawsey0001-gpu
#SBATCH --partition=gpu
#SBATCH --exclusive
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1
module load tensorflow/rocm5.6-tf2.12
VENV_PATH=/software/projects/pawsey0001/cdipietrantonio/myenv/bin/activate
PYTHON_SCRIPT=/software/projects/pawsey0001/cdipietrantonio/cdipietrantonio-machinelearning/models/01_horovod_mnist.py
srun --tasks-per-node=1 --nodes=2 bash -c "source $VENV_PATH && python3 $PYTHON_SCRIPT"
|
|
As you might have guessed we use the bash
alias to call the BASH interpreter within the container to execute a BASH command line which activates the environment and invokes python3 to execute the script. For a more complex sequence of commands, it is advised to create a support BASH script to be executed with the bash
alias.