Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 1. Look for the TensorFlow module
$ module avail tensorflow
--------------------------------------------------------- /software/setonix/2023.08/containers/views/modules -------------------------------------------------------------
   tensorflow/rocm5.5-tf2.11-dev       tensorflow/rocm5.6-tf2.12 (D)


The tensorflow/rocm5.5-tf2.11-dev module is deprecated and should not be used. It will be removed at the next maintenance.The container is deployed using SHPC (Singularity Registry HPC). SHPC generates a module (listed above) providing convenient aliases for key programs within the container. This means you can run the container's python3 executable without explicitly loading and executing Singularity. Here is a very simple example.

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 2. A simple interaction with the TensorFlow module.
$ module load tensorflow/rocm5.6-tf2.12 
$ python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2023-09-07 14:29:15.551224: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> tf.__version__
'2.12.0'
>>> exit()
$ 


Here is another example of running a simple training script on a GPU node:

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 3. Running a ML Python script interactively on a compute node
$ salloc -p gpu --nodes=1 --gres=gpu:gpus-per-node=1 --ntasksgpus-per-nodetask=1 -Apawsey0001A yourProjectName-gpu --time=00:20:00
salloc: Granted job allocation 4360927
salloc: Waiting for resource configuration
salloc: Nodes nid002828 are ready for job
$ module load tensorflow/rocm5.6-tf2.12  
$ python3 01_horovod_mnist.py 
2023-09-07 14:32:18.907641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 0 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-09-07 14:32:23.886297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 134200961 MB memory:  -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
[...]
INFO:root:This is process with rank 0 and local rank 0: my prediction is [[ -3.5764134  -6.1231604  -1.5476028   2.1744065 -14.56255    -5.4938045
  -20.374353   12.388017   -3.1701622  -1.0773858]]



(Note that when requesting the interactive allocation, users should use their correct project name instead of the "yourProjectName" place holder. Also notice the use of the "-gpu" postfix to the project name in order to be able to access any partition with GPU-nodes.)

Installing additional Python packages

...

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 4. Installing additional Python packages using virtual environments
$ module load tensorflow/rocm5.6-tf2.12   
$ bash
Singularity> python3 -m venv --system-site-packages myenv  
Singularity> source myenv/bin/activate
(myenv) Singularity> python3 -m pip install xarray
Collecting xarray
  Using cached xarray-2023.8.0-py3-none-any.whl (1.0 MB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from xarray) (23.1)
Collecting pandas>=1.4
  Using cached pandas-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from xarray) (1.23.5)
Collecting pytz>=2020.1
  Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting tzdata>=2022.1
  Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.4->xarray) (1.16.0)
Installing collected packages: pytz, tzdata, python-dateutil, pandas, xarray
Successfully installed pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3.post1 tzdata-2023.3 xarray-2023.8.0
(myenv) Singularity> python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2023-09-07 14:59:00.339696: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> import xarray
>>> exit
>>> 
(myenv) Singularity> 
exit
/software/projects/pawsey0001pawsey12345/cdipietrantonio>matilda> ls
myenv 


as you can see, the environment stays on the filesystem and can be used in later runs.

...

Column
width900px


Code Block
languagebash
themeEmacs
titleListing 1. An example batch script to run a TensorFlow distributed training job.
#!/bin/bash

#SBATCH --account=pawsey0001pawsey12345-gpu
#SBATCH --partition=gpu
#SBATCH --exclusive
#SBATCH --nodes=2
#SBATCH --tasks-per-node=1

module load tensorflow/rocm5.6-tf2.12

VENV_PATH=/software/projects/pawsey0001pawsey12345/cdipietrantoniomatilda/myenv/bin/activate
PYTHON_SCRIPT=/software/projects/pawsey0001pawsey12345/cdipietrantoniomatilda/cdipietrantoniomatilda-machinelearning/models/01_horovod_mnist.py
srun --tasks-per-node=1 --nodes=2 bash -c "source $VENV_PATH && python3 $PYTHON_SCRIPT"


...