Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

Machine Learning workloads are supported on Setonix through a custom TensorFlow container developed by Pawsey. This page illustrates its usage.

...

$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0 


Column


Note

This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates.


...

The TensorFlow module

Currently, there are two TensorFlow containers is available on Setonix via modules that make use of containers:

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 1. Look for the TensorFlow module
$ module avail tensorflow
--------------------------------------------------------- /software/setonix/2023.08/containers/views/modules -------------------------------------------------------------
tensorflow/rocm5.6-tf2.12 (D)


The container is deployed as a module using SHPC (Singularity Registry HPC). SHPC generates a module (listed above) providing convenient aliases for key programs within the container. This means you can run the container's python3 executable without explicitly loading and executing Singularity command. Singularity module is indeed loaded as a dependency when the Tensorflow module is loaded, and all the SIngularity commands are taken care of via wrappers. Here is a very simple example.

...

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 3. Running a ML Python script interactively on a compute node
$ salloc -p gpu --nodes=1 --gpus-per-node=1 --gpus-per-task=gres=gpu:1 -A yourProjectName-gpu --time=00:20:00
salloc: Granted job allocation 4360927
salloc: Waiting for resource configuration
salloc: Nodes nid002828 are ready for job
$ module load tensorflow/rocm5.6-tf2.12  
$ python3 01_horovod_mnist.py 
2023-09-07 14:32:18.907641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 0 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-09-07 14:32:23.886297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 134200961 MB memory:  -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
[...]
INFO:root:This is process with rank 0 and local rank 0: my prediction is [[ -3.5764134  -6.1231604  -1.5476028   2.1744065 -14.56255    -5.4938045
  -20.374353   12.388017   -3.1701622  -1.0773858]]



...