Excerpt |
---|
Machine Learning workloads are supported on Setonix through a custom TensorFlow container developed by Pawsey. This page illustrates its usage. |
...
$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0
Column |
---|
Note |
---|
This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates. |
|
...
The TensorFlow module
Currently, there are two TensorFlow containers is available on Setonix via modules that make use of containers:
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 1. Look for the TensorFlow module |
---|
| $ module avail tensorflow
--------------------------------------------------------- /software/setonix/2023.08/containers/views/modules -------------------------------------------------------------
tensorflow/rocm5.6-tf2.12 (D)
|
|
The container is deployed as a module using SHPC (Singularity Registry HPC). SHPC generates a module (listed above) providing convenient aliases for key programs within the container. This means you can run the container's s python3
executable without explicitly loading and executing Singularity command. Singularity module is indeed loaded as a dependency when the Tensorflow module is loaded, and all the SIngularity commands are taken care of via wrappers. Here is a very simple example.
...
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 3. Running a ML Python script interactively on a compute node |
---|
| $ salloc -p gpu --nodes=1 --gpus-per-node=1 --gpus-per-task=gres=gpu:1 -A yourProjectName-gpu --time=00:20:00
salloc: Granted job allocation 4360927
salloc: Waiting for resource configuration
salloc: Nodes nid002828 are ready for job
$ module load tensorflow/rocm5.6-tf2.12
$ python3 01_horovod_mnist.py
2023-09-07 14:32:18.907641: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 0 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
2023-09-07 14:32:23.886297: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 134200961 MB memory: -> device: 0, name: AMD Instinct MI250X, pci bus id: 0000:d1:00.0
[...]
INFO:root:This is process with rank 0 and local rank 0: my prediction is [[ -3.5764134 -6.1231604 -1.5476028 2.1744065 -14.56255 -5.4938045
-20.374353 12.388017 -3.1701622 -1.0773858]]
|
|
...