Page Comparison

Excerpt
Machine Learning workloads are supported on Setonix through a custom TensorFlow container developed by Pawsey. This page illustrates its usage.

...

$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0

Column

Note
This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates.

...

Column

width	900px

Code Block

language	bash
theme	Emacs
title	Listing 1. distribute_tf.sh : An example batch script to run a TensorFlow distributed training job.

#!/bin/bash

#SBATCH --account=pawsey12345-gpu
#SBATCH --partition=gpu
#SBATCH --exclusive
#SBATCH --nodes=2

module load tensorflow/rocm5.6-tf2.12

#If additional python packages have been installed in user's own virtual environment
VENV_PATH=/software/projects/pawsey12345/matilda/myenv/bin/activate

#Clear definition of the python script containing the tensorflow training case
PYTHON_SCRIPT=/software/projects/pawsey12345/matilda/matilda-machinelearning/models/01_horovod_mnist.py

#Launch for execution indicating resources to the srun command
srun -N2 --tasksn16 -c8 --ntasks-per-node=18 --nodesgres=2gpu:8 bash -c "source $VENV_PATH && python3 $PYTHON_SCRIPT"

Here, the training distribution takes place on 8 GPUs per node. (The reasoning of resource request and indications to srun command on GPU nodes is explained extensively in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.).

As you might have guessed we use the the bash alias to call the BASH interpreter within the container to execute a BASH command line which activates the environment and invokes python3 to execute the script. For a more complex sequence of commands, it is advised to create a support BASH script to be executed with the bash alias. (If no virtual environment is needed, then sections related to this can be ommited from the script.)

Version	Old Version 11	New Version 12
Changes made by	Alexis Espinosa	Alexis Espinosa
Saved on	Jan 30, 2024	Jan 31, 2024

Versions Compared

Key