Introduction

PyTorch is the most popular framework to develop Machine Learning and Deep Learning applications. It provides users with building blocks to define neural networks using a variety of predefined layers, activation functions, optimisation algorithms, and utilities to load and store data. It supports GPU acceleration for training and inference on a variety of hardware such as NVIDIA, AMD and Intel GPUs.

PyTorch installation on Setonix

Setonix can support Deep Learning workloads thanks to the large number of AMD GPUs installed on the system. PyTorch must be compiled from source to make use of the Cray MPI library for distributed training, and a suitable ROCm version to use GPUs. To make it easier for users, Pawsey developed a Docker container for PyTorch. The library has been built with all the necessary dependencies and configuration options to run efficiently on Setonix.

The Docker container is publicly available on the Pawsey repository (external site) on Quay.io. Users can build on top of the container to install additional Python packages. It can be pulled using Docker using the following command:

$ docker pull quay.io/pawsey/pytorch:2.2.0-rocm5.7.3

The container can be also pulled using singularity:

$ singularity pull pytorch.sif docker://quay.io/pawsey/pytorch:2.2.0-rocm5.7.3

An official AMD container is also available but lacks both support for Cray MPI and some core Python packages, making it unusable on Setonix.

The PyTorch container developed by Pawsey is also available on Setonix as a module installed using SHPC.

Because of software stack deployment policies, container versions deployed on Setonix might be older than what you may find on the online repository. We install new software every six months, roughly. But you are free to pull the latest container in your own space.

To check what version is available, use the module avail command.

Terminal 1. Checking what version of PyTorch is available on Setonix.

$ module avail pytorch

-------------- /software/setonix/2023.08/containers/views/modules --------------
   pytorch/2.2.0-rocm5.7.3

SHPC generates a module (listed above) providing convenient aliases for key programs within the container. This means you can run the container's python3 executable without explicitly loading and executing Singularity command. Singularity module is indeed loaded as a dependency when the PyTorch module is loaded, and all the SIngularity commands are taken care of via wrappers. Here is a very simple example.

Terminal 1. Invoking python3 interpreter within the PyTorch container.

$ module load pytorch/2.2.0-rocm5.7.3 
$ python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.2.0a0+git8ac9b20'

Here is another example of running a simple training script on a GPU node during an interactive session:

Terminal 2. Using PyTorch on a compute node in an interactive Slurm session.

setonix-05$ salloc -p gpu -A yourProjectCode-gpu --gres=gpu:1 --time=00:20:00
salloc: Pending job allocation 12386179
salloc: job 12386179 queued and waiting for resources
salloc: job 12386179 has been allocated resources
salloc: Granted job allocation 12386179
salloc: Waiting for resource configuration
salloc: Nodes nid002096 are ready for job
nid002096$ module load pytorch/2.2.0-rocm5.7.3 
nid002096$ python3 main.py

Note that when requesting the interactive allocation, users should use their correct project name instead of the "yourProjectCode" place holder used in the example. Also notice the use of the "-gpu" postfix to the project name in order to be able to access any partition with GPU-nodes. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes.

Writing PyTorch code for AMD GPUs

To increase portability and to minimise code change, PyTorch implements support for AMD GPUs within the interface initially dedicated only to CUDA. More information at HIP (ROCm) semantics (external site).