Introduction

PyTorch is the most popular framework to develop Machine Learning and Deep Learning applications. It provides users with building blocks to define neural networks using a variety of predefined layers, activation functions, optimisation algorithms, and utilities to load and store data. It supports GPU acceleration for training and inference on a variety of hardware such as NVIDIA, AMD and Intel GPUs.

PyTorch installation on Setonix

Setonix can support Deep Learning workloads thanks to the large number of AMD GPUs installed on the system. PyTorch must be compiled from source to make use of the Cray MPI library for distributed training, and a suitable ROCm version to use GPUs. To make it easier for users, Pawsey developed a Docker container for PyTorch. The library has been built with all the necessary dependencies and configuration options to run efficiently on Setonix.

The Docker container is publicly available on the Pawsey repository (external site) on Quay.io. Users can build on top of the container to install additional Python packages. It can be pulled using Docker using the following command:

$ docker pull quay.io/pawsey/pytorch:2.2.0-rocm5.7.3

The container can be also pulled using singularity:

$ singularity pull pytorch.sif docker://quay.io/pawsey/pytorch:2.2.0-rocm5.7.3

An official AMD container is also available but lacks both support for Cray MPI and some core Python packages, making it unusable on Setonix.

The PyTorch container developed by Pawsey is also available on Setonix as a module installed using SHPC.

Because of software stack deployment policies, container versions deployed on Setonix might be older than what you may find on the online repository. We install new software every six months, roughly. But you are free to pull the latest container in your own space.

To check what version is available, use the module avail command.

Terminal 1. Checking what version of PyTorch is available on Setonix.

$ module avail pytorch

-------------- /software/setonix/2023.08/containers/views/modules --------------
   pytorch/2.2.0-rocm5.7.3

SHPC generates a module (listed above) providing convenient aliases for key programs within the container. This means you can run the container's python3 executable without explicitly loading and executing Singularity command. Singularity module is indeed loaded as a dependency when the PyTorch module is loaded, and all the SIngularity commands are taken care of via wrappers. Here is a very simple example.

Terminal 1. Invoking python3 interpreter within the PyTorch container.

$ module load pytorch/2.2.0-rocm5.7.3 
$ python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'2.2.0a0+git8ac9b20'

Here is another example of running a simple training script on a GPU node during an interactive session:

Redefine your TMPDIR

We have seen some crashes when two users in the same node are using the same pytorch script with the containerised pytorch module. This due to conflicts of both jobs trying to create files with the same name in /tmp. We are currently working towards a definite solution to this problem. In the meantime, we are currently recommending to redefine the TMPDIR environment variable as a workaround:

export TMPDIR=/tmp/${USER}-${SLURM_JOB_ID}

.

Terminal 2. Using PyTorch on a compute node in an interactive Slurm session.

setonix-05$ salloc -p gpu -A yourProjectCode-gpu --gres=gpu:1 --time=00:20:00
salloc: Pending job allocation 12386179
salloc: job 12386179 queued and waiting for resources
salloc: job 12386179 has been allocated resources
salloc: Granted job allocation 12386179
salloc: Waiting for resource configuration
salloc: Nodes nid002096 are ready for job
nid002096$ export TMPDIR=/tmp/${USER}-${SLURM_JOB_ID}
nid002096$ module load pytorch/2.2.0-rocm5.7.3 
nid002096$ python3 main.py 
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)

### Epoch 0/10 ###
loss: 2.289750 [   64/60000]
loss: 2.287861 [ 6464/60000]
loss: 2.263056 [12864/60000]
loss: 2.261112 [19264/60000]
loss: 2.240377 [25664/60000]
loss: 2.208018 [32064/60000]
loss: 2.225265 [38464/60000]
loss: 2.183236 [44864/60000]

Note that when requesting the interactive allocation, users should use their correct project name instead of the "yourProjectCode" place holder used in the example. Also notice the use of the "-gpu" postfix to the project name in order to be able to access any partition with GPU-nodes. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes.

Installing additional Python packages

There are at least two ways on which users can install additional Python packages that are required and the container lacks. The first way is to build a user's own container image from the Pawsey PyTorch container. The second way is the use of a virtual environment saved on Setonix itself, on which the user can install this additional Python packages and be loaded from there. This second way is our recommended procedure.

The trick is to create a virtual environment using the Python installation within the container. This ensures that your Packages are installed considering what is already installed on the container and not on Setonix. However, the virtual environment will be created on the host filesystem, ideally Setonix's /software. Filesystems of Setonix are mounted by default on containers, are writable from within the container, and hence pip can install additional packages. Additionally, virtual environments can be preserved from one container run to the next. We recommend to install this virtual environments in some understandable path under $MYSOFTWARE/manual/software.

To do so, you will need to open a BASH shell within the container. Thanks to the installation of the PyTorch container as a module, there is no need to explicitly call the singularity command. Instead, the containerised installation provides the bash wrapper that does all the work for the users and then provide an interactive bash session inside the Singularity container. Here is a practical example that installs xarray package into a virtual environment named myenv:

Terminal 4. Installing additional Python packages using virtual environments

$ module load pytorch/2.2.0-rocm5.7.3 
$ mkdir -p $MYSOFTWARE/manual/software/pythonEnvironments/pytorchContainer-environments
$ cd $MYSOFTWARE/manual/software/pythonEnvironments/pytorchContainer-environments
$ bash
Singularity> python3 -m venv --system-site-packages myenv  
Singularity> source myenv/bin/activate
(myenv) Singularity> python3 -m pip install xarray
Collecting xarray
  Downloading xarray-2024.5.0-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 28.1 MB/s eta 0:00:00
Requirement already satisfied: packaging>=23.1 in /usr/local/lib/python3.10/dist-packages (from xarray) (23.2)
Requirement already satisfied: numpy>=1.23 in /usr/local/lib/python3.10/dist-packages (from xarray) (1.26.3)
Collecting pandas>=2.0
  Downloading pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.0/13.0 MB 95.4 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2024.1-py2.py3-none-any.whl (505 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 505.5/505.5 KB 59.9 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.2
  Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
Collecting tzdata>=2022.7
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 345.4/345.4 KB 48.1 MB/s eta 0:00:00
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0->xarray) (1.16.0)
Installing collected packages: pytz, tzdata, python-dateutil, pandas, xarray
Successfully installed pandas-2.2.2 python-dateutil-2.9.0.post0 pytz-2024.1 tzdata-2024.1 xarray-2024.5.0

# Now test the use of the installed package
(myenv) Singularity> python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import xarray
>>> exit
(myenv) Singularity> exit
$ ls -l
drwxr-sr-x 5 matilda pawsey12345 4096 Apr 22 16:33 myenv

as you can see, the environment stays on the filesystem and can be used in later runs.

Terminal 5. The environment can be used unsafe-once again.

$ module load pytorch/2.2.0-rocm5.7.3
$ bash

Singularity> source $MYSOFTWARE/manual/software/pythonEnvironments/pytorch-environments/myenv/bin/activate
(myenv) Singularity>

Writing PyTorch code for AMD GPUs

To increase portability and to minimise code change, PyTorch implements support for AMD GPUs within the interface initially dedicated only to CUDA. More information at HIP (ROCm) semantics (external site).

Example Scripts for the use of Pytorch module

Single GPU training (shared node)

The following slurm job script (runPytorch.sh) submits a pytorch script (mnist.py) for training using a single GPU. The slurm job script is based on the examples provided in the Setonix guide for shared nodes in the section of Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.

Multi GPU training (exclusive nodes)

Related pages

Browser not supported