PyTorch

PyTorch

PyTorch is an optimised tensor library for deep learning using GPUs and CPUs.

 

On this page:

Introduction

PyTorch is the most popular framework to develop Machine Learning and Deep Learning applications. It provides users with building blocks to define neural networks using a variety of predefined layers, activation functions, optimisation algorithms, and utilities to load and store data. It supports GPU acceleration for training and inference on a variety of hardware such as NVIDIA, AMD and Intel GPUs.

PyTorch installation on Setonix

Setonix can support Deep Learning workloads thanks to the large number of AMD GPUs installed on the system. PyTorch must be compiled from source to make use of the Cray MPI library for distributed training, and a suitable ROCm version to use GPUs. To make it easier for users, Pawsey developed a Docker container for PyTorch. The library has been built with all the necessary dependencies and configuration options to run efficiently on Setonix.

The PyTorch container developed by Pawsey is available on Setonix as a module installed using SHPC.

Because of software stack deployment policies, container versions deployed on Setonix might be older than what you may find on the online repository. We install new software every six months, roughly. But you are free to pull the latest container in your own space.

To check what version is available in the default software stack, use the module avail  command.

Terminal 1. Checking what version of PyTorch is available on Setonix.
$ module avail pytorch ------------------------------- /software/setonix/2025.08/containers/views/modules -------------------------------- pytorch/2.7.1-rocm6.3.3 (D)

SHPC generates a module (listed above) providing convenient aliases for key programs within the container. This means you can run the container's python3 executable without explicitly loading and executing Singularity command. Singularity module is indeed loaded as a dependency when the PyTorch module is loaded, and all the SIngularity commands are taken care of via wrappers. Here is a very simple example.

Terminal X. Invoking python3 interpreter within the PyTorch container.
$ module load pytorch/2.7.1-rocm6.3.3 $ python3 Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux Type "help >>> import torch >>> torch.__version__ '2.7.1a0+gite2d141d'

Here is another example of running a simple training script on a GPU node during an interactive session:

Redefine your TMPDIR

We have seen some crashes when two users in the same node are using the same pytorch script with the containerised pytorch module. This due to conflicts of both jobs trying to create files with the same name in /tmp. We are currently working towards a definite solution to this problem. In the meantime, we are currently recommending to redefine the TMPDIR environment variable as a workaround:

export TMPDIR=/tmp/${USER}-${SLURM_JOB_ID}

Terminal 2. Using PyTorch on a compute node in an interactive Slurm session.
setonix-05$ salloc -p gpu -A yourProjectCode-gpu --gres=gpu:1 --time=00:20:00 salloc: Pending job allocation 12386179 salloc: job 12386179 queued and waiting for resources salloc: job 12386179 has been allocated resources salloc: Granted job allocation 12386179 salloc: Waiting for resource configuration salloc: Nodes nid002096 are ready for job nid002096$ module load pytorch/2.7.1-rocm6.3.3 nid002096$ export TMPDIR=/tmp/${USER}-${SLURM_JOB_ID} nid002096$ python3 main.py NeuralNetwork( (flatten): Flatten(start_dim=1, end_dim=-1) (linear_relu_stack): Sequential( (0): Linear(in_features=784, out_features=512, bias=True) (1): ReLU() (2): Linear(in_features=512, out_features=512, bias=True) (3): ReLU() (4): Linear(in_features=512, out_features=10, bias=True) ) ) ### Epoch 0/10 ### loss: 2.289750 [ 64/60000] loss: 2.287861 [ 6464/60000] loss: 2.263056 [12864/60000] loss: 2.261112 [19264/60000] loss: 2.240377 [25664/60000] loss: 2.208018 [32064/60000] loss: 2.225265 [38464/60000] loss: 2.183236 [44864/60000]

Note that when requesting the interactive allocation, users should use their correct project name instead of the "yourProjectCode" place holder used in the example. Also notice the use of the "-gpu" postfix to the project name in order to be able to access any partition with GPU-nodes.  Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes.

Installing additional Python packages

There are at least two ways on which users can install additional Python packages that are required and the container lacks. The first way is to build a user's own container image from the Pawsey PyTorch container. The second way is the use of a virtual environment saved on Setonix itself, on which the user can install this additional Python packages and be loaded from there. This second way is our recommended procedure.

The trick is to create a virtual environment using the Python installation within the container. This ensures that your Packages are installed considering what is already installed on the container and not on Setonix. However, the virtual environment will be created on the host filesystem, ideally Setonix's /software. Filesystems of Setonix are mounted by default on containers, are writable from within the container, and hence pip  can install additional packages. Additionally, virtual environments can be preserved from one container run to the next. We recommend to install this virtual environments in some understandable path under $MYSOFTWARE/manual/software.

To do so, you will need to open a BASH shell within the container. Thanks to the installation of the PyTorch container as a module, there is no need to explicitly call the singularity command. Instead, the containerised installation provides the bash wrapper that does all the work for the users and then provide an interactive bash session inside the Singularity container. Here is a practical example that installs xarray package into a virtual environment. Note that the name of the environment stored in the variable myEnv that you must set to a relevant name for you to choose:

Terminal 4. Installing additional Python packages using virtual environments
$ module load pytorch/2.7.1-rocm6.3.3 $ mkdir -p $MYSOFTWARE/manual/software/pythonEnvironments/pytorchContainer-environments $ cd $MYSOFTWARE/manual/software/pythonEnvironments/pytorchContainer-environments $ export TMPDIR=/tmp/${USER}-${SLURM_JOB_ID} $ bash Singularity> myEnv=chooseAName Singularity> python3 -m venv --system-site-packages $myEnv Singularity> source $myEnv/bin/activate (chooseAName) Singularity> python3 -m pip install xarray Collecting xarray Downloading xarray-2024.5.0-py3-none-any.whl (1.2 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 28.1 MB/s eta 0:00:00 Requirement already satisfied: packaging>=23.1 in /usr/local/lib/python3.10/dist-packages (from xarray) (23.2) Requirement already satisfied: numpy>=1.23 in /usr/local/lib/python3.10/dist-packages (from xarray) (1.26.3) Collecting pandas>=2.0 Downloading pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.0/13.0 MB 95.4 MB/s eta 0:00:00 Collecting pytz>=2020.1 Downloading pytz-2024.1-py2.py3-none-any.whl (505 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 505.5/505.5 KB 59.9 MB/s eta 0:00:00 Collecting python-dateutil>=2.8.2 Using cached python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB) Collecting tzdata>=2022.7 Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 345.4/345.4 KB 48.1 MB/s eta 0:00:00 Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas>=2.0->xarray) (1.16.0) Installing collected packages: pytz, tzdata, python-dateutil, pandas, xarray Successfully installed pandas-2.2.2 python-dateutil-2.9.0.post0 pytz-2024.1 tzdata-2024.1 xarray-2024.5.0 # Now test the use of the installed package (chooseAName) Singularity> python3 Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import torch >>> import xarray >>> exit (chooseAName) Singularity> exit $ ls -l drwxr-sr-x 5 matilda pawsey12345 4096 Apr 22 16:33 chooseAName

as you can see, the environment stays on the filesystem and can be used in later runs.

Terminal 5. The environment can be used unsafe-once again.
$ module load pytorch/2.7.1-rocm6.3.3 $ export TMPDIR=/tmp/${USER}-${SLURM_JOB_ID} $ bash Singularity> myEnv=chooseAName Singularity> source $MYSOFTWARE/manual/software/pythonEnvironments/pytorch-environments/${myEnv}/bin/activate (chooseAName) Singularity>

Writing PyTorch code for AMD GPUs

To increase portability and to minimise code change, PyTorch implements support for AMD GPUs within the interface initially dedicated only to CUDA. More information at HIP (ROCm) semantics (external site).

Example Scripts for the use of Pytorch module

Single GPU training (shared node)

The following slurm job script (runPytorch.sh) submits a pytorch script (mnist.py) for training using a single GPU. The slurm job script is based on the examples provided in the Setonix guide for shared nodes in the section of Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.

Listing N. single_gpu_training.sbatch (sbatch script for single GPU training)

#!/bin/bash --login #SBATCH --job-name=pytorch_singleGPU #SBATCH --partition=gpu #SBATCH --nodes=1 #1 node in this example #SBATCH --gres=gpu:1 #1 GPUs (1 "allocation pack" in total for the job) #SBATCH --time=00:05:00 #SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix #---- #Loading Pytorch module from the default software stack: module load pytorch/2.7.1-rocm6.3.3 echo -e "\n\n#------------------------#" module list #---- #Printing the status of the given allocation echo -e "\n\n#------------------------#" echo "Printing from scontrol:" scontrol show job ${SLURM_JOBID} #---- #Display rocm hardware information echo -e "\n\n#------------------------#" echo "Printing from roc-smi:" srun -N 1 -n 1 -c 8 --gres=gpu:1 rocm-smi --showhw #---- #If the script requires additional python modules not present in the pytorch container provided, # Then users need to install them in a virtual environment, # and define the path with the variables below: #export MYENV=chooseAName #export VENV_PATH=$MYSOFTWARE/manual/software/pythonEnvironments/pytorchContainer-environments/${MYENV} #---- #Definition of the python script containing the pytorch training case PYTHON_SCRIPT_DIR=$MYSCRATCH/machinelearning/models PYTHON_SCRIPT="$PYTHON_SCRIPT_DIR/mnist.py" #---- #Additional Settings needed when using Pytorch module: export TMPDIR="/tmp/${USER}-${SLURM_JOB_ID}" mkdir -p $TMPDIR #---- #Additional Settings needed for this Pytorch script: export OMP_NUM_THREADS=1 #Effective variable to control the number of CPU threads #---- #Execution #Note: srun needs the explicit indication full parameters for use of resources in the job step. # These are independent from the allocation parameters (which are not inherited by srun) # In this case, there is only 1 task and 1 GPU allocated. # So, no optimal binding parameters are needed. # Therefore, "--gpus-per-task" and "--gpu-bind" are not used. # "-c 8" is used to force allocation of a full CPU chiplet (and corresponding memory) to the job. # Then, the REAL number of threads for the code SHOULD be defined by # the environment variables above. echo -e "\n\n#------------------------#" echo "Code execution:" #Launch distributed training through torchrun inside the container #When no virtual environment is needed: srun -l -u -N 1 -n 1 -c 8 --gres=gpu:1 python3 ${PYTHON_SCRIPT} #When using a virtual environment to make use of additional python modules: #srun -l -u -N 1 -n 1 -c 8 --gres=gpu:1 bash -c "source $VENV_PATH/bin/activate && python3 ${PYTHON_SCRIPT}" #---- # Remove any potentially leftover /tmp directories from this job rm -rf ${TMPDIR} #---- #Printing information of finished job steps: echo -e "\n\n#------------------------#" echo "Printing information of finished jobs steps using sacct:" sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20 #---- #Done echo -e "\n\n#------------------------#" echo "Done"

Some particularities of this slurm job script for the use of the Pytorch module are:

  • Loading of the pytorch module: (comment/uncomment depending on the desired module to use)

    • Loading of the pytorch module from the default software stack is straight forward.

    • If the desired pytorch module is in another sofware stack, then change to the other sofware stack first (instructions for the use of a different sofware stack were taken from: Setonix Updates: Important Information.)

  • Indicate the path of a python environment to use: (uncomment if needed)

    • When the Pytorch script needs additional python modules not included in the containerised module, users need to install those modules in a separate python environment (instructions for this are included in the above sections).

    • The path indicated here is the path to the python environment that contains these additional modules

    • This path is activated in the srun command (later).

  • Redefine the TMPDIR

    • When different Pytorch jobs are being executed at the same time, there can be conflicts with the creation of auxiliary files for which the different jobs may use the same name. To avoid these conflicts, users should redefine the TMPDIR variable

  • Define environment variables/settings needed to execute your specific Pytorch script

    • OMP_NUM_THREADS is the effective variable to control the number of CPU threads

  • Choose if the srun command:

    • The first option of the srun command is the straight forward option

    • The second option activates a python environment to make use of additional python modules (uncomment if needed)

    • mnist.py is the python script that has the pytorch training instructions (assigned via the PYTHON_SCRIPT variable)

    • Note the use of "-c 8" to agree with the best practices for the use of GPUs on Setonix. This isolates 1 full CPU chiplet and the corresponding host memory for this job. But the effective variable to control the number of CPU threads is OMP_NUM_THREADS (defined above).

  • Execute the pytorch script using the python3 wrapper within the srun command

    • When using the provided containerised Pytorch module, the python3 is not a "normal" command but a wrapper

      • This wrapper calls the singularity command to execute the singularity container together with the internal python3 command (all these call are invisible to the user)

  • As a best practice, delete the temporary directory created for the job

  • Further explanation of other details can be found in the documentation of Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.

 

Listing N. mnist.py (Pytorch script for single GPU training)

import os import torch from torch import nn from torch.utils.data import DataLoader from torchvision import datasets from torchvision.transforms import ToTensor # Program parameters device = "cuda" if torch.cuda.is_available() else "cpu" batch_size = 64 n_epochs = 10 class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10) ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits def train(dataloader, model, loss_fn, optimizer): size = len(dataloader.dataset) model.train() for batch, (X, y) in enumerate(dataloader): X, y = X.to(device), y.to(device) # Compute prediction error pred = model(X) loss = loss_fn(pred, y) # Backpropagation loss.backward() optimizer.step() optimizer.zero_grad() if batch % 100 == 0: loss, current = loss.item(), (batch + 1) * len(X) print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]") if __name__ == "__main__": data_path = os.path.join(os.environ['MYSOFTWARE'], 'pytorch_data', 'mnist_data') training_data = datasets.FashionMNIST(root=data_path, train=True, download=True, transform=ToTensor()) test_data = datasets.FashionMNIST(root=data_path, train=False, download=True, transform=ToTensor()) train_dataloader = DataLoader(training_data, batch_size=batch_size) test_dataloader = DataLoader(test_data, batch_size=batch_size) model = NeuralNetwork().to(device) print(model) loss_fn = nn.CrossEntropyLoss() optimizer = torch.optim.SGD(model.parameters(), lr=1e-3) for epoch in range(n_epochs): print(f"\n### Epoch {epoch}/{n_epochs} ###") train(train_dataloader, model, loss_fn, optimizer)

The mnist.py script contains the pytorch instructions for the use of a single GPU for the training and evaluating of a machine learning model to recognize handwritten digits.

The MNIST dataset is a widely-used benchmark consisting of 60,000 training images and 10,000 testing images of size-normalized, centered 28x28 pixel grayscale digits.

No further description of the script is given here, but users can learn about this script and other similar versions elsewhere.

 

Multi GPU training (exclusive nodes)

Listing N. runPytorchDDP.sh (Slurm job script for multi GPU job)

#!/bin/bash --login #SBATCH --job-name=pytorch_multiGPU #SBATCH --partition=gpu-dev ##SBATCH --partition=gpu #SBATCH --nodes=2 #2 nodes in this example #SBATCH --exclusive #All resources of the node are exclusive to this job # #8 GPUs per node (16 "allocation packs" in total for the job) #SBATCH --time=00:05:00 #SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix #---- #Loading Pytorch module from the default software stack: module load pytorch/2.7.1-rocm6.3.3 echo -e "\n\n#------------------------#" module list #---- #Printing the status of the given allocation echo -e "\n\n#------------------------#" echo "Printing from scontrol:" scontrol show job ${SLURM_JOBID} #---- #Display rocm hardware information echo -e "\n\n#------------------------#" echo "Printing from roc-smi:" srun -N 2 -n 2 --ntasks-per-node=1 -c 64 --gres=gpu:8 rocm-smi --showhw #---- #If the script requires additional python modules not present in the pytorch container provided, # Then users need to install them in a virtual environment, # and define the path with the variables below: #export MYENV=chooseAName #export VENV_PATH=$MYSOFTWARE/manual/software/pythonEnvironments/pytorchContainer-environments/${MYENV} #---- #Definition of the python script containing the pytorch training case PYTHON_SCRIPT_DIR=$MYSCRATCH/machinelearning/models PYTHON_SCRIPT="$PYTHON_SCRIPT_DIR/torchrun_mnist_ddp.py" #---- #Additional Settings needed when using Pytorch module: export TMPDIR="/tmp/${USER}-${SLURM_JOB_ID}" mkdir -p $TMPDIR #---- #Additional Settings needed for this Pytorch script: export MIOPEN_USER_DB_PATH="$TMPDIR/miopen" export MIOPEN_CUSTOM_CACHE_DIR="$TMPDIR/miopen_cache" mkdir -p $MIOPEN_USER_DB_PATH $MIOPEN_CUSTOM_CACHE_DIR export OMP_NUM_THREADS=8. #Effective variable to control the number of CPU threads export RDZV_PORT=29500 export RDZV_HOST=$(hostname) #---- #Execution #Note: srun needs the explicit indication full parameters for use of resources in the job step. # These are independent from the allocation parameters (which are not inherited by srun) # In this case, each task needs access to all the 8 available GPUs in each node where it's running. # So, no optimal binding can be provided by the scheduler. # Therefore, "--gpus-per-task" and "--gpu-bind" are not used. # Then, optimal use of resources is now responsability of the code. # "-c 64" is used to give access to all CPU chiplets (and memory) to the 1 task in each node. # But the REAL number of threads for the code SHOULD be defined by the environment variables above # and the parameters given to torchrun. echo -e "\n\n#------------------------#" echo "Code execution:" #Launch distributed training through torchrun inside the container #When no virtual environment is needed: srun -l -u -N 2 -n 2 -c 64 --ntasks-per-node=1 --gres=gpu:8 \ singularity exec ${SINGULARITY_CONTAINER} \ torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=${SLURM_JOB_ID} \ --rdzv_backend=c10d --rdzv_endpoint="${RDZV_HOST}:${RDZV_PORT}" \ ${PYTHON_SCRIPT} -e 20 -b 128 #When using a virtual environment to make use of additional python modules: #srun -l -u -N 2 -n 2 -c 64 --ntasks-per-node=1 --gres=gpu:8 \ # singularity exec ${SINGULARITY_CONTAINER} \ # bash -c "source $VENV_PATH/bin/activate && \ # torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=${SLURM_JOB_ID} \ # --rdzv_backend=c10d --rdzv_endpoint=\"${RDZV_HOST}:${RDZV_PORT}\" \ # ${PYTHON_SCRIPT} -e 20 -b 128" #---- # Remove any potentially leftover /tmp directories from this job rm -rf ${TMPDIR} #---- #Printing information of finished job steps: echo -e "\n\n#------------------------#" echo "Printing information of finished jobs steps using sacct:" sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20 #---- #Done echo -e "\n\n#------------------------#" echo "Done"

Some particularities of this slurm job script for the use of the Pytorch module are:

  • Loading of the pytorch module: (comment/uncomment depending on the desired module to use)

    • Loading of the pytorch module from the default software stack is straight forward.

    • If the desired pytorch module is in another software stack, then change to the other software stack first (instructions for the use of a different software stack were taken from: Setonix Updates: Important Information.)

  • Indicate the path of a python environment to use: (uncomment if needed)

    • When the Pytorch script needs additional python modules not included in the containerised module, users need to install those modules in a separate python environment (instructions for this are included in the above sections).

    • The path indicated here is the path to the python environment that contains these additional modules

    • This path is activated in the srun command (later).

  • Redefine the TMPDIR

    • When different Pytorch jobs are being executed at the same time, there can be conflicts with the creation of auxiliary files for which the different jobs may use the same name. To avoid these conflicts, users should redefine the TMPDIR variable

  • Define environment variables/settings needed to execute your specific Pytorch script

    • We suggest to create auxiliary paths for MIOpen under TMPDIR and define the corresponding variables for their use.

    • OMP_NUM_THREADS is the effective variable to control the number of CPU threads

    • Define the rendezvous endpoint for PyTorch Distributed Data Parallel (DDP) coordination

      • RDZV_HOST: Specifies the hostname of the coordination node (typically the first node)

      • RDZV_PORT: Designates the TCP port for inter-node communication (Port 29500-29599 are conventionally reserved for PyTorch distributed operations.)

      • These variables are utilized by torchrun to establish the distributed training environment

  • Choose if the srun command:

    • The first option of the srun command is the straight forward option

    • The second option activates a python environment to make use of additional python modules (uncomment if needed)

    • torchrun_mnist_ddp.py is the python script that has the pytorch training instructions (assigned via the PYTHON_SCRIPT variable)

    • Note that the real management of the node resources is performed by torchrun and not by the parameters given to srun:

      • srun basically assigns 1 task per node with access to all node resources and that is inherited by torchrun

      • torchrun then perform the real management of resources with the indicated 8 processes per node (each process controlling 1 GPU)

        • Note that each of these processes can also access up to 8 CPU threads. This indicated above by the OMP_NUM_THREADS variable.

      • The other parameters given to torchrun are:

        • --rdzv_id=${SLURM_JOB_ID}  :  defines a tag identifier for the rendezvous (coordination) process

        • --rdzv_backend=c10d  :   defines the backend for the rendezvous mechanism

        • --rdzv_endpoint="${RDZV_HOST}:${RDZV_PORT}  :  defines the network endpoint for rendezvous

  • Contrary to the use of python3 in the single GPU case, which avoids the need of explicit singularity command, here torchrun does not have wrapper:

    • Then, the use of torchrun requires the indication of the full singularity command

    • The variable SINGULARITY_CONTAINER is defined by the pytorch module itself and contains the path to the pytorch container used by the module

    • Note that the singularity command DOES NOT use the "–rocm" parameter to avoid the binding of the host ROCm and then use the internal container ROCm

  • As a best practice, delete the temporary directory created for the job

  • Further explanation of other details can be found in the documentation of Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.

Listing N. torch_mnist_ddp.py (Pytorch script for multi GPU training)

import os from datetime import datetime from time import time import argparse import torch import torch.multiprocessing as mp import torch.nn as nn import torch.distributed as dist from torch.utils.data import Dataset, DataLoader from torch.utils.data.distributed import DistributedSampler from torch.nn.parallel import DistributedDataParallel as DDP from torchvision.datasets import MNIST from torchvision.transforms import ToTensor # Neural network model to predict which number an image represents class NeuralNetwork(nn.Module): def __init__(self, num_classes = 10): super().__init__() self.layer1 = nn.Sequential( nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2), nn.BatchNorm2d(16), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2)) self.layer2 = nn.Sequential( nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2), nn.BatchNorm2d(32), nn.ReLU(), nn.MaxPool2d(kernel_size=2, stride=2)) self.fc = nn.Linear(7*7*32, num_classes) def forward(self, x): out = self.layer1(x) out = self.layer2(out) out = out.reshape(out.size(0), -1) out = self.fc(out) return out def parse_args(): parser = argparse.ArgumentParser() parser.add_argument('-b', '--batch_size', default = 128, type = int, help = 'Batch size. It will be divided into mini-batches for each worker') parser.add_argument('-e', '--epochs', default = 15, type = int, metavar = 'N', help = 'Number of total epochs to run') args = parser.parse_args() return args.batch_size, args.epochs # --- Get separate training and validation data sets --- def get_data(rank, local_rank): data_path = os.path.join(os.environ['MYSOFTWARE'], 'pytorch_data', 'mnist_data') # Download dataset only on rank 0 if rank == 0: training_data = MNIST(root = data_path, train = True, download = True, transform = ToTensor()) validation_data = MNIST(root = data_path, train = False, download = True, transform = ToTensor()) # Ensure all processes wait until dataset is downloaded dist.barrier(device_ids=[local_rank]) training_data = MNIST(root = data_path, train = True, download = False, transform = ToTensor()) validation_data = MNIST(root = data_path, train = False, download = False, transform = ToTensor()) return training_data, validation_data def train_and_validate(nepochs, batch_size, train_dataset, validate_dataset, rank, local_rank): # Define and distribute model gpu = torch.device('cuda') model = NeuralNetwork().to(gpu) model = DDP(model, device_ids = [local_rank]) loss_fn = nn.CrossEntropyLoss().to(gpu) opt = torch.optim.SGD(model.parameters(), lr = 1e-3) # Partition training dataset among GPUs and load the data for each GPU train_sampler = DistributedSampler(train_dataset) train_loader = DataLoader(train_dataset, batch_size, shuffle = False, num_workers = 0, pin_memory = True, sampler = train_sampler) # Partition validation dataset among GPUs and load the data for each GPU validate_sampler = DistributedSampler(validate_dataset, shuffle = False) validate_loader = DataLoader(validate_dataset, batch_size, shuffle = False, num_workers = 0, pin_memory = True, sampler = validate_sampler) # Print statistics, etc. on root rank only root_rank = rank == 0 # Begin training and validation, recording start time if root_rank: start = datetime.now() total_train_step = len(train_loader) total_val_step = len(validate_loader) N = len(validate_dataset) # Iterate over specified number of epochs for epoch in range(nepochs): ############ # Training # ############ model.train() # Initialise training loss train_loss = torch.Tensor([0.]).to(gpu) start_train = datetime.now() if root_rank: start_train_dataload = time() for i, (images, labels) in enumerate(train_loader): # Transfer training data to GPU images = images.to(gpu, non_blocking = True) labels = labels.to(gpu, non_blocking = True) if root_rank: stop_train_dataload = time() if root_rank: start_training = time() # Forward pass predictions = model(images) loss = loss_fn(predictions, labels) # Backward and optimisation opt.zero_grad() loss.backward() opt.step() # Update training loss train_loss += loss.item() if root_rank: stop_training = time() # Print training statistics at regular intervals (every 50 training steps per epoch) #if (i + 1) % 50 == 0 and root_rank: if root_rank: print('TRAINING: Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Time data load: {:.3f}ms, Time training: {:.3f}ms'.format( epoch + 1, nepochs, i + 1, total_train_step, loss.item(), (stop_train_dataload - start_train_dataload) * 1000, (stop_training - start_training) * 1000 )) ############## # Validation # ############## model.eval() val_loss = torch.Tensor([0.]).to(gpu) start_val = datetime.now() correct = 0 total = 0 if root_rank: start_val_dataload = time() # Track correct predictions in validation dataset for recording accuracy for j, (images, labels) in enumerate(validate_loader): images = images.to(gpu, non_blocking = True) labels = labels.to(gpu, non_blocking = True) if root_rank: stop_val_dataload = time() if root_rank: start_validation = time() # Don't compute gradients in validation phase with torch.no_grad(): # Forward pass predictions = model(images) loss = loss_fn(predictions, labels) # Weight and cumulate the loss per GPU # Use images.size(0) instead of batch_size since last batch may be smaller val_loss += loss * images.size(0) / N predicted_labels = [list(l).index(max(l)) for l in predictions] epoch_correct = sum([float(predicted_labels[i] == labels[i]) for i in range(len(labels))]) epoch_acc = epoch_correct / len(predictions) correct += epoch_correct total += len(predictions) if root_rank: stop_validation = time() # Print validation statistics every 50 steps each epoch #if ((j + 1) % 10 == 0) and (root_rank): if root_rank: print('VALIDATION: Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Accuracy: {:.4f}, Time data load: {:.3f}ms, Time validation: {:.3f}ms'.format( epoch + 1, nepochs, j + 1, total_val_step, val_loss.item(), epoch_acc, (stop_val_dataload - start_val_dataload) * 1000, (stop_validation - start_validation) * 1000 )) # Sum weighted averages over all GPUs dist.all_reduce(val_loss, op = dist.ReduceOp.SUM) # Get accuracy of validation phase accuracy = correct / total # Print findal loss values for each phase if root_rank: print('Epoch [{}/{}], Training Loss: {:.4f}, Validation Loss: {:.4f}, Validation Accuracy: {:.4f}'.format( epoch + 1, nepochs, train_loss.item() / len(train_loader), val_loss.item(), accuracy# / len(val_loader) )) # Print total time taken if root_rank: print('Model training + validation complete in: ' + str(datetime.now() - start)) def main(): batch_size, nepochs = parse_args() # Sanity check for GPUs ngpus = torch.cuda.device_count() assert ngpus >= 2, f"Requires at least 2 GPUs to run, but got {ngpus}" # Get local rank from environment variable local_rank = int(os.environ['LOCAL_RANK']) torch.cuda.set_device(local_rank) # Set device before init_process_group # Initialize the process group dist.init_process_group(backend='nccl') # Get rank after initialization rank = dist.get_rank() # Generate training and validation data training_dataset, validation_dataset = get_data(rank, local_rank) # Perform training and validation train_and_validate(nepochs, batch_size, training_dataset, validation_dataset, rank, local_rank) # Cleanup dist.destroy_process_group() if __name__ == "__main__": main()

The torchrun_mnist_ddp.py script contains the pytorch instructions for the use of a multiple GPUs for the training and evaluating of a machine learning model to recognize handwritten digits. The script uses PyTorch's Distributed Data Parallel (DDP) which accelerates training by replicating the model across multiple GPUs, splitting input batches into non-overlapping subsets for parallel processing, and synchronizing gradients between devices to maintain consistency.

The MNIST dataset is a widely-used benchmark consisting of 60,000 training images and 10,000 testing images of size-normalized, centered 28x28 pixel grayscale digits.

No further description of the script is given here, but users can learn about this script and other similar versions elsewhere.

Public Registry

The Docker images of Pytorch containers are publicly available on the Pawsey repository (external site) on Quay.io. Users can build on top of these images to install additional Python packages for their own purposes if the use of virtual environments is not the preferred option. Currently, building of images is not possible on Pawsey clusters. Then, this building process should be performed in their own resources where the build command of Docker can be executed. Simply use:

FROM quay.io/pawsey/pytorch:2.7.1-rocm6.3.3

within your Dockerfile.

If required, users can also pull the image to their own resources using Docker using the following command:

$ docker pull quay.io/pawsey/pytorch:2.7.1-rocm6.3.3

The container can be also pulled using singularity:

 An official AMD container is also available but lacks both support for Cray MPI and some core Python packages, making it unusable on Setonix.

Related pages