Page Comparison

Excerpt
PyTorch is an optimised tensor library for deep learning using GPUs and CPUs.

Column

Panel

title	On this page:

Table of Contents

...

$ docker pull quay.io/pawsey/pytorch:2.2.0-rocm5.7.3

The container can be also pulled using singularity:

...

Here is another example of running a simple training script on a GPU node during an interactive session:

Column

width	900px

Warning

title	Redefine your TMPDIR

We have seen some crashes when two users in the same node are using the same pytorch script with the containerised pytorch module. This due to conflicts of both jobs trying to create files with the same name in /tmp. We are currently working towards a definite solution to this problem. In the meantime, we are currently recommending to redefine the TMPDIR environment variable as a workaround:

export TMPDIR=/tmp/$SLURM_JOB_ID

.

Column

width	900px

Code Block

language	bash
theme	DJango
title	Terminal 2. Using PyTorch on a compute node in an interactive Slurm session.

setonix-05$ salloc -p gpu -A yourProjectCode-gpu --gres=gpu:1 --time=00:20:00
salloc: Pending job allocation 12386179
salloc: job 12386179 queued and waiting for resources
salloc: job 12386179 has been allocated resources
salloc: Granted job allocation 12386179
salloc: Waiting for resource configuration
salloc: Nodes nid002096 are ready for job
nid002096$ export TMPDIR=/tmp/$SLURM_JOB_ID
nid002096$ module load pytorch/2.2.0-rocm5.7.3 
nid002096$ python3 main.py 
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)

### Epoch 0/10 ###
loss: 2.289750 [   64/60000]
loss: 2.287861 [ 6464/60000]
loss: 2.263056 [12864/60000]
loss: 2.261112 [19264/60000]
loss: 2.240377 [25664/60000]
loss: 2.208018 [32064/60000]
loss: 2.225265 [38464/60000]
loss: 2.183236 [44864/60000]

...

Versions Compared

Old Version 17

New Version Current

Key