Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Excerpt

PyTorch is an optimised tensor library for deep learning using GPUs and CPUs.


Column


Panel
titleOn this page:

Table of Contents


...

$ docker pullĀ quay.io/pawsey/pytorch:2.2.0-rocm5.7.3

The container can be also pulled using singularity:

...

Here is another example of running a simple training script on a GPU node during an interactive session:

Column
width900px


Warning
titleRedefine your TMPDIR

We have seen some crashes when two users in the same node are using the same pytorch script with the containerised pytorch module. This due to conflicts of both jobs trying to create files with the same name in /tmp. We are currently working towards a definite solution to this problem. In the meantime, we are currently recommending to redefine the TMPDIR environment variable as a workaround:

export TMPDIR=/tmp/$SLURM_JOB_ID


.

Column
width900px


Code Block
languagebash
themeDJango
titleTerminal 2. Using PyTorch on a compute node in an interactive Slurm session.
setonix-05$ salloc -p gpu -A yourProjectCode-gpu --gres=gpu:1 --time=00:20:00
salloc: Pending job allocation 12386179
salloc: job 12386179 queued and waiting for resources
salloc: job 12386179 has been allocated resources
salloc: Granted job allocation 12386179
salloc: Waiting for resource configuration
salloc: Nodes nid002096 are ready for job
nid002096$ export TMPDIR=/tmp/$SLURM_JOB_ID
nid002096$ module load pytorch/2.2.0-rocm5.7.3 
nid002096$ python3 main.py 
NeuralNetwork(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=512, bias=True)
    (1): ReLU()
    (2): Linear(in_features=512, out_features=512, bias=True)
    (3): ReLU()
    (4): Linear(in_features=512, out_features=10, bias=True)
  )
)

### Epoch 0/10 ###
loss: 2.289750 [   64/60000]
loss: 2.287861 [ 6464/60000]
loss: 2.263056 [12864/60000]
loss: 2.261112 [19264/60000]
loss: 2.240377 [25664/60000]
loss: 2.208018 [32064/60000]
loss: 2.225265 [38464/60000]
loss: 2.183236 [44864/60000]


...