Excerpt |
---|
PyTorch is an optimised tensor library for deep learning using GPUs and CPUs. |
...
$ docker pullĀ quay.io/pawsey/pytorch:2.2.0-rocm5.7.3
The container can be also pulled using singularity:
...
Here is another example of running a simple training script on a GPU node during an interactive session:
Column |
---|
|
Warning |
---|
title | Redefine your TMPDIR |
---|
| We have seen some crashes when two users in the same node are using the same pytorch script with the containerised pytorch module. This due to conflicts of both jobs trying to create files with the same name in /tmp . We are currently working towards a definite solution to this problem. In the meantime, we are currently recommending to redefine the TMPDIR environment variable as a workaround: export TMPDIR=/tmp/$SLURM_JOB_ID
|
|
.
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 2. Using PyTorch on a compute node in an interactive Slurm session. |
---|
| setonix-05$ salloc -p gpu -A yourProjectCode-gpu --gres=gpu:1 --time=00:20:00
salloc: Pending job allocation 12386179
salloc: job 12386179 queued and waiting for resources
salloc: job 12386179 has been allocated resources
salloc: Granted job allocation 12386179
salloc: Waiting for resource configuration
salloc: Nodes nid002096 are ready for job
nid002096$ export TMPDIR=/tmp/$SLURM_JOB_ID
nid002096$ module load pytorch/2.2.0-rocm5.7.3
nid002096$ python3 main.py
NeuralNetwork(
(flatten): Flatten(start_dim=1, end_dim=-1)
(linear_relu_stack): Sequential(
(0): Linear(in_features=784, out_features=512, bias=True)
(1): ReLU()
(2): Linear(in_features=512, out_features=512, bias=True)
(3): ReLU()
(4): Linear(in_features=512, out_features=10, bias=True)
)
)
### Epoch 0/10 ###
loss: 2.289750 [ 64/60000]
loss: 2.287861 [ 6464/60000]
loss: 2.263056 [12864/60000]
loss: 2.261112 [19264/60000]
loss: 2.240377 [25664/60000]
loss: 2.208018 [32064/60000]
loss: 2.225265 [38464/60000]
loss: 2.183236 [44864/60000]
|
|
...