/
Example of using pytorch on AMD when it uses CUDA extensions

Example of using pytorch on AMD when it uses CUDA extensions

In this step-by-step recount of migration, we will see how we took a machine learning workflow originally running on CSIRO Bracewell using NVIDIA CUDA and ran it on Setonix using AMD via HIP for GPU needs. After successfully running on Setonix, the same code runs approximately 20 times faster, and as can be seen the effort required to migrate is minor.

If you have a similar case and would like some help with your migration or you have a similar example that we can help you with while documenting it for other users, please get in touch via Pawsey’s help desk.

The original code can be found here: https://github.com/Nikhel1/Gal-DINO . The code aims to detect and match galaxies on radio observations with infrared observations and to classify those galaxies.

 

εικόνα-20240426-062721.png

 

Source: https://nips.cc/media/PosterPDFs/NeurIPS 2023/76102.png?t=1701025439.3578622

The training dataset can be found here: https://data.csiro.au/collection/csiro%3A61068v1

 Overview

Setting up to run this code, consists of the following steps:

Cloning the repo

In more detail for the cloning, we pick a directory under /scratch for both the code and the data. On Pawsey we have a handy environment variable for the scratch directory which is $MYSCRATCH so we can use it to create the workspace:

cd $MYSCRATCH git clone https://github.com/knservis/Gal-DINO.git cd Gal-DINO

Setting up the correct environment

Ensure you have conda, or install it* if you don't with:

tmp_dir=$(mktemp -d) cd $tmp_dir curl -LO https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh cd - bash $tmp_dir/Miniconda3-latest-Linux-x86_64.sh -b -p $MYSOFTWARE/miniconda3/ rm -rf $tmp_dir

*There is a limitation in the number of files currently in the /software filesystem at Pawsey, and you may have issues using $MYSOFTWARE. If at any point you run into issues with space (the error will say something like 'disk quota exceeded'), come back and use $MYSCRATCH instead.

Alternatively, you can try removing all other environments and running:

conda clean --all --force-pkgs-dirs && export CONDA_PKGS_DIRS=$(mktemp -d) && conda install <blah> -y && rm -rf $CONDA_PKGS_DIRS

This approach has the disadvantage of being impermanent, and you may have to reinstall if this directory gets purged.

Once you have conda installed (re-login if you just installed it using the snippet above and ran conda init bash), you need to create the correct environment. To compile and verify you can use the gpu-dev partition. To get an interactive shell there, simply* do:

*Simply: your project code on Pawsey systems is something like pawseyXXXX (e.g. pawsey0123) or directorXXXX usually and if it’s not you will definitely know (talk to your PI if you don’t or check the Pawsey origin portal). As a reminder, you will need to append -gpu to the end of your project code to submit jobs to the gpu or gpu-dev partitions in Pawsey, so your account will become pawseyXXXX-gpu (using the example before pawsey0123-gpu).

Compiling the CUDA extensions

Now let us activate the right version of ROCm:

Now assuming you have the conda env activated and that the current working directory is Gal-DINO can install the requirements with:

Optionally you can test the CUDA (now converted by HIP), extensions with python test.py, but this may run out of memory for higher iterations so you may need to edit the test.

Downloading the dataset

A quick note on training data: if you follow the code repo instructions on how to obtain the data you will reach: https://data.csiro.au/collection/csiro%3A61068v1. Following the “Files” tab and then clicking on “Download”, you will naturally on Setonix select the S3 method at which point you will get a string for rclone (which is the Setonix recommended and supported client via module load rclone/1.62.2 or if the admins have updated another version that you will find via module spider rclone) like:

Assuming that the current directory is still Gal-DINO, in order to get the correct structure and skip some unused other files that are in there you will need to modify it to:

Notice the last line is modified by adding data/RadioGalaxyNET to the source to skip the fluff and replacing the destination directory with ./RadioGalaxyNET which is what is expected by the default config.

Training the network

At this stage, you can go ahead and run the training* by:

In this and most cases, you would only run this for 1-2 epochs just to verify the setup as you will likely not have enough time left from what is allowed on the dev nodes (1 hour) for more.

*There is a newer version of running torch which is not yet used in this repo, the details of which you can find here https://pytorch.org/docs/stable/elastic/run.html

Once you are ready to run your training in anger log out of the dev node and from the login node do:

Or modify it to suit your needs.

In this case, the effort is definitely worth it as the above network was trained in ~2 hours on Setonix vs ~40 hours on Bracewell!

You are welcome to read through the commit log but essentially there was only one check stopping the compilation, which was addressed here: https://github.com/knservis/Gal-DINO/commit/3fe802adcad24a42700261ac4a320191c3418df7#diff-e4a3b64f2e6e2a5324a08042473a1459fe78999c0136574af10ea16303045965. Essentially if hasattr(torch.version, 'hip') and torch.version.hip is not None passes, we pretend that we are using CUDA and it works via HIP.

Other than that the effort was to find the correct version combination and work around file quotas on Setonix, but even that was not significant.

Comparison with Virga

Virga is the most recent cluster in CSIRO and it uses NVIDIA H100 GPUs

In brief, running on Virga is done using the following instructions:

Although it may appear to be a more efficient and simplified process compared to Setonix, there are certain administrative decisions in Virga that contribute to this perception. For instance, the login node is equipped with 4 GPUs, supports conda/mamba, offers a larger /home directory space, and has default versions for modules. While these factors do streamline operations considerably, there are also limitations to consider.

Access to Virga is restricted solely to CSIRO members and requires approval from your LM. Additionally, SSH access is only possible when connected through the CSIRO VPN, and outgoing connections from cluster nodes are not permitted except from the login nodes.

In the scenario involving conda/mamba, it is possible to encounter disk quota limitations in the default location (your /home directory). If this occurs, executing conda clean -a may resolve the issue. In rare situations where this does not provide a solution, alternative approaches such as those used in the Setonix case may need to be considered.

 

Training time is almost identical with Setonix for this case (we would need a few repetitions to be sure).

 

 Related articles