The Pawsey Supercomputing Centre has installed a new GPU-enabled system called Garrawarla, a Wajarri word meaning Spider, to enable our Murchison Wide Field Array (MWA) researchers to produce scientific outcomes while the Pawsey Supercomputing System is being procured. This MWA compute cluster provides the latest generation of CPUs and GPUs, high memory bandwidth, and increased memory per node to allow MWA researchers to effectively process large datasets.
System overview
System configuration
Garrawarla has the following hardware characteristics:
78 "HPE XL190 Gen10" compute nodes, each with:
- 2x Intel Xeon Gold 6230 20 core 2.1GHz CPU, code name "Cascade Lake".
- 384GB RAM
- 1x 240GB SSD boot drive
- 1x 960GB NVMe drive
- 1x HDR100/Ethernet 100Gb ConnectX-6 with single QSFP56 port
- 1x V100 32GB NVIDIA GPU
Filesystems and data management
The Astronomy filesystem, mounted as /astro
, which is dedicated to MWA, is the only Lustre filesystem mounted on Garrawarla. It is provided by HPE, with 3 PB of usable space and capable of reading/writing at 30 GB/s.
More information about filesystems and data management can be found in File Management.
Available Software
Garrawarla includes the following software characteristics:
- SLES12 with SP5 Operating System
- Slurm queueing system,
- Compilers: gcc, intel, pgi and clang
- Python
- CUDA
- MPI: OpenMPI and IntelMPI
- Singularity
- Profilers: ARM Forge, Intel VTune and NVIDIA Nsight
Python/2.7.17 and its associated packages are installed in the /pawsey/mwa/software/mwa_sles12sp4 directory.
Run "module use /pawsey/mwa/software/mwa_sles12sp4/modulefiles" before loading these modules.
ddeeptimahanti@garrawarla-1:~> module use /pawsey/mwa/software/mwa_sles12sp4/modulefiles ddeeptimahanti@garrawarla-1:~> module avail ---------------------------- /pawsey/mwa/software/mwa_sles12sp4/modulefiles ----------------------------- argparse/1.4.0 (D) distribute/0.7.3 (D) pyparsing/2.4.7 (D) astropy/2.0.16 ephem/3.7.7.1 (D) pytest/4.6.9 attrs/19.3.0 (D) funcsigs/1.0.2 (D) python-dateutil/2.8.1 (D) backports.functools_lru_cache/1.6.1 (D) functools32/3.2.3-2 (D) python/2.7.17 backports_abc/0.5 (D) h5py/2.9.0 pytz/2019.3 (D) boost/1.66.0 (D) healpy/1.13.0 (D) scipy/1.2.3 casacore/2.4.1 matplotlib/2.1.0 setuptools/38.2.1 casacore/3.2.1 (D) mpi4py/3.0.3 (D) singledispatch/3.4.0.3 (D) certifi/2020.4.5.1 (D) numpy/1.13.3 sip/4.19.8 configparser/4.0.2 pluggy/0.13.1 (D) six/1.14.0 (D) cycler/0.10.0 (D) psycopg2/2.8.5 (D) subprocess32/3.5.4 (D) cython/0.29.14 (D) py/1.8.1 (D) tornado/6.0.4 (D) d2to1/0.2.12.post1 (D) pyfits/3.5 (D) zipp/1.2.0 ...
Logging in
Interaction with Garrawarla is done remotely using SSH (Secure Shell version 2, SSH-2):
localComputer:~> ssh username@garrawarla.pawsey.org.au
More information about SSH-based access see Use of SSH Keys for Authentication.
Compiling
There are two families of supported software compilers on Garrawarla:
- Intel
- GNU: GNU Compiler Collection 8.3.0 is loaded by default.
It is up to you, as the user, to decide which programming environment is most suitable for the task at hand. To know the available gcc versions:
ddeeptimahanti@garrawarla-1:~> module avail gcc ------------------------------- /pawsey/mwa_sles12sp4/modulefiles/devel -------------------------------- gcc/4.8.5 gcc/5.5.0 gcc/8.3.0 (L,D) gcc/10.1.0
In the above round brackets, the L means the module is loaded, and D means it is the default version if no version is specified during the module load.
Compiler executables are named as follows:
Intel | GNU | ||
---|---|---|---|
Language | Compiler executable | Language | Compiler executable |
C | icc | C | gcc |
C++ | icpc | C++ | g++ |
Fortran | ifort | Fortran | gfortran |
Type the man
command followed by the compiler name to load the corresponding manual page.
Compiling MPI code
MPI libraries can be loaded using the corresponding modules. Use of OpenMPI with Unified Communication X is recommended for normal use cases, and can be achieved by loading the appropriate module:
$ module load openmpi-ucx/4.0.3
Once the MPI library is loaded, MPI wrappers are available for the currently selected compiler. Example commands follow:
OpenMPI-UCX | Intel-MPI | ||||||
---|---|---|---|---|---|---|---|
Intel | GNU | Intel | GNU | ||||
Language | Command | Language | Command | Language | Command | Language | Command |
C | mpicc hello_mpi.c | C | mpicc hello_mpi.c | C | mpiicc hello_mpi.c | C | mpicc hello_mpi.c |
C++ | mpicxx hello_mpi.cpp | C++ | mpicxx hello_mpi.cpp | C++ | mpiicpc hello_mpi.cpp | C++ | mpicxx hello_mpi.cpp |
Fortran | mpif90 hello_mpi.f90 | Fortran | mpif90 hello_mpi.f90 | Fortran | mpiifort hello_mpi.f90 | Fortran | mpif90 hello_mpi.f90 |
Always use srun to launch a MPI executable, regardless of whether it is OpenMPI or Intel-MPI.
Compiling OpenMP code
To compile code for OpenMP multi-threading, add specific flags at compile time. The syntax differs depending on the selected compiler:
Intel | GNU | ||
---|---|---|---|
Language | Command | Language | Command |
C | icc -qopenmp hello_omp.c | C | gcc -fopenmp hello_omp.c |
C++ | icpc -qopenmp hello_omp.cpp | C++ | g++ -fopenmp hello_omp.cpp |
Fortran | ifort -qopenmp hello_omp.f90 | Fortran | gfortran -fopenmp hello_omp.f90 |
Refer here for useful compiler options while compiling code on Garrawarla.
Compiling GPU code
All the nodes in Garrawarla are equipped with NVIDIA v100 GPUs, based on the NVIDIA Volta architecture and are accessible from the gpuq partition. GPU code compilation should occur on the compute nodes in the gpuq partition, either interactively for simple programs or via a job script for larger software suites.
Compiler compatibility notice
CUDA versions up to 10.2 are compatible with Intel compilers and GCC compilers.
Compiling a CUDA application
Compiling interactively
1. To compile interactively, submit a job request with salloc:
$ salloc --partition gpuq --time 1:00:00 --nodes 1 --gres=gpu:1
The terminal appears to hang until the job starts. Once the job has started, your login prompt displays the compute node in the gpuq you are now on.
2. Ensure that the module for the desired compiler is loaded. The current default on Garrawarla is gcc/8.3.0. A different GNU version or the Intel compiler can be loaded with module swap, e.g.:
$ module swap gcc gcc/5.5.0
3. Load the CUDA module:
$ module load cuda
4. To compile MPI-enabled CUDA code, l oad the OPENMPI-UCX-GPU module as well:
$ module load openmpi-ucx-gpu
5. Execute compile and link stages jointly or separately:
5a. Execute compilation commands jointly, e.g.:
$ srun nvcc -O2 -arch=sm_70 code_host.c code_cuda.cu
5b. Alternatively, if you require separate compilation and link stages, compile with the "-c" option first, e.g.:
$ srun g++ -O2 -c code_host.cpp
$ srun nvcc -O2 -arch=sm_70 -c code_cuda.cu
5c. Continue with the link stage. Make sure the link stage takes place using the host compiler, and includes the CUDA run time library via "-lcudart":
$ srun g++ code_cuda.o code_host.o -lcudart
Compiling using a job script
To compile via a job script:
1. Prepare the script to request a node, load the relevant environment, and execute the compilation commands. For example, create a script file named compile.slurm which contains:
#!/bin/bash --login #SBATCH --nodes=1 #SBATCH --partition=gpuq #SBATCH --gres=gpu:1 #SBATCH --time=00:10:00 #SBATCH --account=[your-account] module load cuda srun nvcc -O2 -arch=sm_70 code_host.c code_cuda.cu
2. Submit the script called compile.slurm to the queue:
$ sbatch compile.slurm
Compiling an OpenACC application
The PGI compiler (v20.1) is available for compiling code that contains OpenACC directives. C, C++ and Fortran compilers are invoked using pgcc
, pgc++
and pgfortran
, respectively.
You can compile either interactively via salloc
or in a batch job.
Compiling interactively
$ salloc --partition gpuq --time 1:00:00 --nodes 1 --gres=gpu:1 node$ module swap gcc pgi node$ srun pgcc -acc code_openacc.c
Compiling using a batch job
#!/bin/bash --login #SBATCH --nodes=1 #SBATCH --partition=gpuq #SBATCH --gres=gpu:1 #SBATCH --time=00:10:00 #SBATCH --account=[your-account] module swap gcc pgi srun pgcc -acc code_openacc.c
Queue policy and limits
Garrawarla resources are managed by the Slurm queueing system. For detailed information about Slurm, refer to Job Scheduling.
Garrawarala has two overlapping partitions with all 78 nodes available in both partitions (queues):
- workq- for CPU-only jobs; with only 38 cores available in each node (mwa001-mwa078), max. 24h walltime. The remaining 2 cores are available to support GPU jobs in each node.
- gpuq - for GPU-only jobs; with all 40 cores and single GPU available in each node (mwa001-mwa078), max. 24h walltime. You can either request the entire node (with 40 cores and single GPU) or only 2 cores + single GPU for your GPU jobs. Any non-GPU job requests are automatically rejected from this partition by Slurm.
Submit your job
Garrawarla compute nodes are in both partitions and are configured as a shared resource. This means that it is especially important in a request for the GPU to specify the number of tasks and amount of main memory required by the job. If not specified, by default a job is allocated with a single CPU core, no GPU and around 9GB of RAM.
It is recommended that all jobs request the following:
Option | Purpose |
---|---|
--account=account
| Set the account to which the job is to be charged. A default account is configured for each user. |
--nodes=nnodes
| Specify the total number of nodes. |
--ntasks=number
| Specify the total number of tasks (processes). |
| Specify GPUs per node |
--ntasks-per-node=number
| Specify the number of tasks per node. |
--ntasks-per-socket=number
| Specify the number of tasks per socket. |
--cores-per-socket=number
| Specify the number of cores per socket. Note: Each node has two CPU sockets with 18 cores and 20 cores, respectively, to support GPU workflows. |
--cpus-per-task=number
| Specify the number of threads per process for multi-threaded jobs. |
--mem=size | Specify the memory required per node. Note: If this option is not used, the scheduler allocates approximately 9gb of memory per process. |
--partition=partition
| Request an allocation on the specified partition. If not specified, jobs are submitted to the default partition. |
--time=hh:mm:ss
| Set the wall-clock time limit for the job. |
Refer to the following examples, which demonstrate different job allocation modes, including how to access the local NVMe storage.
Batch Job Examples
Requesting NVMe resources in SLURM
Each node in Garrawarla has an attached NVMe device with 890GB usable space mounted as /nvmetmp.
Request a specific amount of NVMe storage in your job script using --gres=tmp:<some-value>g
or --tmp=<some-value>g
directives, and request up to 890GB. If both commands are used, only --gres
is applied. You should not be able to use more NVMe space than what has been allocated to you. By default, without any explicit NVMe request, a job should get allocated 1G of a /nvmetmp on the NVMe device.
The NVMe device (or the portion used by a job) is cleaned up after the job completes. IMPORTANT: Migrate any valuable results from the NVMe device before the job completes.
$ salloc -N 1 --tmp=200g salloc: Nodes mwa001 are ready for job mwa001$ df -h | grep /nvmetmp /dev/nvme0n1p1 200G 0 200G 0% /nvmetmp
Serial job using a single CPU core
In the following example, we assume that the serial_code
is a serial application that is to be run on a single core in the workq partition. The amount of memory available for the job is adjusted since by default it would be given only about 9GB per process.
#!/bin/bash -l #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=380gb #SBATCH --time=00:01:00 #SBATCH --partition=workq #SBATCH --account=[your-project] #load required modules srun -n 1 ./serial_code
OpenMP code using all available CPU cores per node
In the following example, we assume that the cpu_code
is an OpenMP code using all 20 CPU cores of a socket. Note: Each node has 2 CPUs, with 20 cores each. A single process per socket and 20 OpenMP threads (single CPU) is run, leaving the other resources (the other CPU) available for other jobs. The amount of memory is adjusted to 180GB since by default it would be given only 9GB per process. In this example, you are allocated with CPU cores and memory of a single CPU (single NUMA node).
Note: To facilitate GPU workflows, only 38 cores are available on a node in the workq partition, with 18 cores on CPU-1 and 20 on CPU-2. For best performance with OpenMP applications, it is recommended to launch threads in a single CPU/NUMA node.
#!/bin/bash -l #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cores-per-socket=[some-value] # up to 18 or 20 to explicitly request CPU socket with 18 or 20 cores respectively #SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value #SBATCH --mem=180gb #SBATCH --time=00:01:00 #SBATCH --partition=workq #SBATCH --account=[your-project] export OMP_NUM_THREADS=20 # This should be equal to cpus-per-task value srun -n 1 -c ${OMP_NUM_THREADS} ./cpu_code
Non-MPI code using a single GPU
In the following example. we assume that the gpu_code
is a non-MPI application and can use a single GPU. The amount of memory available for the job since is adjusted by default, and the job is given about 9gb per process.
Note: For best performance with OpenMP applications, it is recommended to launch threads within a single CPU (or NUMA node or socket). Each socket
#!/bin/bash -l #SBATCH --nodes=1 #SBATCH --gres=gpu:1 #SBATCH --ntasks-per-node=1 #SBATCH --cores-per-socket=[some-value] # up to 20 to explicitly request cores from a single CPU socket #SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value #SBATCH --mem=380gb #SBATCH --time=00:01:00 #SBATCH --partition=gpuq #SBATCH --account=[your-project] module load cuda export OMP_NUM_THREADS=20 # This should be equal to cpus-per-task value srun -n 1 -c ${OMP_NUM_THREADS} ./gpu_code
MPI code using more than one GPU
In the following example, we assume that the gpu_code
is a MPI application and can use a single GPU per process. Two processes are run, one per node, and we adjust the amount of memory per node since by default the job is given 9gb per process.
#!/bin/bash -l #SBATCH --nodes=2 #SBATCH --gres=gpu:1 #SBATCH --ntasks-per-node=1 #SBATCH --mem=380gb #SBATCH --time=00:01:00 #SBATCH --partition=gpuq #SBATCH --account=[your-project] module load cuda srun -n 2 -N 2 ./gpu_code
OpenMP code using a single GPU and all available CPU cores
In the following example, we assume that the gpu_code
is an OpenMP code using a single GPU and all 20 CPU cores. Note: Each node has 2 CPUs, with 20 cores each. A single process per socket and 20 OpenMP threads (single CPU) is run, leaving the other resources (CPU) available for other jobs. The amount of memory is adjusted to 180gb since by default the job is given 9gb per process. In this example, the job is allocated with CPU cores and memory of a single CPU (single NUMA node).
Note: For best performance with OpenMP applications, it is recommended to launch threads within a single CPU (or NUMA node or socket).
#!/bin/bash -l #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cores-per-socket=[some-value] # up to 20 to explicitly request cores from a single CPU socket #SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value #SBATCH --gres=gpu:1 #SBATCH --mem=180gb #SBATCH --time=00:01:00 #SBATCH --partition=gpuq #SBATCH --account=[your-project] module load cuda export OMP_NUM_THREADS=20 # This should be equal to cpus-per-task value srun -n 1 -c ${OMP_NUM_THREADS} ./gpu_code
MPI + OpenMP code using the GPU and all available CPU cores per node
In the following example, we assume that gpu_code
is an MPI + OpenMP code using a single GPU per process and capable of using OpenMP multi-threading to additionally use all CPU cores in a node. Note: There are 2 CPUs per node, with 20 cores each. The code is run on two nodes with one process per node, each using a single GPU. The amount of memory is adjusted to 180gb since by default the job would be given 9gb per process.
#!/bin/bash -l #SBATCH --nodes=2 #SBATCH --gres=gpu:1 #SBATCH --ntasks-per-node=1 #SBATCH --cores-per-socket=[some-value] # up to 20 to explicitly request cores from a single CPU socket #SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value #SBATCH --mem=180gb #SBATCH --time=00:01:00 #SBATCH --partition=gpuq #SBATCH --account=[your-project] module load cuda export OMP_NUM_THREADS=20 # This should be equal to cpus-per-task value srun -n 2 -c ${OMP_NUM_THREADS} ./gpu_code
Run a job using interactive mode
As on other Pawsey systems, you can used the salloc
command to run interactive sessions. You can use the#SBATCH
options mentioned above to specify various interactive job parameters. For example, to run an OpenMP code using 1 GPU, you can open an interactive session with the following command:
$ salloc --nodes=1 --gres=gpu:1 --ntasks-per-socket=1 --cores-per-socket=20 --cpus-per-task=20 --mem=180gb --time=00:05:00 --partition=gpuq --account=[your-project]
For all interactive sessions, after salloc
has run and you are on a compute node, use the srun
command to execute your commands. This is valid for all commands. For example, used srun
to run the nvidia-smi
command on the interactive node:
ddeeptimahanti@mwa041:~> srun nvidia-smi Sun May 24 16:48:39 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla V100-PCIE... On | 00000000:D8:00.0 Off | 0 | | N/A 34C P0 23W / 250W | 0MiB / 32510MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+ ddeeptimahanti@mwa041:~>
Resource Accounting
Pawsey provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance utility in pawseytools shows the current state of the default group's usage against their allocation and also the /astro storage quota and usage. For example:
ddeeptimahanti@garrawarla-1:~> pawseyAccountBalance --cluster=garrawarla -p mwaops -storage Compute Information ------------------- Project ID Allocation Usage % used ---------- ---------- ----- ------ mwaops 100000 1381 1.4 Storage Information ------------------- /astro usage for mwaops, used = 2.09 TiB, quota = 20.00 TiB
Troubleshooting & Good Practices
Using singularity with GPUs
Use the --nv
option when using Singularity on Garrawarla compute nodes.
Segmentation fault while running CUDA/OpenACC applications with UCX support
A segmentation fault can occur if applications are either statically linked to CUDA libraries or memory is allocated before MPI_Init. As a workaround, disable memory type cache by exporting UCX_MEMTYPE_CACHE=n
Using ramdisk support on the compute nodes
Each node on Garrawarla has up to 50% of the memory ( ~185GB) mounted in /dev/shm and available as ramdisk, which can be used to speed up large I/O intensive computations. This resource is not trackable in Slurm, so you should cleanup /dev/shm before exiting the job, which otherwise will reduce the memory available for subsequent jobs on that node. Also, to be fair with system usage, request cores according to the ramdisk usage. For example, by default only 9gb is available per core; therefore, to use 90gb of ramdisk you should ask for an additional 10 cores to avoid issues for other jobs running on the same node.
ddeeptimahanti@mwa001:~> df -h | grep /dev/shm tmpfs 189G 0 189G 0% /dev/shm ddeeptimahanti@mwa001:~>
Requesting only the required memory to allow jobs on the overlapping partitions
Each compute node has 384GB of CPU memory, out of which only ~371 GB is available for the users' jobs through SLURM. However, users will notice only 9GB allocated for each core requested in the SLURM. The workq partition provides only 38 cores on each node and if a job requests all 38 cores of a node from the workq partition, SLURM will automatically allocate 342GB memory (= 38x9GB) for that job. This will leave only ~29GB (= 371-342) of memory for any GPU job that is going to run on that same/overlapping node. So, it is recommended to explicitly request only the required amount of memory for your jobs using --mem directive so that the nodes will be effectively utilized by both CPU and GPU workflows.
The following interactive job requested 38 tasks from a single node in the workq partition. SLURM allocated mwa024 and by default provided 342GB (9GB per each task) for this job.
ddeeptimahanti@garrawarla-1:~> salloc -p workq --ntasks-per-node=38 salloc: Granted job allocation 612214 salloc: Waiting for resource configuration salloc: Nodes mwa024 are ready for job ddeeptimahanti@mwa024:~> scontrol show job 612214 | grep mem TRES=cpu=38,mem=342G,node=1,billing=38
Now, only 29GB (=371-342) is remaining on the mwa024 that is available for any GPU jobs on this node. So, SLURM will fail to allocate the resources for the job requesting over 29GB from this node from the gpuq partition. It can only honor jobs requesting 29GB or less memory.
ddeeptimahanti@garrawarla-1:~> salloc -w mwa024 -p gpuq --gres=gpu:1 --mem=29g salloc: Granted job allocation 612245 salloc: Waiting for resource configuration salloc: Nodes mwa024 are ready for job ddeeptimahanti@mwa024:~> scontrol show node mwa024 | grep mem CfgTRES=cpu=40,mem=380000M,billing=40,gres/gpu=1 AllocTRES=cpu=39,mem=379904M,gres/gpu=1
So to facilitate jobs to run on both overlapping partitions, users are recommended to request memory as required for their jobs.
Now, the following interactive job requested 38 tasks from a single node in the workq partition but explicitly requested 200g memory using the --mem directive:
ddeeptimahanti@garrawarla-1:~> salloc -w mwa024 -p workq --ntasks-per-node=38 --mem=200g salloc: Granted job allocation 865353 salloc: Waiting for resource configuration salloc: Nodes mwa024 are ready for job scddeeptimahanti@mwa024:~> scontrol show job 865353 | grep mem TRES=cpu=38,mem=200G,node=1,billing=38 ddeeptimahanti@mwa024:~>
This allowed in launching another job from the gpuq partition on this same node and request up to 171GB (= 371-200):
ddeeptimahanti@garrawarla-1:~> salloc -w mwa024 -p gpuq --gres=gpu:1 --mem=171g salloc: Granted job allocation 865355 salloc: Waiting for resource configuration salloc: Nodes mwa024 are ready for job ddeeptimahanti@mwa024:~> vi vtk.cyg ddeeptimahanti@mwa024:~> scontrol show node mwa024 | grep mem CfgTRES=cpu=40,mem=380000M,billing=40,gres/gpu=1 AllocTRES=cpu=39,mem=371G,gres/gpu=1