The Pawsey Supercomputing Centre has installed a new GPU-enabled system called Garrawarla, a Wajarri word meaning Spider, to enable our Murchison Wide Field Array (MWA) researchers to produce scientific outcomes while the Pawsey Supercomputing System is being procured. This MWA compute cluster provides the latest generation of CPUs and GPUs, high memory bandwidth, and increased memory per node to allow MWA researchers to effectively process large datasets.

On this page:

System overview

System configuration

Garrawarla has the following hardware characteristics:

78 "HPE XL190 Gen10" compute nodes, each with:

2x Intel Xeon Gold 6230 20 core 2.1GHz CPU, code name "Cascade Lake".
384GB RAM
1x 240GB SSD boot drive
1x 960GB NVMe drive
1x HDR100/Ethernet 100Gb ConnectX-6 with single QSFP56 port
1x V100 32GB NVIDIA GPU

Filesystems and data management

The Astronomy filesystem, mounted as /astro, which is dedicated to MWA, is the only Lustre filesystem mounted on Garrawarla. It is provided by HPE, with 3 PB of usable space and capable of reading/writing at 30 GB/s.

More information about filesystems and data management can be found in File Management.

Available Software

Garrawarla includes the following software characteristics:

SLES12 with SP5 Operating System
Slurm queueing system,
Compilers: gcc, intel, pgi and clang
Python
CUDA
MPI: OpenMPI and IntelMPI
Singularity
Profilers: ARM Forge, Intel VTune and NVIDIA Nsight

Python/2.7.17 and its associated packages are installed in the /pawsey/mwa/software/mwa_sles12sp4 directory.

Run "module use /pawsey/mwa/software/mwa_sles12sp4/modulefiles" before loading these modules.

Terminal 1. Using python2 modules.

ddeeptimahanti@garrawarla-1:~> module use /pawsey/mwa/software/mwa_sles12sp4/modulefiles
ddeeptimahanti@garrawarla-1:~> module avail

---------------------------- /pawsey/mwa/software/mwa_sles12sp4/modulefiles -----------------------------
   argparse/1.4.0                      (D)    distribute/0.7.3    (D)    pyparsing/2.4.7        (D)
   astropy/2.0.16                             ephem/3.7.7.1       (D)    pytest/4.6.9
   attrs/19.3.0                        (D)    funcsigs/1.0.2      (D)    python-dateutil/2.8.1  (D)
   backports.functools_lru_cache/1.6.1 (D)    functools32/3.2.3-2 (D)    python/2.7.17
   backports_abc/0.5                   (D)    h5py/2.9.0                 pytz/2019.3            (D)
   boost/1.66.0                        (D)    healpy/1.13.0       (D)    scipy/1.2.3
   casacore/2.4.1                             matplotlib/2.1.0           setuptools/38.2.1
   casacore/3.2.1                      (D)    mpi4py/3.0.3        (D)    singledispatch/3.4.0.3 (D)
   certifi/2020.4.5.1                  (D)    numpy/1.13.3               sip/4.19.8
   configparser/4.0.2                         pluggy/0.13.1       (D)    six/1.14.0             (D)
   cycler/0.10.0                       (D)    psycopg2/2.8.5      (D)    subprocess32/3.5.4     (D)
   cython/0.29.14                      (D)    py/1.8.1            (D)    tornado/6.0.4          (D)
   d2to1/0.2.12.post1                  (D)    pyfits/3.5          (D)    zipp/1.2.0
...

Logging in

Interaction with Garrawarla is done remotely using SSH (Secure Shell version 2, SSH-2):

localComputer:~> ssh username@garrawarla.pawsey.org.au

More information about SSH-based access see Use of SSH Keys for Authentication.

Compiling

There are two families of supported software compilers on Garrawarla:

Intel
GNU: GNU Compiler Collection 8.3.0 is loaded by default.

It is up to you, as the user, to decide which programming environment is most suitable for the task at hand. To know the available gcc versions:

Terminal 2. Available GCC versions.

ddeeptimahanti@garrawarla-1:~> module avail gcc

------------------------------- /pawsey/mwa_sles12sp4/modulefiles/devel --------------------------------
   gcc/4.8.5    gcc/5.5.0    gcc/8.3.0 (L,D)    gcc/10.1.0

In the above round brackets, the L means the module is loaded, and D means it is the default version if no version is specified during the module load.

Compiler executables are named as follows:

Intel		GNU
Language	Compiler executable	Language	Compiler executable
C	icc	C	gcc
C++	icpc	C++	g++
Fortran	ifort	Fortran	gfortran

Type the man command followed by the compiler name to load the corresponding manual page.

Compiling MPI code

MPI libraries can be loaded using the corresponding modules. Use of OpenMPI with Unified Communication X is recommended for normal use cases, and can be achieved by loading the appropriate module:

$ module load openmpi-ucx/4.0.3

Once the MPI library is loaded, MPI wrappers are available for the currently selected compiler. Example commands follow:

OpenMPI-UCX				Intel-MPI
Intel		GNU		Intel		GNU
Language	Command	Language	Command	Language	Command	Language	Command
C	mpicc hello_mpi.c	C	mpicc hello_mpi.c	C	mpiicc hello_mpi.c	C	mpicc hello_mpi.c
C++	mpicxx hello_mpi.cpp	C++	mpicxx hello_mpi.cpp	C++	mpiicpc hello_mpi.cpp	C++	mpicxx hello_mpi.cpp
Fortran	mpif90 hello_mpi.f90	Fortran	mpif90 hello_mpi.f90	Fortran	mpiifort hello_mpi.f90	Fortran	mpif90 hello_mpi.f90

Always use srun to launch a MPI executable, regardless of whether it is OpenMPI or Intel-MPI.

In the case of Intel Compilers + Intel MPI, wrapper names, there are differences when compared to three other combinations: mpiicc, mpiicpc and mpiifort (in contrast to the usual mpicc, mpicxx and mpif90).

Codes compiled with OpenMPI will not work properly with libraries compiled with Intel MPI and vice versa. Make sure that all linked libraries are compiled with the same MPI implementation used for your parallel MPI code.

Compiling OpenMP code

To compile code for OpenMP multi-threading, add specific flags at compile time. The syntax differs depending on the selected compiler:

Intel		GNU
Language	Command	Language	Command
C	icc -qopenmp hello_omp.c	C	gcc -fopenmp hello_omp.c
C++	icpc -qopenmp hello_omp.cpp	C++	g++ -fopenmp hello_omp.cpp
Fortran	ifort -qopenmp hello_omp.f90	Fortran	gfortran -fopenmp hello_omp.f90

Refer here for useful compiler options while compiling code on Garrawarla.

Compiling GPU code

All the nodes in Garrawarla are equipped with NVIDIA v100 GPUs, based on the NVIDIA Volta architecture and are accessible from the gpuq partition. GPU code compilation should occur on the compute nodes in the gpuq partition, either interactively for simple programs or via a job script for larger software suites.

Compiler compatibility notice

CUDA versions up to 10.2 are compatible with Intel compilers and GCC compilers.

Compiling a CUDA application

Compiling interactively

1. To compile interactively, submit a job request with salloc:

$ salloc --partition gpuq --time 1:00:00 --nodes 1 --gres=gpu:1

The terminal appears to hang until the job starts. Once the job has started, your login prompt displays the compute node in the gpuq you are now on.

2. Ensure that the module for the desired compiler is loaded. The current default on Garrawarla is gcc/8.3.0. A different GNU version or the Intel compiler can be loaded with module swap, e.g.:

$ module swap gcc gcc/5.5.0

3. Load the CUDA module:

$ module load cuda

4. To compile MPI-enabled CUDA code, l oad the OPENMPI-UCX-GPU module as well:

$ module load openmpi-ucx-gpu

5. Execute compile and link stages jointly or separately:

5a. Execute compilation commands jointly, e.g.:

$ srun nvcc -O2 -arch=sm_70 code_host.c code_cuda.cu

5b. Alternatively, if you require separate compilation and link stages, compile with the "-c" option first, e.g.:

$ srun g++ -O2 -c code_host.cpp

$ srun nvcc -O2 -arch=sm_70 -c code_cuda.cu

5c. Continue with the link stage. Make sure the link stage takes place using the host compiler, and includes the CUDA run time library via "-lcudart":

$ srun g++ code_cuda.o code_host.o -lcudart

Compiling using a job script

To compile via a job script:

1. Prepare the script to request a node, load the relevant environment, and execute the compilation commands. For example, create a script file named compile.slurm which contains:

Listing 1. Compiling using a jobscript.

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]

module load cuda

srun nvcc -O2 -arch=sm_70 code_host.c code_cuda.cu

2. Submit the script called compile.slurm to the queue:

$ sbatch compile.slurm

Compiling an OpenACC application

The PGI compiler (v20.1) is available for compiling code that contains OpenACC directives. C, C++ and Fortran compilers are invoked using pgcc, pgc++ and pgfortran, respectively.

You can compile either interactively via salloc or in a batch job.

Compiling interactively

Terminal 3. Compiling OpeanACC code interactively

$ salloc --partition gpuq --time 1:00:00 --nodes 1 --gres=gpu:1 
node$ module swap gcc pgi
node$ srun pgcc -acc code_openacc.c

Compiling using a batch job

Listing 2. Compiling OpenACC code.

#!/bin/bash --login
#SBATCH --nodes=1
#SBATCH --partition=gpuq
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]

module swap gcc pgi

srun pgcc -acc code_openacc.c

Queue policy and limits

Garrawarla resources are managed by the Slurm queueing system. For detailed information about Slurm, refer to Job Scheduling.

Garrawarala has two overlapping partitions with all 78 nodes available in both partitions (queues):

workq- for CPU-only jobs; with only 38 cores available in each node (mwa001-mwa078), max. 24h walltime. The remaining 2 cores are available to support GPU jobs in each node.
gpuq - for GPU-only jobs; with all 40 cores and single GPU available in each node (mwa001-mwa078), max. 24h walltime. You can either request the entire node (with 40 cores and single GPU) or only 2 cores + single GPU for your GPU jobs. Any non-GPU job requests are automatically rejected from this partition by Slurm.

Submit your job

Garrawarla compute nodes are in both partitions and are configured as a shared resource. This means that it is especially important in a request for the GPU to specify the number of tasks and amount of main memory required by the job. If not specified, by default a job is allocated with a single CPU core, no GPU and around 9GB of RAM.

It is recommended that all jobs request the following:

Option	Purpose
`--account=account`	Set the account to which the job is to be charged. A default account is configured for each user.
`--nodes=nnodes`	Specify the total number of nodes.
`--ntasks=number`	Specify the total number of tasks (processes).
`--gres=gpu:1`	Specify GPUs per node
`--ntasks-per-node=number`	Specify the number of tasks per node.
`--ntasks-per-socket=number`	Specify the number of tasks per socket.
`--cores-per-socket=number`	Specify the number of cores per socket. Note: Each node has two CPU sockets with 18 cores and 20 cores, respectively, to support GPU workflows.
`--cpus-per-task=number`	Specify the number of threads per process for multi-threaded jobs.
`--mem=size`	Specify the memory required per node. Note: If this option is not used, the scheduler allocates approximately 9gb of memory per process.
`--partition=partition`	Request an allocation on the specified partition. If not specified, jobs are submitted to the default partition.
`--time=hh:mm:ss`	Set the wall-clock time limit for the job.

Refer to the following examples, which demonstrate different job allocation modes, including how to access the local NVMe storage.

Batch Job Examples

Requesting NVMe resources in SLURM

Each node in Garrawarla has an attached NVMe device with 890GB usable space mounted as /nvmetmp.

Request a specific amount of NVMe storage in your job script using --gres=tmp:<some-value>g or --tmp=<some-value>g directives, and request up to 890GB. If both commands are used, only --gres is applied. You should not be able to use more NVMe space than what has been allocated to you. By default, without any explicit NVMe request, a job should get allocated 1G of a /nvmetmp on the NVMe device.

The NVMe device (or the portion used by a job) is cleaned up after the job completes. IMPORTANT: Migrate any valuable results from the NVMe device before the job completes.

Terminal 4. Requesting 200gb NVMe space

$ salloc -N 1 --tmp=200g
salloc: Nodes mwa001 are ready for job
mwa001$ df -h | grep /nvmetmp
/dev/nvme0n1p1  200G     0  200G   0% /nvmetmp

Serial job using a single CPU core

In the following example, we assume that the serial_code is a serial application that is to be run on a single core in the workq partition. The amount of memory available for the job is adjusted since by default it would be given only about 9GB per process.

Listing 3. Running a serial code.

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=380gb
#SBATCH --time=00:01:00
#SBATCH --partition=workq
#SBATCH --account=[your-project]

#load required modules
srun -n 1 ./serial_code

OpenMP code using all available CPU cores per node

In the following example, we assume that the cpu_code is an OpenMP code using all 20 CPU cores of a socket. Note: Each node has 2 CPUs, with 20 cores each. A single process per socket and 20 OpenMP threads (single CPU) is run, leaving the other resources (the other CPU) available for other jobs. The amount of memory is adjusted to 180GB since by default it would be given only 9GB per process. In this example, you are allocated with CPU cores and memory of a single CPU (single NUMA node).

Note: To facilitate GPU workflows, only 38 cores are available on a node in the workq partition, with 18 cores on CPU-1 and 20 on CPU-2. For best performance with OpenMP applications, it is recommended to launch threads in a single CPU/NUMA node.

Listing 4. Running OpenMP code

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1 
#SBATCH --cores-per-socket=[some-value] # up to 18 or 20 to explicitly request CPU socket with 18 or 20 cores respectively 
#SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value
#SBATCH --mem=180gb
#SBATCH --time=00:01:00
#SBATCH --partition=workq
#SBATCH --account=[your-project]

export OMP_NUM_THREADS=20  # This should be equal to cpus-per-task value
srun -n 1 -c ${OMP_NUM_THREADS} ./cpu_code

Non-MPI code using a single GPU

In the following example. we assume that the gpu_code is a non-MPI application and can use a single GPU. The amount of memory available for the job since is adjusted by default, and the job is given about 9gb per process.

Note: For best performance with OpenMP applications, it is recommended to launch threads within a single CPU (or NUMA node or socket). Each socket

Listing 5. Non MPI code using a single GPU.

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1
#SBATCH --cores-per-socket=[some-value] # up to 20 to explicitly request cores from a single CPU socket 
#SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value
#SBATCH --mem=380gb
#SBATCH --time=00:01:00
#SBATCH --partition=gpuq
#SBATCH --account=[your-project]

module load cuda
export OMP_NUM_THREADS=20 # This should be equal to cpus-per-task value
srun -n 1 -c ${OMP_NUM_THREADS} ./gpu_code

MPI code using more than one GPU

In the following example, we assume that the gpu_code is a MPI application and can use a single GPU per process. Two processes are run, one per node, and we adjust the amount of memory per node since by default the job is given 9gb per process.

Listing 5. MPI code using more than one GPU

#!/bin/bash -l
#SBATCH --nodes=2
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1
#SBATCH --mem=380gb
#SBATCH --time=00:01:00
#SBATCH --partition=gpuq
#SBATCH --account=[your-project]

module load cuda
srun -n 2 -N 2 ./gpu_code

OpenMP code using a single GPU and all available CPU cores

In the following example, we assume that the gpu_code is an OpenMP code using a single GPU and all 20 CPU cores. Note: Each node has 2 CPUs, with 20 cores each. A single process per socket and 20 OpenMP threads (single CPU) is run, leaving the other resources (CPU) available for other jobs. The amount of memory is adjusted to 180gb since by default the job is given 9gb per process. In this example, the job is allocated with CPU cores and memory of a single CPU (single NUMA node).

Note: For best performance with OpenMP applications, it is recommended to launch threads within a single CPU (or NUMA node or socket).

Listing 6. OpenMP code using a single GPU and all available CPU cores

#!/bin/bash -l
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=1 
#SBATCH --cores-per-socket=[some-value] # up to 20 to explicitly request cores from a single CPU socket  
#SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value
#SBATCH --gres=gpu:1
#SBATCH --mem=180gb
#SBATCH --time=00:01:00
#SBATCH --partition=gpuq
#SBATCH --account=[your-project]

module load cuda
export OMP_NUM_THREADS=20 # This should be equal to cpus-per-task value
srun -n 1 -c ${OMP_NUM_THREADS} ./gpu_code

MPI + OpenMP code using the GPU and all available CPU cores per node

In the following example, we assume that gpu_code is an MPI + OpenMP code using a single GPU per process and capable of using OpenMP multi-threading to additionally use all CPU cores in a node. Note: There are 2 CPUs per node, with 20 cores each. The code is run on two nodes with one process per node, each using a single GPU. The amount of memory is adjusted to 180gb since by default the job would be given 9gb per process.

Listing 7. MPI + OpenMP

#!/bin/bash -l
#SBATCH --nodes=2
#SBATCH --gres=gpu:1
#SBATCH --ntasks-per-node=1
#SBATCH --cores-per-socket=[some-value] # up to 20 to explicitly request cores from a single CPU socket 
#SBATCH --cpus-per-task=[some-value] # To run threaded code and should be less than or equal to the above cores-per-socket value
#SBATCH --mem=180gb
#SBATCH --time=00:01:00
#SBATCH --partition=gpuq
#SBATCH --account=[your-project]

module load cuda
export OMP_NUM_THREADS=20 # This should be equal to cpus-per-task value
srun -n 2 -c ${OMP_NUM_THREADS} ./gpu_code

Run a job using interactive mode

As on other Pawsey systems, you can used the salloc command to run interactive sessions. You can use the#SBATCH options mentioned above to specify various interactive job parameters. For example, to run an OpenMP code using 1 GPU, you can open an interactive session with the following command:

Terminal 5. Running a job in interactive mode.

$ salloc --nodes=1 --gres=gpu:1 --ntasks-per-socket=1 --cores-per-socket=20 --cpus-per-task=20 --mem=180gb --time=00:05:00 --partition=gpuq --account=[your-project]

For all interactive sessions, after salloc has run and you are on a compute node, use the srun command to execute your commands. This is valid for all commands. For example, used srun to run the nvidia-smi command on the interactive node:

Terminal 6. Running nvidia-smi

ddeeptimahanti@mwa041:~> srun nvidia-smi
Sun May 24 16:48:39 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:D8:00.0 Off |                    0 |
| N/A   34C    P0    23W / 250W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
ddeeptimahanti@mwa041:~>

Resource Accounting

Please note the AccountBalance utility does not currently function.

Pawsey provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance utility in pawseytools shows the current state of the default group's usage against their allocation and also the /astro storage quota and usage. For example:

Terminal 7. Query project information.

ddeeptimahanti@garrawarla-1:~> pawseyAccountBalance --cluster=garrawarla -p mwaops -storage
Compute Information
-------------------
          Project ID     Allocation          Usage     % used
          ----------     ----------          -----     ------
              mwaops         100000           1381        1.4

Storage Information
-------------------
/astro usage for mwaops, used = 2.09 TiB, quota = 20.00 TiB

Troubleshooting & Good Practices

Using singularity with GPUs

Use the --nv option when using Singularity on Garrawarla compute nodes.

Segmentation fault while running CUDA/OpenACC applications with UCX support

A segmentation fault can occur if applications are either statically linked to CUDA libraries or memory is allocated before MPI_Init. As a workaround, disable memory type cache by exporting UCX_MEMTYPE_CACHE=n

Using ramdisk support on the compute nodes

Each node on Garrawarla has up to 50% of the memory ( ~185GB) mounted in /dev/shm and available as ramdisk, which can be used to speed up large I/O intensive computations. This resource is not trackable in Slurm, so you should cleanup /dev/shm before exiting the job, which otherwise will reduce the memory available for subsequent jobs on that node. Also, to be fair with system usage, request cores according to the ramdisk usage. For example, by default only 9gb is available per core; therefore, to use 90gb of ramdisk you should ask for an additional 10 cores to avoid issues for other jobs running on the same node.

Terminal 8. Temp space available.

ddeeptimahanti@mwa001:~> df -h | grep /dev/shm
tmpfs                                           189G     0  189G   0% /dev/shm
ddeeptimahanti@mwa001:~>

Requesting only the required memory to allow jobs on the overlapping partitions

Each compute node has 384GB of CPU memory, out of which only ~371 GB is available for the users' jobs through SLURM. However, users will notice only 9GB allocated for each core requested in the SLURM. The workq partition provides only 38 cores on each node and if a job requests all 38 cores of a node from the workq partition, SLURM will automatically allocate 342GB memory (= 38x9GB) for that job. This will leave only ~29GB (= 371-342) of memory for any GPU job that is going to run on that same/overlapping node. So, it is recommended to explicitly request only the required amount of memory for your jobs using --mem directive so that the nodes will be effectively utilized by both CPU and GPU workflows.

The following interactive job requested 38 tasks from a single node in the workq partition. SLURM allocated mwa024 and by default provided 342GB (9GB per each task) for this job.

Terminal 8. Requesting only the required memory.

ddeeptimahanti@garrawarla-1:~> salloc -p workq --ntasks-per-node=38
salloc: Granted job allocation 612214
salloc: Waiting for resource configuration
salloc: Nodes mwa024 are ready for job
ddeeptimahanti@mwa024:~> scontrol show job 612214 | grep mem
   TRES=cpu=38,mem=342G,node=1,billing=38

Now, only 29GB (=371-342) is remaining on the mwa024 that is available for any GPU jobs on this node. So, SLURM will fail to allocate the resources for the job requesting over 29GB from this node from the gpuq partition. It can only honor jobs requesting 29GB or less memory.

Terminal 10.

ddeeptimahanti@garrawarla-1:~> salloc -w mwa024 -p gpuq --gres=gpu:1 --mem=29g
salloc: Granted job allocation 612245
salloc: Waiting for resource configuration
salloc: Nodes mwa024 are ready for job

ddeeptimahanti@mwa024:~> scontrol show node mwa024 | grep mem
   CfgTRES=cpu=40,mem=380000M,billing=40,gres/gpu=1
   AllocTRES=cpu=39,mem=379904M,gres/gpu=1

So to facilitate jobs to run on both overlapping partitions, users are recommended to request memory as required for their jobs.

Now, the following interactive job requested 38 tasks from a single node in the workq partition but explicitly requested 200g memory using the --mem directive:

Terminal 11.

ddeeptimahanti@garrawarla-1:~> salloc -w mwa024 -p workq --ntasks-per-node=38 --mem=200g
salloc: Granted job allocation 865353
salloc: Waiting for resource configuration
salloc: Nodes mwa024 are ready for job
scddeeptimahanti@mwa024:~> scontrol show job 865353 | grep mem
   TRES=cpu=38,mem=200G,node=1,billing=38
ddeeptimahanti@mwa024:~>

This allowed in launching another job from the gpuq partition on this same node and request up to 171GB (= 371-200):

Terminal 12.

ddeeptimahanti@garrawarla-1:~> salloc -w mwa024 -p gpuq --gres=gpu:1 --mem=171g
salloc: Granted job allocation 865355
salloc: Waiting for resource configuration
salloc: Nodes mwa024 are ready for job
ddeeptimahanti@mwa024:~> vi vtk.cyg
ddeeptimahanti@mwa024:~> scontrol show node mwa024 | grep mem
   CfgTRES=cpu=40,mem=380000M,billing=40,gres/gpu=1
   AllocTRES=cpu=39,mem=371G,gres/gpu=1

User Support Documentation

Garrawarla User Guide

System overview

System configuration

Filesystems and data management

Available Software

Logging in

Compiling

Compiling MPI code

Compiling OpenMP code

Compiling GPU code

Compiling a CUDA application

Compiling interactively

Compiling using a job script

Compiling an OpenACC application

Compiling interactively

Compiling using a batch job

Queue policy and limits

Submit your job

Batch Job Examples

Requesting NVMe resources in SLURM

Serial job using a single CPU core

OpenMP code using all available CPU cores per node

Non-MPI code using a single GPU

MPI code using more than one GPU

OpenMP code using a single GPU and all available CPU cores

MPI + OpenMP code using the GPU and all available CPU cores per node

Run a job using interactive mode

Resource Accounting

Troubleshooting & Good Practices

Using singularity with GPUs

Segmentation fault while running CUDA/OpenACC applications with UCX support

Using ramdisk support on the compute nodes

Requesting only the required memory to allow jobs on the overlapping partitions