Setonix is a supercomputer based on the HPE Cray EX architecture that was commissioned in 2020 and will be delivered over the course of 2022. It will be the next flagship supercomputer of the Pawsey Supercomputing Centre.
Setonix is the scientific name for the Quokka, a very popular animal found on Rottnest Island, Western Australia.
System Overview
The Setonix supercomputer is a heterogeneous system consisting of CPUs and GPUs, with AMD providing both types of hardware, and it is based on the HPE Cray EX architecture. After its complete delivery, Setonix will have more than 200,000 CPU cores and 750 GPUs, with a peak computational power of 50 petaflops, 40 of which come from GPU accelerators. Nodes will be interconnected using the Slingshot-10 interconnect, providing a 100Gb/s bandwidth, later to be upgraded to 200Gb/s. The AMD Infinity Fabric interconnect provides a direct channel of communication among GPUs, as well as among CPUs and GPUs.
The system will be delivered to the Pawsey Supercomputing Centre by HPE in two phases, conveniently named Phase 1 and Phase 2.
Available during Phase 1 are all of the filesystems, one-third of the CPU-only compute nodes, half of the visualisation and high-memory nodes, and four GPU-enabled nodes. The system has a capacity of 2.4 PFLOPs. It is predominantly CPU-only, each node equipped with two AMD Milan CPUs for a total of 128 cores and 256Gb of RAM.
Table 1. Phase 1 of Setonix
Reason | N. Nodes | CPU | Cores Per Node | RAM Per Node |
---|---|---|---|---|
Log In | 4 | AMD Milan | 2x 64 | 256Gb |
CPU computing | 504 | AMD Milan (2.45GHz, 280W) | 2x 64 | 256Gb |
CPU High mem | 8 | AMD Milan (2.45GHz, 280W) | 2x 64 | 1Tb |
Data movement | 8 | AMD 7502P | 1x 32 | 128Gb |
All of the filesystems are made available with Phase 1. Check the filesystem section for more details.
Phase 2 deployment will upgrade Setonix to its full computational capacity by adding over 1000 CPU nodes, more than 750 AMD MI200 GPUs as well as login, visualisation and data mover nodes.
Logging in
Users can access Setonix using any SSH client and your Pawsey credentials. The hostname is setonix.pawsey.org.au
For more information, visit the page How to log into Setonix.
Hardware architecture
Login and management nodes are placed within air-cooled cabinets. Compute nodes are hosted in liquid-cooled cabinets instead. Each compute cabinet is made of eight chassis, containing eight custom compute blades each. Each compute cabinet also hosts up to 64 Slingshot switches, each having in turn 64 200Gbps ports. Compute blades and network switches are connected orthogonally. All Setonix nodes are connected using the dragonfly topology.
Figure 1. Representation of a chassis in a compute cabinets, showing how switches, compute blades, node cards, and nodes relate to each other.
Each compute blade has two independent node cards, each of which hosts two compute nodes. A compute node has 2 AMD EPYC CPUs with 64 cores each and 256Gb of RAM. This is pictured in figure 1.
AMD Zen3 CPU architecture
Figure 2. Cores on a Zen3-based AMD CPU are partitioned in groups of eight, all residing on a Core Chiplet Die (CCD) and sharing the same L3 cache.
Figure 3. Schematic representation of the Zen3 CPU.
The 64 cores of a Zen3 AMD CPU are evenly distributed across eight Core Chiplet Dies (CCD), each of which has 32Mb of L3 cache shared among all the cores on that CCD. There is no limitation on the use of the L3 cache by a single Zen3 core, that can use up all of it. The Zen3 CPU is composed of 8 such CCDs, all connected to an additional memory and I/O controller die through the AMD Infinity Fabric. There are 8 memory channels, each with up to RAM circuits (DIMMS). The CPU supports 128 lanes of PCIe gen4 and up to 32 SATA or NVMe direct connect devices. Every two CCDs form a NUMA region. For more information about NUMA regions check the output of the lstopo-no-gui
program.
Software environment
The operating system of Setonix is CrayOS, based on sles15sp2. The supercomputer comes with key software packages optimised by the vendor for the HPE Cray EX architecture. It is accessible through the module system, like any other software installed system-wide; however, the Pawsey staff relies on the vendor for its maintenance.
Job Scheduler
Setonix adopts the Slurm job scheduler to manage resources and to grant users fair access to those. To know more about job scheduling with Slurm, visit the page Job Scheduling.
Software stack
Pawsey installs and maintains a predefined set of applications and libraries optimised for Setonix, collectively forming the Pawsey-provided software stack. The list of supported software is available in List of Supported Software. For further information, visit Software Stack.
Programming environments
On an HPE system, the Cray Programming Environment (CPE) determines which set of compilers and libraries are used when compiling and linking code. There are three available programming environments, PrgEnv-aocc
, PrgEnv-cray
and PrgEnv-gnu
, that respectively give access to AMD, Cray, GNU (loaded by default) compilers, along with a consistent set of libraries. It is up to the user to decide which programming environment is most suitable for the task at hand.
Vendor-provided libraries
For many of the popular HPC libraries and interfaces, such as BLAS and MPI, HPE Cray provides its own optimised implementations, preinstalled on Setonix. For some of those libraries, such as HDF5 and netCDF, Pawsey maintains its own builds.
There are a couple of know issues with the Cray libraries. Visit the Known Issues section for more information.
Cray MPICH
Cray MPICH is an MPI implementation optimised by Cray to take advantage of the Slingshot-10 interconnect through libfabric
and is optimized for the Cray Programming Environment. It is based on the ANL MPICH implementation, version 3.4. Users access Cray MPICH by loading the module cray-mpich
.
Cray LibSci
The Cray Scientific and Math Libraries (CSML, also known as LibSci) are a collection of numerical routines optimized for best performance on Cray systems. All programming environment modules load cray-libsci
by default, except where noted. When possible, users should use calls to the CSML routines in your code in place of calls to public-domain or user-written versions. The CSML/LibSci collection contains the following Scientific Libraries:
- BLAS (Basic Linear Algebra Subroutines)
- LAPACK (Linear Algebra Routines)
- ScaLAPACK (Scalable LAPACK)
- NetCDF (Network Common Data Format)
- FFTW3 (the Fastest Fourier Transforms in the West, release 3)
In addition, the Cray LibSci collection contains the Iterative Refinement Toolkit (IRT) developed by Cray.
Versions are provided for all programming environments. The cray-libsci
module is loaded by default and Cray LibSci it will link automatically with your code, selecting the appropriate serial or multithreaded variant of the library depending on whether OpenMP is enabled and the call is inside a parallel region. The OMP_NUM_THREADS
environment variable can be used to control threading. Single-threaded version can be enforced by linking with -lsci_cray
, -lsci_intel
, -lsci_gnu
for Cray, Intel and GNU compilers respectively.
Module system
A module system is a way of providing users access to a variety of applications, and to different versions of the same, in an easy way. Setonix adopts the LMOD module system. To see the list of currently installed software, use the command:
$ module avail
For more information on how to interact with modules, visit Modules. For a more general discussion on what software is supported by Pawsey, and how, visit Software Stack.
Environment variables
Pawsey defines a set of environment variables that may be useful when writing batch scripts or simply interacting with the supercomputer.
Table 2. Predefined variables that are available when you log into Pawsey supercomputing systems
Variable name | Purpose | Example values |
---|---|---|
PAWSEY_CLUSTER | Host name of the system | setonix |
PAWSEY_OS | Current operating system | sles15sp1 |
PAWSEY_PROJECT | Default project for the user | pawsey####, director#### |
MYSCRATCH | Default /scratch directory for the user | /scratch/$PAWSEY_PROJECT/$USER |
MYSOFTWARE | Default /software directory for the user | /software/projects/$PAWSEY_PROJECT/$USER |
Cray environment variables
The compilation process on the HPE Cray EX architecture, that Setonix is based on, adopts dynamic linking by default. This is in contrast with previous generations of Cray systems, HPE Cray EX compilers do not allow dynamic linking with libraries. To turn static linking on, set the following environment variable:
$ export CRAYPE_LINK_TYPE=static
Filesystems and data management
- The
/home
filesystem, where the user can save personal configurations files; - The
/software
filesystem, hosting the Pawsey-provided software stack, and where users can install software; - The
/scratch
filesystem, high-performance, parallel filesystem to be used for I/O operations within jobs.
Lustre filesystems are connected to compute nodes through the Slingshot fabric.
Because /scratch
is a temporary storage solution, Pawsey provides users with the Acacia storage system to store data for the lifetime of their projects. It is based on the object storage paradigm, as opposed to a file storage system, and users transfer data to and from Acacia using a dedicated command-line tool. Check Pawsey Object Storage: Acacia for more information.
Available filesystems on Setonix are summarised in Table 3.
Table 3. Important filesystems mounted on Setonix
Mount point | Variable | Type | Size | Description |
---|---|---|---|---|
|
| Lustre filesystem | 14.4PB | A high-performance parallel filesystem for data processing. |
|
| Lustre filesystem | 393TB | Where system and user software are installed. |
|
| NFS | 92TB | Storage relatively small numbers of important system files such as your Linux profile and shell configuration. |
| 2.8PB | Filesystem dedicated to astronomy research. |
More information about filesystems and data management can be found in File Management.
Running jobs
Setonix uses the Slurm workload manager to schedule user programs for execution. To learn the generalities of using Slurm to schedule programs in supercomputers, visit the Job Scheduling page. In addition, please read the following subsections discuss the peculiarities of running jobs on Setonix, together with the Example Batch Scripts for Setonix.
Important
It is highly recommended that you specify values for the --nodes
, --ntasks
, --cpus-per-task
and --time
options that are optimal for the job and for the system on which it will run. Also, use --mem
if the job will not use all the resources in the node: shared access; or --exclusive
for allocation of all resources in the requested nodes: exclusive access.
Overview
Each compute node of Setonix share its resources by default to run multiple jobs in the node at the same time, submitted by many users from the same or different projects. We call this configuration shared access and, as mentioned, is the default for Setonix nodes. Nevertheless, users can use slurm options to overrride the default and explicitly request for exclusive access to the requested nodes.
Nodes are grouped in partitions. Each partition is characterised by a particular configuration of its resources and it is intended for a particular workload or stage of the scientific workflow development. Table 4 shows the list of partitions present on Setonix Phase 1 and their available resources per node.
Each job submitted to the scheduler gets assigned a Quality of Service (QoS) level which determines the priority of the job with respect to the others in the queue. Usually, the default normal QoS applies. Users can boost the priority of their jobs up to 10% of their allocations, using the high QoS, in the following way:
$ sbatch --qos=high myscript.sh
Each project has an allocation for a number of service units (SUs) in a year, which is broken into quarters. Jobs submitted under a project will subtract SUs from the project's allocation. A project that has entirely consumed its SUs for a given quarter of the year will run its jobs in low priority mode for that time period. If a project's SU consumption (for a given quarter) hits the 150% usage mark with respect to its granted allocation, no further jobs will be able to run under the project.
Table 4. Slurm partitions on Setonix
Name | N. Nodes | Cores per node | Available node-RAM for jobs | Reason | Wall time |
---|---|---|---|---|---|
long | 8 | 2 X 64 | 230 Gb | Long-running jobs | 96h |
debug | 8 | 2 X 64 | 230 Gb | Development and debugging. | 1h |
work | 308 | 2 X 64 | 230 Gb | Supports production level. | 24h |
highmem | 8 | 2 X 64 | 980 Gb | Supports jobs that require a large amount of memory. | 24h |
copy | 8 | 1 X 32 | 118 Gb | Copy of large data to and from the supercomputer's filesystems. | 24h |
askaprt | 180 | 2 X 64 | 230 Gb | dedicated to the ASKAP project | 24h |
Table 5. Quality of Service levels applicable to a Slurm job running on Setonix
Name | Priority Level | Description |
---|---|---|
lowest | 0 | Reserved for particular cases. |
low | 3000 | Priority for jobs past the 100% allocation usage. |
normal | 10000 | The default priority for production jobs. |
high | 14000 | Priority boost available to all projects for a fraction (10%) of their allocation. |
highest | 20000 | Assigned to jobs that are of critical interest (e.g. project part of the national response to an emergency). |
exhausted | 0 | QoS for jobs for projects that have consumed more than 150% of their allocation. |
Job Queue Limits
Users can check the limits on the maximum number of jobs that users can run at a time (i.e., MaxJobs
) and the maximum number of jobs that can be submitted (i.e., MaxSubmitJobs
) for each partition on Setonix using the command:
$ sacctmgr show associations user=$USER cluster=setonix
Additional constraints are imposed on projects that have overused their quarterly allocation.
Executing large jobs
When executing large, multinode jobs on Setonix, the use of the --exclusive
option in the batch script is recommended. The addition will result in better resource utilisation within each node assigned to the job.
Compiling
The Cray (HPE), GNU, and AMD compilation environments are available on Setonix.
A CPE provides compiler wrappers, shown in table 6, for both the Cray Compiling Environment (CCE) and third-party compiler drivers. When using the wrappers, the actual compiler invoked is determined by the Programming Environment (PrgEnv-cray, PrgEnv-aocc or PrgEnv-gnu), as loaded through the module system. These compiler wrappers handle common tasks like linking adding MPI and numerical libraries such as BLAS/LAPACK, and cross-compilation (discussed below). Compiler wrappers compile both serial and parallel code; there is no separate MPI compiler (e.g. mpicc, mpicxx, mpif90
). These wrappers also work in Makefiles and build scripts, without the need to modify them.
gcc
). Use the appropriate compiler wrapper in conjunction with the correct choice of the programming environment.Table 6. Compiler wrappers that are available for every programming environment on an HPE Cray supercomputer
Language | Wrapper |
---|---|
C | cc |
C++ | CC |
Fortran | ftn |
The Fortran compiler coming with the Cray Programming Environment is developed entirely by Cray and supports the Fortran 2018 standard (ISO/IEC 1539:2018), with some exceptions and deferred features. The C/C++ compiler is instead based on Clang/LLVM, with some Cray enhancements. For instance, the OpenMP implementation is HPE Cray proprietary. CCE compilers are documented through their man pages.
The CCE C/C++ compiler supports Unified Parallel C (UPC), an extension of the C programming language designed for high-performance computing on large-scale parallel machines.
Furthermore, the following third-party programming languages are bundled with the Programming Environment: Python 3.8.x through the module cray-python
, and R 4.0 through the module cray-R
.
In order to build code optimised for the compute nodes, three ways are available:
- through a Slurm interactive session on the compute nodes (suggested for small codes),
- through a Slurm batch job on the compute nodes, or
- interactively on the login node using the compute node-specific modules and compiler flags (again, for small codes).
We suggest always compiling code on the nodes it will run on, hence the compute nodes.
Compiling MPI code
As mentioned above, wrappers are able to compile both serial and parallel code. Then, regardless of the selected programming environment, users compile MPI code choosing from the same wrappers shown in table 6, according to the programming language used.
Compiling OpenMP code
Users must use specific flags to compile code that makes use of OpenMP for multithreading, with a different syntax depending on the selected programming environment.
Table X. Flags enabling OpenMP compilation for various compilers.
PrgEnv-cray | PrgEnv-aocc | PrgEnv-gnu | |||
---|---|---|---|---|---|
Language | Command | Language | Command | Language | Command |
C | cc -fopenmp hello_omp.c | C | cc -qopenmp hello_omp.c | C | cc -fopenmp hello_omp.c |
C++ | CC -fopenmp hello_omp.cpp | C++ | CC -qopenmp hello_omp.cpp | C++ | CC -fopenmp hello_omp.cpp |
Fortran | ftn -h omp hello_omp.f90 | Fortran | ftn -qopenmp hello_omp.f90 | Fortran | ftn -fopenmp hello_omp.f90 |
To execute OpenMP programs, set the OMP_NUM_THREADS
environment variable with the number of threads to be created, and request the same number of cores using the -c
, --cpus-per-task
option of srun
.
Compiling OpenACC code
OpenACC is only supported by the Cray Fortran compiler and GCC for C and C++.
Compiler manual pages
Executing man cc
, man CC
or man ftn
within will open the manual pages of the wrapper. Manual pages for every compiler are also available.
Profiling and optimisation
ARM Forge is available for users to profile their programs. ARM Forge supports Cray Debugger Support Tools (CDST). Shasta ships with numerous debugging tools for the programming environment.
A number of Cray-authored tools are included:
- Gdb4hpc. A command-line interactive parallel debugger that allows debugging of the application at scale. Helps diagnose hangs and crashes. A good all-purpose debugger to track down bugs, analyse hangs, and determine the causes of crashes.
- Valgrind4hpc: a parallel memory debugging tool to detect memory leaks and errors in parallel applications.
- Stack Trace Analysis Tool (STAT) - A single merged stack backtrace tool to analyze application behaviour at the function level. Helps trace down the cause of crashes.
- Abnormal Termination Processing (ATP) - A scalable core file generation and analysis tool for analysing crashes, with a selection algorithm to determine which core files to dump. Helps determine the cause of crashes.
- Cray Comparative Debugger (CCDB) - CCDB is not a traditional debugger, but rather a tool to run run and step through two versions of the same application side-by-side to help determine where they diverge.
Accounting
The cost of running a job on Setonix is expressed in Service Units (SUs) and it is given by the following formula.
Partition Charge Rate ✕ Max(Cores Proportion, Memory Proportion) ✕ N. of nodes requested ✕ Job Elapsed Time (Hours).
Where,
- Partition Charge Rate is a constant value associated with each Slurm partition,
- Core proportion is the number of CPU cores per node requested divided by the total number of CPU cores per node,
- Memory proportion is the amount of memory per node requested divided by the total amount of memory available per node,
For Setonix Phase 1, with CPU-only nodes, the charge rate is 128, due to the fact that each node has 128 cores.
Maintenance
Due to the novelty of the system, users should expect regular and frequent updates of the software stack during the first year of Setonix's life.
Frequently asked questions
No FAQ at the moment.
Known issues
Instability of the Cray environment
At the moment, vendor-provided libraries such as cray-libsci
and cray-fftw
are unstable and Pawsey staff will test them during the incoming months. Moreover, the Cray C/C++ compiler seems to not optimise code very well. For this reason, you should avoid using the Cray stack other than cray-mpich
and crayftn
, and using the alternatives provided by the Pawsey software stack. To compile C/C++ code, we suggest using GCC compilers.
Linking to Cray libraries different than the default ones
Cray modulefiles do not set the LD_LIBRARY_PATH
environment variable despite the fact that Cray compilers now default to dynamic linking. As a result, programs will link to the libraries found in /opt/cray/pe/lib64
, which are a symlink to the deployment of the latest version available.
$ cat /etc/ld.so.conf.d/cray-pe.conf /opt/cray/pe/lib64 /opt/cray/pe/lib64/cce
To avoid this behaviour, once the desired version of the Cray libraries has been loaded with the module
command, set the appropriate environment variables accordingly.
$export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
$export LIBRARY_PATH=$CRAY_LIBRARY_PATH:$LIBRARY_PATH
Threads and processes placement on Zen3 cores
Currently, Slurm presents unwanted behaviours that have an impact on the performance of a job when it is submitted without the --exlusive
Sbatch flag, In particular, Slurm loses awareness of the Zen3 architecture and threads and/or processes are placed onto cores with no reasonable mapping.
To avoid the issue, pass the -m block:block:block
flag to srun
within a sbatch script or an interactive session.
Memory consumption by Cray OS on compute nodes
We have detected that in some circumstances the memory available to Slurm jobs is significantly less than what is stated in the Slurm configuration. The mismatch can lead to Slurm jobs crashing without an apparent reason. The issue is being currently investigated, any update will be given through the technical newsletter.
The default project for a user on Setonix may have been subject to a misconfiguration.
A user's default project is determined by the content of the ~/.pawsey_project value, that then populates the $PAWSEY_PROJECT variable. Said variable is then used in various processes like Spack software installations. Due to a bug, a wrong value is been written to the ~/.pawsey_project file. Please verify that the content of the file corresponds to your main project (or any project you want to be your default one).