Skip to end of banner
Go to start of banner

Setonix User Guide

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 26 Next »

Work in Progress for Phase-2 Documentation

The content of this section is currently being updated to include material relevant for Phase-2 of Setonix and the use of GPUs.
On the other hand, all the existing material related to Phase-1 and the use of CPU compute nodes can be considered safe, valid and up-to-date.

Setonix is a supercomputer based on the HPE Cray EX architecture that was commissioned in 2020 and will be delivered over the course of 2022. It will be the next flagship supercomputer of the Pawsey Supercomputing Centre.

Setonix is the scientific name for the Quokka, a very popular animal found on Rottnest Island, Western Australia.

On this Page

System Overview

The Setonix supercomputer is a heterogeneous system consisting of CPUs and GPUs, with AMD providing both types of hardware, and it is based on the HPE Cray EX architecture. After its complete delivery, Setonix will have more than 200,000 CPU cores and 750 GPUs, with a peak computational power of 50 petaflops, 40 of which come from GPU accelerators. Nodes will be interconnected using the Slingshot-10 interconnect, providing a 100Gb/s bandwidth, later to be upgraded to 200Gb/s. The AMD Infinity Fabric interconnect provides a direct channel of communication among GPUs, as well as among CPUs and GPUs.

The system will be delivered to the Pawsey Supercomputing Centre by HPE in two phases, conveniently named Phase 1 and Phase 2.

Available during Phase 1 are all of the filesystems, one-third of the CPU-only compute nodes, half of the visualisation and high-memory nodes, and four GPU-enabled nodes. The system has a capacity of 2.4 PFLOPs. It is predominantly CPU-only, each node equipped with two AMD Milan CPUs for a total of 128 cores and 256Gb of RAM.

Table 1. Phase 1 of Setonix

Reason

N. Nodes

CPU

Cores Per Node

RAM Per Node

Log In

4

AMD Milan2x 64256Gb

CPU computing

504

AMD Milan (2.45GHz, 280W)

2x 64

256Gb
CPU High mem8AMD Milan (2.45GHz, 280W)2x 641Tb
Data movement8

AMD 7502P

1x 32128Gb

All of the filesystems are made available with Phase 1. Check the filesystem section for more details.

Phase 2 deployment will upgrade Setonix to its full computational capacity by adding over 1000 CPU nodes, more than 750 AMD MI200 GPUs as well as login, visualisation and data mover nodes.

Logging in

Users can access Setonix using any SSH client and your Pawsey credentials. The hostname is setonix.pawsey.org.au

For more information, visit the page How to log into Setonix.

Hardware architecture

Login and management nodes are placed within air-cooled cabinets. Compute nodes are hosted in liquid-cooled cabinets instead. Each compute cabinet is made of eight chassis, containing eight custom compute blades each. Each compute cabinet also hosts up to 64 Slingshot switches, each having in turn 64 200Gbps ports. Compute blades and network switches are connected orthogonally. All Setonix nodes are connected using the dragonfly topology.


Figure 1. Representation of a chassis in a compute cabinets, showing how switches, compute blades, node cards, and nodes relate to each other.

Each compute blade has two independent node cards, each of which hosts two compute nodes. A compute node has 2 AMD EPYC CPUs with 64 cores each and 256Gb of RAM. This is pictured in figure 1.

AMD Zen3 CPU architecture



Figure 2. Cores on a Zen3-based AMD CPU are partitioned in groups of eight, all residing on a Core Chiplet Die (CCD) and sharing the same L3 cache.


Figure 3. Schematic representation of the Zen3 CPU.



The 64 cores of a Zen3 AMD CPU are evenly distributed across eight Core Chiplet Dies (CCD), each of which has 32Mb of L3 cache shared among all the cores on that CCD. There is no limitation on the use of the L3 cache by a single Zen3 core, that can use up all of it. The Zen3 CPU is composed of 8 such CCDs, all connected to an additional memory and I/O controller die through the AMD Infinity Fabric. There are 8 memory channels, each with up to RAM circuits (DIMMS). The CPU supports 128 lanes of PCIe gen4 and up to 32 SATA or NVMe direct connect devices. Every two CCDs form a NUMA region. For more information about NUMA regions check the output of the lstopo-no-gui program.


Software environment

The operating system of Setonix is CrayOS, based on sles15sp2. The supercomputer comes with key software packages optimised by the vendor for the HPE Cray EX architecture. It is accessible through the module system, like any other software installed system-wide; however, the Pawsey staff relies on the vendor for its maintenance.

Job Scheduler

Setonix adopts the Slurm job scheduler to manage resources and to grant users fair access to those. To know more about job scheduling with Slurm, visit the page Job Scheduling.

Software stack

Pawsey installs and maintains a predefined set of applications and libraries optimised for Setonix, collectively forming the Pawsey-provided software stack. The list of supported software is available in List of Supported Software. For further information, visit Software Stack.

Programming environments

On an HPE system, the Cray Programming Environment (CPE) determines which set of compilers and libraries are used when compiling and linking code. There are three available programming environments, PrgEnv-aocc, PrgEnv-cray and PrgEnv-gnu, that respectively give access to AMD, Cray, GNU (loaded by default) compilers, along with a consistent set of libraries. It is up to the user to decide which programming environment is most suitable for the task at hand.

Vendor-provided libraries

For many of the popular HPC libraries and interfaces, such as BLAS and MPI, HPE Cray provides its own optimised implementations, preinstalled on Setonix. For some of those libraries,  such as HDF5 and netCDF, Pawsey maintains its own builds.

There are a couple of know issues with the Cray libraries. Visit the Known Issues section for more information.

Cray MPICH

Cray MPICH is an MPI implementation optimised by Cray to take advantage of the Slingshot-10 interconnect through libfabric and is optimized for the Cray Programming Environment. It is based on the ANL MPICH implementation, version 3.4. Users access Cray MPICH by loading the module cray-mpich.

Cray LibSci

The Cray Scientific and Math Libraries (CSML, also known as LibSci) are a collection of numerical routines optimized for best performance on Cray systems. All programming environment modules load cray-libsci by default, except where noted. When possible, users should use calls to the CSML routines in your code in place of calls to public-domain or user-written versions. The CSML/LibSci collection contains the following Scientific Libraries:

  • BLAS (Basic Linear Algebra Subroutines)
  • LAPACK (Linear Algebra Routines)
  • ScaLAPACK (Scalable LAPACK)
  • NetCDF (Network Common Data Format)
  • FFTW3 (the Fastest Fourier Transforms in the West, release 3)

In addition, the Cray LibSci collection contains the Iterative Refinement Toolkit (IRT) developed by Cray.

Versions are provided for all programming environments. The cray-libsci module is loaded by default and Cray LibSci it will link automatically with your code, selecting the appropriate serial or multithreaded variant of the library depending on whether OpenMP is enabled and the call is inside a parallel region. The OMP_NUM_THREADS environment variable can be used to control threading. Single-threaded version can be enforced by linking with -lsci_cray, -lsci_intel-lsci_gnu for Cray, Intel and GNU compilers respectively.

Module system

A module system is a way of providing users access to a variety of applications, and to different versions of the same, in an easy way. Setonix adopts the LMOD module system. To see the list of currently installed software, use the command:

    $ module avail

For more information on how to interact with modules, visit Modules. For a more general discussion on what software is supported by Pawsey, and how, visit Software Stack.

Environment variables

Pawsey defines a set of environment variables that may be useful when writing batch scripts or simply interacting with the supercomputer.

Table 2. Predefined variables that are available when you log into Pawsey supercomputing systems

Variable namePurposeExample values
PAWSEY_CLUSTERHost name of the systemsetonix
PAWSEY_OSCurrent operating systemsles15sp1
PAWSEY_PROJECTDefault project for the userpawsey####, director####
MYSCRATCHDefault /scratch directory for the user/scratch/$PAWSEY_PROJECT/$USER 
MYSOFTWAREDefault /software directory for the user/software/projects/$PAWSEY_PROJECT/$USER 

Cray environment variables

The compilation process on the HPE Cray EX architecture, that Setonix is based on, adopts dynamic linking by default. This is in contrast with previous generations of Cray systems, HPE Cray EX compilers do not allow dynamic linking with libraries. To turn static linking on, set the following environment variable:

    $ export CRAYPE_LINK_TYPE=static

Filesystems and data management

  • The /home filesystem, where the user can save personal configurations files;
  • The /software filesystem, hosting the Pawsey-provided software stack, and where users can install software;
  • The /scratch filesystem, high-performance, parallel filesystem to be used for I/O operations within jobs.

Lustre filesystems are connected to compute nodes through the Slingshot fabric.

Because /scratch is a temporary storage solution, Pawsey provides users with the Acacia storage system to store data for the lifetime of their projects. It is based on the object storage paradigm, as opposed to a file storage system, and users transfer data to and from Acacia using a dedicated command-line tool. Check Pawsey Object Storage: Acacia for more information.

Available filesystems on Setonix are summarised in Table 3.

Table 3. Important filesystems mounted on Setonix

Mount point

Variable

Type

Size

Description

/scatch

MYSCRATCH

Lustre filesystem

14.4PB

A high-performance parallel filesystem for data processing.

/software

MYSOFTWARE

Lustre filesystem

393TB

Where system and user software are installed.

/home

HOME

NFS

92TB

Storage relatively small numbers of important system files such as your Linux profile and shell configuration.

/astro



2.8PB

Filesystem dedicated to astronomy research.

More information about filesystems and data management can be found in File Management.

Running jobs

Setonix uses the Slurm workload manager to schedule user programs for execution. To learn the generalities of using Slurm to schedule programs in supercomputers, visit the Job Scheduling page. In addition, please read the following subsections discuss the peculiarities of running jobs on Setonix, together with the Example Batch Scripts for Setonix.

Important

It is highly recommended that you specify values for the --nodes, --ntasks, --cpus-per-task and --time options that are optimal for the job and for the system on which it will run. Also, use --mem if the job will not use all the resources in the node: shared access; or --exclusive for allocation of all resources in the requested nodes: exclusive access.

Overview

Each compute node of Setonix share its resources by default to run multiple jobs in the node at the same time, submitted by many users from the same or different projects. We call this configuration shared access and, as mentioned, is the default for Setonix nodes. Nevertheless, users can use slurm options to overrride the default and explicitly request for exclusive access to the requested nodes.

Nodes are grouped in partitions. Each partition is characterised by a particular configuration of its resources and it is intended for a particular workload or stage of the scientific workflow development. Table 4 shows the list of partitions present on Setonix Phase 1 and their available resources per node.

Each job submitted to the scheduler gets assigned a Quality of Service (QoS) level which determines the priority of the job with respect to the others in the queue. Usually, the default normal QoS applies. Users can boost the priority of their jobs up to 10% of their allocations, using the high QoS, in the following way:

$ sbatch --qos=high myscript.sh

Each project has an allocation for a number of service units (SUs) in a year, which is broken into quarters. Jobs submitted under a project will subtract SUs from the project's allocation. A project that has entirely consumed its SUs for a given quarter of the year will run its jobs in low priority mode for that time period. If a project's SU consumption (for a given quarter) hits the 150% usage mark with respect to its granted allocation, no further jobs will be able to run under the project.


Table 4. Slurm partitions on Setonix

Name

N. Nodes

Cores per nodeAvailable node-RAM for jobsReasonWall time
long82 X 64230 GbLong-running jobs96h
debug82 X 64230 GbDevelopment and debugging.1h
work3082 X 64230 GbSupports production level.24h
highmem82 X 64980 GbSupports jobs that require a large amount of memory.24h
copy81 X 32118 GbCopy of large data to and from the supercomputer's filesystems.24h
askaprt1802 X 64230 Gbdedicated to the ASKAP project24h


Table 5. Quality of Service levels applicable to a Slurm job running on Setonix

NamePriority LevelDescription
lowest0Reserved for particular cases.
low3000

Priority for jobs past the 100% allocation usage.

normal10000The default priority for production jobs.
high14000Priority boost available to all projects for a fraction (10%) of their allocation.
highest20000Assigned to jobs that are of critical interest (e.g. project part of the national response to an emergency).
exhausted0QoS for jobs for projects that have consumed more than 150% of their allocation.

Job Queue Limits

Users can check the limits on the maximum number of jobs that users can run at a time (i.e., MaxJobs) and the maximum number of jobs that can be submitted (i.e., MaxSubmitJobs) for each partition on Setonix using the command:

$ sacctmgr show associations user=$USER cluster=setonix

Additional constraints are imposed on projects that have overused their quarterly allocation.

Executing large jobs

When executing large, multinode jobs on Setonix, the use of the --exclusive option in the batch script is recommended. The addition will result in better resource utilisation within each node assigned to the job.

Compiling

The Cray (HPE), GNU, and AMD compilation environments are available on Setonix. 

A CPE provides compiler wrappers, shown in table 6, for both the Cray Compiling Environment (CCE) and third-party compiler drivers. When using the wrappers, the actual compiler invoked is determined by the Programming Environment (PrgEnv-cray, PrgEnv-aocc or PrgEnv-gnu), as loaded through the module system. These compiler wrappers handle common tasks like linking adding MPI and numerical libraries such as BLAS/LAPACK, and cross-compilation (discussed below). Compiler wrappers compile both serial and parallel code; there is no separate MPI compiler (e.g. mpicc, mpicxx, mpif90). These wrappers also work in Makefiles and build scripts, without the need to modify them.


Users should not attempt to explicitly invoke specific compilers (for example, gcc). Use the appropriate compiler wrapper in conjunction with the correct choice of the programming environment.

Table 6. Compiler wrappers that are available for every programming environment on an HPE Cray supercomputer

LanguageWrapper
Ccc
C++CC
Fortranftn

The Fortran compiler coming with the Cray Programming Environment is developed entirely by Cray and supports the Fortran 2018 standard (ISO/IEC 1539:2018), with some exceptions and deferred features. The C/C++ compiler is instead based on Clang/LLVM, with some Cray enhancements. For instance, the OpenMP implementation is HPE Cray proprietary. CCE compilers are documented through their man pages.


The CCE C/C++ compiler supports Unified Parallel C (UPC), an extension of the C programming language designed for high-performance computing on large-scale parallel machines.

Furthermore, the following third-party programming languages are bundled with the Programming Environment: Python 3.8.x through the module cray-python, and R 4.0 through the module cray-R.

In order to build code optimised for the compute nodes, three ways are available:

  • through a Slurm interactive session on the compute nodes (suggested for small codes),
  • through a Slurm batch job on the compute nodes, or
  • interactively on the login node using the compute node-specific modules and compiler flags (again, for small codes).

We suggest always compiling code on the nodes it will run on, hence the compute nodes.

Compiling MPI code

As mentioned above, wrappers are able to compile both serial and parallel code. Then, regardless of the selected programming environment, users compile MPI code choosing from the same wrappers shown in table 6, according to the programming language used.

Compiling OpenMP code

Users must use specific flags to compile code that makes use of OpenMP for multithreading, with a different syntax depending on the selected programming environment.


In previous versions of the Cray compiler, OpenMP compilation was enabled by default.

Table X. Flags enabling OpenMP compilation for various compilers.

PrgEnv-crayPrgEnv-aoccPrgEnv-gnu
LanguageCommandLanguageCommandLanguageCommand
Ccc -fopenmp hello_omp.cCcc -qopenmp hello_omp.cCcc -fopenmp hello_omp.c
C++CC -fopenmp hello_omp.cppC++CC -qopenmp hello_omp.cppC++CC -fopenmp hello_omp.cpp
Fortranftn -h omp hello_omp.f90Fortranftn -qopenmp hello_omp.f90Fortranftn -fopenmp hello_omp.f90

To execute OpenMP programs, set the OMP_NUM_THREADS environment variable with the number of threads to be created, and request the same number of cores using the -c--cpus-per-task option of srun.

Compiling OpenACC code

OpenACC is only supported by the Cray Fortran compiler and GCC for C and C++.

Compiler manual pages

Executing man ccman CC or man ftn within will open the manual pages of the wrapper. Manual pages for every compiler are also available.

Profiling and optimisation

ARM Forge is available for users to profile their programs. ARM Forge supports Cray Debugger Support Tools (CDST). Shasta ships with numerous debugging tools for the programming environment.

A number of Cray-authored tools are included:

  • Gdb4hpc. A command-line interactive parallel debugger that allows debugging of the application at scale. Helps diagnose hangs and crashes. A good all-purpose debugger to track down bugs, analyse hangs, and determine the causes of crashes.
  • Valgrind4hpc: a parallel memory debugging tool to detect memory leaks and errors in parallel applications.
  • Stack Trace Analysis Tool (STAT)  - A single merged stack backtrace tool to analyze application behaviour at the function level. Helps trace down the cause of crashes.
  • Abnormal Termination Processing (ATP)  - A scalable core file generation and analysis tool for analysing crashes, with a selection algorithm to determine which core files to dump. Helps determine the cause of crashes.
  • Cray Comparative Debugger (CCDB)  - CCDB is not a traditional debugger, but rather a tool to run run and step through two versions of the same application side-by-side to help determine where they diverge.

Accounting

The cost of running a job on Setonix is expressed in Service Units (SUs) and it is given by the following formula.

    Partition Charge Rate ✕ Max(Cores Proportion, Memory Proportion) ✕ N. of nodes requested ✕ Job Elapsed Time (Hours).

Where,

  • Partition Charge Rate is a constant value associated with each Slurm partition,
  • Core proportion is the number of CPU cores per node requested divided by the total number of CPU cores per node,
  • Memory proportion is the amount of memory per node requested divided by the total amount of memory available per node,

For Setonix Phase 1, with CPU-only nodes, the charge rate is 128, due to the fact that each node has 128 cores.

Maintenance

Due to the novelty of the system, users should expect regular and frequent updates of the software stack during the first year of Setonix's life.

Frequently asked questions

No FAQ at the moment.

Known issues

Instability of the Cray environment

At the moment, vendor-provided libraries such as cray-libsci and cray-fftw are unstable and Pawsey staff will test them during the incoming months. Moreover, the Cray C/C++ compiler seems to not optimise code very well. For this reason, you should avoid using the Cray stack other than cray-mpich and crayftn, and using the alternatives provided by the Pawsey software stack. To compile C/C++ code, we suggest using GCC compilers.

Linking to Cray libraries different than the default ones

Cray modulefiles do not set the LD_LIBRARY_PATH environment variable despite the fact that Cray compilers now default to dynamic linking. As a result, programs will link to the libraries found in  /opt/cray/pe/lib64, which are a symlink to the deployment of the latest version available.

Terminal 1. Content of the LD config file.
$ cat /etc/ld.so.conf.d/cray-pe.conf
/opt/cray/pe/lib64
/opt/cray/pe/lib64/cce


To avoid this behaviour, once the desired version of the Cray libraries has been loaded with the module command, set the appropriate environment variables accordingly.

$export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

$export LIBRARY_PATH=$CRAY_LIBRARY_PATH:$LIBRARY_PATH

Threads and processes placement on Zen3 cores

Currently, Slurm presents unwanted behaviours that have an impact on the performance of a job when it is submitted without the --exlusive Sbatch flag, In particular, Slurm loses awareness of the Zen3 architecture and threads and/or processes are placed onto cores with no reasonable mapping.

To avoid the issue, pass the -m block:block:block flag to srun within a sbatch script or an interactive session.

Memory consumption by Cray OS on compute nodes

We have detected that in some circumstances the memory available to Slurm jobs is significantly less than what is stated in the Slurm configuration. The mismatch can lead to Slurm jobs crashing without an apparent reason. The issue is being currently investigated, any update will be given through the technical newsletter.

The default project for a user on Setonix may have been subject to a misconfiguration.

A user's default project is determined by the content of the ~/.pawsey_project value, that then populates the $PAWSEY_PROJECT variable. Said variable is then used in various processes like Spack software installations. Due to a bug, a wrong value is been written to the ~/.pawsey_project file. Please verify that the content of the file corresponds to your main project (or any project you want to be your default one).



  • No labels