Setonix General Information
Setonix is Pawsey's flagship supercomputer based on the HPE Cray EX architecture that was commissioned in 2020 and delivered in two phases over the course of 2022 and 2023.
Setonix is the scientific name for the Quokka, a very popular animal found on Rottnest Island, Western Australia.
System Overview
The Setonix supercomputer is a heterogeneous system consisting of AMD CPUs and GPUs, based on the HPE Cray EX architecture. It has more than 200,000 CPU cores and 750 GPUs, interconnected using the Slingshot-10 interconnect with 200Gb/s bandwidth per connection. The AMD Infinity Fabric interconnect provides a direct channel of communication among GPUs as well as between CPUs and GPUs.
The system will be delivered to the Pawsey Supercomputing Centre by HPE in two phases, conveniently named Phase 1 and Phase 2. Phase 1 included all of the filesystems, one-third of the CPU-only compute nodes, as well as visualisation and high-memory nodes. The CPU-only nodes are equipped with two AMD Milan CPUs for a total of 128 cores and 256Gb of RAM, along with 8 high memory nodes with 1 TB of RAM. Phase 2 brings the total CPU nodes to 1600 including the high memory nodes, as well as 192 GPU-enabled nodes with one 64-core AMD Trento CPU and 4 AMD MI250X GPU cards providing 8 logical GPUs per node.
Table 1. Setonix Node Overview
Type | N. Nozdes | CPU | Cores Per Node | RAM Per Node | GPUs Per Node |
---|---|---|---|---|---|
Login | 9 | AMD Milan | 2x 64 | 256GB | n/a |
CPU computing | 1592 | AMD Milan (2.45GHz, 280W) | 2x 64 | 256GB | n/a |
CPU high memory | 8 | AMD Milan (2.45GHz, 280W) | 2x 64 | 1TB | n/a |
GPU computing | 154 | AMD Trento | 1 x 64 | 256GB | 8 GCDs (from 4x "AMD MI250X" cards, each card with 2 GCDs) |
GPU high memory | 38 | AMD Trento | 1 x 64 | 512GB | 8 GCDs (from 4x "AMD MI250X" cards, each card with 2 GCDs) |
Data movement | 11 | AMD 7502P | 1x 32 | 128Gb | n/a |
More details regarding the hardware architecture and filesystems are made available in the sections below.
Hardware architecture
Login and management nodes are placed within air-cooled cabinets. Compute nodes are hosted in liquid-cooled cabinets instead. Each compute cabinet is made of eight chassis, containing eight custom compute blades each. Each compute cabinet also hosts up to 64 Slingshot switches, each having in turn 64 200Gbps ports. Compute blades and network switches are connected orthogonally. All Setonix nodes are connected using the dragonfly topology.
Figure 1. Representation of a chassis in a compute cabinets, showing how switches, compute blades, node cards, and nodes relate to each other.
Each compute blade has two independent node cards, each of which hosts two compute nodes.
AMD Zen3 CPU architecture
A CPU compute node has 2 AMD Milan EPYC CPUs with 64 cores each and 256GB of RAM. The 64 cores of a Zen3 AMD CPU (shown in Figure 2 below) are evenly distributed across eight Core Chiplet Dies (CCD), each of which has 32Mb of L3 cache shared among all the cores on that CCD (shown in Figure 3 below). There is no limitation on the use of the L3 cache by a single Zen3 core, that can use up all of it. The Zen3 CPU is composed of 8 such CCDs, all connected to an additional memory and I/O controller die through the AMD Infinity Fabric. There are 8 memory channels, each with up to RAM circuits (DIMMS). The CPU supports 128 lanes of PCIe gen4 and up to 32 SATA or NVMe direct connect devices. Every two CCDs form a NUMA region. For more information about NUMA regions check the output of the lstopo-no-gui
program.
Figure 2. Schematic representation of the Zen3 CPU.
Figure 3. Cores on a Zen3-based AMD CPU are partitioned in groups of eight, all residing on a Core Chiplet Die (CCD) and sharing the same L3 cache.
GPU node architecture
Each GPU compute node has one AMD Trento EPYC CPU with 64 cores and 256GB of RAM. The Trento CPU architecture is similar to the Milan CPUs in the CPU nodes, with additional support for Infinity Fabric links to the four AMD MI250X GPU cards. Each MI250X card has two "logical GPUs" for a total of 8 GPUs per node. The node architecture of the Setonix GPU nodes is pictured in Figure 4 below. Each L3 cache region is connected to a logical GPU in the MI250X GPU cards via Infinity Fabric connections. These GPUs are also closely inter-connected via numerous Infinity Fabric links, and also connect to the Slingshot NIC cards for data transfer between nodes.
Figure 4. GPU node architecture. Note that the GPU's shown here are equivalent to a GCD.
Note that each MI250X has two Graphics Compute Dies (GCD) that are accessible as two logical GPUs, for a total of eight per node.
Important: GCD vs GPU
A MI250x GPU card has two GCDs. Previous generations of GPUs only had 1 GCD per GPU card, so these terms could be used interchangeably. The interchangeable usage continues even though now GPUs have more than one GCD. Slurm for instance only use the GPU terminology when referring to accelerator resources, so requests such as gpu-per-node
is equivalent to a request for a certain number of GCDs per node. On Setonix, the max number is 8.
The GCD architecture is shown in Figure 5 below, and consists of 110 Compute Units (CU) (for 220 per MI250X, or 880 per node) with 64GB of GPU memory (for 128GB per MI250X, or 512GB per node).
Figure 5. MI250X GCD architecture
Each Compute Unit contains 64 Stream Processors and 4 Matrix Cores, as shown below in Figure 6.
Figure 6. MI250X Compute Unit architecture
For more detail regarding the MI250X GPU architecture, refer to the AMD CDNA 2 Architecture Whitepaper.
Filesystems and data management
- The
/home
filesystem, where the user can save personal configurations files; - The
/software
filesystem, hosting the Pawsey-provided software stack, and where users can install software; - The
/scratch
filesystem, high-performance, parallel filesystem to be used for I/O operations within jobs.
Lustre filesystems are connected to compute nodes through the Slingshot fabric.
Because /scratch
is a temporary storage solution, Pawsey provides users with the Acacia storage system to store data for the lifetime of their projects. It is based on the object storage paradigm, as opposed to a file storage system, and users transfer data to and from Acacia using a dedicated command-line tool. Check Pawsey Object Storage: Acacia for more information.
Available filesystems on Setonix are summarised in Table 2.
Table 2. Important filesystems mounted on Setonix
Mount point | Variable | Type | Size | Description |
---|---|---|---|---|
|
| Lustre filesystem | 14.4PB | A high-performance parallel filesystem for data processing. |
|
| Lustre filesystem | 393TB | Where system and user software are installed. |
|
| NFS | 92TB | Storage relatively small numbers of important system files such as your Linux profile and shell configuration. |
| 2.8PB | Filesystem dedicated to astronomy research. |
More information about filesystems and data management can be found in File Management.
Accounting
The cost of running a job on Setonix is expressed in Service Units (SUs) and it is given by the following formula.
Partition Charge Rate ✕ Max(Cores Proportion, Memory Proportion, GPU Proportion) ✕ N. of nodes requested ✕ Job Elapsed Time (Hours).
Where,
- Partition Charge Rate is a constant value associated with each Slurm partition,
- Core proportion is the number of CPU cores per node requested divided by the total number of CPU cores per node,
- Memory proportion is the amount of memory per node requested divided by the total amount of memory available per node,
- GPU proportion is the amount of GPUs requested divided by the total amount of GPUs available per node (remember that for slurm, each GPU is equivalent to a GCD, so each GPU-node has 8 available GPUs to be requested).
For Setonix CPU nodes, the charge rate is 128 SU per node hour, as each CPU node has 128 cores.
For Setonix GPU nodes, the charge rate is 512 SU per node hour, based on the difference in energy consumption between the CPU and GPU node architectures. Since there are fewer GPU nodes than CPU nodes, these GPU nodes are to be used solely for GPU-enabled codes. Thus, resource requests on GPU nodes are slightly different to CPU nodes as all requests are in units of GCDs, with 1 GCD = 1 Slurm GPU. Requests cannot be made based on memory but must be based on the number of GPUs to be used.
Worked Examples of the Accounting Model
Let’s work through some CPU examples.
Example 1: Memory heavy
You request the following:
Resource | Request |
---|---|
Cores | 32 |
Memory | 128GB |
Nodes | 1 |
Hours | 4 |
On a CPU node, there are 128 cores per node. Core Proportion is calculated as requested cores
/ total cores
, which is 32/128=0.25
On a normal CPU node (i.e. not the highmem
nodes), there is 256GB of memory. Memory Proportion is calculated as requested memory
/ total memory
, which is 128/256=0.5
Which of these is bigger? It’s the memory proportion in this case, so that’s what we will use for the equation. You always use whichever is the biggest proportion. Also remember that the Partition Charge Rate
never changes.
128 x 0.5 x 1 x 4 = 256SU
Partition charge rate x memory proportion x nodes x hours = 256SU
Example 2: CPU heavy
You request the following:
Resource | Request |
---|---|
Cores | 128 |
Memory | 64GB |
Nodes | 1 |
Hours | 4 |
On a CPU node, there are 128 cores per node. Core Proportion is calculated as requested cores
/ total cores
, which is 128/128=1
On a normal CPU node (i.e. not the highmem
nodes), there is 256GB of memory. Memory Proportion is calculated as requested memory
/ total memory
, which is 64/256=0.25
Which of these is bigger? It’s the memory proportion in this case, so that’s what we will use for the equation. You always use whichever is the biggest proportion. Also remember that the Partition Charge Rate
never changes.
128 x 1 x 1 x 4 = 512SU
Partition charge rate x CPU proportion x nodes x hours = 512SU
Example 3: GPU
Resource | Request |
---|---|
GCDs | 3 |
Memory | N/A |
Nodes | 1 |
Hours | 4 |
On a GPU node, there are 8 GCDs per node. The GPU proportion is calculated as requested GCDs/total GCDs
which is 3/8=0.375
You don't need to worry about how much memory to request as this is automatically allocated for you.
512 x 0.375 x 1 x 4 = 768SU
Partition charge rate x memory proportion x nodes x hours = 768SU
Maintenance
Due to the cutting-edge nature of Setonix, regular and frequent updates of the software stack is expected during the first year of Setonix's operation as further optimisations and improvements are made available.