Job Scheduling
Job scheduling is the process of requesting the execution of a program on a supercomputer. A job consists of one or more related programs that are executed together.
Because a supercomputer is shared among many users, and its architecture is substantially more complex than that of a workstation, programs are not executed immediately. Instead they are submitted to a central scheduling system that decides when to run them.
On this page:
Introduction
Supercomputers usually have two types of nodes, which are shared by all of the users on the system.
Login nodes. These are dedicated to interactive activities such as logging in, compiling, debugging, file management and job submission. Users shouldn't run any compute or memory-intensive programs on login nodes because it would prevent other users from performing their tasks normally.
Compute nodes. Hundreds, or even thousands, of compute nodes can be available to actually execute jobs submitted by users. A scheduling mechanism enforces a fair usage of the compute nodes by users, who essentially take turns to run their jobs.
Slurm (Simple Linux Utility for Resource Management) is a batch queuing system and scheduler. Slurm is highly scalable and capable of operating a heterogeneous cluster with up to tens of millions of processors. It can sustain job throughputs of more than 120,000 jobs per hour, with bursts of job submissions at several times that rate. It is highly tolerant of system failures, with built-in fault tolerance. Plug-ins can be added to support various interconnects, authentication methods, schedulers, and more. Slurm allows users to submit batch jobs, check on their status, and cancel them if necessary. Interactive jobs can also be submitted, and detailed estimates of expected run times can be viewed.
Currently, all Pawsey systems use Slurm. Other popular scheduling systems include PBS Pro and Torque, which are versions of the PBS codebase. Slurm is based on the same underlying concepts as PBS, though the syntax differs.
Resource allocation
Slurm regulates access to computing resources on each node of a supercomputer. A resource is a CPU core, a GPU accelerator, or a quantity of RAM that is allocated by the scheduler for a specified period of time. You must specify the resources your job needs at submission time to the Slurm scheduler. You can either specify resource requirements together with the programs to execute in a batch script, or request resources to be used in an interactive shell. Once resources are granted to your job, you can use all or a part of them when running an executable. To execute a program within a Slurm job, the srun program launcher is used. The launcher takes care of running the program according to a specified configuration, for instance by partitioning multiple processes across multiple cores and nodes. Each program execution imparted using srun, within a Slurm job, is called Slurm step. This approach requires you to think about the resource requirements and their distribution across nodes in advance.
Slurm glossary
Table 1. Definition of main terms used by Slurm to identify technical concepts
Term | Description |
|---|---|
Account | Account is used to describe the entity to which used resources are charged. |
CPU | CPU is used to describe the smallest physical consumable. For multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread. |
GPU | Stands for Graphical Processing Unit. A specialised type of processor that can greatly accelerate certain computational tasks that have been optimised for GPUs. |
GRES | Stands for 'Generic Resources'. A GRES flag in your Slurm script allows you to allocate resources beyond the usual CPU and memory, such as GPUs. |
Partition | Slurm groups nodes into sets called partitions. Each partition has an associated queue where jobs are submitted to in order to run. Examples include the work, long, debug, and copy partitions on Setonix. |
Task | A task under Slurm is a synonym for a process, and is often the number of MPI processes that are required. |
Submitting a job to Slurm
To perform a computation on a supercomputer, you need to specify the resource requirements for the job. To do this you can either request an interactive shell using the salloc command, or create a batch script specifying the sequence of commands and submit it for execution through the sbatch command. salloc and sbatch accept the same set of options for specifying resource requirements. Table 2 describes the most common options accepted by salloc and sbatch.
Important
It is highly recommended that you specify values for the --nodes, --ntasks, --cpus-per-task and --time options that are optimal for the job and for the system on which it will run. Also use --mem for jobs that will share node resources (shared access), or --exclusive for allocation of all node resources for a single job (exclusive access).
If you know the approximate time that your job will run for, then specify it (with a suitable margin of error) with the --time option. It will enable the scheduler to make better choices for backfilling gaps in the job schedule and, ultimately, your job may start running eariler.
For interactive jobs, both standard output (stdout) and standard error (stderr) are displayed on the terminal.
For batch jobs, the default location for stdout and stderr is a file named slurm-<jobid>.out in the working directory from which the job was submitted. (<jobid> is replaced by the numeric Slurm job ID.) To change the name of the standard output file, use the --output=<stdoutfile> option. To change the name of the standard error file, use the --error=<stderrfile> option. You can use the special token %j to include the unique job ID in the filename.
When running jobs through Slurm, the unix group ownership of new files generated by the job execution is the one given to Slurm using the --account option (-A). This is usually your project group, so others in your project can read the files if they and the directory (and relevant parent directories) have the group-read attribute.
Slurm partitions
It is also important to choose the right partition or queue for your job. Each partition has a different configuration and use-case, as explained in the tables below.
Table 3. Setonix Slurm Partitions for Production Jobs and Data Transfers
Name | Nodes | Cores per node | Available memory per node | GPUs (logical) | Job suitability | Maximum nodes per job | Maximum job duration (wall time) | Maximum number of concurrent jobs per user | Maximum number of submitted jobs per user |
|---|---|---|---|---|---|---|---|---|---|
work | 1376 | 128 [2 × 64] | 230 GiB | - | CPU-based production jobs | - | 24 h | 256 | 1024 |
long | 8 | 128 [2 × 64] | 230 GiB | - | Long-running CPU-based production jobs | 1 | 96 h | 4 | 96 |
highmem | 16 | 128 [2 × 64] | 980 GiB | - | CPU-based production jobs with large memory requirements | 1 | 96 h | 2 | 96 |
gpu | 134 | 64 [1 × 64] | 230 GiB | 8 | GPU-based production jobs | - | 24 h | 64 | 1024 |
gpu-highmem | 38 | 64 [1 × 64] | 460 GiB | 8 | GPU-based production jobs with large host-side memory requirements | - | 24 h | 8 | 256 |
copy | 7 | 32 [1 x 32] | 115 GiB | - | Copying large amounts of data to and from the supercomputer filesystems | - | 48 h | 4 | 500 |
askaprt | 180 | 128 [2 × 64] | 230 GiB | - | Dedicated to the ASKAP project | - | 24 h | 8192 | 8192 |
casda | 1 | 32 [1 × 32] | 115 GiB | - | Dedicated to the CASDA project | - | 24 h | 30 | 40 |
mwa | 10 | 128 [2 × 64] | 230 GiB | - | Dedicated to the MWA projects | - | 24 h | 1000 | 2000 |
mwa-asvo | 10 | 128 [2 × 64] | 230 GiB | - | Dedicated to the MWA projects | - | 24 h | 1000 | 2000 |
mwa-gpu | 10 | 64 [1 × 64] | 230 GiB | 8 | Dedicated to the MWA projects | - | 24 h | 1000 | 2000 |
mwa-asvocopy | 2 | 32 [1 × 32] | 115 GiB | - | Dedicated to the MWA projects | - | 48 h | 32 | 1000 |
quantum | 4 | 288 [4 × 72] | 857 GiB | 4 | Dedicated to Setonix-Q merit allocation scheme, quantum computing simulation, and hybrid quantum-classical workflows | 4 | 24 h | 8 | 256 |
Table 4. Setonix Slurm Partitions for Debug and Development
Name | Nodes | Cores per node | Available memory per node | GPUs (logical) | Job suitability | Maximum nodes per job | Maximum job duration (wall time) | Maximum number of concurrent jobs per user | Maximum number of submitted jobs per user |
|---|---|---|---|---|---|---|---|---|---|
debug | 8 | 128 [2 × 64] | 230 GiB | - | Development and debugging of CPU-based code | 4 | 1 h | 1 | 4 |
gpu-dev | 10 | 64 [1 × 64] | 230 GiB | 8 | Development and debugging of GPU-based code | 2 | 4 h | 1 | 4 |
quantum | 4 | 288 [4 × 72] | 857 GiB | 4 | Development and porting for GH200 architecture | 4 | 24 h | 8 | 256 |