Job Scheduling

Job Scheduling

Job scheduling is the process of requesting the execution of a program on a supercomputer. A job consists of one or more related programs that are executed together.

Because a supercomputer is shared among many users, and its architecture is substantially more complex than that of a workstation, programs are not executed immediately. Instead they are submitted to a central scheduling system that decides when to run them.

On this page:

Introduction

Supercomputers usually have two types of nodes, which are shared by all of the users on the system.

  • Login nodes. These are dedicated to interactive activities such as logging in, compiling, debugging, file management and job submission. Users shouldn't run any compute or memory-intensive programs on login nodes because it would prevent other users from performing their tasks normally.

  • Compute nodes. Hundreds, or even thousands, of compute nodes can be available to actually execute jobs submitted by users. A scheduling mechanism enforces a fair usage of the compute nodes by users, who essentially take turns to run their jobs.

Jobs are submitted from login nodes to the scheduler, which releases them to run on the availability of compute nodes and other supercomputing resources.
Figure 1. Architecture of a supercomputer

Slurm (Simple Linux Utility for Resource Management) is a batch queuing system and scheduler. Slurm is highly scalable and capable of operating a heterogeneous cluster with up to tens of millions of processors. It can sustain job throughputs of more than 120,000 jobs per hour, with bursts of job submissions at several times that rate. It is highly tolerant of system failures, with built-in fault tolerance. Plug-ins can be added to support various interconnects, authentication methods, schedulers, and more. Slurm allows users to submit batch jobs, check on their status, and cancel them if necessary. Interactive jobs can also be submitted, and detailed estimates of expected run times can be viewed.

Currently, all Pawsey systems use Slurm. Other popular scheduling systems include PBS Pro and Torque, which are versions of the PBS codebase. Slurm is based on the same underlying concepts as PBS, though the syntax differs.

Resource allocation

Slurm regulates access to computing resources on each node of a supercomputer. A resource is a CPU core, a GPU accelerator, or a quantity of RAM that is allocated by the scheduler for a specified period of time. You must specify the resources your job needs at submission time to the Slurm scheduler. You can either specify resource requirements together with the programs to execute in a batch script, or request resources to be used in an interactive shell. Once resources are granted to your job, you can use all or a part of them when running an executable. To execute a program within a Slurm job, the srun program launcher is used. The launcher takes care of running the program according to a specified configuration, for instance by partitioning multiple processes across multiple cores and nodes. Each program execution imparted using srun, within a Slurm job, is called Slurm step. This approach requires you to think about the resource requirements and their distribution across nodes in advance.

Slurm glossary

Table 1. Definition of main terms used by Slurm to identify technical concepts

Term

Description

Term

Description

Account

Account is used to describe the entity to which used resources are charged.

CPU

CPU is used to describe the smallest physical consumable. For multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread.

GPU

Stands for Graphical Processing Unit. A specialised type of processor that can greatly accelerate certain computational tasks that have been optimised for GPUs.  

GRES

Stands for 'Generic Resources'. A GRES flag in your Slurm script allows you to allocate resources beyond the usual CPU and memory, such as GPUs.

Partition

Slurm groups nodes into sets called partitions. Each partition has an associated queue where jobs are submitted to in order to run. Examples include the work, long, debug, and copy partitions on Setonix.

Task

A task under Slurm is a synonym for a process, and is often the number of MPI processes that are required.

 

Submitting a job to Slurm

To perform a computation on a supercomputer, you need to specify the resource requirements for the job. To do this you can either request an interactive shell using the salloc command, or create a batch script specifying the sequence of commands and submit it for execution through the sbatch command. salloc and sbatch accept the same set of options for specifying resource requirements. Table 2 describes the most common options accepted by salloc and sbatch.

Important

It is highly recommended that you specify values for the --nodes, --ntasks, --cpus-per-task and --time options that are optimal for the job and for the system on which it will run. Also use --mem for jobs that will share node resources (shared access), or --exclusive for allocation of all node resources for a single job (exclusive access).

If you know the approximate time that your job will run for, then specify it (with a suitable margin of error) with the --time option. It will enable the scheduler to make better choices for backfilling gaps in the job schedule and, ultimately, your job may start running eariler. 

For interactive jobs, both standard output (stdout) and standard error (stderr) are displayed on the terminal.

For batch jobs, the default location for stdout and stderr is a file named slurm-<jobid>.out in the working directory from which the job was submitted. (<jobid> is replaced by the numeric Slurm job ID.) To change the name of the standard output file, use the --output=<stdoutfile> option. To change the name of the standard error file, use the --error=<stderrfile> option. You can use the special token %j to include the unique job ID in the filename.

 

Table 2. Common options for sbatch  and salloc

Option

Reduced
syntax

Purpose

Option

Reduced
syntax

Purpose

--account=<project>

-A <project>

Set the project code to which the job is to be charged. A default project is configured for each Pawsey user.

--nodes=<N>

-N <N>

Request N nodes for the job.

--ntasks=<n>

-n <n>

Specify the maximum number of tasks or processes that each job step will run. On supercomputers with exclusive access to nodes, specify a multiple of the total number of cores available on a node for efficient use of resources.

--ntasks-per-node=<nN>

 

Specify the number of tasks per node.

--cpus-per-task=<c>

-c <c>

Specify the number of physical or logical cores per task.

--mem=<size>

 

Specify the real memory required per node. The given value should be an integer. Different units can be specified using the suffix K,M or G. Default units are megabytes.

--mem-per-cpu=<size>

 

Specify the minimum memory required per CPU core. The given value should be an integer. Different units can be specified using the suffix K,M or G. Default units are megabytes.

--exclusive

 

Indicate that all resources from the requested nodes are going to be granted with exclusive access (contrary to the default scheduling which shares the resources of nodes among different jobs: shared access)

--gres=gpu:<nG>

 

Specify the required number of GPUs per node.

--time=<timeLimit>

-t <timeLimit>

Set the wall-clock time limit for the job (hh:mm:ss). If the job exceeds this time limit, it will be subject to termination.

--job-name=<jobName>

-J <jobName>

Set the job name (as it will be displayed by squeue). This defaults to the name of the batch script.

--output=<stdoutFile>

-o <stdoutFile>

(sbatch only) Set the file name for standard output. Use the token %j to include the job ID.

--error=<stderrFile>

-r <stderrFile>

(sbatch only) Set the file name for standard error. Use the token %j to include the job ID.

--partition=<partition>

-p <partition>

Request an allocation on the specified partition. If this option is not specified, jobs will be submitted to the default partition.

--qos=<qos> 

-q <qos>

Request to run the job with a particular Quality of Service (QoS).

--array=<indexList>

-a <indexList>

(sbatch only) Specify an array job with the defined indices.

--dependency=<dependencyList>

-d <dependencyList>

Specify a job dependency.

--mail-type=<eventList>

 

Request an e-mail notification for events in eventlist. Valid event values include BEGINENDFAIL and ALL. Multiple values can be specified in a comma-separated list.

--mail-user=<address>

 

Specify an e-mail address for event notifications.

--export=<variables>

 

(sbatch only) Specify which environment variables are propagated to the batch job. Valid only as a command-line option. The recommended value is NONE.

--distribution=<distributionMethod>

-m <distributionMethod>

Specifies the distribution methods of allocation of cores

When running jobs through Slurm, the unix group ownership of new files generated by the job execution is the one given to Slurm using the --account option (-A). This is usually your project group, so others in your project can read the files if they and the directory (and relevant parent directories) have the group-read attribute.

Slurm partitions

It is also important to choose the right partition or queue for your job. Each partition has a different configuration and use-case, as explained in the tables below.

Table 3. Setonix Slurm Partitions for Production Jobs and Data Transfers

Name

Nodes

Cores per node
[sockets × cores / socket]

Available memory per node

GPUs (logical)

Job suitability

Maximum nodes per job

Maximum job duration (wall time)

Maximum number of concurrent jobs per user

Maximum number of submitted jobs per user

work

1376

128 [2 × 64]

230 GiB

-

CPU-based production jobs

-

24 h

256

1024

long

8

128 [2 × 64]

230 GiB

-

Long-running CPU-based production jobs

1

96 h

4

96

highmem

16

128 [2 × 64]

980 GiB

-

CPU-based production jobs with large memory requirements

1

96 h

2

96

gpu

134

64 [1 × 64]

230 GiB

8

GPU-based production jobs

-

24 h

64

1024

gpu-highmem

38

64 [1 × 64]

460 GiB

8

GPU-based production jobs with large host-side memory requirements

-

24 h

8

256

copy

7

32 [1 x 32]

115 GiB

-

Copying large amounts of data to and from the supercomputer filesystems

-

48 h

4

500

askaprt

180

128 [2 × 64]

230 GiB

-

Dedicated to the ASKAP project

-

24 h

8192

8192

casda

1

32 [1 × 32]

115 GiB

-

Dedicated to the CASDA project

-

24 h

30

40

mwa

10

128 [2 × 64]

230 GiB

-

Dedicated to the MWA projects

-

24 h

1000

2000

mwa-asvo

10

128 [2 × 64]

230 GiB

-

Dedicated to the MWA projects

-

24 h

1000

2000

mwa-gpu

10

64 [1 × 64]

230 GiB

8

Dedicated to the MWA projects

-

24 h

1000

2000

mwa-asvocopy

2

32 [1 × 32]

115 GiB

-

Dedicated to the MWA projects

-

48 h

32

1000

quantum

4

288 [4 × 72]

857 GiB

4

Dedicated to Setonix-Q merit allocation scheme, quantum computing simulation, and hybrid quantum-classical workflows

4

24 h

8

256

 

Table 4. Setonix Slurm Partitions for Debug and Development

Name

Nodes

Cores per node
[sockets × cores / socket]

Available memory per node

GPUs (logical)

Job suitability

Maximum nodes per job

Maximum job duration (wall time)

Maximum number of concurrent jobs per user

Maximum number of submitted jobs per user

debug

8

128 [2 × 64]

230 GiB

-

Development and debugging of CPU-based code

4

1 h

1

4

gpu-dev

10

64 [1 × 64]

230 GiB

8

Development and debugging of GPU-based code

2

4 h

1

4

quantum

4

288 [4 × 72]

857 GiB

4

Development and porting for GH200 architecture

4

24 h

8

256