Job Scheduling
Job scheduling is the process of requesting the execution of a program on a supercomputer. A job consists of one or more related programs that are executed together.Because a supercomputer is shared among many users, and its architecture is substantially more complex than that of a workstation, programs are not executed immediately. Instead they are submitted to a central scheduling system that decides when to run them.
Introduction
Supercomputers usually have two types of nodes, which are shared by all of the users on the system.
- Login nodes. These are dedicated to interactive activities such as logging in, compiling, debugging, file management and job submission. Users shouldn't run any compute or memory-intensive programs on login nodes because it would prevent other users from performing their tasks normally.
- Compute nodes. Hundreds, or even thousands, of compute nodes can be available to actually execute jobs submitted by users. A scheduling mechanism enforces a fair usage of the compute nodes by users, who essentially take turns to run their jobs.
FIgure 1. Architecture of a supercomputer
Slurm (Simple Linux Utility for Resource Management) is a batch queuing system and scheduler. Slurm is highly scalable and capable of operating a heterogeneous cluster with up to tens of millions of processors. It can sustain job throughputs of more than 120,000 jobs per hour, with bursts of job submissions at several times that rate. It is highly tolerant of system failures, with built-in fault tolerance. Plug-ins can be added to support various interconnects, authentication methods, schedulers, and more. Slurm allows users to submit batch jobs, check on their status, and cancel them if necessary. Interactive jobs can also be submitted, and detailed estimates of expected run times can be viewed.
Currently, all Pawsey systems use Slurm. Other popular scheduling systems include PBS Pro and Torque, which are versions of the PBS codebase. Slurm is based on the same underlying concepts as PBS, though the syntax differs.
Resource allocation
Slurm regulates access to computing resources on each node of a supercomputer. A resource is a CPU core, a GPU accelerator, or a quantity of RAM that is allocated by the scheduler for a specified period of time. You must specify the resources your job needs at submission time to the Slurm scheduler. You can either specify resource requirements together with the programs to execute in a batch script, or request resources to be used in an interactive shell. Once resources are granted to your job, you can use all or a part of them when running an executable. To execute a program within a Slurm job, the srun
program launcher is used. The launcher takes care of running the program according to a specified configuration, for instance by partitioning multiple processes across multiple cores and nodes. Each program execution imparted using srun
, within a Slurm job, is called Slurm step. This approach requires you to think about the resource requirements and their distribution across nodes in advance.
Slurm glossary
Table 1. Definition of main terms used by Slurm to identify technical concepts
Term | Description |
---|---|
Account | Account is used to describe the entity to which used resources are charged. |
CPU | CPU is used to describe the smallest physical consumable. For multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread. |
GPU | Stands for Graphical Processing Unit. A specialised type of processor that can greatly accelerate certain computational tasks that have been optimised for GPUs. |
GRES | Stands for 'Generic Resources'. A GRES flag in your Slurm script allows you to allocate resources beyond the usual CPU and memory, such as GPUs. |
Partition | Slurm groups nodes into sets called partitions. Each partition has an associated queue where jobs are submitted to in order to run. Examples include the work, long, debug, and copy partitions on Setonix. |
Task | A task under Slurm is a synonym for a process, and is often the number of MPI processes that are required. |
Submitting a job to Slurm
To perform a computation on a supercomputer, you need to specify the resource requirements for the job. To do this you can either request an interactive shell using the salloc
command, or create a batch script specifying the sequence of commands and submit it for execution through the sbatch
command. salloc
and sbatch
accept the same set of options for specifying resource requirements. Table 2 describes the most common options accepted by salloc
and sbatch
.
Important
It is highly recommended that you specify values for the --nodes
, --ntasks
, --cpus-per-task
and --time
options that are optimal for the job and for the system on which it will run. Also use --mem
for jobs that will share node resources (shared access), or --exclusive
for allocation of all node resources for a single job (exclusive access).
If you know the approximate time that your job will run for, then specify it (with a suitable margin of error) with the --time
option. It will enable the scheduler to make better choices for backfilling gaps in the job schedule and, ultimately, your job may start running eariler.
For interactive jobs, both standard output (stdout) and standard error (stderr) are displayed on the terminal.
For batch jobs, the default location for stdout and stderr is a file named slurm-
<jobid>
.out
in the working directory from which the job was submitted. (<jobid> is replaced by the numeric Slurm job ID.) To change the name of the standard output file, use the --output=<stdoutfile>
option. To change the name of the standard error file, use the --error=<stderrfile>
option. You can use the special token %j
to include the unique job ID in the filename.
Table 2. Common options for sbatch
and salloc
When running jobs through Slurm, the unix group ownership of new files generated by the job execution is the one given to Slurm using the --account
option (-A
). This is usually your project group, so others in your project can read the files if they and the directory (and relevant parent directories) have the group-read attribute.
Slurm partitions
It is also important to choose the right partition or queue for your job. Each partition has a different configuration and use-case, as explained in the table below.
Table 3. Overview of Setonix Slurm Partitions
N. NodesName Cores per node Available node-RAM for jobs GPU chiplets per node Types of jobs supported Max Number of Nodes per Job Max Wall time Max Number of Concurrent Jobs per User Max Number of Jobs Submitted per User work 1376 2x 64 230 GB n/a Supports CPU-based production jobs. - 24h 256 1024 long 8 2x 64 230 GB n/a Long-running CPU-based production jobs. 1 96h 4 96 highmem 8 2x 64 980 GB n/a Supports CPU-based production jobs that require a large amount of memory. 1 96h 2 96 debug 8 2x 64 230 GB n/a Exclusive for development and debugging of CPU code and workflows. 4 1h 1 4 gpu 124 1x 64 230 GB 8 Supports GPU-based production jobs. - 24h - - gpu-highmem 38 1x 64 460 GB 8 Supports GPU-based production jobs requiring large amount of host memory. - 24h - - gpu-dev 20 1x 64 230 GB 8 Exclusive for development and debugging of GPU code and workflows. - 4h - - copy 7 1x 32 115 GB n/a Copy of large data to and from the supercomputer's filesystems. - 48h 4 2048 askaprt 180 2x 64 230 GB n/a Dedicated to the ASKAP project (similar to work partition) - 24h 8192 8192 casda 1 1x 32 115 GB n/a Dedicated to the CASDA project (similar to copy partition) - 24h 30 40 mwa 10 2x 64 230 GB n/a Dedicated to the MWA projects (similar to work partition) - 24h 1000 2000 mwa-asvo 10 2x 64 230 GB n/a Dedicated to the MWA projects (similar to work partition) - 24h 1000 2000 mwa-gpu 10 1x 64 230 GB 8 Dedicated to the MWA projects (similar to gpu partition) - 24h 1000 2000 mwa-asvocopy 2 1x 32 115 GB n/a Dedicated to the MWA projects (similar to copy partition) - 48h 32 1000
*Max resources refers to the limit enforced by the Slurm scheduler. Although some partitions do not enforce any limit on the number of nodes a user can request, you still need to have sufficient allocation remaining to request resources.
Debug and Development Partitions Policy
To ensure the debug and development partitions are available for use by Pawsey researchers, they are strictly reserved for the following activities: These partitions must not be used for the following activities: Note: This restriction applies regardless of the execution time of the jobs. For instance, jobs that involve testing for numerical stability, parameter optimization, or early-stage simulations should not be conducted on the debug/development partitions, even if the run times are under the partition's walltime limit.
Batch jobs
A batch job is a sequence of commands submitted for execution that do not need user input and are executed as a single unit. In Slurm terminology, each command execution is called a step and the whole command sequence is specified in a batch script.
In its simplest form, a batch script is a bash script containing the shebang line specifying which shell to use, and a single step that executes a command.
Always use bash with the --login option
The --login
option makes bash configure the user environment or, in this case, the job environment for the correct job execution.
Omitting --login
may result in unwanted behaviours.
For instance, if you wanted to print a simple message using echo
, you would write the batch script shown in listing 1.
#!/bin/bash --login srun echo "Hello world!"
You can save the contents of listing 1 in a file called script.sh
, and submit it for execution using sbatch
as shown below:
$ sbatch --account=your_account_code --nodes=1 --ntasks=1 --mem=1840M --time=00:01:00 script.sh
While this way of specifying resource requirements and general job information is valid, the established best practice is to provide Slurm with these details within the batch script itself. The advantages of this are that you don't have to type the same information every time you want to submit a job, and the configuration is documented together with the commands to execute. Slurm directives are special bash comment lines starting with the #SBATCH
string followed by an option specification. Directives must precede every non-comment, non-blank lines in the script. The batch script then comprises three sections: the shebang line, the Slurm directives section, also called the header, and the commands to run.
The simple job described in listing 1 and terminal 1 then becomes what is shown in listing 2 and terminal 2.
#!/bin/bash --login #SBATCH --account=your_account_code #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=00:01:00 srun echo "Hello world!"
$ sbatch script.sh
You can also use a combination of Slurm directives and command-line options, with the latter having higher precedence over the former. It is particularly useful if you need to occasionally override option values specified in the batch script.
The --export option
Specifying the --export
option with a value different than NONE
within the batch script won't work. If you want to export a variable, a practice that is discouraged, you'll need to pass --export
to sbatch
through the command-line. The reason is that the SLURM_EXPORT
variable is set to NONE
, and it has a higher precedence than the script directive, but a lower one than the command-line option.
You should specify as many job requirements as possible in the header of the Slurm script so that any incorrect oversubscription of resources will be notified during the submission rather than execution of the job.
To see more examples of batch scripts, head to the User Guide of one of our systems where you can find examples tailored for that particular environment (links are at the bottom of this page).
Interactive jobs
It is possible to run serial or parallel jobs interactively through Slurm. This is a very useful feature when developing parallel codes, debugging applications or running X applications (that is, applications with a Graphical User Interface). You can use the salloc
command to obtain an interactive shell over a Slurm job allocation. The same command-line options as sbatch
are used to indicate the required resources, as salloc
shares the most important command-line options with the sbatch
command, but no script has to be provided. Once the interactive session is allocated, the use of the allocated resources should be managed with the srun
command, as done within usual Slurm batch job scripts.
The following terminal shows an example of a request for an interactive Slurm session using part of the resources of the node (sharing resources):
$ salloc -p work -n 1 -N 1 -c 64 --mem=115G -A <your-project-code> salloc: Granted job allocation 32272 salloc: Waiting for resource configuration salloc: Nodes nid001206 are ready for job @nid001206 $
Notice that the interactive shell is opened on one of the allocated nodes. To exit the shell, complete the interactive job and release all of the allocated resources, run the exit
command or press Ctrl+D. Also note the use of the --mem option when requesting resources in shared access, which is the default mode. If the the request of memory is not indicated, the request may be rejected.
If the use of the resources is required to be exclusive, then it's better to indicate that explicitly (and then there is no need for indicating the amount of memory, as all the memory is going to be granted to the request):
$ salloc -p work -n 1 -N 1 -c 128 --exclusive -A <your-project-code> salloc: Granted job allocation 32272 salloc: Waiting for resource configuration salloc: Nodes nid001206 are ready for job @nid001206 $
Again, the interactive shell is opened on one of the allocated nodes. Again, to exit the shell, complete the interactive job and release all of the allocated resources, run the exit
command or press Ctrl+D. Also note the use of the --exclusive option when requesting exclusive access. If the the request of exclusiveness is not indicated, the request may be rejected.
Or, for example, if the development of a GPU capable code requires two GPUs:
$ salloc -p gpu-dev -N 1 --gres=gpu:2 -A <your-project-code>-gpu salloc: Granted job allocation 32272 salloc: Waiting for resource configuration salloc: Nodes nid002948 are ready for job @nid002948 $
This request is asking for the use of two GPU-packs. Note that the request of GPU resources in Setonix (and their use with subsequent calls of srun
during the interactive session) have a very specific procedure that is explained in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.
Launching executables
Slurm provides its own job launcher called srun. The srun
command provides similar functionality as other job launchers, such as OpenMPI’s mpirun
, and will run the specified executable on resources allocated to an interactive or bach Slurm job. The general syntax is srun [options] executable
. The list of options accepted by the srun
command comprises the ones related to resource specification listed in table 2 above, but some of them hold a different meaning.
--ntasks=<n>
(or-n <n>
): instructs the launcher to spawn n processes from the specified executable. When used in combination with MPI, it allows parallel and distributed computing.--nodes=<N>
(or-N <N>
): instructs the launcher to spawn processes across N compute nodes.--cpus-per-task=<c>
(or-c <c>
): specifies the number of cores assigned to each process (or task related to the--ntasks
option). For OpenMP jobs, and multithreaded programs in general, this implies each task may have up to c threads running in parallel. The value for the --cpus-per-task
should correspond to the one associated with theOMP_NUM_THREADS
variable for OpenMP applications.
Check the Slurm documentation to find out more about the various possible options.
`srun` options for the use of GPU resources have a very specific procedure
The srun
options for the use of GPU resources have a very specific procedure that is explained in extent in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes
Job and cluster monitoring
Slurm provides other commands besides the ones already discussed to monitor the status of both jobs and the cluster.
Querying the status of a cluster (sinfo)
The sinfo
command queries and prints the state of the supercomputer nodes and partitions.
Querying the Slurm queue (squeue)
The squeue
command queries and prints the status of all jobs currently in a queue.
You can filter the results by a number of filters.
The information displayed can be formatted using the --format
option, or by setting the SQUEUE_FORMAT
environment variable.
Checking the priority of a job
Use the squeue
command to display the priority of all jobs in the queue. A lower priority job will never delay the scheduling of a higher priority job. Small jobs may squeeze into the gap being created for larger higher priority jobs, if Slurm is certain they will finish before the higher priority job is due to start.
Viewing details of a running job (scontrol)
To view detailed information regarding a queued job, the scontrol
subcommand show job can be used. Information such as resources requested, submit/start time, node lists and more are available.
This information is available only for queued and running jobs. To gather information about completed jobs refer to the sacct
command description.
Cancelling a job (scancel)
If for some reason you wish to stop a job that is running or delete a job in the queue, use the scancel
command. The syntax is scancel [job id [ job id] ...]
. The command sends a signal to the specified jobs to instruct them to stop. If running, the job will be terminated; if queued, the job will be removed. Flexible filtering options, like --account
, --name
and --user
also permit job IDs to be automatically selected based on account, job name or user name, or any combination of those. Arbitrary signals may also be sent using the --signal=<signal name>
option. Signal names may be either their name or number.
Holding jobs (scontrol hold)
To prevent a job in the queue from being scheduled for execution, use the scontrol hold
subcommand. The syntax is scontrol hold <jobid>
. It is not possible to hold a job that has already begun its execution.
Releasing jobs (scontrol release)
To release a job that was previously (manually) held, use the scontrol release
subcommand. The syntax is scontrol release <jobid>
.
Slurm environment variables
Slurm sets environment variables that your running batch script can use.
See the sbatch
man page for more environment variables.
Reservations
With prior arrangement, for special cases, you may request a resource reservation. This gives you exclusive access to a resource such as a node for a predefined amount of time. If your request is successful a named reservation will be created, and you may submit jobs on the reserved resource during the allocated time period by using the --reservation
option, without other users being able to do the same. See the Extraordinary Resource Request policy.
You may also view reservations on the system using the command scontrol show reservations
, or view jobs in the queue for a reservation using the command squeue -R <reservation>
.
Project accounting
There are a couple of ways to check a group's or user's usage against their allocation, be it time or storage.
Computing time allocation usage
You can use Slurm commands such as sacct
and sreport
to derive the total consumption of computing time. For more details on that, consult the general Slurm documentation.
Pawsey also provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance
utility in pawseytools shows you the current usage of the granted allocation by your (default) group. It also accepts options to customise the query. For example, terminal 8 gives a list of the quarterly usage of members of project123 in service units.
$ pawseyAccountBalance -p project123 -users Compute Information ------------------- Project ID Allocation Usage % used ---------- ---------- ----- ------ project123 1000000 372842 37.3 --user1 356218 35.6 --user2 7699 0.8
Terminal 9 shows how to print the usage of project123 for the whole year, by quarter.
$ pawseyAccountBalance -p project123 -yearly Compute Information ------------------- Project ID Period Usage ---------- ---------- ----- project123 2018Q1 372842 project123 2018Q1 250000 project123 2018Q1-2 622842
Terminal 10 shows how to print usage of GPU allocations by appending the -gpu suffix to the project name.
$ pawseyAccountBalance -p project123-gpu Compute Information ------------------- Project ID Allocation Usage % used ---------- ---------- ----- ------ project0123 250000 200000 80.0 project0123-gpu 1000000 500000 50.0
Accounting data for historical, queued and running jobs can be displayed with the sacct
command. To display information regarding a particular job, use the -j
option. By default, only jobs submitted during the current day will be displayed. To find jobs earlier than that, you can pass the -S
option to specify a different start time. Additional filtering options are available, like --account
, --name
, and --user
.
$ sacct JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 136593 PS1_INSR gpuq project123 16 TIMEOUT 0:0 136593.batch batch project123 16 CANCELLED 0:15 136593.exte+ extern project123 16 COMPLETED 0:0 136593.0 gmx project123 16 FAILED 2:0
The sreport
command can also be used to generate similar reports from the accounting data stored in the Slurm database. The output may vary from the sacct
information on systems with hyper-threading enabled. In particular, the number of cores reported by sreport
might need to be divided by the number of hardware threads.
Allocation underuse and overuse
Director's Share projects cannot execute jobs after having consumed their allocation. Jobs can be queued but they will not run. Projects that overrun their quarterly allocation in any one quarter can still run, albeit with reduced priority. A project may have more than one allocation on different machines. Each allocation is specific to a supercomputer, so the consumption of one allocation does not affect the other. For information specific to a supercomputing system, check the relevant system user guide.
High priority mode
You can submit jobs having a higher priority than usual, subject to some limitations. This is similar to the express queue feature at other supercomputing centres. There is no charging rate for this feature, meaning there is not a multiplier on your service unit usage. This feature is intended for short test jobs before running a large simulation, or for running short test jobs during code development. It complements and should be considered before reservations.
You do not need to contact the helpdesk to use this feature. You can pass the --qos=high
option to sbatch
or salloc
. If the mode is not available, you will receive this error:
sbatch: error: Batch job submission failed: Invalid qos specification
Check the system-specific pages for details on how high-priority mode is implemented on various supercomputers at Pawsey.
Storage allocation usage
There are several filesystems imposing a limitation on how many files or how much space a user or group is allowed to create or consume, respectively. Refer to File Management for more information.
Job provenance
Provenance is important for trusting simulation outputs or data analysis. It is necessary to have concise and complete records of the transformations of the input data to be able to reproduce the output data. This not only includes recording input data and batch scripts, but also what versions of the software were used and where the job was run. Different software versions and different hardware may give different output if the algorithm is sensitive to precision.
Batch scripts or workflow engines are preferable to interactive sessions.
Batch scripts
You should follow Best Practices on batch script Reproducibility. In particular, start the job with a pristine environment that is not inherited from your login shell, and do not use Shell Initialisation scripts to alter the job environment.
Slurm keeps some information regarding a job, such as resources used, so you do not need to separately record it in your output (unless required for convenience).
Print out the currently loaded modules in Bash at the beginning of your batch script, using the following command:
module list 2>&1
The 2>&1
sequence sends the module output to stdout
instead of stderr
. Print the current working directory using pwd
.
The scontrol
command shows helpful information for a running job, including the list of nodes, working directory, input filename and output filenames. You can add this information before the executable starts. Not all of this information is retained for querying with sacct
.
scontrol show job $SLURM_JOBID
If your batch script is short, you can copy it to stdout
so you only have one file to keep. Execute the following command near the top of the batch script:
cat $0
It can be helpful to know if your jobs have a consistent runtime. Add the command shown in terminal 12 to the end of your batch script to record the execution time. The invoked command will display the start time and elapsed time in the output.
$ sacct -j $SLURM_JOBID -o jobid%20,Start,elapsed JobID Start Elapsed -------------------- ------------------- ---------- 3109174 2019-07-02T08:34:36 03:32:09 3109174.extern 2019-07-02T08:34:36 03:32:09
Note that if there are multiple srun
commands in the batch script, a line will be added for each one of those.
Conclusion
You should be able now to interact with and submit jobs to the Slurm scheduler. Slurm allows for complex sequences of jobs to be orchestrated and executed, for instance by providing facilities to declare dependencies between jobs. For more information, visit Example Workflows and the specific examples batch scripts within the user guides for each of the Guides per Supercomputer.
Related pages
- Example Slurm Batch Scripts for Setonix on CPU Compute Nodes
- Example Slurm Batch Scripts for Setonix on GPU Compute Nodes
- Example Batch Scripts for Garrawarla
- File Management
- Extraordinary Resource Request