Job Scheduling

Job scheduling is the process of requesting the execution of a program on a supercomputer. A job consists of one or more related programs that are executed together.Because a supercomputer is shared among many users, and its architecture is substantially more complex than that of a workstation, programs are not executed immediately. Instead they are submitted to a central scheduling system that decides when to run them.


On this Page

Introduction

Supercomputers usually have two types of nodes, which are shared by all of the users on the system.

  • Login nodes. These are dedicated to interactive activities such as logging in, compiling, debugging, file management and job submission. Users shouldn't run any compute or memory-intensive programs on login nodes because it would prevent other users from performing their tasks normally.
  • Compute nodes. Hundreds, or even thousands, of compute nodes can be available to actually execute jobs submitted by users. A scheduling mechanism enforces a fair usage of the compute nodes by users, who essentially take turns to run their jobs.


Jobs are submitted from login nodes to the scheduler, which releases them to run on the availability of compute nodes and other supercomputing resources.

FIgure 1. Architecture of a supercomputer

Slurm (Simple Linux Utility for Resource Management) is a batch queuing system and scheduler. Slurm is highly scalable and capable of operating a heterogeneous cluster with up to tens of millions of processors. It can sustain job throughputs of more than 120,000 jobs per hour, with bursts of job submissions at several times that rate. It is highly tolerant of system failures, with built-in fault tolerance. Plug-ins can be added to support various interconnects, authentication methods, schedulers, and more. Slurm allows users to submit batch jobs, check on their status, and cancel them if necessary. Interactive jobs can also be submitted, and detailed estimates of expected run times can be viewed.

Currently, all Pawsey systems use Slurm. Other popular scheduling systems include PBS Pro and Torque, which are versions of the PBS codebase. Slurm is based on the same underlying concepts as PBS, though the syntax differs.

Resource allocation

Slurm regulates access to computing resources on each node of a supercomputer. A resource is a CPU core, a GPU accelerator, or a quantity of RAM that is allocated by the scheduler for a specified period of time. You must specify the resources your job needs at submission time to the Slurm scheduler. You can either specify resource requirements together with the programs to execute in a batch script, or request resources to be used in an interactive shell. Once resources are granted to your job, you can use all or a part of them when running an executable. To execute a program within a Slurm job, the srun program launcher is used. The launcher takes care of running the program according to a specified configuration, for instance by partitioning multiple processes across multiple cores and nodes. Each program execution imparted using srun, within a Slurm job, is called Slurm step. This approach requires you to think about the resource requirements and their distribution across nodes in advance.

Slurm glossary


Table 1. Definition of main terms used by Slurm to identify technical concepts

TermDescription
AccountAccount is used to describe the entity to which used resources are charged.
CPU

CPU is used to describe the smallest physical consumable. For multi-core machines this will be the core. For multi-core machines where hyper-threading is enabled this will be a hardware thread.

GPUStands for Graphical Processing Unit. A specialised type of processor that can greatly accelerate certain computational tasks that have been optimised for GPUs.  
GRESStands for 'Generic Resources'. A GRES flag in your Slurm script allows you to allocate resources beyond the usual CPU and memory, such as GPUs.
PartitionSlurm groups nodes into sets called partitions. Each partition has an associated queue where jobs are submitted to in order to run. Examples include the work, long, debug, and copy partitions on Setonix.
TaskA task under Slurm is a synonym for a process, and is often the number of MPI processes that are required.

Submitting a job to Slurm

To perform a computation on a supercomputer, you need to specify the resource requirements for the job. To do this you can either request an interactive shell using the salloc command, or create a batch script specifying the sequence of commands and submit it for execution through the sbatch command. salloc and sbatch accept the same set of options for specifying resource requirements. Table 2 describes the most common options accepted by salloc and sbatch.


Important

It is highly recommended that you specify values for the --nodes, --ntasks, --cpus-per-task and --time options that are optimal for the job and for the system on which it will run. Also use --mem for jobs that will share node resources (shared access), or --exclusive for allocation of all node resources for a single job (exclusive access).

If you know the approximate time that your job will run for, then specify it (with a suitable margin of error) with the --time option. It will enable the scheduler to make better choices for backfilling gaps in the job schedule and, ultimately, your job may start running eariler. 

For interactive jobs, both standard output (stdout) and standard error (stderr) are displayed on the terminal.

For batch jobs, the default location for stdout and stderr is a file named slurm- <jobid> .out in the working directory from which the job was submitted. (<jobid> is replaced by the numeric Slurm job ID.) To change the name of the standard output file, use the --output=<stdoutfile> option. To change the name of the standard error file, use the --error=<stderrfile> option. You can use the special token %j to include the unique job ID in the filename.

Table 2. Common options for sbatch  and salloc

 Click here to show the table
OptionReduced
syntax
Purpose
--account=<project> -A <project>Set the project code to which the job is to be charged. A default project is configured for each Pawsey user.
--nodes=<N> -N <N>Request N nodes for the job.
--ntasks=<n> -n <n>Specify the maximum number of tasks or processes that each job step will run. On supercomputers with exclusive access to nodes, specify a multiple of the total number of cores available on a node for efficient use of resources.
--ntasks-per-node=<nN>
Specify the number of tasks per node.
--cpus-per-task=<c> -c <c>Specify the number of physical or logical cores per task.
--mem=<size>
Specify the real memory required per node. The given value should be an integer. Different units can be specified using the suffix K,M or G. Default units are megabytes.
--mem-per-cpu=<size>
Specify the minimum memory required per CPU core. The given value should be an integer. Different units can be specified using the suffix K,M or G. Default units are megabytes.
--exclusive
Indicate that all resources from the requested nodes are going to be granted with exclusive access (contrary to the default scheduling which shares the resources of nodes among different jobs: shared access)
--gres=gpu:<nG>
Specify the required number of GPUs per node.
--time=<timeLimit> -t <timeLimit>Set the wall-clock time limit for the job (hh:mm:ss). If the job exceeds this time limit, it will be subject to termination.
--job-name=<jobName> -J <jobName>Set the job name (as it will be displayed by squeue). This defaults to the name of the batch script.
--output=<stdoutFile> -o <stdoutFile>(sbatch only) Set the file name for standard output. Use the token %j to include the job ID.
--error=<stderrFile> -r <stderrFile>(sbatch only) Set the file name for standard error. Use the token %j to include the job ID.
--partition=<partition> -p <partition>Request an allocation on the specified partition. If this option is not specified, jobs will be submitted to the default partition.
--qos=<qos> -q <qos>Request to run the job with a particular Quality of Service (QoS).
--array=<indexList> -a <indexList>(sbatch only) Specify an array job with the defined indices.
--dependency=<dependencyList> -d <dependencyList>Specify a job dependency.
--mail-type=<eventList>
Request an e-mail notification for events in eventlist. Valid event values include BEGINENDFAIL and ALL. Multiple values can be specified in a comma-separated list.
--mail-user=<address>
Specify an e-mail address for event notifications.
--export=<variables>
(sbatch only) Specify which environment variables are propagated to the batch job. Valid only as a command-line option. The recommended value is NONE.
--distribution=<distributionMethod>-m <distributionMethod>Specifies the distribution methods of allocation of cores

When running jobs through Slurm, the unix group ownership of new files generated by the job execution is the one given to Slurm using the --account option (-A). This is usually your project group, so others in your project can read the files if they and the directory (and relevant parent directories) have the group-read attribute.

Slurm partitions

It is also important to choose the right partition or queue for your job. Each partition has a different configuration and use-case, as explained in the table below. 

Table 3. Overview of Setonix Slurm Partitions


Name

N. Nodes

Cores per nodeAvailable node-RAM for jobsGPU chiplets per nodeTypes of jobs supportedMax Number of Nodes per JobMax Wall timeMax Number of Concurrent Jobs per UserMax Number of Jobs Submitted per User
work13762x 64230 GBn/aSupports CPU-based production jobs.-24h2561024
long82x 64230 GBn/aLong-running CPU-based production jobs.196h496
highmem82x 64980 GBn/aSupports CPU-based production jobs that require a large amount of memory.196h296
debug82x 64230 GBn/aExclusive for development and debugging of CPU code and workflows.41h14
gpu1241x 64230 GB8Supports GPU-based production jobs.-24h--
gpu-highmem381x 64460 GB8Supports GPU-based production jobs requiring large amount of host memory.-24h--
gpu-dev201x 64230 GB8Exclusive for development and debugging of GPU code and workflows.-4h--
copy71x 32115 GBn/aCopy of large data to and from the supercomputer's filesystems.-48h42048
askaprt1802x 64230 GBn/aDedicated to the ASKAP project (similar to work partition)-24h81928192
casda11x 32115 GBn/aDedicated to the CASDA project (similar to copy partition)-24h3040
mwa102x 64230 GBn/aDedicated to the MWA projects (similar to work partition)-24h10002000
mwa-asvo102x 64230 GBn/aDedicated to the MWA projects (similar to work partition)-24h10002000
mwa-gpu101x 64230 GB8Dedicated to the MWA projects (similar to gpu partition)-24h10002000
mwa-asvocopy21x 32115 GBn/aDedicated to the MWA projects (similar to copy partition)-48h321000

*Max resources refers to the limit enforced by the Slurm scheduler. Although some partitions do not enforce any limit on the number of nodes a user can request, you still need to have sufficient allocation remaining to request resources.

Debug and Development Partitions Policy

To ensure the debug and development partitions are available for use by Pawsey researchers, they are strictly reserved for the following activities:

  • Code porting
  • Code debugging
  • Code development
  • Job script/workflow management script porting, debugging and/or development

These partitions must not be used for the following activities:

  • Production runs (i.e., jobs that are intended to generate final results or data for publication, reporting, or use in further analysis)
  • Preparatory or test runs, including but not limited to:
    • Warm-up/generation of initial conditions for simulations
    • Testing configurations, searching for optimal/stabilitiy parameters, or setting up simulations, even if the results will not be used directly.
    • Running simulations or experiments to determine production parameters for AI/ML model training (e.g., hyperparameter tuning, configuration testing, validation of stability under different settings).
    • Testing code or scripts in ways that mimic production workloads, such as large-scale simulations or model training, that are not explicitly part of the development or debugging process.

Note: This restriction applies regardless of the execution time of the jobs. For instance, jobs that involve testing for numerical stability, parameter optimization, or early-stage simulations should not be conducted on the debug/development partitions, even if the run times are under the partition's walltime limit.

Batch jobs

A batch job is a sequence of commands submitted for execution that do not need user input and are executed as a single unit. In Slurm terminology, each command execution is called a step and the whole command sequence is specified in a batch script.

In its simplest form, a batch script is a bash script containing the shebang line specifying which shell to use, and a single step that executes a command.

Always use bash with the --login option

The --login option makes bash configure the user environment or, in this case, the job environment for the correct job execution.
Omitting --login  may result in unwanted behaviours.

For instance, if you wanted to print a simple message using echo, you would write the batch script shown in listing 1.

Listing 1. A simple batch script
#!/bin/bash --login

srun echo "Hello world!"

You can save the contents of listing 1 in a file called script.sh, and submit it for execution using sbatch as shown below:

Terminal 1. Executing a batch script using command-line options
$ sbatch --account=your_account_code --nodes=1 --ntasks=1 --mem=1840M --time=00:01:00 script.sh

While this way of specifying resource requirements and general job information is valid, the established best practice is to provide Slurm with these details within the batch script itself. The advantages of this are that you don't have to type the same information every time you want to submit a job, and the configuration is documented together with the commands to execute. Slurm directives are special bash comment lines starting with the #SBATCH string followed by an option specification. Directives must precede every non-comment, non-blank lines in the script. The batch script then comprises three sections: the shebang line, the Slurm directives section, also called the header, and the commands to run.

The simple job described in listing 1 and terminal 1 then becomes what is shown in listing 2 and terminal 2.

Listing 2. A batch script making use of Slurm directives
#!/bin/bash --login
#SBATCH --account=your_account_code
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:01:00

srun echo "Hello world!"
Terminal 2. Executing a batch script without specifying options
$ sbatch script.sh

You can also use a combination of Slurm directives and command-line options, with the latter having higher precedence over the former. It is particularly useful if you need to occasionally override option values specified in the batch script.

The --export option

Specifying the --export  option with a value different than NONE within the batch script won't work. If you want to export a variable, a practice that is discouraged, you'll need to pass --export  to sbatch through the command-line. The reason is that the SLURM_EXPORT variable is set to NONE, and it has a higher precedence than the script directive, but a lower one than the command-line option.

You should specify as many job requirements as possible in the header of the Slurm script so that any incorrect oversubscription of resources will be notified during the submission rather than execution of the job.

To see more examples of batch scripts, head to the User Guide of one of our systems where you can find examples tailored for that particular environment (links are at the bottom of this page).

Interactive jobs

It is possible to run serial or parallel jobs interactively through Slurm. This is a very useful feature when developing parallel codes, debugging applications or running X applications (that is, applications with a Graphical User Interface). You can use the salloc command to obtain an interactive shell over a Slurm job allocation. The same command-line options as sbatch are used to indicate the required resources, as salloc shares the most important command-line options with the sbatch command, but no script has to be provided. Once the interactive session is allocated, the use of the allocated resources should be managed with the srun command, as done within usual Slurm batch job scripts.

The following terminal shows an example of a request for an interactive Slurm session using part of the resources of the node (sharing resources):

Terminal 3A. Requesting an interactive session to Slurm
$ salloc -p work -n 1 -N 1 -c 64 --mem=115G -A <your-project-code>
salloc: Granted job allocation 32272
salloc: Waiting for resource configuration
salloc: Nodes nid001206 are ready for job
@nid001206 $ 
  

Notice that the interactive shell is opened on one of the allocated nodes. To exit the shell, complete the interactive job and release all of the allocated resources, run the exit command or press Ctrl+D. Also note the use of the --mem option when requesting resources in shared access, which is the default mode. If the the request of memory is not indicated, the request may be rejected.

If the use of the resources is required to be exclusive, then it's better to indicate that explicitly (and then there is no need for indicating the amount of memory, as all the memory is going to be granted to the request):

Terminal 3B. Requesting an interactive session to Slurm
$ salloc -p work -n 1 -N 1 -c 128 --exclusive -A <your-project-code>
salloc: Granted job allocation 32272
salloc: Waiting for resource configuration
salloc: Nodes nid001206 are ready for job
@nid001206 $ 
  

Again, the interactive shell is opened on one of the allocated nodes. Again, to exit the shell, complete the interactive job and release all of the allocated resources, run the exit command or press Ctrl+D. Also note the use of the --exclusive option when requesting exclusive access. If the the request of exclusiveness is not indicated, the request may be rejected.

Or, for example, if the development of a GPU capable code requires two GPUs:

Terminal 3C. Requesting an interactive session to a GPU node
$ salloc -p gpu-dev -N 1 --gres=gpu:2 -A <your-project-code>-gpu
salloc: Granted job allocation 32272
salloc: Waiting for resource configuration
salloc: Nodes nid002948 are ready for job
@nid002948 $ 
  

This request is asking for the use of two GPU-packs. Note that the request of GPU resources in Setonix (and their use with subsequent calls of srun during the interactive session) have a very specific procedure that is explained in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes.

Launching executables

Slurm provides its own job launcher called srun. The srun command provides similar functionality as other job launchers, such as OpenMPI’s mpirun, and will run the specified executable on resources allocated to an interactive or bach Slurm job. The general syntax is srun [options] executable. The list of options accepted by the srun command comprises the ones related to resource specification listed in table 2 above, but some of them hold a different meaning.

  • --ntasks=<n> (or -n <n>): instructs the launcher to spawn n processes from the specified executable. When used in combination with MPI, it allows parallel and distributed computing.
  • --nodes=<N> (or -N <N>): instructs the launcher to spawn processes across N compute nodes.
  • --cpus-per-task=<c> (or -c <c>): specifies the number of cores assigned to each process (or task related to the --ntasks option). For OpenMP jobs, and multithreaded programs in general, this implies each task may have up to c threads running in parallel. The value for the --cpus-per-task should correspond to the one associated with the OMP_NUM_THREADS variable for OpenMP applications.

Check the Slurm documentation to find out more about the various possible options.

`srun` options for the use of GPU resources have a very specific procedure

The srun options for the use of GPU resources have a very specific procedure that is explained in extent in Example Slurm Batch Scripts for Setonix on GPU Compute Nodes

Job and cluster monitoring

Slurm provides other commands besides the ones already discussed to monitor the status of both jobs and the cluster.

Querying the status of a cluster (sinfo)

The sinfo command queries and prints the state of the supercomputer nodes and partitions.

 Click here to show an example...
Terminal 4. Querying the status of a Slurm cluster
cdipietrantonio@nid001206:~> sinfo 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
work*        up 1-00:00:00      2  down$ nid[001070-001071]
work*        up 1-00:00:00      5  down* nid[001192,001205,001212-001213,001267]
work*        up 1-00:00:00     11   down nid[001100-001103,001135,001145,001214-001215,001264-001266]
work*        up 1-00:00:00      1    mix nid001206
work*        up 1-00:00:00    297   idle nid[001008-001069,001072-001099,001104-001134,001136-001144,001146-001191,001193-001204,001207-001211,001216-001263,001268-001323]
long         up 4-00:00:00      8   idle nid[001316-001323]
copy         up 2-00:00:00      7  down* dm[02-08]
askaprt      up 1-00:00:00      3  maint nid[001488-001489,001491]
askaprt      up 1-00:00:00      1  down$ nid001490
askaprt      up 1-00:00:00    176   idle nid[001324-001487,001492-001503]
debug        up    1:00:00      3  maint nid[001004-001005,001007]
debug        up    1:00:00      1  down$ nid001006
debug        up    1:00:00      4   idle nid[001000-001003]
highmem      up 1-00:00:00      8   idle nid[001504-001511]

Each entry in the table represents a set of nodes within a partition that share the same state. The suffix * in the state code of a node indicates that the node is not responding and won't accept any new job. If the problem persists, the Slurm will change its state to down. A node is in the drained state when made unavailable by the system administrator. The $ suffix indicates that the nodes or partition are in maintenance. For more information, consult the manpage of sinfo.

Querying the Slurm queue (squeue)

The squeue command queries and prints the status of all jobs currently in a queue.

You can filter the results by a number of filters.

 Click here to see useful filters ...

Table 4. Filters that you can apply to the squeue command

squeue option Description
--meShow only your jobs.
--account=<account list>Filter results based on an account
--arrayJob arrays are displayed one element per line
--jobs=<job list>Comma-separated list of Job IDs to display
--longDisplay output in long format
--name=<name list>Filter results based on job name
--partition=<partition>Comma-separated list of partitions to display
--user=<user>Display results based on the listed user names

The information displayed can be formatted using the --format option, or by setting the SQUEUE_FORMAT environment variable.

 Click here to show an example...
Terminal 5. Executing squeue with formatted output
$ export SQUEUE_FORMAT="%.6i %.10P %.8u %15a %.15j %.3t %9r %19S %.10M %.10L %.5D %.4C %Q %N"
$ squeue
JOBID PARTITION USER  ACCOUNT     NAME         ST REASON START_TIME        TIME    TIME_LEFT NODES CPUS PRIORITY NODELIST
4679  work     	user1 director100 run          R  None 2018-04-17T00:00:46 9:03:52 2:56:08   16    256   3736    nid000[39-54]
4680  work     	user1 director100 run          R  None 2018-04-17T00:01:14 9:03:24 2:56:36   32    512   3737    nid00[151-182]
4682  work     	user2 director100 script.20.00 R  None 2018-04-17T07:24:18 1:40:20 5:19:40   2     32    3764    nid00[144-145]

Checking the priority of a job

Use the squeue command to display the priority of all jobs in the queue.  A lower priority job will never delay the scheduling of a higher priority job. Small jobs may squeeze into the gap being created for larger higher priority jobs, if Slurm is certain they will finish before the higher priority job is due to start.

 Click here to expand...
Terminal 6. Display priority of all jobs
$ squeue -o "%8i %8u %15a %.10r %.10L %.5D %.10Q" | less
JOBID    USER     ACCOUNT             REASON  TIME_LEFT NODES   PRIORITY
300731   deplazes m72                   None   22:00:13     1       4179
293572   prey     ga6              Resources 1-00:00:00     8       5406
296619   tnataraj partner981        Priority 1-00:00:00    35       1498

In the above output, job 300731 is already running. Job 293572 is higher in the queue than 296619 due to higher priority. Job 293572 will run when resources become available. Use sprio to determine the components of the priority

Terminal 7. List job priority details
$ sprio -j 293572 -l
  JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
 293572     prey       5406        430       3971          5       1000          0      0
$ sprio -j 296619 -l
  JOBID     USER   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS   NICE
 296619 tnataraj       1498        246        230         23       1000          0      0

In the above output, the FAIRSHARE is the dominant factor in 293572 having higher priority. The project that this job belongs to has used less of their quarterly allocation so their priority is higher. The AGE column shows that job has been in the queue longer too. Job 296619 has a slight help from having a larger JOBSIZE but it is not a significant factor.


Viewing details of a running job (scontrol)

To view detailed information regarding a queued job, the scontrol subcommand show job can be used. Information such as resources requested, submit/start time, node lists and more are available.


This information is available only for queued and running jobs. To gather information about completed jobs refer to the sacct  command description.

Cancelling a job (scancel)

If for some reason you wish to stop a job that is running or delete a job in the queue, use the scancel command. The syntax is scancel [job id [ job id] ...]. The command sends a signal to the specified jobs to instruct them to stop. If running, the job will be terminated; if queued, the job will be removed. Flexible filtering options, like --account, --name and --user also permit job IDs to be automatically selected based on account, job name or user name, or any combination of those. Arbitrary signals may also be sent using the --signal=<signal name> option. Signal names may be either their name or number.

Holding jobs (scontrol hold)

To prevent a job in the queue from being scheduled for execution, use the scontrol hold subcommand. The syntax is scontrol hold <jobid> . It is not possible to hold a job that has already begun its execution.

Releasing jobs (scontrol release)

To release a job that was previously (manually) held, use the scontrol release subcommand. The syntax is scontrol release <jobid> .

Slurm environment variables 

Slurm sets environment variables that your running batch script can use.

 Click here to expand the list of environment variables...

Table 5. Environment variables Slurm defines automatically when a job is submitted for execution

Variable Description
SLURM_SUBMIT_DIR The directory that the job was submitted from.
SLURM_JOB_NAME The name of the job (such as specified with --job-name=).
SLURM_JOB_ID The unique identifier (job ID) for this job.
SLURM_JOB_NODELIST List of node names assigned to the job.
SLURM_NTASKS Number of tasks allocated to the job.
SLURM_JOB_CPUS_PER_NODE Number of CPUs per node available to the job.
SLURM_JOB_NUM_NODES Number of nodes allocated to the job.
SLURM_ARRAY_TASK_ID This task's ID in the job array.
SLURM_ARRAY_JOB_ID The master job ID for the job array.
SLURM_PROCIDUniquely identifies each task. This ranges from 0 to the number of tasks minus 1.

See the sbatch man page for more environment variables.

Reservations 

With prior arrangement, for special cases, you may request a resource reservation. This gives you exclusive access to a resource such as a node for a predefined amount of time. If your request is successful a named reservation will be created, and you may submit jobs on the reserved resource during the allocated time period by using the --reservation option, without other users being able to do the same. See the Extraordinary Resource Request policy.

You may also view reservations on the system using the command scontrol show reservations, or view jobs in the queue for a reservation using the command squeue -R <reservation> .

Project accounting 

There are a couple of ways to check a group's or user's usage against their allocation, be it time or storage.

Computing time allocation usage

You can use Slurm commands such as sacct and sreport to derive the total consumption of computing time. For more details on that, consult the general Slurm documentation.

Pawsey also provides a tailored suite of tools called pawseytools which is already configured to be a default module upon login. The pawseyAccountBalance utility in pawseytools shows you the current usage of the granted allocation by your (default) group. It also accepts options to customise the query. For example, terminal 8 gives a list of the quarterly usage of members of project123 in service units.

Terminal 8. Using pawseytools to determine allocation usage
$ pawseyAccountBalance -p project123 -users

Compute Information
-------------------
Project ID Allocation Usage % used
---------- ---------- ----- ------
project123 1000000 	 372842  37.3
--user1 			 356218  35.6
--user2 			7699 0.8

Terminal 9 shows how to print the usage of project123 for the whole year, by quarter.

Terminal 9. Using pawseytools to determine quarterly usage of the allocation
$ pawseyAccountBalance -p project123 -yearly

Compute Information
-------------------
Project ID   Period    Usage
---------- ----------  ----- 
project123  2018Q1     372842 
project123  2018Q1     250000 
project123  2018Q1-2   622842 

Terminal 10 shows how to print usage of GPU allocations by appending the -gpu suffix to the project name.

Terminal 10. Using pawseytools to determine GPU allocation usage
$ pawseyAccountBalance -p project123-gpu  

Compute Information
-------------------
     Project ID     Allocation          Usage     % used
     ----------     ----------          -----     ------
    project0123         250000         200000       80.0
project0123-gpu        1000000         500000       50.0
 

Accounting data for historical, queued and running jobs can be displayed with the sacct command. To display information regarding a particular job, use the -j option. By default, only jobs submitted during the current day will be displayed. To find jobs earlier than that, you can pass the -S option to specify a different start time. Additional filtering options are available, like --account, --name, and --user.

Terminal 11. Using sacct to query jobs information
$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
136593         PS1_INSR       gpuq project123         16    TIMEOUT      0:0 
136593.batch      batch            project123         16  CANCELLED     0:15 
136593.exte+     extern            project123         16  COMPLETED      0:0 
136593.0            gmx            project123         16     FAILED      2:0

The sreport command can also be used to generate similar reports from the accounting data stored in the Slurm database. The output may vary from the sacct information on systems with hyper-threading enabled. In particular, the number of cores reported by sreport might need to be divided by the number of hardware threads.

Allocation underuse and overuse

Director's Share projects cannot execute jobs after having consumed their allocation. Jobs can be queued but they will not run. Projects that overrun their quarterly allocation in any one quarter can still run, albeit with reduced priority. A project may have more than one allocation on different machines. Each allocation is specific to a supercomputer, so the consumption of one allocation does not affect the other. For information specific to a supercomputing system, check the relevant system user guide.

High priority mode

You can submit jobs having a higher priority than usual, subject to some limitations. This is similar to the express queue feature at other supercomputing centres. There is no charging rate for this feature, meaning there is not a multiplier on your service unit usage. This feature is intended for short test jobs before running a large simulation, or for running short test jobs during code development. It complements and should be considered before reservations.

You do not need to contact the helpdesk to use this feature. You can pass the --qos=high option to sbatch  or salloc. If the mode is not available, you will receive this error:

    sbatch: error: Batch job submission failed: Invalid qos specification

Check the system-specific pages for details on how high-priority mode is implemented on various supercomputers at Pawsey.

Storage allocation usage

There are several filesystems imposing a limitation on how many files or how much space a user or group is allowed to create or consume, respectively. Refer to File Management for more information.

Job provenance

Provenance is important for trusting simulation outputs or data analysis. It is necessary to have concise and complete records of the transformations of the input data to be able to reproduce the output data. This not only includes recording input data and batch scripts, but also what versions of the software were used and where the job was run. Different software versions and different hardware may give different output if the algorithm is sensitive to precision.

Batch scripts or workflow engines are preferable to interactive sessions.

Batch scripts

You should follow Best Practices on batch script Reproducibility. In particular, start the job with a pristine environment that is not inherited from your login shell, and do not use Shell Initialisation scripts to alter the job environment.

Slurm keeps some information regarding a job, such as resources used, so you do not need to separately record it in your output (unless required for convenience).

Print out the currently loaded modules in Bash at the beginning of your batch script, using the following command:

    module list 2>&1

The 2>&1 sequence sends the module output to stdout instead of stderr. Print the current working directory using pwd.

The scontrol command shows helpful information for a running job, including the list of nodes, working directory, input filename and output filenames. You can add this information before the executable starts. Not all of this information is retained for querying with sacct.

    scontrol show job $SLURM_JOBID

If your batch script is short, you can copy it to stdout so you only have one file to keep. Execute the following command near the top of the batch script:

    cat $0

It can be helpful to know if your jobs have a consistent runtime. Add the command shown in terminal 12 to the end of your batch script to record the execution time. The invoked command will display the start time and elapsed time in the output.

Terminal 12. Using sacct to show start and end times of the job
$ sacct -j $SLURM_JOBID -o jobid%20,Start,elapsed
 
              JobID               Start    Elapsed
-------------------- ------------------- ----------
             3109174 2019-07-02T08:34:36   03:32:09
      3109174.extern 2019-07-02T08:34:36   03:32:09

Note that if there are multiple srun commands in the batch script, a line will be added for each one of those.

Conclusion

You should be able now to interact with and submit jobs to the Slurm scheduler. Slurm allows for complex sequences of jobs to be orchestrated and executed, for instance by providing facilities to declare dependencies between jobs. For more information, visit Example Workflows and the specific examples batch scripts within the user guides for each of the Guides per Supercomputer.

Related pages

External links