Skip to end of banner
Go to start of banner

Example Workflows

Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 49 Next »

Slurm is able to orchestrate the execution of complex job workflows. This page shows you, for instance, how to schedule the execution of a job only after another one has ended, or how to execute different programs within a single job to make use of all the allocated resources.

On this Page

Job arrays 

An array job provides a mechanism to run a (possibly large) number of similar jobs from a single batch script. The number of instances of the job is controlled via the SLURM directive --array:

#SBATCH --array=<list>

Here <list> may be a comma-separated list of numbers, or a range of numbers specified using a dash "-". For example, one may have "--array=0,1,2,3" or "--array=0-3" to specify four instances. These two formats can be combined: "--array=0-2,4,8". An optional stride may be introduced when specifying a range using a colon ":". For example, "--array=0-7:2", is equivalent to "--array=0,2,4,6".

All of the other SLURM directives specified in the script are common to all of the jobs, specifically the number of nodes and the time limit.

The following simple example runs two instances of a 32 MPI task job, each on one node.

Listing 1. Job array MPI example
#!/bin/bash --login
#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes

# SLURM directives
#
# This is an array job with two subtasks 0 and 1 (--array=0,1).
#
# The output for each subtask will be sent to a separate file
# identified by the jobid (--output=array-%j.out)
# 
# Each subtask will occupy one node (--nodes=1) with
# a wall-clock time limit of 20 minutes (--time=00:20:00)
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123)

#SBATCH --account=[your-project]
#SBATCH --array=0,1
#SBATCH --output=array-%j.out
#SBATCH --nodes=1
#SBATCH --ntasks=32 	    #this need to be indicated for shared access nodes
#SBATCH --cpus-per-task=1   #confirm the number of cpus per task
#SBATCH --mem=58G           #specify when asking for shared access to compute nodes (or use --exclusive for exclusive access)
#SBATCH --time=00:20:00 

# ---
# To launch the job, we specify to srun 24 MPI tasks (-n 24)
# to run on the node
#
# Note we avoid any inadvertent OpenMP threading by setting
# OMP_NUM_THREADS=1
#
# The input to the execuatable is the unique array task identifier
# $SLURM_ARRAY_TASK_ID which will be either 0 or 1

export OMP_NUM_THREADS=1

# ---
# Set MPI related environment variables. Not all need to be set
# main variables for multi-node jobs (uncomment for multinode jobs)
#export MPICH_OFI_STARTUP_CONNECT=1
#export MPICH_OFI_VERBOSE=1
#Ask MPI to provide useful runtime information (uncomment if debugging)
#export MPICH_ENV_DISPLAY=1
#export MPICH_MEMORY_REPORT=1

#--- 
#Specific settings for the cluster you are on
#(Check the specific guide of the cluster for additional settings)

#---
echo This job shares a SLURM array job ID with the parent job: $SLURM_ARRAY_JOB_ID
echo This job has a SLURM job ID: $SLURM_JOBID
echo This job has a unique SLURM array index: $SLURM_ARRAY_TASK_ID

#----
#Execute command:
srun -N 1 -n 32 -c 1 -m block:block:block ./code_mpi.x $SLURM_ARRAY_TASK_ID

The job is submitted as normal:

$ sbatch array_script.sh
Submitted batch job 212681

The "parent" job will initially appear in the queue with an underscore appended to the job ID, for example: 212681_. The first sub-job, when started, will appear with the same job ID as the parent but without the underscore. Subsequent sub-jobs have consecutive job IDs which in this case give output, for example: array-212681.out and array-212682.out.

Below is another example for a slightly different use of job arrays. Here, you have a task you want to perform on many input files with a consistent file naming pattern. For example, all input files end in input.txt. Rather than performing the task in series (one file at a time), this script will run multiple tasks in parallel (all at the same time).

Listing 2. Example of a job array with many files
#!/bin/bash --login
#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes

#
# SLURM directives
#
# This is an array job with 35 subtasks, (--array=0-34).
#
# The output for each subtask will be sent to a separate file
# identified by the jobid (--output=array-%j.out)
# 
# Each subtask will occupy one node (--nodes=1) with
# a wall-clock time limit of twenty minutes (--time=00:20:00)
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123)


#SBATCH --account=[your-project]
#SBATCH --output=array-%j.out
#SBATCH --array=0-34       #this should match the number of input files
#SBATCH --nodes=1
#SBATCH --ntasks=1          #One main task per file to be processed in each array-subtask
#SBATCH --cpus-per-task=1 	#this will vary depending on the requirements of the task
#SBATCH --mem=1840M         #Needed memory per array-subtask (or use --exclusive for exclusive access)
#SBATCH --time=00:20:00

#---  
echo "All jobs in this array have:"
echo "- SLURM_ARRAY_JOB_ID=${SLURM_ARRAY_JOB_ID}"
echo "- SLURM_ARRAY_TASK_COUNT=${SLURM_ARRAY_TASK_COUNT}"
echo "- SLURM_ARRAY_TASK_MIN=${SLURM_ARRAY_TASK_MIN}"
echo "- SLURM_ARRAY_TASK_MAX=${SLURM_ARRAY_TASK_MAX}"
 
echo "This job in the array has:"
echo "- SLURM_JOB_ID=${SLURM_JOB_ID}"
echo "- SLURM_ARRAY_TASK_ID=${SLURM_ARRAY_TASK_ID}"

#--- 
#Specific settings for the cluster you are on
#(Check the specific guide of the cluster for additional settings)

#---  
# grab our filename from a directory listing
FILES=($(ls -1 *.input.txt)) #this pulls in all the files ending with input.txt
FILENAME=${FILES[$SLURM_ARRAY_TASK_ID]} #this allows the slurm to enter the input.txt files into the job array
echo "My input file is ${FILENAME}" #this will print the file name into the log file 

#---  
#example job using the above variables
srun -N 1 -n 1 -c 1 ExpansionHunterDenovo-v0.8.7-linux_x86_64/scripts/casecontrol.py locus \
        --manifest ${FILENAME} \
        --output ${FILENAME}.CC_locus.tsv

Again, the job would be submitted as normal:

$ sbatch array_script2.sh
Submitted batch job 212682

Job dependencies 

It is possible to specify dependencies between two jobs using a unique SLURM job ID. Suppose Job 2 cannot start until the successful completion of Job 1, but we want to submit Job 1 and Job 2 at the same time. So, submit Job 1 as usual:

$ sbatch job1-script.sh
Submitted batch job 206842

We can then immediately submit Job 2 specifying the dependency on Job 1 (ID 206842) using the -d option:

$ sbatch -d afterok:206842 job2-script.sh
Submitted batch job 206845

At this point Job 2 will enter the queue in the pending state and appear under squeue as having a dependency. The clause afterok:id means Job 2 will not become eligible to run until Job 1 has finished successfully (that is, job1-script.sh exits with exit code zero). If Job 1 does exit successfully, Job 2 will become eligible to run and will run at the next opportunity. However, if Job 1 fails, Job 2 can never run, and will be silently removed from the queue.

A number of additional types of dependency are available. These include:

Option Purpose
-d afterany:jobidDependent job may run after any exit status
-d afternotok:jobid Dependent job may run only after non-zero exit status

A group of jobs can be submitted one after the other, using the previous job ID to create a dependency between the individual jobs.

The following example has four jobs that are dependent on the previously submitted job. It uses the --parsable option of sbatch to receive an always uniform output of the command where the first field is the job ID.

Terminal 1. Submit multiple dependent jobs
$ jobid=`sbatch --parsable first_job.sh | cut -d ";" -f 1` 									#this submits the 1st job and captures the jobid, for use in the next line
$ jobid=`sbatch --parsable --dependency=afterok:$jobid second_job.sh | cut -d ";" -f 1`  	#this submits the 2nd job and captures the jobid, for use in the next line
$ jobid=`sbatch --parsable --dependency=afterok:$jobid third_job.sh | cut -d ";" -f 1`   	#this submits the 3rd job and captures the jobid, for use in the next line
$ sbatch --dependency=afterok:$jobid fourth_job.sh 											#this submits the 4th and last job

The more complicated example below has the first three jobs as being independent, but the last job only runs after the previous three have completed successfully.

Terminal 2. Submit parallel and dependent jobs
$ jobid1=`sbatch --parsable first_job.sh | cut -d ";" -f 1` 								#this submits the 1st job and captures the jobid, for use in the last line
$ jobid2=`sbatch --parsable second_job.sh | cut -d ";" -f 1` 								#this submits the 2nd job and captures the jobid, for use in the last line
$ jobid3=`sbatch --parsable third_job.sh | cut -d ";" -f 1` 								#this submits the 3rd job and captures the jobid, for use in the last line
$ sbatch --dependency=afterok:$jobid1:$jobid2:$jobid3 fourth_job.sh

Another common method of creating dependencies between jobs is to have the batch job submit a new job at the end of the job script. This is known as job chaining

Listing 3. Example job chaining script
#!/bin/bash -l
#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes

#SBATCH --account=[your-project]
#SBATCH --nodes=xx
#SBATCH --ntasks=yy 			#this directive is required on setonix to request yy tasks
#SBATCH --cpus-per-task=zz
#SBACTH --exclusive             #For exclusive access to node resources (or use --mem for shared access)
#SBATCH --time=00:05:00

srun -N xx -n yy -c zz ./a.out # fill in the srun options '-N xx/-n yy/etc' to be appropriate to run the job
sbatch next_job.sh

Dependencies may be used within a SLURM script itself by making use of the SLURM variable $SLURM_JOB_ID to identify the current job. For example:

Listing 4. Example dependency job script
#!/bin/bash -l
#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes

#SBATCH --account=[your-project]
#SBATCH --nodes=xx
#SBATCH --ntasks=yy 					#this directive is required on setonix to request yy tasks
#SBATCH --cpus-per-task=zz
#SBACTH --exclusive             #For exclusive access to node resources (or use --mem for shared access)
#SBATCH --time=00:05:00

sbatch --dependency=afternotok:${SLURM_JOB_ID} next_job.sh
srun -N xx -n yy -c zz ./code.x 		# fill in the srun options '-N xx -n yy' etc. to be appropriate to run the job

Job dependencies and job priorities

Note that submitting multiple jobs and using dependencies will not obtain a higher queue priority for the dependent jobs just because they were submitted earlier. Accrual of job age priority starts from the eligible time, not the submission time. Jobs with dependencies only become eligible when the dependency is removed/completed.

Terminal 3. View dependent jobs in queue
$ squeue -j 4979452,4979463,4979465 -O "jobid,submittime,eligibletime,reason,dependency"
JOBID               SUBMIT_TIME         ELIGIBLE_TIME       REASON              DEPENDENCY
4979452             2020-05-22T09:47:45 2020-05-22T09:47:45 Priority
4979463             2020-05-22T09:49:19 2020-05-22T10:20:54 Priority
4979465             2020-05-22T09:49:38 N/A                 Dependency          afternotok:4979463

In the above example, job 4979452 was submitted with no dependency, and became eligible to run immediately. Job 4979463 was submitted with a dependency, but that dependency finished at 10:20:54 so this job was now eligible to run. Job 4979465 is still not eligible to run.

Recursive jobs

When a job cannot be completed within the walltime of 24 hours, it will need to be restarted from its last checkpoint. If the job needs to be restarted several times before reaching completion, it is convenient to allow subsequent jobs to restart automatically rather than submitting them manually. For this, the initial job script needs to contain the logic for submitting subsequent jobs automatically. Note that the code that is being executed needs to be able to restart from an existing checkpoint generated from the previous run. Usually, it also needs an updated version of the restart parameters. Therefore, the initial job script also needs to be equipped with the necessary updating procedure to allow the use of the existing checkpoint file and the new input parameters.

Figure 1 shows the logic flow of the following recursive job script example. Three check levels are included in the example.

  • Check 1: Checks for the existence of a file (the stop file), which is used as an indicator for the process to terminate.
  • Check 2: Counts how many job output files have been generated. If the specified maximum has been reached then the job is terminated.
  • Check 3: Examines how many job submissions (iterations) have been performed. If the specified maximum has been reached, no new jobs will be submitted. Instead, the current script will run until completion.


Figure 1. Logic flow of the recursive job script example


From the three check levels suggested above, check 3 is the most intuitive from a "programming" logic perspective. The first two have been added as additional "safety net" checks. The second one has been included to avoid the creation of an infinite loop of resubmissions when there is some bug in logic of the script. And the first check allows the user (or any other sub process or script) to raise a flag of termination when a dummy "flag-to-stop" file is created.

In the following paragraphs we'll explain the logic with an example.


First of all, if the main executable (code.x) in the job script needs to read its input parameters from a file (input.dat in this case), then the script may need to adapt the input parameters for each job execution. To deal with that, assume as an example that the input.dat file is something like this:

Listing 5. input.dat
starttime=0
endtime=10

These parameters are used by code.x to define the initial and final times of the numerical simulation it executes. As the job will be submitted many times recursively, the file of input parameters needs to be updated at every recursion before running the executable. For that, we'll make use of a template file named input.template:

Listing 6. input.template
starttime=VAR_START_TIME
endtime=VAR_END_TIME

This template file will replace the input.dat file and the strings VAR_START_TIME and VAR_END_TIME will be replaced by the needed values in each recursion (using the command sed) before running the executable in the current job. This logic is in the section "##Setup/Update of parameters/input for this current script" in listing 7.

For this process to work properly, the correct values of the input parameters need to be set and "sent" to the following job submission. This is performed when the submission of the following dependent job is done. This logic is in the section "##Submitting the dependent job" in listing 7.

The example script iterative.sh performs the logic described above. Comments within explain the reasoning of each section.

Listing 7. iterative.sh
#!/bin/bash -l

#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes

#-----------------------
##Defining the needed resources with SLURM parameters (modify as needed)
#SBATCH --account=[your-project]
#SBATCH --job-name=iterativeJob
#SBATCH --ntasks=128 
#SBATCH --ntasks-per-node=128
#SBATCH --cpus-per-task=1
#SBATCH --exclusive           #Will use exclusive access to nodes (Or use --mem for shared access)
#SBATCH --time=05:00:00

#-----------------------
##Setting modules
#Add the needed modules (uncomment and adapt the follwing lines)
#module swap the-module-to-swap the-module-i-need
#module load the-modules-i-need

#-----------------------
##Setting the variables for controlling recursion
#job iteration counter. It's default value is 1 (as for the first submission). For a subsequent submission, it will receive it value through the "sbatch --export" command from the "parent job".
: ${job_iteration:="1"}
this_job_iteration=${job_iteration}

#Maximum number of job iterations. It is always good to have a reasonable number here
job_iteration_max=5

echo "This jobscript is calling itself in recursively. This is iteration=${this_job_iteration}."
echo "The maximum number of iterations is set to job_iteration_max=${job_iteration_max}."
echo "The slurm job id is: ${SLURM_JOB_ID}"

#-----------------------
##Defining the name of the dependent script.
#This "dependentScript" is the name of the next script to be executed in workflow logic. The most common and more utilised is to re-submit the same script:
thisScript=`squeue -h -j $SLURM_JOBID -o %o`
export dependentScript=${thisScript}

#-----------------------
##Safety-net checks before proceding to the execution of this script

#Check 1: If the file with the exact name 'stopSlurmCycle' exists in the submission directory, then stop execution.
#         Users can create a file with this name if they need to interrupt the submission cycle by using the following command:
#             touch stopSlurmCycle
#         (Remember to remove the file before submitting this script again.)
if [[ -f stopSlurmCycle ]]; then
   echo "The file \"stopSlurmCycle\" exists, so the script \"${thisScript}\" will exit."
   echo "Remember to remove the file before submitting this script again, or the execution will be stopped."
   exit 1
fi

#Check 2: If the number of output files has reached a limit, then stop execution.
#         The existence of a large number of output files could be a sign of an infinite recursive loop.
#         In this case we check for the number of "slurm-XXXX.out" files.
#         (Remember to check your output files regularly and remove the not needed old ones or the execution may be stoppped.)
maxSlurmies=25
slurmyBaseName="slurm" #Use the base name of the output file
slurmies=$(find . -maxdepth 1 -name "${slurmyBaseName}*" | wc -l)
if [ $slurmies -gt $maxSlurmies ]; then
   echo "There are slurmies=${slurmies} ${slurmyBaseName}-XXXX.out files in the directory."
   echo "The maximum allowed number of output files is maxSlurmies=${maxSlurmies}"
   echo "This could be a sign of an infinite loop of slurm resubmissions."
   echo "So the script ${thisScript} will exit."
   exit 2
fi

#Check 3: Add some other adequate checks to guarantee the correct execution of your workflow
#Check 4: etc.

#-----------------------
##Setup/Update of parameters/input for the current script

#The following variables will receive a value with the "sbatch --export" submission from the parent job.
#If this is the first time this script is called, then they will start with the default values given here:
: ${var_start_time:="0"}
: ${var_end_time:="10"}
: ${var_increment:="10"}

#Replacing the current values in the parameter/input file used by the executable:
paramFile=input.dat
templateFile=input.template
cp $templateFile $paramFile
sed -i "s,VAR_START_TIME,$var_start_time," $paramFile
sed -i "s,VAR_END_TIME,$var_end_time," $paramFile

#Creating the backup of the parameter file utilised in this job
cp $paramFile $paramFile.$SLURM_JOB_ID

#-----------------------
##Verify that everything that is needed is ready
#This section is IMPORTANT. For example, it can be used to verify that the results from the parent submission are there. If not, stop execution.

#-----------------------
##Submitting the dependent job
#IMPORTANT: Never use cycles that could fall into infinite loops. Numbered cycles are the best option.

#The following variable needs to be "true" for the cycle to proceed (it can be set to false to avoid recursion when testing):
useDependentCycle=true

#Check if the current iteration is within the limits of the maximum number of iterations, then submit the dependent job:
if [ "$useDependentCycle" = "true" ] && [ ${job_iteration} -lt ${job_iteration_max} ]; then
   #Update the counter of cycle iterations
   (( job_iteration++ ))
   #Update the values needed for the next submission
   var_start_time=$var_end_time
   (( var_end_time += $var_increment ))
   #Dependent Job submission:
   #                         (Note that next_jobid has the ID given by the sbatch)
   #                         For the correct "--dependency" flag:
   #                         "afterok", when each job is expected to properly finish.
   #                         "afterany", when each job is expected to reach walltime.
   #                         "singleton", similar to afterany, when all jobs will have the same name
   #                         Check documentation for other available dependency flags.
   #IMPORTANT: The --export="list_of_exported_vars" guarantees that values are inherited to the dependent job
   next_jobid=$(sbatch --parsable --export="job_iteration=${job_iteration},var_start_time=${var_start_time},var_end_time=${var_end_time},var_increment=${var_increment}" --dependency=afterok:${SLURM_JOB_ID} ${dependentScript} | cut -d ";" -f 1)
   echo "Dependent with slurm job id ${next_jobid} was submitted"
   echo "If you want to stop the submission chain it is recommended to use scancel on the dependent job first"
   echo "Or create a file named: \"stopSlurmCycle\""
   echo "And then you can scancel this job if needed too"
else
   echo "This is the last iteration of the cycle, no more dependent jobs will be submitted"
fi

#-----------------------
##Run the main executable.
#(Modify as needed)
#Syntax should allow restart from a checkpoint
srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./code.x

You can adapt this script to your needs, with special attention to the security checks, type of dependency needed and appropriate syntax for restarting from previous checkpoint.


Submit the initial job in the usual way:

$ sbatch iterative.sh

In the first iteration, the job will run with the default value of job_iteration=1, and of the input parameters (var_start_time=0, var_start_time=10, var_increment=10) and use them to create the input.dat file for the current run. Before submitting the second job, those values will be updated. Then, they will be "sent through" in the sbatch submission command of the dependent job. The updated values of those variables and parameters will be received in the second job and will be utilised instead of the defaults.

Note how the sbatch submission uses the –-dependency directive:

--dependency=afterok:${SLURM_JOB_ID}

This means that the dependent job will be submitted to the queue, but Slurm will still wait for the parent job to finish properly. And only if the parent job did finish properly (afterok) the dependent job will be kept in the queue and continue its process. Other common dependency option is afterany, which is of common use if the job is expected to reach the walltime in each submission. For other dependency options, see Job dependencies.


When a job is submitted with recursive capabilities, the squeue command may show a running job and a job waiting to be processed due to dependency. The dependent job will not be eligible to start until the running job has finished. As explained in the Job dependencies and job priorities section, the dependent job will not accrue age priority until the first job has completed.

Terminal 4. View jobs in queue for user
$ squeue -u espinosa
JOBID    USER     ACCOUNT                   NAME EXEC_HOST ST     REASON   START_TIME     END_TIME  TIME_LEFT NODES   PRIORITY
3483798  espn     pawsey            iterativeJob  nid00017  R       None     14:44:27     14:47:27       2:50     1       5269
3483799  espn     pawsey            iterativeJob       n/a PD Dependency          N/A          N/A       5:00     1       5269

Alternatively, if the running job has completed and the dependent job has not yet started, then you will only see the dependent job in the squeue output, and the REASON will be either Priority or Resources.


Figure 2 summarises what to expect when looking at the job queue for this recursive job script.

  • On initial job submission, there will be one job in the queue
  • When a job is running, you will see another job appear in the queue that is held with REASON=Dependency
  • When each iteration of the job has finished, there will again only be one pending job in the queue



Figure 2. Flow of execution in the recursive job script example


If you want to stop the recursive submission without cancelling the current running job, you can create a file named stopSlurmCycle (check the example script above in the section "##Safety-net checks") by using the touch command:

$ touch stopSlurmCycle

Or you could cancel the dependent job:

$ scancel 3483799

The jobID of the dependent job was taken from the display of the squeue command above.

To cancel the whole process, you should cancel both the dependent job and the running job. (It is always wise to cancel first the dependent job.)

Multi-cluster jobs

SLURM queueing system offers the ability to launch commands on other clusters instead of, or in addition to, the local cluster on which the command is invoked. A classic example of data staging is presented on the Data Workflows page, which runs a computation in the work partition of setonix and upon completion launches a data copying job to the copy partition of the setonix cluster.

Interactive jobs 

For code development, debugging and light-weight visualisation purposes, it is sometimes convenient to run on the back-end "interactively". This can be done using the SLURM command salloc. For example, from the front-end we can enter salloc to ask for one node to be allocated in the debug partition:

Terminal 5. Launch interactive session
$ salloc -p debug --nodes=1 --ntasks=32 --cpus-per-task=1 --mem=58G
salloc: Pending job allocation 206121
salloc: job 206121 queued and waiting for resources
salloc: Granted job allocation 206121
setonix@nid00200:~>

While interactive access to the work partition is available via salloc, interactive jobs do not get additional priority. This may mean long wait times for interactive requests to be satisfied if the machine is busy.

Note the change in prompt from $ to setonix@nid00200:~>, which indicates that you are now logged into the compute node (nid00200 in this case).

The --ntasks option should also be used on Setonix to explicitly specify the number of tasks required for the interactive session, and the --mem option to indicate the required memory.

You must use srun to run multiple instances of your executable in parallel.

For example:


Terminal 6. Move into $MYSCRATCH directory
$ cd $MYSCRATCH
setonix@nid00200:/scratch/[project]/[username]>
setonix@nid00200:/scratch/[project]/[username]> srun -N 1 -n 4 -c 1 ./code_test.x
...

When finished, type exit to leave the interactive queue and rejoin the front-end.

Terminal 7. Exit interactive session
$ exit
exit
salloc: Relinquishing job allocation 206121
salloc: Job allocation 206121 has been revoked

Note that X11 forwarding is enabled by default in the interactive queue.

We recommend users to use FastX, a web-based remote visualisation service on Topaz, to launch compute-intensive visualisation packages such as ParaView, VisIt or VMD. Refer to the Remote Visualisation - Topaz support page for more information.

Packing serial/small multithreaded jobs 

Implementing parallelism for a given workflow sometimes means running many copies of a serial code with different input parameters or data. Outputs must be stored separately and in an identifiable way. The same is true of jobs that support threads but which do not scale to a full node. In this case we might want to run, say, six jobs each of four threads at the same time. For purposes of efficiency, we would like to pack a number of such instances in one node to make use of all cores available within the node (for example, the 128 cores on Setonix CPU-only nodes).

There are a number of ways in which this can be done:

  1. For "trivial" parallelism, where all the tasks are completely independent, individual tasks can be uniquely identified by the environment variable SLURM_PROCID which takes on a value from 0 to <ntasks-1> when an application is launched using srun -n ntasks. For more information, see Method 1: Using SLURM_PROCID.
  2. For more complex workflows, where there may be some dependencies between tasks, we recommend considering mpibash. For more information, see Method 2: Using mpibash.
  3. For complex scripting tasks requiring parallelism, we suggest considering Python and message passing via mpi4py. See Method3: Using Python and mpi4py.

Method 1: Using SLURM_PROCID

Packing Serial Jobs

This section shows how to pack a workflow consisting of multiple serial (single core) instances of work. The individual "instances of work" here might represent a serial binary executable or a separate serial script. In the example below, we use the environment variable SLURM_PROCID to identify input files and output files for each of 64 requested instances of a serial executable serial-code.x. Instead of launching the executable directly, an intermediate (wrapper) shell script is launched by srun. Inside the wrapper script one has access to $SLURM_PROCID and can construct the serial workflow that is intended to execute a given instance:

Listing 8. Parallel serial jobs example
#!/bin/bash --login
#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes

# SLURM directives
#
# Here we specify to SLURM we want 64 tasks in a single node with
# a wall-clock time limit of 1 hour (--time=01:00:00).
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123).

#SBATCH --account=[your-project]
#SBATCH --nodes=1
#SBATCH --ntasks=64  			
#SBATCH --ntasks-per-node=64 	
#SBATCH --cpus-per-task=1
#SBATCH --mem=117G              #Needed memory per node when share access (or use --exclusive for exclusive access)
#SBATCH --time=01:00:00 

#--- 
#Specific settings for the cluster you are on
#(Check the specific guide of the cluster for additional settings)

#---  
# Launch 64 instances of wrapper script (make sure it's executable),
srun -N 1 -n 64 -c 1 -m block:block:block ./wrapper.sh

The wrapper.sh may look something like this:

Listing 9. wrapper.sh example
#!/bin/bash
#
# This is a standard bash script which has access to the environment variable SLURM_PROCID
# This is used to construct input filenames of the form input-0 input-1 ... input-63
# and similarly named output files

INFILE="input-$SLURM_PROCID"
OUTFILE="output-$SLURM_PROCID.out"

# Assuming all the input files exist in the current directory, we run the executable.
# Each instance will use the appropriate input and produce the relevant output.

./serial-code.x < $INFILE > $OUTFILE

Java Jobs

Example 1: Serial Java application

Here, we run a serial Java application (class Application) on one node.

Listing 10. Serial Java example
#!/bin/bash --login
#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes

# SLURM directives
#
# Here we specify to SLURM we want one node (--nodes=1) with
# a wall-clock time limit of 1 hr (--time=01:00:00).
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123).

#SBATCH --account=[your-project]
#SBATCH --nodes=1
#SBATCH --ntasks=1 		
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G           #specify when asking for shared access to compute nodes (or use --exclusive for exclusive access) 
#SBATCH --time=00:10:00
 
# Launch the job.
# There is one task to run java in serial (-n 1).

srun -N 1 -n 1 -c 1 java Application
Example 2: Two Java instances on one node

Running a single Java application on one node will not make use of all cores on that node (although it might require the entire available RAM). To run a number of instances of an application on the same node, an intermediate (or wrapper) application must be used via srun. The following example uses two instances, which are identified via the environment variable SLURM_PROCID. This variable takes on a unique value (starting at zero) for each instance specified to srun.

The SLURM script is as follows:

Listing 11. Dual Java instances example script
#!/bin/bash --login
#This example use general SBATCH settings, but please refer to the specific guide
#of intended the cluster for possible needed changes


# Here we specify to SLURM we want one node (--nodes=1) with
# a wall-clock time limit of ten minutes (--time=00:10:00).
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123).

#SBATCH --account=[your-project]
#SBATCH --nodes=1
#SBATCH --ntasks=2 				#this directive is required on setonix to request 2 tasks
#SBATCH --cpus-per-task=1
#SBATCH --mem=3680M   #specify when asking for shared access to compute nodes (or use --exclusive for exclusive access)
#SBATCH --time=00:10:00


#--- 
#Specific settings for the cluster you are on
#(Check the specific guide of the cluster for additional settings)

# We request two instances "-n 2" to be placed on the node
srun -N 1 -n 2 -c 1 java Wrapper

The Wrapper.java application takes the form. Two instances of the Wrapper class are run (asynchronously)) which will be identical except for the value of SLURM_PROCID obtained from the environment. Appropriate program logic can be used to arrange, for example, specific input to an instance of an underlying application. Here, we simply report the value of SLURM_PROCID to standard output.

Listing 12. Java wrapper example
/* Wrapper to differentiate instances produced by srun */
/* The resulting "rank" may be used in conjunction with
* program logic to run different tasks from within java. */
import java.io.*;
class Wrapper {
	public static void main(String argv[]) {
		int rank;
		try {
			String slurm_proc_id = System.getenv("SLURM_PROCID");
			rank = Integer.parseInt(slurm_proc_id);
		}
		catch (NumberFormatException e) {
			rank = -1;
		}
		if (rank == 0) {
			System.out.println("Running with SLURM_PROCID zero");
		}
		if (rank == 1) {
			System.out.println("Running with SLURM_PROCID one");
		}
		/* ...and so on */
		return;
	}
}

R Jobs

Interactive access to R for testing and development is available through the queue system.

Terminal 8. Launch interactive session
$ salloc --nodes=1 --ntasks=1 --cpus-per-task=1 --mem=1840M --time=06:00:00
salloc: Granted job allocation 291021
setonix@nid00294:~> module load cray-R
setonix@nid00294:~> srun -N 1 -n 1 -c 1 R --no-save

R version 3.3.3 (2017-03-06) -- "Another Canoe"
Copyright (C) 2017 The R Foundation for Statistical Computing
...

Notes

  • The srun is required to ensure that the R executable runs on the back end.
  • The default time limit for the interactive queue is one hour, at the end of which you will be logged out automatically and without warning. Be sure to specify a time limit that is long enough to complete the task.
  • For short jobs of up to one hour, you can use salloc -p debug if the default work partition is busy.

Trivial parallelism may be introduced by the following mechanism. An intermediate "wrapper" script is required between the SLURM submission script and the R script itself. A simple example is:


Listing 13. R trivial parallelism example
#!/bin/bash --login

# SLURM script requesting one node with a time limit of 20 minutes.
# Replace [your-project] with the appropriate project budget.

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=24 		#this is required on setonix to request 24 tasks on a node 
#SBATCH --time=00:20:00
#SBATCH --account=[your-project]
#SBATCH --export=NONE
#SBATCH --mem=96G
#SBATCH --cpus-per-task=1

# Launch 24 instances (-n 24) on 24 cores of the wrapper script
srun -N 1 -n 24 -c 1 ./r-wrapper.sh

The wrapper script is r-wrapper.sh. This script must be in the same location as the submission script and it must be executable (chmod 740 r-wrapper.sh):

Listing 14. r-wrapper.sh
#!/bin/bash
#
# This script, running on the back-end, has access to the
# environment variable $SLURM_PROCID, which will take on
# values 0-23 when launched via srun -n 24.
#
# This is used as input to the R script, and to differentiate the output
# as r-job-<jobid>-<instance>.out (where the jobid is the same for each
# separate batch submission $SLURM_JOBID)

R --no-save "--args $SLURM_PROCID " < my-script.R > r-$SLURM_JOBID-$SLURM_PROCID.log

Finally, the R script (my-script.R) can be based on:

Listing 15. my-script.R
# The R script identifies its "rank" via the command line argument

args <- commandArgs(TRUE)

print ("This R script instance has input ")
print (args)

Packing Small Multithreaded Jobs

If your application supports multithreading, you can use the srun -c option to request the number of threads per instance on a node. You can also request a number of instances, as long as the number of threads times the number of instances does not exceed the total number of cores available within a node. Here is an example using OpenMP on Setonix. A similar approach can be used for pthreaded applications.

Listing 16. Job packing small multithreaded tasks example
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want two nodes (--nodes=2) with
# a wall-clock time limit of twenty minutes (--time=00:20:00).
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123).


#SBATCH --account=[your-project]
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4 	#this directive is required on setonix to request 4 tasks on each node
#SBATCH --ntasks-per-socket=2   #this directive indicates that maximum 2 tasks per socket are to be allocated
#SBATCH --ntasks=8 				#this directive is required on setonix to request a total of 8 tasks 
#SBATCH --cpus-per-task=32 		#this directive is required on setonix to request 32 CPU cores for each task for the OpenMP threads
#SBATCH --exclusive             #always use when all node resources are needed (or use --mem for shared access)
#SBATCH --time=00:20:00 

# Set number of OpenMP threads
export OMP_NUM_THREADS=32

#--- 
#Specific settings for the cluster you are on
#(Check the specific guide of the cluster for additional settings) 


# Launch the job.
# Here we use 8 instances (-n 8) with 4 per node with two instances per socket,
# or NUMA region. Each instance requests 32 cores -c 32 (via -c $OMP_NUM_THREADS)
srun -N 2 -n 8 -c ${OMP_NUM_THREADS} -m block:block:block ./wrapper.sh

The wrapper script in this case will invoke an OpenMP code. Again, the wrapper script must use SLURM_PROCID to differentiate the individual tasks (here 0-7) in an appropriate way for the workflow. For pthreaded applications, the number of threads must be communicated to the wrapper script and must be consistent with the value specified by the -c option. (Note that in the case above, the number of threads is available to the wrapper script in the exported environment variable OMP_NUM_THREADS.)

Method 2: Using mpibash

An MPI implementation for bash is available through the module system. mpibash provides an implementation of a limited number of key MPI routines and is described here. Using mpibash presents a simple way to parallelise workflows based on standard bash scripts.

A simple example is given in the following snippet. Programmers who have used MPI should be familiar with the idiom.

Listing 17. MPIbash example
#!/usr/bin/env mpibash

# Note the mpibash shebang
#
# The following command informs bash of the location of the mpi_init commnd
# which can then be used to initialise MPI

enable -f mpibash.so mpi_init

mpi_init

mpi_comm_rank rank
mpi_comm_size size

echo "Hello from bash mpi rank $rank of $size"

mpi_finalize

Remember to change the file permissions on the script so that anyone can execute it. For example, if you called the script mpi-bash.sh:

$ chmod a+x mpi-bash.sh

The script can be launched with the following SLURM script on Setonix:

Listing 18. Launch mpi-bash example
#!/bin/bash --login

# We must load the mpibash module
#
# This particular example uses 48 MPI tasks on 2 nodes
#
# Note --export=none is necessary to avoid error messages of the form:
# _pmi_inet_listen_socket_setup:socket setup failed

#SBATCH --account=[your-project]
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24	#this directive is required on setonix to request 24 tasks on each node
#SBATCH --ntasks=48 			#this directive is required on setonix to request a total of 48 tasks
#SBATCH --mem=96G
#SBATCH --time=00:02:00
#SBATCH --export=none


module swap PrgEnv-cray PrgEnv-gnu 	#this is required for setonix
module load mpibash
export PMI_NO_PREINITIALIZE=1 		#this is required for setonix
export PMI_NO_FORK=1				#this is required for setonix
srun -N 2 -n 48 -c ${OMP_NUM_THREADS} ./mpi-bash.sh

The bash script can contain anything appropriate for a normal workflow. It cannot, however, attempt to launch a standalone MPI executable.

Method 3: Using Python and mpi4py

The example in listing 19 runs a single serial Python script on a single node.

Listing 19. Single serial Python example
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want one node (--nodes=1) with
# a wall-clock time limit of ten minutes (--time=00:10:00).
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123).

#SBATCH --nodes=1
#SBATCH --ntasks=1 		#this directive is required on setonix to request 1 task
#SBATCH --time=00:10:00
#SBATCH --account=[your-project]
#SBATCH --export=NONE
#SBATCH --cpus-per-task=1
#SBATHC --mem=4G

# Launch the job.
#
# Serial python script. Load the default python module with
#
# module load python
#
# Launch the script on the back end with srun -n 1

module load python
srun -N 1 -n 1 -c 1 python ./serial-python.py

#
# If you have an executable python script with a "bang path",
# make sure the path is of the form
#
# #!/usr/bin/env python

srun -N 1 -n 1 -c 1 ./serial-python-x.py

Suggestions on how to pack many serial tasks on a single node using mpi4py are given below.

Listing 20. Python serial task packing
#!/bin/bash --login

# SLURM directives
#
# Here we specify to SLURM we want two nodes (--nodes=2) with
# a wall-clock time limit of ten minutes (--time=00:10:00).
#
# Replace [your-project] with the appropriate project name
# following --account (e.g., --account=project123).

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=24	#this directive is required on setonix to request 24 tasks on each node
#SBATCH --ntasks=48 			#this directive is required on setonix to request a total of 48 tasks
#SBATCH --time=00:10:00
#SBATCH --account=[your-account]
#SBATCH --export=NONE
#SBATCH --mem=96G   
#SBATCH --cpus-per-task=c

# Launch the job.
# This python script uses the python module mpi4py, which we need
# to load with
#
# module load mpi4py
#
# (which will also load the default python module as a dependency).
#
# The script is launched via srun -n 48, which specifies 48 MPI
# tasks, and is invoked via the interpreter.

module load mpi4py
srun -N 2 -n 48 -c $OMP_NUM_THREADS python ./mpi-python.py

# If you have an executable python script with a "bang path",
# make sure the path is of the form
#
# #!/usr/bin/env python

srun -N 2 -n 48 -c $OMP_NUM_THREADS ./mpi-python-x.py

Arrays or packing of many jobs requiring GPUs

Implementing parallelism for a given workflow sometimes means running many copies of a GPU code with different input parameters or data. This can be achieved with the two approaches already described in the sections above: job arrays and job packing.

When to use job arrays: For nodes that can be shared, the best practice is to use job arrays. A disadvantage of job packing on shared nodes is that unbalanced steps might lead to resources being held unnecessarily. When using arrays this problem does not exist because, as soon as any job finishes or fails, the resources for that job are freed for use by another user.

In the following example eight jobs are submitted as a job array, each using one GPU:

Listing 21. GPU job array example
#!/bin/bash --login

#SBATCH --account=[your-account]-gpu
#SBATCH --array=0-7
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gpu=1
#SBATCH --time=00:10:00


#Go to the right directory for this instance of the job array using SLURM_ARRAY_TASK_ID as the identifier:
#We are assuming all the input files needed for each specific job reside in the corresponding working directory
cd workingDir_${SLURM_ARRAY_TASK_ID}

#Run the cuda executable (asuming the same executable will be used by each job, and that it resides in the submission directory):
srun -u -N 1 -n 1 ${SLURM_SUBMIT_DIR}/main_hip

When to use job packing: For nodes where resources are exclusive and cannot be shared among different users/jobs at the same time (like nvlinkq partition in Topaz) the best practise is to to use job packing. Ideally, multiple jobs should be packed in order to make use of the four available GPUs in the node. (Obviously if a single job can make use of the four GPUs, that is also desirable and that would not need packing.) We do not recommend packing jobs across  multiple nodes with the same job script due to possible load balancing issues: all resources will be held and unavailable to other users/jobs until the last substep (job) in the packing finishes.

Plan for balanced execution times between packed tasks

All allocated resources remain allocated until the last task finishes its execution. No partial resources are freed for other users when an individual task finishes. Therefore, users should plan these kind of jobs very carefully and aim for all tasks to have very similar execution times. For example, if many of the tasks finish quickly, but one tasks keeps executing until reaching the walltime, there is the danger that most of the resources will remain idle for a long time. Even if your project is still being charged for the resources that remained idle, the creation of idle allocations is a very bad practise and must be avoided at all costs.

In the following example, four srun steps are ran simultaneously in a single node, each step using one GPU. The header of SBATCH allocation request asks for all the resources in the node. And the srun command asks for specific resources for each step.


Listing 22. GPU job packing example using multiple steps simultaneously
#!/bin/bash --login


#SBATCH --account=[your-account]-gpu
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks-per-socket=2   #maximum 2 tasks per socket (each socket has 2 GPUs in this partition)
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:4
#SBATCH --time=00:10:00
 
#Default loaded compiler module is gcc module
 
module load cuda

for tagID in $(seq 0 3); do
   #Go to the right directory for this step of the job pack using tagID as the identifier:
   #We are assuming all the input files needed for each specific job reside in the corresponding working directory
   cd ${SLURM_SUBMIT_DIR}/workingDir_${tagID}

   #Defining an output file for this step
   outputFile=results_${tagID}.out
   echo "Starting" > $outputFile

   #Run the cuda executable (asuming the same executable will be used by each step, and that it resides in the submission directory):
   srun -u -N 1 -n 1 --mem=56G --gres=gpu:1 --exact ${SLURM_SUBMIT_DIR}/main_cuda >> $outputFile &
done
wait

Notes

  • In the header a total of four GPUS is requested. For each job step the specific number of GPUs to be used (1 in this case) is indicated. The use of --mem=56G indicates the amount of memory to be allocated for each step, and the --exact allows access to only the resources requested for the step.
  • Note the logic of the use of " & .. & ..wait" for being able to execute each step in the background and wait for them to finish before ending the job script.
  • In the loop, the iterator (numeric identifier) for each step is defined to start at 0 in order to be equivalent to the natural numbering of Slurm, but you can use any start and end value to be consistent with your own naming of directories, input files and output files.


Exactly the same effect (packing) can also be achieved by using the --gpu-bind option of the Slurm scheduler and a wrapper:

Listing 23. GPU job packing with --gpu-bind
#!/bin/bash --login

#SBATCH --account=[your-account]
#SBATCH --partition=nvlinkq
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --ntasks-per-socket=2    #maximum 2 tasks per socket (each socket has 2 GPUs in this partition) #SBATCH --cpus-pert-task=1
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=map_gpu:0,1,2,3
#SBATCH --time=00:10:00

 
#Default loaded compiler module is gcc module
 
module load cuda

#Run the cuda executable from a wrapper:
srun -u -N 1 -n 4 -c 1 wrapper.sh

And the wrapper.sh in this case is:

Listing 24. GPU wrapper example
#!/bin/bash

#Go to the right directory for this instance of the job pack using tagID as the identifier:
#We are assuming all the input files needed for each specific job reside in the corresponding working directory
cd ${SLURM_SUBMIT_DIR}/workingDir_${SLURM_PROCID}

#Defining an output file for this process
outputFile=results_${SLURM_PROCID}.out
echo "Starting" > $outputFile

#Check that the settings for this process are correct
echo "SLURM_PROCID=$SLURM_PROCID" >> outputFile
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" >> $outputFile
echo "" >> $outputFile

#Run the cuda executable (asuming the same executable will be used by each job, and that it resides in the submission directory):
${SLURM_SUBMIT_DIR}/main_cuda >> $outputFile

The --gpu-bind setting will define the correct value for the environment variable CUDA_VISIBLE_DEVICES for each process to work on a different GPU. So this variable will get the value of 0 for the first instance of the wrapper running in the node, 1 for the second, 2 for the third and 3 for the last one. In this way the four instances will run simultaneously, each instance utilising a different GPU in the node.

Related pages

External links


  • No labels