Example Workflows
Slurm is able to orchestrate the execution of complex job workflows. This page shows you, for instance, how to schedule the execution of a job only after another one has ended, or how to execute different programs within a single job to make use of all the allocated resources.
Job arrays
An array job provides a mechanism to run a (possibly large) number of similar jobs from a single batch script. The number of instances of the job is controlled via the SLURM directive --array
:
#SBATCH --array=<list>
Here <list>
may be a comma-separated list of numbers, or a range of numbers specified using a dash "-". For example, one may have "--array=0,1,2,3" or "--array=0-3" to specify four instances. These two formats can be combined: "--array=0-2,4,8". An optional stride may be introduced when specifying a range using a colon ":". For example, "--array=0-7:2", is equivalent to "--array=0,2,4,6".
All of the other SLURM directives specified in the script are common to all of the jobs, specifically the number of nodes and the time limit.
The following simple example runs two instances of a 32 MPI task job, each on one node.
#!/bin/bash --login #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes # SLURM directives # # This is an array job with two subtasks 0 and 1 (--array=0,1). # # The output for each subtask will be sent to a separate file # identified by the jobid (--output=array-%j.out) # # Each subtask will occupy one node (--nodes=1) with # a wall-clock time limit of 20 minutes (--time=00:20:00) # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123) #SBATCH --account=[your-project] #SBATCH --array=0,1 #SBATCH --output=array-%j.out #SBATCH --nodes=1 #SBATCH --ntasks=32 #this need to be indicated for shared access nodes #SBATCH --cpus-per-task=1 #confirm the number of cpus per task #SBATCH --mem=58G #specify when asking for shared access to compute nodes (or use --exclusive for exclusive access) #SBATCH --time=00:20:00 # --- # To launch the job, we specify to srun 24 MPI tasks (-n 24) # to run on the node # # Note we avoid any inadvertent OpenMP threading by setting # OMP_NUM_THREADS=1 # # The input to the execuatable is the unique array task identifier # $SLURM_ARRAY_TASK_ID which will be either 0 or 1 export OMP_NUM_THREADS=1 # --- # Set MPI related environment variables. Not all need to be set # main variables for multi-node jobs (uncomment for multinode jobs) #export MPICH_OFI_STARTUP_CONNECT=1 #export MPICH_OFI_VERBOSE=1 #Ask MPI to provide useful runtime information (uncomment if debugging) #export MPICH_ENV_DISPLAY=1 #export MPICH_MEMORY_REPORT=1 #--- #Specific settings for the cluster you are on #(Check the specific guide of the cluster for additional settings) #--- echo This job shares a SLURM array job ID with the parent job: $SLURM_ARRAY_JOB_ID echo This job has a SLURM job ID: $SLURM_JOBID echo This job has a unique SLURM array index: $SLURM_ARRAY_TASK_ID #---- #Execute command: srun -N 1 -n 32 -c 1 -m block:block:block ./code_mpi.x $SLURM_ARRAY_TASK_ID
The job is submitted as normal:
$ sbatch array_script.sh
Submitted batch job 212681
The "parent" job will initially appear in the queue with an underscore appended to the job ID, for example: 212681_
. The first sub-job, when started, will appear with the same job ID as the parent but without the underscore. Subsequent sub-jobs have consecutive job IDs which in this case give output, for example: array-212681.out
and array-212682.out
.
Below is another example for a slightly different use of job arrays. Here, you have a task you want to perform on many input files with a consistent file naming pattern. For example, all input files end in input.txt
. Rather than performing the task in series (one file at a time), this script will run multiple tasks in parallel (all at the same time).
#!/bin/bash --login #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes # # SLURM directives # # This is an array job with 35 subtasks, (--array=0-34). # # The output for each subtask will be sent to a separate file # identified by the jobid (--output=array-%j.out) # # Each subtask will occupy one node (--nodes=1) with # a wall-clock time limit of twenty minutes (--time=00:20:00) # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123) #SBATCH --account=[your-project] #SBATCH --output=array-%j.out #SBATCH --array=0-34 #this should match the number of input files #SBATCH --nodes=1 #SBATCH --ntasks=1 #One main task per file to be processed in each array-subtask #SBATCH --cpus-per-task=1 #this will vary depending on the requirements of the task #SBATCH --mem=1840M #Needed memory per array-subtask (or use --exclusive for exclusive access) #SBATCH --time=00:20:00 #--- echo "All jobs in this array have:" echo "- SLURM_ARRAY_JOB_ID=${SLURM_ARRAY_JOB_ID}" echo "- SLURM_ARRAY_TASK_COUNT=${SLURM_ARRAY_TASK_COUNT}" echo "- SLURM_ARRAY_TASK_MIN=${SLURM_ARRAY_TASK_MIN}" echo "- SLURM_ARRAY_TASK_MAX=${SLURM_ARRAY_TASK_MAX}" echo "This job in the array has:" echo "- SLURM_JOB_ID=${SLURM_JOB_ID}" echo "- SLURM_ARRAY_TASK_ID=${SLURM_ARRAY_TASK_ID}" #--- #Specific settings for the cluster you are on #(Check the specific guide of the cluster for additional settings) #--- # grab our filename from a directory listing FILES=($(ls -1 *.input.txt)) #this pulls in all the files ending with input.txt FILENAME=${FILES[$SLURM_ARRAY_TASK_ID]} #this allows the slurm to enter the input.txt files into the job array echo "My input file is ${FILENAME}" #this will print the file name into the log file #--- #example job using the above variables srun -N 1 -n 1 -c 1 ExpansionHunterDenovo-v0.8.7-linux_x86_64/scripts/casecontrol.py locus \ --manifest ${FILENAME} \ --output ${FILENAME}.CC_locus.tsv
Again, the job would be submitted as normal:
$ sbatch array_script2.sh
Submitted batch job 212682
Job dependencies
It is possible to specify dependencies between two jobs using a unique SLURM job ID. Suppose Job 2 cannot start until the successful completion of Job 1, but we want to submit Job 1 and Job 2 at the same time. So, submit Job 1 as usual:
$ sbatch job1-script.sh
Submitted batch job 206842
We can then immediately submit Job 2 specifying the dependency on Job 1 (ID 206842) using the -d
option:
$ sbatch -d afterok:206842 job2-script.sh
Submitted batch job 206845
At this point Job 2 will enter the queue in the pending state and appear under squeue as having a dependency. The clause afterok:id
means Job 2 will not become eligible to run until Job 1 has finished successfully (that is, job1-script.sh
exits with exit code zero). If Job 1 does exit successfully, Job 2 will become eligible to run and will run at the next opportunity. However, if Job 1 fails, Job 2 can never run, and will be silently removed from the queue.
A number of additional types of dependency are available. These include:
Option | Purpose |
---|---|
-d afterany:jobid | Dependent job may run after any exit status |
-d afternotok:jobid | Dependent job may run only after non-zero exit status |
A group of jobs can be submitted one after the other, using the previous job ID to create a dependency between the individual jobs.
The following example has four jobs that are dependent on the previously submitted job. It uses the --parsable
option of sbatch
to receive an always uniform output of the command where the first field is the job ID.
$ jobid=`sbatch --parsable first_job.sh | cut -d ";" -f 1` #this submits the 1st job and captures the jobid, for use in the next line $ jobid=`sbatch --parsable --dependency=afterok:$jobid second_job.sh | cut -d ";" -f 1` #this submits the 2nd job and captures the jobid, for use in the next line $ jobid=`sbatch --parsable --dependency=afterok:$jobid third_job.sh | cut -d ";" -f 1` #this submits the 3rd job and captures the jobid, for use in the next line $ sbatch --dependency=afterok:$jobid fourth_job.sh #this submits the 4th and last job
The more complicated example below has the first three jobs as being independent, but the last job only runs after the previous three have completed successfully.
$ jobid1=`sbatch --parsable first_job.sh | cut -d ";" -f 1` #this submits the 1st job and captures the jobid, for use in the last line $ jobid2=`sbatch --parsable second_job.sh | cut -d ";" -f 1` #this submits the 2nd job and captures the jobid, for use in the last line $ jobid3=`sbatch --parsable third_job.sh | cut -d ";" -f 1` #this submits the 3rd job and captures the jobid, for use in the last line $ sbatch --dependency=afterok:$jobid1:$jobid2:$jobid3 fourth_job.sh
Another common method of creating dependencies between jobs is to have the batch job submit a new job at the end of the job script. This is known as job chaining.
#!/bin/bash -l #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes #SBATCH --account=[your-project] #SBATCH --nodes=xx #SBATCH --ntasks=yy #this directive is required on setonix to request yy tasks #SBATCH --cpus-per-task=zz #SBACTH --exclusive #For exclusive access to node resources (or use --mem for shared access) #SBATCH --time=00:05:00 srun -N xx -n yy -c zz ./a.out # fill in the srun options '-N xx/-n yy/etc' to be appropriate to run the job sbatch next_job.sh
Dependencies may be used within a SLURM script itself by making use of the SLURM variable $SLURM_JOB_ID to identify the current job. For example:
#!/bin/bash -l #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes #SBATCH --account=[your-project] #SBATCH --nodes=xx #SBATCH --ntasks=yy #this directive is required on setonix to request yy tasks #SBATCH --cpus-per-task=zz #SBACTH --exclusive #For exclusive access to node resources (or use --mem for shared access) #SBATCH --time=00:05:00 sbatch --dependency=afternotok:${SLURM_JOB_ID} next_job.sh srun -N xx -n yy -c zz ./code.x # fill in the srun options '-N xx -n yy' etc. to be appropriate to run the job
Job dependencies and job priorities
Note that submitting multiple jobs and using dependencies will not obtain a higher queue priority for the dependent jobs just because they were submitted earlier. Accrual of job age priority starts from the eligible time, not the submission time. Jobs with dependencies only become eligible when the dependency is removed/completed.
$ squeue -j 4979452,4979463,4979465 -O "jobid,submittime,eligibletime,reason,dependency" JOBID SUBMIT_TIME ELIGIBLE_TIME REASON DEPENDENCY 4979452 2020-05-22T09:47:45 2020-05-22T09:47:45 Priority 4979463 2020-05-22T09:49:19 2020-05-22T10:20:54 Priority 4979465 2020-05-22T09:49:38 N/A Dependency afternotok:4979463
In the above example, job 4979452 was submitted with no dependency, and became eligible to run immediately. Job 4979463 was submitted with a dependency, but that dependency finished at 10:20:54 so this job was now eligible to run. Job 4979465 is still not eligible to run.
Recursive jobs
When a job cannot be completed within the walltime of 24 hours, it will need to be restarted from its last checkpoint. If the job needs to be restarted several times before reaching completion, it is convenient to allow subsequent jobs to restart automatically rather than submitting them manually. For this, the initial job script needs to contain the logic for submitting subsequent jobs automatically. Note that the code that is being executed needs to be able to restart from an existing checkpoint generated from the previous run. Usually, it also needs an updated version of the restart parameters. Therefore, the initial job script also needs to be equipped with the necessary updating procedure to allow the use of the existing checkpoint file and the new input parameters.
Figure 1 shows the logic flow of the following recursive job script example. Three check levels are included in the example.
- Check 1: Checks for the existence of a file (the stop file), which is used as an indicator for the process to terminate.
- Check 2: Counts how many job output files have been generated. If the specified maximum has been reached then the job is terminated.
- Check 3: Examines how many job submissions (iterations) have been performed. If the specified maximum has been reached, no new jobs will be submitted. Instead, the current script will run until completion.
Figure 1. Logic flow of the recursive job script example
From the three check levels suggested above, check 3 is the most intuitive from a "programming" logic perspective. The first two have been added as additional "safety net" checks. The second one has been included to avoid the creation of an infinite loop of resubmissions when there is some bug in logic of the script. And the first check allows the user (or any other sub process or script) to raise a flag of termination when a dummy "flag-to-stop" file is created.
In the following paragraphs we'll explain the logic with an example.
First of all, if the main executable (code.x
) in the job script needs to read its input parameters from a file (input.dat
in this case), then the script may need to adapt the input parameters for each job execution. To deal with that, assume as an example that the input.dat
file is something like this:
starttime=0 endtime=10
These parameters are used by code.x
to define the initial and final times of the numerical simulation it executes. As the job will be submitted many times recursively, the file of input parameters needs to be updated at every recursion before running the executable. For that, we'll make use of a template file named input.template
:
starttime=VAR_START_TIME endtime=VAR_END_TIME
This template file will replace the input.dat
file and the strings VAR_START_TIME
and VAR_END_TIME
will be replaced by the needed values in each recursion (using the command sed
) before running the executable in the current job. This logic is in the section "##Setup/Update of parameters/input for this current script
" in listing 7.
For this process to work properly, the correct values of the input parameters need to be set and "sent" to the following job submission. This is performed when the submission of the following dependent job is done. This logic is in the section "##Submitting the dependent job"
in listing 7.
The example script iterative.sh
performs the logic described above. Comments within explain the reasoning of each section.
#!/bin/bash -l #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes #----------------------- ##Defining the needed resources with SLURM parameters (modify as needed) #SBATCH --account=[your-project] #SBATCH --job-name=iterativeJob #SBATCH --ntasks=128 #SBATCH --ntasks-per-node=128 #SBATCH --cpus-per-task=1 #SBATCH --exclusive #Will use exclusive access to nodes (Or use --mem for shared access) #SBATCH --time=05:00:00 #----------------------- ##Setting modules #Add the needed modules (uncomment and adapt the follwing lines) #module swap the-module-to-swap the-module-i-need #module load the-modules-i-need #----------------------- ##Setting the variables for controlling recursion #job iteration counter. It's default value is 1 (as for the first submission). For a subsequent submission, it will receive it value through the "sbatch --export" command from the "parent job". : ${job_iteration:="1"} this_job_iteration=${job_iteration} #Maximum number of job iterations. It is always good to have a reasonable number here job_iteration_max=5 echo "This jobscript is calling itself in recursively. This is iteration=${this_job_iteration}." echo "The maximum number of iterations is set to job_iteration_max=${job_iteration_max}." echo "The slurm job id is: ${SLURM_JOB_ID}" #----------------------- ##Defining the name of the dependent script. #This "dependentScript" is the name of the next script to be executed in workflow logic. The most common and more utilised is to re-submit the same script: thisScript=`squeue -h -j $SLURM_JOBID -o %o` export dependentScript=${thisScript} #----------------------- ##Safety-net checks before proceding to the execution of this script #Check 1: If the file with the exact name 'stopSlurmCycle' exists in the submission directory, then stop execution. # Users can create a file with this name if they need to interrupt the submission cycle by using the following command: # touch stopSlurmCycle # (Remember to remove the file before submitting this script again.) if [[ -f stopSlurmCycle ]]; then echo "The file \"stopSlurmCycle\" exists, so the script \"${thisScript}\" will exit." echo "Remember to remove the file before submitting this script again, or the execution will be stopped." exit 1 fi #Check 2: If the number of output files has reached a limit, then stop execution. # The existence of a large number of output files could be a sign of an infinite recursive loop. # In this case we check for the number of "slurm-XXXX.out" files. # (Remember to check your output files regularly and remove the not needed old ones or the execution may be stoppped.) maxSlurmies=25 slurmyBaseName="slurm" #Use the base name of the output file slurmies=$(find . -maxdepth 1 -name "${slurmyBaseName}*" | wc -l) if [ $slurmies -gt $maxSlurmies ]; then echo "There are slurmies=${slurmies} ${slurmyBaseName}-XXXX.out files in the directory." echo "The maximum allowed number of output files is maxSlurmies=${maxSlurmies}" echo "This could be a sign of an infinite loop of slurm resubmissions." echo "So the script ${thisScript} will exit." exit 2 fi #Check 3: Add some other adequate checks to guarantee the correct execution of your workflow #Check 4: etc. #----------------------- ##Setup/Update of parameters/input for the current script #The following variables will receive a value with the "sbatch --export" submission from the parent job. #If this is the first time this script is called, then they will start with the default values given here: : ${var_start_time:="0"} : ${var_end_time:="10"} : ${var_increment:="10"} #Replacing the current values in the parameter/input file used by the executable: paramFile=input.dat templateFile=input.template cp $templateFile $paramFile sed -i "s,VAR_START_TIME,$var_start_time," $paramFile sed -i "s,VAR_END_TIME,$var_end_time," $paramFile #Creating the backup of the parameter file utilised in this job cp $paramFile $paramFile.$SLURM_JOB_ID #----------------------- ##Verify that everything that is needed is ready #This section is IMPORTANT. For example, it can be used to verify that the results from the parent submission are there. If not, stop execution. #----------------------- ##Submitting the dependent job #IMPORTANT: Never use cycles that could fall into infinite loops. Numbered cycles are the best option. #The following variable needs to be "true" for the cycle to proceed (it can be set to false to avoid recursion when testing): useDependentCycle=true #Check if the current iteration is within the limits of the maximum number of iterations, then submit the dependent job: if [ "$useDependentCycle" = "true" ] && [ ${job_iteration} -lt ${job_iteration_max} ]; then #Update the counter of cycle iterations (( job_iteration++ )) #Update the values needed for the next submission var_start_time=$var_end_time (( var_end_time += $var_increment )) #Dependent Job submission: # (Note that next_jobid has the ID given by the sbatch) # For the correct "--dependency" flag: # "afterok", when each job is expected to properly finish. # "afterany", when each job is expected to reach walltime. # "singleton", similar to afterany, when all jobs will have the same name # Check documentation for other available dependency flags. #IMPORTANT: The --export="list_of_exported_vars" guarantees that values are inherited to the dependent job next_jobid=$(sbatch --parsable --export="job_iteration=${job_iteration},var_start_time=${var_start_time},var_end_time=${var_end_time},var_increment=${var_increment}" --dependency=afterok:${SLURM_JOB_ID} ${dependentScript} | cut -d ";" -f 1) echo "Dependent with slurm job id ${next_jobid} was submitted" echo "If you want to stop the submission chain it is recommended to use scancel on the dependent job first" echo "Or create a file named: \"stopSlurmCycle\"" echo "And then you can scancel this job if needed too" else echo "This is the last iteration of the cycle, no more dependent jobs will be submitted" fi #----------------------- ##Run the main executable. #(Modify as needed) #Syntax should allow restart from a checkpoint srun -N $SLURM_JOB_NUM_NODES -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./code.x
You can adapt this script to your needs, with special attention to the security checks, type of dependency needed and appropriate syntax for restarting from previous checkpoint.
Submit the initial job in the usual way:
$ sbatch iterative.sh
In the first iteration, the job will run with the default value of job_iteration=1,
and of the input parameters (var_start_time=0, var_start_time=10, var_increment=10
) and use them to create the input.dat
file for the current run. Before submitting the second job, those values will be updated. Then, they will be "sent through" in the sbatch
submission command of the dependent job. The updated values of those variables and parameters will be received in the second job and will be utilised instead of the defaults.
Note how the sbatch
submission uses the –-dependency
directive:
--dependency=afterok:${SLURM_JOB_ID}
This means that the dependent job will be submitted to the queue, but Slurm will still wait for the parent job to finish properly. And only if the parent job did finish properly (afterok
) the dependent job will be kept in the queue and continue its process. Other common dependency option is afterany
, which is of common use if the job is expected to reach the walltime in each submission. For other dependency options, see Job dependencies.
When a job is submitted with recursive capabilities, the squeue
command may show a running job and a job waiting to be processed due to dependency. The dependent job will not be eligible to start until the running job has finished. As explained in the Job dependencies and job priorities section, the dependent job will not accrue age priority until the first job has completed.
$ squeue -u matilda JOBID USER ACCOUNT NAME EXEC_HOST ST REASON START_TIME END_TIME TIME_LEFT NODES PRIORITY 3483798 matilda pawsey iterativeJob nid00017 R None 14:44:27 14:47:27 2:50 1 5269 3483799 matilda pawsey iterativeJob n/a PD Dependency N/A N/A 5:00 1 5269
Alternatively, if the running job has completed and the dependent job has not yet started, then you will only see the dependent job in the squeue
output, and the REASON will be either Priority or Resources.
Figure 2 summarises what to expect when looking at the job queue for this recursive job script.
- On initial job submission, there will be one job in the queue
- When a job is running, you will see another job appear in the queue that is held with REASON=Dependency
- When each iteration of the job has finished, there will again only be one pending job in the queue
Figure 2. Flow of execution in the recursive job script example
If you want to stop the recursive submission without cancelling the current running job, you can create a file named stopSlurmCycle
(check the example script above in the section "##Safety-net checks
") by using the touch
command:
$ touch stopSlurmCycle
Or you could cancel the dependent job:
$ scancel 3483799
The jobID of the dependent job was taken from the display of the squeue
command above.
To cancel the whole process, you should cancel both the dependent job and the running job. (It is always wise to cancel first the dependent job.)
Multi-cluster jobs
SLURM queueing system offers the ability to launch commands on other clusters instead of, or in addition to, the local cluster on which the command is invoked. A classic example of data staging is presented on the Data Workflows page, which runs a computation in the work
partition of setonix
and upon completion launches a data copying job to the copy
partition of the setonix
cluster.
Interactive jobs
For code development, debugging and light-weight visualisation purposes, it is sometimes convenient to run on the back-end "interactively". This can be done using the SLURM command salloc
. For example, from the front-end we can enter salloc
to ask for one node to be allocated in the debug
partition:
$ salloc -p debug --nodes=1 --ntasks=32 --cpus-per-task=1 --mem=58G salloc: Pending job allocation 206121 salloc: job 206121 queued and waiting for resources salloc: Granted job allocation 206121 setonix@nid00200:~>
While interactive access to the work
partition is available via salloc
, interactive jobs do not get additional priority. This may mean long wait times for interactive requests to be satisfied if the machine is busy.
Note the change in prompt from $
to setonix@nid00200:~>
, which indicates that you are now logged into the compute node (nid00200 in this case).
The --ntasks
option should also be used on Setonix to explicitly specify the number of tasks required for the interactive session, and the --mem
option to indicate the required memory.
You must use srun
to run multiple instances of your executable in parallel.
For example:
$ cd $MYSCRATCH setonix@nid00200:/scratch/[project]/[username]> setonix@nid00200:/scratch/[project]/[username]> srun -N 1 -n 4 -c 1 ./code_test.x ...
When finished, type exit
to leave the interactive queue and rejoin the front-end.
$ exit exit salloc: Relinquishing job allocation 206121 salloc: Job allocation 206121 has been revoked
Note that X11 forwarding is enabled by default in the interactive queue.
Nevertheless, we recommend users to Pawsey's web-based remote-visualisation service, to launch compute-intensive visualisation packages such as ParaView, VisIt or VMD. Please refer to Setonix Remote Visualisation documentation for detailed instructions of access and use.
Packing serial/small multithreaded jobs
Implementing parallelism for a given workflow sometimes means running many copies of a serial code with different input parameters or data. Outputs must be stored separately and in an identifiable way. The same is true of jobs that support threads but which do not scale to a full node. In this case we might want to run, say, six jobs each of four threads at the same time. For purposes of efficiency, we would like to pack a number of such instances in one node to make use of all cores available within the node (for example, the 128 cores on Setonix CPU-only nodes).
There are a number of ways in which this can be done:
- For "trivial" parallelism, where all the tasks are completely independent, individual tasks can be uniquely identified by the environment variable SLURM_PROCID which takes on a value from 0 to <ntasks-1> when an application is launched using
srun -n ntasks
. For more information, see Method 1: Using SLURM_PROCID. - For more complex workflows, where there may be some dependencies between tasks, we recommend considering mpibash. For more information, see Method 2: Using mpibash.
- For complex scripting tasks requiring parallelism, we suggest considering Python and message passing via mpi4py. See Method3: Using Python and mpi4py.
Method 1: Using SLURM_PROCID
Packing Serial Jobs
This section shows how to pack a workflow consisting of multiple serial (single core) instances of work. The individual "instances of work" here might represent a serial binary executable or a separate serial script. In the example below, we use the environment variable SLURM_PROCID to identify input files and output files for each of 64 requested instances of a serial executable serial-code.x
. Instead of launching the executable directly, an intermediate (wrapper) shell script is launched by srun
. Inside the wrapper script one has access to $SLURM_PROCID and can construct the serial workflow that is intended to execute a given instance:
#!/bin/bash --login #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes # SLURM directives # # Here we specify to SLURM we want 64 tasks in a single node with # a wall-clock time limit of 1 hour (--time=01:00:00). # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123). #SBATCH --account=[your-project] #SBATCH --nodes=1 #SBATCH --ntasks=64 #SBATCH --ntasks-per-node=64 #SBATCH --cpus-per-task=1 #SBATCH --mem=117G #Needed memory per node when share access (or use --exclusive for exclusive access) #SBATCH --time=01:00:00 #--- #Specific settings for the cluster you are on #(Check the specific guide of the cluster for additional settings) #--- # Launch 64 instances of wrapper script (make sure it's executable), srun -N 1 -n 64 -c 1 -m block:block:block ./wrapper.sh
The wrapper.sh
may look something like this:
#!/bin/bash # # This is a standard bash script which has access to the environment variable SLURM_PROCID # This is used to construct input filenames of the form input-0 input-1 ... input-63 # and similarly named output files INFILE="input-$SLURM_PROCID" OUTFILE="output-$SLURM_PROCID.out" # Assuming all the input files exist in the current directory, we run the executable. # Each instance will use the appropriate input and produce the relevant output. ./serial-code.x < $INFILE > $OUTFILE
Java Jobs
Example 1: Serial Java application
Here, we run a serial Java application (class Application) on one node.
#!/bin/bash --login #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes # SLURM directives # # Here we specify to SLURM we want one node (--nodes=1) with # a wall-clock time limit of 1 hr (--time=01:00:00). # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123). #SBATCH --account=[your-project] #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem=4G #specify when asking for shared access to compute nodes (or use --exclusive for exclusive access) #SBATCH --time=00:10:00 # Launch the job. # There is one task to run java in serial (-n 1). srun -N 1 -n 1 -c 1 java Application
Example 2: Two Java instances on one node
Running a single Java application on one node will not make use of all cores on that node (although it might require the entire available RAM). To run a number of instances of an application on the same node, an intermediate (or wrapper) application must be used via srun
. The following example uses two instances, which are identified via the environment variable SLURM_PROCID. This variable takes on a unique value (starting at zero) for each instance specified to srun
.
The SLURM script is as follows:
#!/bin/bash --login #This example use general SBATCH settings, but please refer to the specific guide #of intended the cluster for possible needed changes # Here we specify to SLURM we want one node (--nodes=1) with # a wall-clock time limit of ten minutes (--time=00:10:00). # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123). #SBATCH --account=[your-project] #SBATCH --nodes=1 #SBATCH --ntasks=2 #this directive is required on setonix to request 2 tasks #SBATCH --cpus-per-task=1 #SBATCH --mem=3680M #specify when asking for shared access to compute nodes (or use --exclusive for exclusive access) #SBATCH --time=00:10:00 #--- #Specific settings for the cluster you are on #(Check the specific guide of the cluster for additional settings) # We request two instances "-n 2" to be placed on the node srun -N 1 -n 2 -c 1 java Wrapper
The Wrapper.java
application takes the form. Two instances of the Wrapper class are run (asynchronously)) which will be identical except for the value of SLURM_PROCID obtained from the environment. Appropriate program logic can be used to arrange, for example, specific input to an instance of an underlying application. Here, we simply report the value of SLURM_PROCID to standard output.
/* Wrapper to differentiate instances produced by srun */ /* The resulting "rank" may be used in conjunction with * program logic to run different tasks from within java. */ import java.io.*; class Wrapper { public static void main(String argv[]) { int rank; try { String slurm_proc_id = System.getenv("SLURM_PROCID"); rank = Integer.parseInt(slurm_proc_id); } catch (NumberFormatException e) { rank = -1; } if (rank == 0) { System.out.println("Running with SLURM_PROCID zero"); } if (rank == 1) { System.out.println("Running with SLURM_PROCID one"); } /* ...and so on */ return; } }
R Jobs
Interactive access to R for testing and development is available through the queue system.
$ salloc --nodes=1 --ntasks=1 --cpus-per-task=1 --mem=1840M --time=06:00:00 salloc: Granted job allocation 291021 setonix@nid00294:~> module load cray-R setonix@nid00294:~> srun -N 1 -n 1 -c 1 R --no-save R version 3.3.3 (2017-03-06) -- "Another Canoe" Copyright (C) 2017 The R Foundation for Statistical Computing ...
Trivial parallelism may be introduced by the following mechanism. An intermediate "wrapper" script is required between the SLURM submission script and the R script itself. A simple example is:
#!/bin/bash --login # SLURM script requesting one node with a time limit of 20 minutes. # Replace [your-project] with the appropriate project budget. #SBATCH --nodes=1 #SBATCH --ntasks-per-node=24 #this is required on setonix to request 24 tasks on a node #SBATCH --time=00:20:00 #SBATCH --account=[your-project] #SBATCH --export=NONE #SBATCH --mem=96G #SBATCH --cpus-per-task=1 # Launch 24 instances (-n 24) on 24 cores of the wrapper script srun -N 1 -n 24 -c 1 ./r-wrapper.sh
The wrapper script is r-wrapper.sh
. This script must be in the same location as the submission script and it must be executable (chmod 740 r-wrapper.sh
):
#!/bin/bash # # This script, running on the back-end, has access to the # environment variable $SLURM_PROCID, which will take on # values 0-23 when launched via srun -n 24. # # This is used as input to the R script, and to differentiate the output # as r-job-<jobid>-<instance>.out (where the jobid is the same for each # separate batch submission $SLURM_JOBID) R --no-save "--args $SLURM_PROCID " < my-script.R > r-$SLURM_JOBID-$SLURM_PROCID.log
Finally, the R script (my-script.R
) can be based on:
# The R script identifies its "rank" via the command line argument args <- commandArgs(TRUE) print ("This R script instance has input ") print (args)
Packing Small Multithreaded Jobs
If your application supports multithreading, you can use the srun -c
option to request the number of threads per instance on a node. You can also request a number of instances, as long as the number of threads times the number of instances does not exceed the total number of cores available within a node. Here is an example using OpenMP on Setonix. A similar approach can be used for pthreaded applications.
#!/bin/bash --login # SLURM directives # # Here we specify to SLURM we want two nodes (--nodes=2) with # a wall-clock time limit of twenty minutes (--time=00:20:00). # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123). #SBATCH --account=[your-project] #SBATCH --nodes=2 #SBATCH --ntasks-per-node=4 #this directive is required on setonix to request 4 tasks on each node #SBATCH --ntasks-per-socket=2 #this directive indicates that maximum 2 tasks per socket are to be allocated #SBATCH --ntasks=8 #this directive is required on setonix to request a total of 8 tasks #SBATCH --cpus-per-task=32 #this directive is required on setonix to request 32 CPU cores for each task for the OpenMP threads #SBATCH --exclusive #always use when all node resources are needed (or use --mem for shared access) #SBATCH --time=00:20:00 # Set number of OpenMP threads export OMP_NUM_THREADS=32 #--- #Specific settings for the cluster you are on #(Check the specific guide of the cluster for additional settings) # Launch the job. # Here we use 8 instances (-n 8) with 4 per node with two instances per socket, # or NUMA region. Each instance requests 32 cores -c 32 (via -c $OMP_NUM_THREADS) srun -N 2 -n 8 -c ${OMP_NUM_THREADS} -m block:block:block ./wrapper.sh
The wrapper script in this case will invoke an OpenMP code. Again, the wrapper script must use SLURM_PROCID to differentiate the individual tasks (here 0-7) in an appropriate way for the workflow. For pthreaded applications, the number of threads must be communicated to the wrapper script and must be consistent with the value specified by the -c
option. (Note that in the case above, the number of threads is available to the wrapper script in the exported environment variable OMP_NUM_THREADS.)
Method 2: Using mpibash
An MPI implementation for bash is available through the module system. mpibash
provides an implementation of a limited number of key MPI routines and is described here. Using mpibash
presents a simple way to parallelise workflows based on standard bash scripts.
A simple example is given in the following snippet. Programmers who have used MPI should be familiar with the idiom.
#!/usr/bin/env mpibash # Note the mpibash shebang # # The following command informs bash of the location of the mpi_init commnd # which can then be used to initialise MPI enable -f mpibash.so mpi_init mpi_init mpi_comm_rank rank mpi_comm_size size echo "Hello from bash mpi rank $rank of $size" mpi_finalize
Remember to change the file permissions on the script so that anyone can execute it. For example, if you called the script mpi-bash.sh
:
$ chmod a+x mpi-bash.sh
The script can be launched with the following SLURM script on Setonix:
#!/bin/bash --login # We must load the mpibash module # # This particular example uses 48 MPI tasks on 2 nodes # # Note --export=none is necessary to avoid error messages of the form: # _pmi_inet_listen_socket_setup:socket setup failed #SBATCH --account=[your-project] #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #this directive is required on setonix to request 24 tasks on each node #SBATCH --ntasks=48 #this directive is required on setonix to request a total of 48 tasks #SBATCH --mem=96G #SBATCH --time=00:02:00 #SBATCH --export=none module swap PrgEnv-cray PrgEnv-gnu #this is required for setonix module load mpibash export PMI_NO_PREINITIALIZE=1 #this is required for setonix export PMI_NO_FORK=1 #this is required for setonix srun -N 2 -n 48 -c ${OMP_NUM_THREADS} ./mpi-bash.sh
The bash script can contain anything appropriate for a normal workflow. It cannot, however, attempt to launch a standalone MPI executable.
Method 3: Using Python and mpi4py
The example in listing 19 runs a single serial Python script on a single node.
#!/bin/bash --login # SLURM directives # # Here we specify to SLURM we want one node (--nodes=1) with # a wall-clock time limit of ten minutes (--time=00:10:00). # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123). #SBATCH --nodes=1 #SBATCH --ntasks=1 #this directive is required on setonix to request 1 task #SBATCH --time=00:10:00 #SBATCH --account=[your-project] #SBATCH --export=NONE #SBATCH --cpus-per-task=1 #SBATHC --mem=4G # Launch the job. # # Serial python script. Load the default python module with # # module load python # # Launch the script on the back end with srun -n 1 module load python srun -N 1 -n 1 -c 1 python ./serial-python.py # # If you have an executable python script with a "bang path", # make sure the path is of the form # # #!/usr/bin/env python srun -N 1 -n 1 -c 1 ./serial-python-x.py
Suggestions on how to pack many serial tasks on a single node using mpi4py
are given below.
#!/bin/bash --login # SLURM directives # # Here we specify to SLURM we want two nodes (--nodes=2) with # a wall-clock time limit of ten minutes (--time=00:10:00). # # Replace [your-project] with the appropriate project name # following --account (e.g., --account=project123). #SBATCH --nodes=2 #SBATCH --ntasks-per-node=24 #this directive is required on setonix to request 24 tasks on each node #SBATCH --ntasks=48 #this directive is required on setonix to request a total of 48 tasks #SBATCH --time=00:10:00 #SBATCH --account=[your-account] #SBATCH --export=NONE #SBATCH --mem=96G #SBATCH --cpus-per-task=c # Launch the job. # This python script uses the python module mpi4py, which we need # to load with # # module load mpi4py # # (which will also load the default python module as a dependency). # # The script is launched via srun -n 48, which specifies 48 MPI # tasks, and is invoked via the interpreter. module load mpi4py srun -N 2 -n 48 -c $OMP_NUM_THREADS python ./mpi-python.py # If you have an executable python script with a "bang path", # make sure the path is of the form # # #!/usr/bin/env python srun -N 2 -n 48 -c $OMP_NUM_THREADS ./mpi-python-x.py
Arrays or packing of many jobs requiring GPUs
Implementing parallelism for a given workflow sometimes means running many copies of a GPU code with different input parameters or data. This can be achieved with the two approaches already described in the sections above: job arrays and job packing.
When to use job arrays: For nodes that can be shared, the best practice is to use job arrays. A disadvantage of job packing on shared nodes is that unbalanced steps might lead to resources being held unnecessarily. When using arrays this problem does not exist because, as soon as any job finishes or fails, the resources for that job are freed for use by another user.
In the following example eight jobs are submitted as a job array, each using one GPU:
#!/bin/bash --login #SBATCH --account=[your-account]-gpu #SBATCH --array=0-7 #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --gpu=1 #SBATCH --time=00:10:00 #Go to the right directory for this instance of the job array using SLURM_ARRAY_TASK_ID as the identifier: #We are assuming all the input files needed for each specific job reside in the corresponding working directory cd workingDir_${SLURM_ARRAY_TASK_ID} #Run the hip executable (assuming the same executable will be used by each job, and that it resides in the submission directory): srun -u -N 1 -n 1 ${SLURM_SUBMIT_DIR}/main_hip
When to use job packing: When your workflow requieres the execution of several jobs, but each individual job does not requiere the whole resources of a node. In that case, you might consider to use job packing and execute your several jobs in the same node at the same time. Job packing allows to use all (or most) of the resources of the node. Ideally, multiple jobs should be packed in order to make use of all available GPUs in the node and jobs should have a similar estimated execution time to avoid load balancing issues. (Obviously if a single job can make use of all the GPUs, that is also desirable and that would not need packing.) We do not recommend packing jobs across multiple nodes with the same job script due to possible load balancing issues: all resources will be held and unavailable to other users/jobs until the last substep (job) in the packing finishes.
Plan for balanced execution times between packed tasks
In the following example, four srun steps are ran simultaneously in a single node, each step using one GPU. The header of SBATCH allocation request asks for all the resources in the node. And the srun command asks for specific resources for each step.
#!/bin/bash --login #SBATCH --account=[your-account]-gpu #SBATCH --partition=gpu #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --ntasks-per-socket=2 #maximum 2 tasks per socket (each socket has 2 GPUs in this partition) #SBATCH --cpus-per-task=1 #SBATCH --gres=gpu:4 #SBATCH --time=00:10:00 #Default loaded compiler module is gcc module module load cuda for tagID in $(seq 0 3); do #Go to the right directory for this step of the job pack using tagID as the identifier: #We are assuming all the input files needed for each specific job reside in the corresponding working directory cd ${SLURM_SUBMIT_DIR}/workingDir_${tagID} #Defining an output file for this step outputFile=results_${tagID}.out echo "Starting" > $outputFile #Run the cuda executable (asuming the same executable will be used by each step, and that it resides in the submission directory): srun -u -N 1 -n 1 --mem=56G --gres=gpu:1 --exact ${SLURM_SUBMIT_DIR}/main_cuda >> $outputFile & done wait
Exactly the same effect (packing) can also be achieved by using the --gpu-bind
option of the Slurm scheduler and a wrapper:
#!/bin/bash --login #SBATCH --account=[your-account] #SBATCH --partition=nvlinkq #SBATCH --nodes=1 #SBATCH --ntasks=4 #SBATCH --ntasks-per-socket=2 #maximum 2 tasks per socket (each socket has 2 GPUs in this partition) #SBATCH --cpus-pert-task=1 #SBATCH --gres=gpu:4 #SBATCH --gpu-bind=map_gpu:0,1,2,3 #SBATCH --time=00:10:00 #Default loaded compiler module is gcc module module load cuda #Run the cuda executable from a wrapper: srun -u -N 1 -n 4 -c 1 wrapper.sh
And the
wrapper.sh
in this case is:
#!/bin/bash #Go to the right directory for this instance of the job pack using tagID as the identifier: #We are assuming all the input files needed for each specific job reside in the corresponding working directory cd ${SLURM_SUBMIT_DIR}/workingDir_${SLURM_PROCID} #Defining an output file for this process outputFile=results_${SLURM_PROCID}.out echo "Starting" > $outputFile #Check that the settings for this process are correct echo "SLURM_PROCID=$SLURM_PROCID" >> outputFile echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES" >> $outputFile echo "" >> $outputFile #Run the cuda executable (asuming the same executable will be used by each job, and that it resides in the submission directory): ${SLURM_SUBMIT_DIR}/main_cuda >> $outputFile
The --gpu-bind
setting will define the correct value for the environment variable CUDA_VISIBLE_DEVICES
for each process to work on a different GPU. So this variable will get the value of 0 for the first instance of the wrapper running in the node, 1 for the second, 2
for the third and 3
for the last one. In this way the four instances will run simultaneously, each instance utilising a different GPU in the node.