Nextflow
Nextflow enables scalable and reproducible scientific workflows, often using software containers. It allows the adaptation of pipelines written in the most common scripting languages. It can be particularly useful in high-throughput domains such as bioinformatics or radio-astronomy.
Versions installed in Pawsey systems
To check the current installed versions, use the module avail
command (current versions may be different from content shown here):
$ module avail nextflow --------------------------- /software/setonix/2024.05/modules/zen3/gcc/12.2.0/utilities --------------------------- nextflow/22.10.0 nextflow/23.10.0 (D)
How to use Nextflow at Pawsey
Nextflow is currently installed on Pawsey systems.
To start using Nextflow, load the corresponding module:
$ module load nextflow/<VERSION>
Beside setting the PATH to the Nextflow executable, the module also sets the value of NXF_HOME
to $MYSOFTWARE/.nextflow
. The default value, "$HOME/.nextflow
", would not be suitable due to the home quota in place at Pawsey. For similar reasons, the module sets the value of NXF_SINGULARITY_CACHEDIR
to $MYSOFTWARE/.nextflow_singularity.
Do not start Nextflow pipelines from the login node. Instead, submit a dedicated Slurm script, as in this example:
#!/bin/bash -l #SBATCH --job-name=nextflow-master #SBATCH --time=1-00:00:00 #SBATCH --mem=4G #SBATCH --cpus-per-task=1 module load nextflow/<VERSION> nextflow run hello
If a pipeline needs to run for longer than the maximum allowed wall time, you can resume it using the nextflow -resume
flag, or by setting the appropriate flag in the config file (see listing 2).
If using Singularity containers to provide the software required by the Nextflow pipeline, you will need to also load the Singularity module (check our Singularity documentation for further information about the use of containers):
#!/bin/bash -l #SBATCH --job-name=nextflow-master #SBATCH --time=1-00:00:00 #SBATCH --mem=4G #SBATCH --cpus-per-task=1 module load nextflow/<VERSION> module load singularity/<VERSION> nextflow run <pipeline-using-containers>
Nextflow can also be configured to spawn each single step in the pipeline as a separate Slurm job; this is usually the best way to run it on a shared HPC cluster.
When this setup is used in conjunction with Singularity containers, the Slurm script must unset the Slurm variable SBATCH_EXPORT
, otherwise Singularity will not be available in the child jobs:
#!/bin/bash -l #SBATCH --job-name=nextflow-master #SBATCH --time=1-00:00:00 #SBATCH --mem=4G #SBATCH --cpus-per-task=1 module load nextflow/<VERSION> module load singularity/<VERSION> unset SBATCH_EXPORT nextflow run <pipeline-using-slurm-jobs-and-containers>
See the config example below for more details.
Template config file for Pawsey
Pipeline configuration properties are defined in a file named nextflow.config
in the pipeline execution directory.
Here is a template nextflow.config
file that defines a profile containing some default settings for starting Nextflow pipelines.
Notes on nextflow.config
Here are comments on some of the settings in this config template. If you need to know more about Nextflow syntax, see Nextflow docs (external site).
General settings
- Use
resume = true
to resume pipelines by default. - Set the Nextflow
cache
mode to "lenient
", otherwise "resume
" won't work on parallel filesystems such as Lustre. - Set the
workDir
for your pipeline to a path under your scratch path. - Set the
stageInMode
to "symlink" if you want to save disk space in the "workDir
".
Container usage
- Nextflow has native support for Singularity containers, which can further increase the reproducibility and portability of pipelines.
- Use
envWhitelist
to pass three Singularity variables, defined by the module, into Nextflow containerised processes; these are crucial as they enact bind mounting of directories such as/scratch
, as well as a proper configuration for MPI applications. cacheDir
defines a path for Singularity container images; the default behaviour would be to store them in "workDir
" of each workflow; putting it under a unified location reduces the total number of images you will have to download.- It is possible to specify a different container for each process in your pipeline script. By using the
withName
flag, a particular container can be used for a process with a certain name. See the Nextflow documentation for examples.
Slurm interface
- Turn on Slurm usage with "
executor = 'slurm'
". - Configure your Slurm account with a dedicated parameter, subsequently used by "
clusterOptions
". - Specify default resources with "
queue
", "cpus
", "time
", "memory
". - You can specify process-specific resources using the syntax "
withName
". - Avoid overloading the queue by specifying a size with "
queueSize
".
GPU partition usage
The above template now supports running jobs on the GPU partition. Firstly, the `runOptions = "--rocm"` tells singularity to enable GPU support. You can delete that line if you are not using GPUs. Secondly, If you provide a gpu label to relevant processes in your main.nf file, you can control how they will be executed via your config file. See the example below:
process bigTask { label 'gpu' ''' <task script> ''' } process anotherBigTask { label 'gpu' ''' <task script> ''' }
The same label can be applied to more than a process and multiple labels can be applied to the same process using the label directive more than one time. A label must consist of alphanumeric characters or _, must start with an alphabetic character and must end with an alphanumeric character. See Job Scheduling for more information about the GPU partition and Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for more information about requesting appropriate GPU resources on Setonix.
External links
Nextflow homepage ("Data-driven computational pipelines")