Nextflow

Nextflow is a reactive workflow framework and a domain-specific language (DSL) that eases the writing of data-intensive computational pipelines.

On this page:

Nextflow enables scalable and reproducible scientific workflows, often using software containers. It allows the adaptation of pipelines written in the most common scripting languages. It can be particularly useful in high-throughput domains such as bioinformatics or radio-astronomy.

How to use Nextflow at Pawsey

Nextflow is currently installed on Pawsey systems.

To start using Nextflow, load the corresponding module:

$ module load nextflow/22.04.3

Beside setting the PATH to the Nextflow executable, the module also sets the value of NXF_HOME to $MYSOFTWARE/.nextflowThe default value, "$HOME/.nextflow", would not be suitable due to the home quota in place at Pawsey. For similar reasons, the module sets the value of NXF_SINGULARITY_CACHEDIR to $MYSOFTWARE/.nextflow_singularity.

Do not start Nextflow pipelines from the login node. Instead, submit a dedicated Slurm script, as in this example:

Listing 1. job.sh
#!/bin/bash -l
 
#SBATCH --job-name=nextflow-master
#SBATCH --time=1-00:00:00
#SBATCH --mem=4G
#SBATCH --cpus-per-task=1
 
module load nextflow/22.04.3

nextflow run hello

If a pipeline needs to run for longer than the maximum allowed wall time, you can resume it using the nextflow -resume flag, or by setting the appropriate flag in the config file (see listing 2).

If using Singularity containers to provide the software required by the Nextflow pipeline, you will need to also load the Singularity module:

Listing 2. job_singularity.sh
#!/bin/bash -l
 
#SBATCH --job-name=nextflow-master
#SBATCH --time=1-00:00:00
#SBATCH --mem=4G
#SBATCH --cpus-per-task=1
  
module load nextflow/22.04.3
module load singularity/3.8.6

nextflow run <pipeline-using-containers>

Nextflow can also be configured to spawn each single step in the pipeline as a separate Slurm job; this is usually the best way to run it on a shared HPC cluster.

When this setup is used in conjunction with Singularity containers, the Slurm script must unset the Slurm variable SBATCH_EXPORT, otherwise Singularity will not be available in the child jobs:

Listing 3. job_slurm_singularity.sh
#!/bin/bash -l
 
#SBATCH --job-name=nextflow-master
#SBATCH --time=1-00:00:00 
#SBATCH --mem=4G
#SBATCH --cpus-per-task=1

module load nextflow/22.04.3
module load singularity/3.8.6
unset SBATCH_EXPORT

nextflow run <pipeline-using-slurm-jobs-and-containers>

See the config example below for more details.

Template config file for Pawsey

Pipeline configuration properties are defined in a file named nextflow.config in the pipeline execution directory.

Here is a template nextflow.config file that defines a profile containing some default settings for starting Nextflow pipelines. 

Listing 4. Configuration file template nextflow.config
resume = true
 
profiles {
 
  setonix {
     
    process {
      cache = 'lenient'
      stageInMode = 'symlink'
    }
    workDir = "$MYSCRATCH/nxf_work"
     
    singularity {
      enabled = true
      envWhitelist = 'SINGULARITY_BINDPATH, SINGULARITYENV_LD_LIBRARY_PATH, SINGULARITYENV_LD_PRELOAD'
      cacheDir = "$MYSOFTWARE/.nextflow_singularity"
      runOptions = "--rocm"
    }
   
    process {
      withName: 'process1'            { container = 'repo1/container:1' }
    }
     
    params.slurm_account = 'pawseyXXXX'
    process {
      executor = 'slurm'
      clusterOptions = "--account=${params.slurm_account}"
      queue = 'work'
      cpus = 1
      time = '1h'
      memory = '1800MB'
       
      withName: 'process1' {
        cpus = 64
        time = '1d'
        memory = '120GB'
      }

      withLabel: 'gpu' {
		clusterOptions = "--account=${params.slurm_account}-gpu --gpus-per-node=1 --gpus-per-task=1"
		executor = 'slurm'
		queue = 'gpu'
      }
    }
    executor {
      $slurm {
        queueSize = 1024
      }
    }
   
  }
 
}

Notes on nextflow.config

Here are comments on some of the settings in this config template. If you need to know more about Nextflow syntax, see Nextflow docs (external site).

General settings

  • Use resume = true to resume pipelines by default.
  • Set the Nextflow cache mode to "lenient", otherwise "resume" won't work on parallel filesystems such as Lustre.
  • Set the workDir for your pipeline to a path under your scratch path.
  • Set the stageInMode to "symlink" if you want to save disk space in the "workDir".

Container usage

  • Nextflow has native support for Singularity containers, which can further increase the reproducibility and portability of pipelines.
  • Use envWhitelist to pass three Singularity variables, defined by the module, into Nextflow containerised processes; these are crucial as they enact bind mounting of directories such as /scratch, as well as a proper configuration for MPI applications.
  • cacheDir defines a path for Singularity container images; the default behaviour would be to store them in "workDir" of each workflow; putting it under a unified location reduces the total number of images you will have to download.
  • It is possible to specify a different container for each process in your pipeline script. By using the withName flag, a particular container can be used for a process with a certain name. See the Nextflow documentation for examples. 

Slurm interface

  • Turn on Slurm usage with "executor = 'slurm'".
  • Configure your Slurm account with a dedicated parameter, subsequently used by "clusterOptions".
  • Specify default resources with "queue", "cpus", "time", "memory".
  • You can specify process-specific resources using the syntax "withName".
  • Avoid overloading the queue by specifying a size with "queueSize".

GPU partition usage

The above template now supports running jobs on the GPU partition. Firstly, the `runOptions = "--rocm"`  tells singularity to enable GPU support. You can delete that line if you are not using GPUs. Secondly, If you provide a gpu label to relevant processes in your main.nf file, you can control how they will be executed via your config file. See the example below:

Listing 5. Example Nextflow processes using labels
process bigTask {
  label 'gpu'

  '''
  <task script>
  '''
} 

process anotherBigTask {
  label 'gpu'

  '''
  <task script>
  '''
} 


The same label can be applied to more than a process and multiple labels can be applied to the same process using the label directive more than one time.  A label must consist of alphanumeric characters or _, must start with an alphabetic character and must end with an alphanumeric character. See Job Scheduling for more information about the GPU partition and Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for more information about requesting appropriate GPU resources on Setonix. 

External links