Nimbus for Bioinformatics
- Audrey Stott
Summary
This page covers information on how to use the 'Pawsey Bio - Ubuntu 22.04 - 2023-XX' image for Nimbus. Instructions on how to choose this image when creating your instance can be found here. This Bio-image is created to cater to bioinformatics users who prefer to have their instances pre-installed with software, tools and datasets commonly used in the bioinformatics domain, including over 8000 Biocontainer tools.
Commonly used software
Software | Information | Usage/Notes |
---|---|---|
Ansible | An automation platform that Pawsey uses to automate a number of software deployment | |
CernVM-FS | A read-only file system for accessing files on shared repositories | See 1. Biocontainers and Reference Genome data |
Docker | A popular container engine | |
Google Chrome | A web browser | Make sure to SSH in to your instance with X11 forwarding, i.e. ssh -X (or -Y) ubuntu@146.XX.XXX, and have XQuart installed if you are using MacOS |
Jupyter Notebook | For using an interactive Jupyter Notebook | See Run Jupyter Notebook Interactively |
Lmod | A modules environment that we use at Pawsey for loading sotware | |
Nextflow | A popular workflow manager - "Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages." | Nextflow can leverage the containers from /cvmfs/singularity.galaxyproject.org or container modules - See 4. Using Nextflow |
Pip | A Python package installer | |
Python3 | A popular programming language used in many Bioinformatics software | |
RStudio | For using R interactively | See Run RStudio Interactively |
Singularity | A popular container engine that can be used on HPC | |
Singularity-HPC | A container modules installer | See 2. Using Biocontainers |
Spack | A package management tool | |
X2go | A virtual desktop application | X2go server has been pre-installed on the image. To use it, you will need to install X2go client on your local machine - see Installing X2go Client |
Instructions
On this page, we will only cover instructions for how to use access and use Biocontainers and reference genome data sets. For instructions for the software listed above, please see the software's original documentation page.
1. Biocontainers and Reference Genome data
CernVM-FS is a read-only file system that was developed by another supercomputing centre (Cern). It allows files such as container tools, reference datasets and other shared resources that are commonly used by many researchers to be accessed, added to, and updated in the one location. At Pawsey, we currently cache the Biocontainer tools and reference genome datasets that are on Galaxy Project's repositories. The list of Biocontainer tools available can be searched on https://biocontainers.pro/registry.
To use the Biocontainer tools, you can skip this step and proceed to the next section, 2. Using the Biocontainer tools.
To view these repositories, you can do the following:
List the Biocontainer tools repository:
Note: It may take a minute or two to load the folders. When you have done it once, it will not take as long to show again.The/cvmfs/singularity.galaxyproject.org/all
subdirectory is where the entire list of 8000+ Biocontainers can be found, with the alphabetical subdirectories being symlinks to them.$ ls -la /cvmfs/singularity.galaxyproject.org total 140 drwxr-xr-x 33 cvmfs cvmfs 4096 Mar 25 2020 . drwxr-xr-x 3 cvmfs cvmfs 4096 May 12 2022 1 drwxr-xr-x 3 cvmfs cvmfs 4096 Jun 29 2020 2 drwxr-xr-x 4 cvmfs cvmfs 4096 Feb 10 07:09 3 drwxr-xr-x 24 cvmfs cvmfs 4096 Feb 23 2021 a drwxr-xr-x 4 cvmfs cvmfs 4096 Mar 7 19:06 all drwxr-xr-x 23 cvmfs cvmfs 4096 Sep 2 2022 b drwxr-xr-x 26 cvmfs cvmfs 4096 Feb 10 15:05 c drwxr-xr-x 22 cvmfs cvmfs 4096 Feb 10 15:05 d drwxr-xr-x 23 cvmfs cvmfs 4096 May 1 2020 e drwxr-xr-x 21 cvmfs cvmfs 4096 Feb 10 19:06 f drwxr-xr-x 25 cvmfs cvmfs 4096 Feb 5 2022 g drwxr-xr-x 19 cvmfs cvmfs 4096 Feb 10 21:21 h drwxr-xr-x 18 cvmfs cvmfs 4096 Feb 10 21:21 i drwxr-xr-x 14 cvmfs cvmfs 4096 Jul 10 2020 j drwxr-xr-x 21 cvmfs cvmfs 4096 Feb 25 2021 k drwxr-xr-x 16 cvmfs cvmfs 4096 Feb 10 22:54 l drwxr-xr-x 26 cvmfs cvmfs 4096 Feb 10 22:54 m drwxr-xr-x 20 cvmfs cvmfs 4096 Feb 11 05:13 n drwxr-xr-x 14 cvmfs cvmfs 4096 May 7 2020 o drwxr-xr-x 24 cvmfs cvmfs 4096 Aug 24 2021 p drwxr-xr-x 12 cvmfs cvmfs 4096 Feb 27 2021 q drwxr-xr-x 27 cvmfs cvmfs 4096 Jun 28 2022 r drwxr-xr-x 26 cvmfs cvmfs 4096 Apr 7 2020 s drwxr-xr-x 24 cvmfs cvmfs 4096 Feb 11 23:17 t drwxr-xr-x 13 cvmfs cvmfs 4096 Feb 11 23:17 u drwxr-xr-x 18 cvmfs cvmfs 4096 Feb 11 23:17 v drwxr-xr-x 17 cvmfs cvmfs 4096 Jul 28 2021 w drwxr-xr-x 16 cvmfs cvmfs 4096 Apr 7 2020 x drwxr-xr-x 4 cvmfs cvmfs 4096 Feb 28 2021 y drwxr-xr-x 10 cvmfs cvmfs 4096 Feb 11 23:17 z
List the reference genome sets and other data files:
$ ls -la /cvmfs/data.galaxyproject.org/ total 14 drwxr-xr-x 4 cvmfs cvmfs 4096 Mar 31 2016 . -rw-r--r-- 1 cvmfs cvmfs 21 Oct 24 2018 .cvmfsdirtab drwxr-xr-x 210 cvmfs cvmfs 4096 Apr 21 2022 byhand drwxr-xr-x 18 cvmfs cvmfs 4096 Nov 24 2020 managed
Please note that the data sets may not be comprehensive, and this service is not meant to replace your current methods for accessing public datasets.
To use these data files for your analyses, copy the absolute file path in your workflow/pipeline. For example, with the reference genome
Hg38
, the file can be found in the following location, and specifically under theseq
sub directory:$ ls -la /cvmfs/data.galaxyproject.org/byhand/hg38 total 46 drwxrwxr-x 10 cvmfs cvmfs 4096 Apr 22 2016 . drwxr-xr-x 210 cvmfs cvmfs 4096 Apr 21 2022 .. -rw-r--r-- 1 cvmfs cvmfs 0 Apr 22 2016 .cvmfscatalog drwxrwxr-x 3 cvmfs cvmfs 4096 Jan 21 2015 download drwxrwxr-x 6 cvmfs cvmfs 4096 Jan 20 2015 hg38canon drwxrwxr-x 6 cvmfs cvmfs 4096 Jan 20 2015 hg38female drwxrwxr-x 6 cvmfs cvmfs 4096 Jan 20 2015 hg38full drwxrwxr-x 2 cvmfs cvmfs 4096 Mar 18 2014 liftOver drwxrwxr-x 2 cvmfs cvmfs 4096 Mar 18 2014 picard_index drwxrwxr-x 2 cvmfs cvmfs 4096 Mar 18 2014 sam_index drwxrwxr-x 2 cvmfs cvmfs 4096 Apr 1 2016 seq $ ls -la /cvmfs/data.galaxyproject.org/byhand/hg38/seq total 10108046 drwxrwxr-x 2 cvmfs cvmfs 4096 Apr 1 2016 . drwxrwxr-x 10 cvmfs cvmfs 4096 Apr 22 2016 .. -rw-rw-r-- 1 cvmfs cvmfs 136 Mar 18 2014 README -rw-rw-r-- 1 cvmfs cvmfs 835393456 Mar 18 2014 hg38.2bit lrwxrwxrwx 1 cvmfs cvmfs 11 May 17 2014 hg38.fa -> hg38full.fa -rw-r--r-- 1 cvmfs cvmfs 19327 Aug 24 2015 hg38.fa.fai -rw-rw-r-- 1 cvmfs cvmfs 3150052305 Mar 17 2014 hg38canon.fa -rw-rw-r-- 1 cvmfs cvmfs 3091680335 Mar 17 2014 hg38female.fa -rw-r--r-- 1 cvmfs cvmfs 757 Apr 1 2016 hg38female.fa.fai -rw-rw-r-- 1 cvmfs cvmfs 3273481150 Mar 18 2014 hg38full.fa
So the full absolute path for the
Hg38
sequence file would be:/cvmfs/data.galaxyproject.org/byhand/hg38/seq/hg38full.fa
If you run into any errors with accessing the file system, run the following to re-install it:
sudo apt-get autoremove cvmfs sudo apt-get purge cvmfs sudo rm -rf /etc/cvmfs/ cd /home/ubuntu git clone https://github.com/PawseySC/Pawsey-CernVM-FS.git cd Pawsey-CernVM-FS sudo ./install-cvmfs.sh install
If it is still causing errors, you may need to reboot your instance.
2. Using the Biocontainer tools
Singularity-HPC (SHPC) is a software for container modules. In this Pawsey Bio - Ubuntu 22.04 - 2023-XX image, we have integrated the use of SHPC seamlessly with CernVM-FS. This means that you can easily access and use over 8000 Biocontainers (and up to the latest versions) without needing to understand container syntax.
If you are using the now deprecated 'Pawsey Bio - Ubuntu 20.04 - 2021-11' image, you will not be able to seamlessly use Biocontainer tools without first installing them using SHPC in the next section, 3. Adding a local SHPC registry. To avoid that, we recommend that you recreate a new instance with the 'Pawsey Bio - Ubuntu 22.04 - 2023-XX' image.
Data directories
When using Biocontainer tools, you will be required to export the paths for your data directory(ies) to Singularity, so that they can be readable by the container. For example, if your data directory is /data, then you would run the following to add it to the Singularity bind path:
export SINGULARITY_BINDPATH=/data echo 'export SINGULARITY_BINDPATH=/data' >> ~/.bashrc
To search for versions and information on a particular tool, e.g. for cuttlefish, use
shpc show
:$ shpc show quay.io/biocontainers/cuttlefish url: https://biocontainers.pro/tools/cuttlefish maintainer: '@vsoch' description: shpc-registry automated BioContainers addition for cuttlefish latest: 2.2.0--hf1761c0_0: sha256:63cdd7778b144a37684ae53b8e760ed00852f3010aa79292b3f1a6a6470f0992 tags: 2.1.0--hf1761c0_0: sha256:aa009abd48c372125e060d39f49f1690be74b6dac276d451bf1cc4c847a914d6 2.1.1--hf1761c0_0: sha256:8bccced83dd6bbf87843cf08851563c812ef7c36afda2efbcf0d54f9102b913f 2.2.0--hf1761c0_0: sha256:63cdd7778b144a37684ae53b8e760ed00852f3010aa79292b3f1a6a6470f0992 docker: quay.io/biocontainers/cuttlefish aliases: cuttlefish: /usr/local/bin/cuttlefish
To check for availability and to load the tool, use the
module avail
andmodule load
commands:Note that on first use, the tool might take a 30 seconds or so to run the command as the container is being accessed from the filesystem for the first time$ module avail cuttlefish ------------------------------- /home/ubuntu/singularity-hpc/modules ------------------------------- quay.io/biocontainers/cuttlefish/2.2.0--hf1761c0_0/module
$ module load quay.io/biocontainers/cuttlefish/2.2.0--hf1761c0_0/module
$ cuttlefish --version cuttlefish 2.2.0 Supported commands: `build`, `help`, `version`. Usage: cuttlefish build [options]
To check for the list of modules loaded:
This list will be cleared whenever you log out of your instance. After logging back in, you will need to reload the module for it to be on the list for use.$ module list Currently Loaded Modules: 1) quay.io/biocontainers/cuttlefish/2.2.0--hf1761c0_0/module
To install another version not available as a module:
Expand...To use another version of the tool available from the above
shpc show
list, use shpc to install the module, ensuring to use and keep the cvmfs path of the container:$ sudo shpc install quay.io/biocontainers/cuttlefish:2.1.1--hf1761c0_0 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.1.1--hf1761c0_0 --keep-path Module quay.io/biocontainers/cuttlefish:2.1.1--hf1761c0_0 was created.
Update the modules cache, then, run the same module commands to load the tool in this version:
$ /usr/local/lmod/lmod/libexec/update_lmod_system_cache_files -d /opt/mData/cacheDir -t /opt/mData/cacheTS.txt /home/ubuntu/singularity-hpc/modules
$ module avail cuttlefish ------------------------------- /home/ubuntu/singularity-hpc/modules ------------------------------- quay.io/biocontainers/cuttlefish/2.1.1--hf1761c0_0/module quay.io/biocontainers/cuttlefish/2.2.0--hf1761c0_0/module (L,D) Where: L: Module is loaded D: Default Module
$ module load quay.io/biocontainers/cuttlefish/2.1.1--hf1761c0_0/module The following have been reloaded with a version change: 1) quay.io/biocontainers/cuttlefish/2.2.0--hf1761c0_0/module => quay.io/biocontainers/cuttlefish/2.1.1--hf1761c0_0/module
You will notice that the previous version of the tool is now swapped out for the version you just loaded
$ module list Currently Loaded Modules: 1) quay.io/biocontainers/cuttlefish/2.1.1--hf1761c0_0/module
If you prefer to use the biocontainers without SHPC, you can do so by using the absolute path for each of the biocontainers. Note that you would require knowledge on how to use Singularity to do so. The version of Singularity installed on the Nimbus Bio image is 3.8.7 and instructions can be found here: Singularity exec.
For example, to use cuttlefish version 2.2.0–hf1761c0_0:
$ ls /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.2.0* /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.2.0--h6a68c12_1 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.2.0--h6a68c12_2 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.2.0--hf1761c0_0 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.2.0--hf1761c0_1
$ singularity exec /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.2.0--hf1761c0_0 cuttlefish cuttlefish 2.2.0 Supported commands: `build`, `help`, `version`. Usage: cuttlefish build [options]
When using Singularity, if you run into an issue with no loop devices found
, please use the solution provided here: Using Containers#Commonissues
3. Adding a local SHPC registry
If there are versions (usually older ones) of a Biocontainer tool that is present in the cvmfs repository but not on the shpc show
list (i.e. the default recipe), you can create a local SHPC registry and add/update a recipe file for the Biocontainer tool.
Clone the remote SHPC-registry and add it as a local registry:
$ cd /home/ubuntu
$ git clone https://github.com/singularityhub/shpc-registry.git
$ sudo shpc config add registry /home/ubuntu/shpc-registry/ Warning: Check with shpc config edit - ordering of list can change. Added registry to /home/ubuntu/shpc-registry/
Look up for available versions of the tool, e.g. cuttlefish:
$ ls /cvmfs/singularity.galaxyproject.org/all/cuttlefish* /cvmfs/singularity.galaxyproject.org/all/cuttlefish:1.0.0--h2e03b76_0 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:1.0.0--h2e03b76_1 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.0.0--h95f258a_0 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.0.0--hf1761c0_1 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.1.0--hf1761c0_0 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.1.1--hf1761c0_0 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.2.0--hf1761c0_0
To add a different version to the recipe file for the tool, e.g.
2.0.0--hf1761c0_1
:$ shpc add /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.0.0--hf1761c0_1 quay.io/biocontainers/cuttlefish:2.0.0--hf1761c0_1 --registry /home/ubuntu/shpc-registry/ quay.io/biocontainers/cuttlefish:2.0.0--hf1761c0_1 already exists and will be updated! Registry entry quay.io/biocontainers/cuttlefish was added! Before shpc install, edit: /home/ubuntu/shpc-registry/quay.io/biocontainers/cuttlefish/container.yaml
To install the tool:
If you are using the now deprecated 'Pawsey Bio - Ubuntu 20.04 - 2021-11' image, run the following steps first:
cd /home/ubuntu git clone https://github.com/singularityhub/singularity-hpc.git mkdir /home/ubuntu/singularity-hpc/modules sudo shpc config set module_base /home/ubuntu/singularity-hpc/modules cat >> ~/.bashrc <<'EOF' module use /home/ubuntu/singularity-hpc/modules EOF source ~/.bashrc
By setting your
module_base
to this new location, all new container modules will be installed to this path.$ sudo shpc install quay.io/biocontainers/cuttlefish:2.0.0--hf1761c0_1 /cvmfs/singularity.galaxyproject.org/all/cuttlefish:2.0.0--hf1761c0_1 --keep-path Module biocontainers/cuttlefish:2.0.0--hf1761c0_1 was created.
Now when you do a module avail, the newly installed
2.0.0--hf1761c0_1
version will be available:$ module avail cuttlefish ------------------------------- /home/ubuntu/singularity-hpc/modules ------------------------------- quay.io/biocontainers/cuttlefish/2.0.0--hf1761c0_1/module
Since you have created your own local registry, shpc will default to your local registry whenever you do a look up with
shpc show
. To look up the full list of Biocontainer tools with the latest versions, you will need to add a flag to point to the remote (Github) shpc-registry in your search:The shpc-registry is kept up-to-date with the latest versions of all Biocontainers on a nightly update.$ shpc show quay.io/biocontainers/cuttlefish --registry https://github.com/singularityhub/shpc-registry
4. Using Nextflow
Nextflow makes use of containers to run your workflows sequentially. Each step of your workflow is called a process. For each process, Nextflow pulls the appropriate container required for that step to run it. You can prevent Nextflow from pulling the container and using what is present on your instance to save time and space.
To do so, you would create an additional config file to point Nextflow to either 1) the paths of the containers on /cvmfs/singularity.galaxyproject.org, or 2) the module paths on your instance. Nextflow prioritises this custom config file above the default nextflow.config
file(s), if present, in other directories for your workflow.
Nextflow pipelines for Bioinformatics
Nextflow has a repository of pipelines that are available through https://nf-co.re. These are becoming increasingly popular, as more peer-reviewed pipelines are added by the community. A couple of popular ones include:
Configuring Nextflow to use Biocontainers
Please note this is only an example of how you can configure your Nextflow workflow to use the containers available from your instance.
Suppose you are using the
nfcore/rnaseq
pipeline. Note that in themain.nf
for theFASTQC
process, there are a few parameters for the tool. These will be over-written by the config file that you will create in the next step.$ cat rnaseq/modules/nf-core/fastqc/main.nf process FASTQC { tag "$meta.id" label 'process_medium' conda "bioconda::fastqc=0.11.9" container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? 'https://depot.galaxyproject.org/singularity/fastqc:0.11.9--0' : 'quay.io/biocontainers/fastqc:0.11.9--0' }" input: tuple val(meta), path(reads) output: tuple val(meta), path("*.html"), emit: html tuple val(meta), path("*.zip") , emit: zip path "versions.yml" , emit: versions when: task.ext.when == null || task.ext.when script: def args = task.ext.args ?: '' def prefix = task.ext.prefix ?: "${meta.id}" // Make list of old name and new name pairs to use for renaming in the bash while loop def old_new_pairs = reads instanceof Path || reads.size() == 1 ? [[ reads, "${prefix}.${reads.extension}" ]] : reads.withIndex().collect { entry, index -> [ entry, "${prefix}_${index + 1}.${entry.extension}" ] } def rename_to = old_new_pairs*.join(' ').join(' ') def renamed_files = old_new_pairs.collect{ old_name, new_name -> new_name }.join(' ') """ printf "%s %s\\n" $rename_to | while read old_name new_name; do [ -f "\${new_name}" ] || ln -s \$old_name \$new_name done fastqc $args --threads $task.cpus $renamed_files cat <<-END_VERSIONS > versions.yml "${task.process}": fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" ) END_VERSIONS """ stub: def prefix = task.ext.prefix ?: "${meta.id}" """ touch ${prefix}.html touch ${prefix}.zip cat <<-END_VERSIONS > versions.yml "${task.process}": fastqc: \$( fastqc --version | sed -e "s/FastQC v//g" ) END_VERSIONS """ }
Nextflow prioritises a custom config file over any other config files or values defined in the workflow files. To ensure that Nextflow uses the existing container for
fastqc
, you would create and use a custom config file, choosing either of the two ways:*_path.config
file to ensure that every process has the path for the existing container to run that part of the workflow from. More info on config files for processes can be found here: https://www.nextflow.io/docs/latest/config.html#scope-process