Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.


Panel
titleThis page:

Table of Contents

Summary


This page covers information on how to use the new 'Pawsey Bio - Ubuntu 20.04 - 2021-11' image for Nimbus. Instructions on how to choose this image when creating your instance can be found here. This Bio-image is created to cater to bioinformatics users who prefer to have their instances pre-installed with software commonly used in the bioinformatics domain. Some of the software are part of Pawsey's ongoing effort to improve the experience of bioinformatics users at Pawsey.

Before you begin


You may be required to input your SSH public key or the path to your SSH public key on your local machine while using some of these software. Please ensure you have it ready to go. Instructions for how to generate one can be found /wiki/spaces/SUP/pages/55410887.

Pre-installed software

The list of pre-installed software on this image is as follows:

Ansible - an

Commonly used software


SoftwareInformation
AnsibleAn automation platform that Pawsey uses to automate a number of software
deployment 
deployment
CernVM-FS
- a
A read-only file system for accessing
reference datasets
files on shared repositories

Docker

- a Lmod - a

A popular container engine
  • Jupyter Notebook (container)
  • Google ChromeA web browser
    Jupyter NotebookFor using an interactive Jupyter Notebook - see Run Jupyter Notebook Interactively
    LmodA modules environment that we use at Pawsey for loading sotware
    Nextflow
    - a
    A popular workflow manager
    Pip
    - a Singularity - a
    A Python package installer 
    Python3
  • RStudio (contrainer)
  • A popular programming language used in many Bioinformatics software

    RStudio

    For using R interactively - see Run RStudio Interactively
    SingularityA popular container engine that can be used on HPC
    Singularity-HPC
    - a
    A container modules installer
    Spack
    - a
    A package management tool
    X2go

    A virtual desktop application - see Setting up a virtual desktop for your instance

    Instructions


    On this page, we will only cover instructions for how to use CernVM-FS , Jupyter Notebook, RStudio, and Singularity-HPC. For instructions for other software listed above, please see the links provided above, or the software's original documentation page.

    CernVM-FS

    CernVM-FS is a read-only file system that was developed by another supercomputing centre (Cern). It allows files such as container tools, reference datasets and other shared resources that are commonly used by many researchers to be accessed, added to, and updated in the one location. At Pawsey, we currently mirror cache the Biocontainer tools and reference genome datasets that are on Galaxy Project's repositoryrepositories. Please note that the datasets data sets may not be comprehensive, and this service is not meant to replace your current methods for accessing public datasets.

    Note

    Due to a recent (27th July 2022) change in the CernVM-FS proxy on Pawsey, please ensure to do the following before proceedingIf you are using the 'Pawsey Bio - Ubuntu 20.04 - 2021-11' image, run the following to re-install CVMFS:

    Code Block
    sudo apt-get autoremove cvmfs
    sudo apt-get purge cvmfs
    sudo rm -rf /etc/cvmfs/
    git clone https://github.com/PawseySC/Pawsey-CernVM-FS.git

    Then run the following to set up the repositories from Galaxy and AARNet, respectively:

    Code Block
    cd  Pawsey-CernVM-FS
    
    # for Galaxy repos
    sudo ./cvmfs-client-setup.sh \
        --stratum-1 cvmfs1-mel0.gvl.org.au \
        --stratum-1 cvmfs1-ufr0.galaxyproject.eu \
        --stratum-1 cvmfs1-tacc0.galaxyproject.org \
        --stratum-1 cvmfs1-iu0.galaxyproject.org \
        --stratum-1 cvmfs1-psu0.galaxyproject.org \
        --proxy cvmfs-cachingproxy.pawsey.org.au \
        pubkeys/cvmfs-config.galaxyproject.org.pub \
        pubkeys/data.galaxyproject.org.pub \
        pubkeys/main.galaxyproject.org.pub \
        pubkeys/sandbox.galaxyproject.org.pub \
        pubkeys/singularity.galaxyproject.org.pub \
        pubkeys/test.galaxyproject.org.pub \
        pubkeys/usegalaxy.galaxyproject.org.pub
    
    # for AARNet repos
    sudo ./cvmfs-client-setup.sh \
        --stratum-1 bcws.test.aarnet.edu.au \
        --proxy cvmfs-cachingproxy.pawsey.org.au \
        pubkeys/containers.biocommons.aarnet.edu.au.pub pubkeys/data.biocommons.aarnet.edu.au.pub pubkeys/tools.biocommons.aarnet.edu.au.pub
     

    You can then refer to and use the path to the datasets as follows:

    Code Block
    ls
    
    cd Pawsey-CernVM-FS
    sudo ./install-cvmfs.sh install


    Then, to access and view the entire list of Biocontainer tools from the repository: 

    Note: It may take a minute or two to load the folders. When you have done it once, it will not take as long to show again.

    The /cvmfs/singularity.galaxyproject.org/all subdirectory is where the entire list of 8000+ Biocontainers can be found, with the alphabetical subdirectories being symlinks to them.

    Code Block
    $ ls -la /cvmfs/singularity.galaxyproject.org
    total 140
    drwxr-xr-x 33 cvmfs cvmfs 4096 Mar 25  2020 .
    drwxr-xr-x  3 cvmfs cvmfs 4096 May 12  2022 1
    drwxr-xr-x  3 cvmfs cvmfs 4096 Jun 29  2020 2
    drwxr-xr-x  4 cvmfs cvmfs 4096 Feb 10 07:09 3
    drwxr-xr-x 24 cvmfs cvmfs 4096 Feb 23  2021 a
    drwxr-xr-x  4 cvmfs cvmfs 4096 Mar  7 19:06 all
    drwxr-xr-x 23 cvmfs cvmfs 4096 Sep  2  2022 b
    drwxr-xr-x 26 cvmfs cvmfs 4096 Feb 10 15:05 c
    drwxr-xr-x 22 cvmfs cvmfs 4096 Feb 10 15:05 d
    drwxr-xr-x 23 cvmfs cvmfs 4096 May  1  2020 e
    drwxr-xr-x 21 cvmfs cvmfs 4096 Feb 10 19:06 f
    drwxr-xr-x 25 cvmfs cvmfs 4096 Feb  5  2022 g
    drwxr-xr-x 19 cvmfs cvmfs 4096 Feb 10 21:21 h
    drwxr-xr-x 18 cvmfs cvmfs 4096 Feb 10 21:21 i
    drwxr-xr-x 14 cvmfs cvmfs 4096 Jul 10  2020 j
    drwxr-xr-x 21 cvmfs cvmfs 4096 Feb 25  2021 k
    drwxr-xr-x 16 cvmfs cvmfs 4096 Feb 10 22:54 l
    drwxr-xr-x 26 cvmfs cvmfs 4096 Feb 10 22:54 m
    drwxr-xr-x 20 cvmfs cvmfs 4096 Feb 11 05:13 n
    drwxr-xr-x 14 cvmfs cvmfs 4096 May  7  2020 o
    drwxr-xr-x 24 cvmfs cvmfs 4096 Aug 24  2021 p
    drwxr-xr-x 12 cvmfs cvmfs 4096 Feb 27  2021 q
    drwxr-xr-x 27 cvmfs cvmfs 4096 Jun 28  2022 r
    drwxr-xr-x 26 cvmfs cvmfs 4096 Apr  7  2020 s
    drwxr-xr-x 24 cvmfs cvmfs 4096 Feb 11 23:17 t
    drwxr-xr-x 13 cvmfs cvmfs 4096 Feb 11 23:17 u
    drwxr-xr-x 18 cvmfs cvmfs 4096 Feb 11 23:17 v
    drwxr-xr-x 17 cvmfs cvmfs 4096 Jul 28  2021 w
    drwxr-xr-x 16 cvmfs cvmfs 4096 Apr  7  2020 x
    drwxr-xr-x  4 cvmfs cvmfs 4096 Feb 28  2021 y
    drwxr-xr-x 10 cvmfs cvmfs 4096 Feb 11 23:17 z

    To access the data files:

    Code Block
    $ ls -la /cvmfs/data.galaxyproject.org
    ls /cvmfs/singularity.galaxyproject.org
    ls /cvmfs/main.galaxyproject.org
    ls /cvmfs/cvmfs-config.galaxyproject.org
    
    ls /cvmfs/containers.biocommons.aarnet.edu.au
    ls /cvmfs/data.biocommons.aarnet.edu.au
    ls /cvmfs/tools.bioommons.aarnet.edu.au

    Note: It may take a minute or two to load the folders. When you have done it once, it will not take as long to show again.

    Jupyter Notebook 

    Jupyter Notebooks are very popular way of running bioinformatics analysis due to its interactive nature. We have enabled an automated way of creating such notebooks from a container format. As containers do not store files, all notebooks created from the interactive session are stored on your Nimbus instance under /data.

    Note
    titleOpen port 8888 on the Nimbus dashboard

    From the Nimbus dashboard:

    1.Navigate to NetworkSecurity Groups:

    Image Removed

    2.Click on + Create Security Group, name it 'port 8888' and then select the Create Security Group button:

    Image Removed

    3.Select + Add Rule:

    Image Removed

    4.Then enter the port number 8888 under 'Port', and click on the Add button:

    Image Removed

    5.Navigate back to Compute Instances, then click on the arrow down button for the your instance, and select Edit Security Groups. Ensure that you select the port 8888 security group that you have just created, i.e. it should appear on the right hand side list of Instance Security Groups:

    Image Removed

    Then, to start a Jupyter Notebook, simply run the following:

    Code Block
    ansible-playbook /jupyter-on-nimbus/ansible-jupyternotebook.yaml

    Notes:

  • The playbook will prompt you to choose a version of the Jupyter Datascience Notebook (https://hub.docker.com/r/jupyter/datascience-notebook/tags/)
  • The pulling of the container will take at least 3-5 minutes, once pulled, it will run instantly each time you want to use it
  • From time to time, you may want to re-clone the jupyter-on-nimbus repo for any future updates. Only essential updates will be notified to Nimbus users. 

    Code Block
    git clone https://github.com/PawseySC/jupyter-on-nimbus
    sudo rm -rf /jupyter-on-nimbus
    sudo mv jupyter-on-nimbus /

    RStudio

    RStudio is another popular bioinformatics analysis interactive software. Here we have also enabled automation to starting an RStudio server session. As containers do not store files, all R sessions created from the interactive session are stored on your Nimbus instance under /data.

    Note
    titleOpen port 8787 on Nimbus dashboard

    From the Nimbus dashboard:

    1.

    Navigate to NetworkSecurity Groups:

    Image Removed

    2.

    Click on + Create Security Group, name it 'port 8787' and then select the Create Security Group button:

    Image Removed

    3.Select + Add Rule: 

    Image Removed

    4.Then enter the port number 8787 under 'Port', and click on the Add button:

    Image Removed

    5.Navigate back to Compute Instances, then click on the arrow down button for the your instance, and select Edit Security Groups. Ensure that you select the port 8787 security group that you have just created, i.e. it should appear on the right hand side list of Instance Security Groups:

    Image Removed

    Then, to start an RStudio server session, simply run the following:

    Code Block
    ansible-playbook /rstudio-on-nimbus/ansible-rstudio.yaml -i /rstudio-on-nimbus/vars_list

    Notes:

  • The playbook will prompt you to choose a version of R (https://hub.docker.com/r/rocker/tidyverse/tags - note that only 4.1.0 are supported at present)
  • You can also enter any R libraries or BiocManager tools you require - ensure to follow the prompts accurately
  • The pulling of the container will take at least 3-5 minutes, once pulled, it will run instantly each time you want to use it
  • From time to time, you may want to re-clone the rstudio-on-nimbus repo for any future updates. Only essential updates will be notified to Nimbus users. 

    Code Block
    /
    total 14
    drwxr-xr-x   4 cvmfs cvmfs 4096 Mar 31  2016 .
    -rw-r--r--   1 cvmfs cvmfs   21 Oct 24  2018 .cvmfsdirtab
    drwxr-xr-x 210 cvmfs cvmfs 4096 Apr 21  2022 byhand
    drwxr-xr-x  18 cvmfs cvmfs 4096 Nov 24  2020 managed

    To use these data files for your analyses, copy the absolute file path in your workflow/pipeline. For example, with the reference genome Hg38, the file can be found in the following location, and specifically under the seq sub directory:

    Code Block
    $ ls -la /cvmfs/data.galaxyproject.org/byhand/hg38
    total 46
    drwxrwxr-x  10 cvmfs cvmfs 4096 Apr 22  2016 .
    drwxr-xr-x 210 cvmfs cvmfs 4096 Apr 21  2022 ..
    -rw-r--r--   1 cvmfs cvmfs    0 Apr 22  2016 .cvmfscatalog
    drwxrwxr-x   3 cvmfs cvmfs 4096 Jan 21  2015 download
    drwxrwxr-x   6 cvmfs cvmfs 4096 Jan 20  2015 hg38canon
    drwxrwxr-x   6 cvmfs cvmfs 4096 Jan 20  2015 hg38female
    drwxrwxr-x   6 cvmfs cvmfs 4096 Jan 20  2015 hg38full
    drwxrwxr-x   2 cvmfs cvmfs 4096 Mar 18  2014 liftOver
    drwxrwxr-x   2 cvmfs cvmfs 4096 Mar 18  2014 picard_index
    drwxrwxr-x   2 cvmfs cvmfs 4096 Mar 18  2014 sam_index
    drwxrwxr-x   2 cvmfs cvmfs 4096 Apr  1  2016 seq
    
    
    $ ls -la /cvmfs/data.galaxyproject.org/byhand/hg38/seq
    total 10108046
    drwxrwxr-x  2 cvmfs cvmfs       4096 Apr  1  2016 .
    drwxrwxr-x 10 cvmfs cvmfs       4096 Apr 22  2016 ..
    -rw-rw-r--  1 cvmfs cvmfs        136 Mar 18  2014 README
    -rw-rw-r--  1 cvmfs cvmfs  835393456 Mar 18  2014 hg38.2bit
    lrwxrwxrwx  1 cvmfs cvmfs         11 May 17  2014 hg38.fa -> hg38full.fa
    -rw-r--r--  1 cvmfs cvmfs      19327 Aug 24  2015 hg38.fa.fai
    -rw-rw-r--  1 cvmfs cvmfs 3150052305 Mar 17  2014 hg38canon.fa
    -rw-rw-r--  1 cvmfs cvmfs 3091680335 Mar 17  2014 hg38female.fa
    -rw-r--r--  1 cvmfs cvmfs        757 Apr  1  2016 hg38female.fa.fai
    -rw-rw-r--  1 cvmfs cvmfs 3273481150 Mar 18  2014 hg38full.fa

    So the full absolute path for the Hg38 sequence file would be:

    Code Block
    /cvmfs/data.galaxyproject.org/byhand/hg38/seq/hg38full.fa


    Note

    If you run into any errors with accessing the file system, run the following to re-install it:

    Code Block
    sudo apt-get autoremove cvmfs
    sudo apt-get purge cvmfs
    sudo rm -rf /etc/cvmfs/
    git clone https://github.com/PawseySC/
    rstudio
    Pawsey-
    on
    CernVM-
    nimbus
    FS.git
    
    sudo
    cd Pawsey-CernVM-
    rf
    FS
    
    rm
    sudo ./
    rstudio-on-nimbus sudo mv rstudio-on-nimbus /Singularity-HPC
    install-cvmfs.sh install


    Singularity-HPC

    Tip

    Our upcoming March 2023 update of the Pawsey Bio image will include integration of Singularity-HPC with our CVMFS repositories, so that you are able to make use of all 8000+ Biocontainers seamlessly.

    Singularity-HPC is a software for container modules. If you are familiar with using containers, this is an added bonus to your experience in using containers. If you are not, this is a great way to start using containers. As container syntax can be messy and confusing, being able to use them as modules removes the need for using container syntaxes. Singularity-HPC was created by one of the original developers of Singularity, and the registry includes many bioinformatics containers that can be readily pulled and used. 

    Before you begin, ensure to move your containers folder to your storage volume (.i.e. /data), then update the container base path:

    Code Block
    mv /home/ubuntu/containers /data/containers
    shpc config set container_base:/data/containers

    To see the entire list of containers available on the registry, run the following command:

    Code Block
    shpc list 

    At Pawsey, we recommend using S-HPC's registry of quay.io/biocontainers containers, as Biocontainers are a reliable source of well-built containers, with versions that are seamlessly matched to BioConda's tools. To narrow down the list to biocontainers, run:

    Code Block
    shpc show -f quay.io/biocontainers

    To install any of these containers, run:

    Code Block
    shpc install quay.io/biocontainers/TOOL_NAME

    To use installed containers, run the following and use the tool as you normally would (no container syntax required):

    Code Block
    module load quay.io/biocontainers/TOOL_NAME

    Notes:

    • The list in the registry is not exhaustive, more packages are being added each day by the community
    • Pawsey is working to add to the quay.io/biocontainers list successively
    • You may want to reclone the repo so that your list is always updated, as such:

      Code Block
      git clone https://github.com/singularityhub/singularity-hpc
      sudo mv /singularity-hpc /shpc


    Child pages (Children Display)
    pageCloud Documentation