How to Choose a Flavour
- Audrey Stott
- Luke Edwards
- Marco De La Pierre
Nimbus, as all Pawsey systems, is a shared research resource. So as a general rule, we encourage users to choose resources as closely as possible to what you will be using. This allows more users to have access to an allocation and for the infrastructure to be used maximally across all research domains, as well as having capacity to respond to requests for increased usage over time. However, we understand that choosing what you require may be tricky at the start, especially if you are not sure what software packages you will be using, or if you are unsure what their computational requirements are. Below, we have a few scenarios to assist you in making a resource size decision for your research project.
Flavours
Each instance has a 'flavour' which determines how much RAM and how many virtual CPU cores it has access to. Note that if you request a flavour now and realise later that you need a smaller or larger flavour, you can request for it to be resized, and then migrate your instance accordingly. These are the available flavours on Nimbus:
Flavour | CPU Cores | RAM | Notes |
---|---|---|---|
n3.1c4r | 1 | 4GB | |
n3.2c8r | 2 | 8GB | |
n3.4c16r | 4 | 16GB | |
n3.8c32r | 8 | 32GB | |
n3.16c64r | 16 | 64GB | Most detail required in terms of application |
Scenarios
Biological sciences
John is starting a research project that involves doing phylogenetic analyses on the drought-resistant genes of a total of 20 samples from five different crop plants. He is using raw DNA sequencing reads that are roughly 7GB per sample, and the output files will be less than 1MB per sample, which brings his total storage requirements to 140GB. To run his analyses, he will be doing quality checks on the raw reads, performing adapter trimming, aligning the trimmed reads to a reference genome, annotating the alignments, and finally constructing a phylogenetic tree. From this analysis pipeline, some of the common types of tools John would use are FASTQC, cutadapter, blast, SPAdes, and MrBayes.
Of these tools, blast and SPAdes are the most resource-intensive, which depending on file sizes require at least 8 CPU cores and 10-30GB of RAM. However, as John will be selecting out only drought-resistant genes, the file size per run of blast and Spades will be magnitudes smaller, and therefore will require only a maximum of 2-4 CPU cores and not more than 10GB of RAM. In this instance, John's best fit flavour is the n3.2c8r with 2 CPU cores and 8GB RAM. If John later decides he will be amending his project to extend the analysis to a greater number of genes, then n3.4c16r with 4 CPU cores and 16GB RAM would be better suited for those needs.
Example suggested application:
- 1x n3.2c8r instance
- 150GB storage
Biomedical sciences
Georgie is starting to compare human exome datasets. She needs process from fastq to VCF format using best practices for the GATK4 toolset, plus Quality Control. She then wants to analyse her data. The raw fastq files range from 4-7GB each, with two files per sample. The intermediate BAM files generated from this will be about 15GB per sample, and these will be kept. There are other intermediate files that will not be kept, which tend to consume about the same amount of space as the BAM files.
Although the RAM usage from the GATK toolset can be limited with command line flags, in general it would be useful to have n3.8c32r or n3.16c64r instances, depending on the sample size (e.g. 15 exomes or 30 exomes). This also helps with multithreading, which is available in many of these tools and will decrease processing time. Multithreading enables using multiple CPUs (cores) for a process. For example, a 16 CPU core instance can support 16 threads, which enables faster processing of a job than using only a single thread/core. Not all GATK tools will explicitly let you set multithreading, but they will be able to make use of the available infrastructure.
With the above in mind and 15 exomes to analyse, Georgie can request 15 * (7GB + 7GB + 15GB) x 2 = 870GB data storage and one n3.8c32r instance. She may wish to request a second instance at the same time and some additional storage to experiment with improving her workflow. In future she may consider running multiple machines in an orchestrated fashion to further improve the speed of processing. She can contact the Pawsey Service Desk to discuss should she want to pursue that.
Example suggested application:
- 2x n3.8c32r instances
- 1TB storage
Chemistry
Martha needs to run a number of quantum chemistry calculations on a set of small/medium sized molecules (20-30 atoms). In particular, she needs to run geometry optimisation, and then to compute vibrational spectra, optical spectra and transition state searches.
These tasks are not particularly memory intensive (less than the 4 GB per core available on Nimbus instances), however each of them may need to run for several days. Depending on the checkpoint/restart capabilities of the adopted code, it can be effective to use the Nimbus cloud rather than a HPC system, trading the absence of wall-times for lower performance.
Considering that most quantum chemistry codes implement parallel algorithms, it is a good idea to aim for one of the two largest flavours, n3.8c32r or n3.16c64r.
In regards to disk space, inputs and outputs of the simulations do not exceed the scale of a few MBs. However some scratch space in the order of 10s of GBs will be required at runtime. It can be sensible to start with 40 GB storage, which can eventually be upgraded later on upon justification.
Example suggested application:
- 1x n3.8c32r or n3.16c64r instance
- 40GB storage
Materials science
As part of her research project, Hannah regularly needs to run a large number (1000s) of lattice dynamics calculations on a variety of solid state compounds; e.g. these can be part of systematic investigations over some parameter space.
Each of these calculations has minimal hardware requirements and takes no more than a few minutes to complete, however there are many of them.
She might be tempted to run all of them just on her laptop. However, using a research dedicated, publicly available computational infrastructure can be a much more effective solution. Nimbus has smaller down-time (can run calculations as she's carrying her laptop around), and the laptop will be mostly used for what it was designed (office work rather than regular scientific computing).
In this case, a minimal allocation can satisfy her needs. Example suggested application:
- 1x n3.1c4r instance
- 10GB storage