Blast+
Blast+ is a program for comparing biological sequence information, such as amino-acid or DNA/RNA sequences.
Due to the varied nature of Blast+ usage and its lack of extensive parallelism we provide only general advice on running Blast searches. We focus on two example cases.
Running Blast+
Blast+ is provided by system modules. To load the Blast+ module, use the following command:
$ module load blast/2.12.0--pl5262h3289130_0
NCBI Indexed Data bases
Standard NCBI indexed databases, as used by the Blast+ executables, are currently centrally installed in the directory /scratch/references/blastdb_update/blast-YYYY-MM-DD/db
. Where 'blastdb-YYYY-MM-DD' is the date of the download of the database files. The Blast+ databases are downloaded every scheduled maintenance period. To check the date of a given nucleotide or protein database, use the Blast+ utility blastdbcmd
, which gives the date the file was updated in the central NCBI repository (via the anonymous FTP download site at ftp://ftp.ncbi.nlm.nih.gov/blast/db/)
.
$ blastdbcmd -info -db /scratch/references/blastdb_update/blast-2021-09-01/db/nt Database: Nucleotide collection (nt) 72,899,005 sequences; 510,954,263,840 total bases Date: Aug 24, 2021 2:13 AM Longest sequence: 99,791,824 bases BLASTDB Version: 5 Volumes: /scratch/references/blastdb_update/blast-2021-09-01/db/nt.00 ...
How to use Blast+ effectively (on HPC systems)
Running typical Blast+ queries against an indexed data base can be time consuming, and can also produce large result files. The following tips can help to improve the efficiency of Blast+ queries:
- Ensure queries do not contain exact replicas
- Limit the number of search results. The default value for both
-num_descriptions
and-num_alignments
(and the alternative-max_target_seqs
) is 500. These values should be reduced as much as possible to reduce both the time taken to perform the search and the resulting output file size. - Using a job array can really speed up your analysis by running many blast queries at once, rather than one at a time (i.e. changing from running in serial to running "embarrassingly parallel"). For more details, see the section about job arrays on the Example Workflows page.
Output format
By default, Blast+ searches write results in a simple text format. Blast+ also offers some structured output data formats: XML, JSON and ASN1. These formats can offer advantages in some circumstances.
You can use the Blast+ option -outfmt 11
to produce ASN1 format. This option saves search results in a form that can be used to re-format the results using blast_formatter
without the necessity of re-running the query (provided the relevant database is available). Depending on the exact type of search, the .asn1
file may be significantly larger (perhaps by a factor of 2-4) than the corresponding default text format. However, compressed .asn1
(produced using gzip for example) is usually smaller, making it a reasonable archival choice for large quantities of search data are to be archived.
External links
- For details of Blast+ applications, in all cases refer to the Blast Command Line Applications User Manual.
- For advice using Blast+ on HPC systems, see Running NCBI-BLAST jobs in parallel.