-
Notifications
You must be signed in to change notification settings - Fork 18
Setting Up BLAST Databases
GeneValidator requires a protein BLAST database in order to fully analyse all sequences. The GeneValidator package includes the Swissprot BLAST database.
Choosing the right BLAST database is essential in order to obtain the best results with GeneValidator. For example, SwissProt contains high quality, manually curated genes, but is a lot smaller than a larger database such as Non-Redundant database. On the hand, the Non-Redundant database contains lower-quality genes. An alternative would be creating BLAST database from a database of a closely related species.
GeneValidator contains a script (ncbi-blast-dbs) that downloads pre-formatted BLAST databases from NCBI.
# To see available BLAST databases
genevalidator ncbi-blast-dbs
# To download BLAST database(s)
genevalidator ncbi-blast-dbs nr refseq_protein
Features include:
- Database files (volumes) are downloaded in parallel - number of threads to use is determined automatically.
- MD5 checksum is verified and the database volume extracted upon download.
- Database volumes are not downloaded in a particular order.
- The volumes are updated if a newer version is available on the server, or re-downloaded if corrupt.
- Aborted downloads are safely resumed.
However, this script requires wget
and md5sum
to work. For macOS users, md5sum
can be installed using Homebrew
brew install md5sha1sum
It is also possible to create a custom BLAST database from a Fasta file by using the included makeblastdb command from BLAST+. In particular, it is important to note that GeneValidator requires that blast databases are set up with the -parse_seqids argument.
makeblastdb -in input_db -dbtype prot -parse_seqids
For example, if we were to download the swissprot database as a fasta file, we could format it as a BLAST database ourselves as follows.
cd genevalidator/blast_db/
# Delete existing version of SwissProt (which is included in the GeneValidator package)
rm swissprot*
# download the latest version in fasta format - from http://www.uniprot.org/downloads
wget ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
# unzip the fasta file
gunzip uniprot_sprot.fasta.gz
# use the packaged makeblastdb tool to create the BLAST DB
makeblastdb -in uniprot_sprot.fasta -dbtype prot -parse_seqids
It is possible to pass BLAST output files (XML or tabular files) to GV. This can be particularly useful if BLAST results have already been produced, when BLAST can be run on a cluster, or when BLAST results are also required for other analyses.
GeneValidator supports the XML and tabular BLAST output formats.
# Run BLAST (XML output)
blast(p/x) -db DATABASE_PATH -num_threads NUM_THREADS -outfmt 5 -out BLAST_XML_FILE -query INPUT_FASTA_FILE
# Run GeneValidator
genevalidator -d DATABASE_PATH -n NUM_THREADS -x BLAST_XML_FILE INPUT_FASTA_FILE
This is the same, but using the BLAST tabular output.
# Run BLAST (tabular output)
blast(p/x) -db DATABASE_PATH -num_threads NUM_THREADS -outfmt '7 qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq' -out BLAST_TAB_FILE -query INPUT_FASTA_FILE
# Run GeneValidator
genevalidator -n NUM_THREADS -t BLAST_TAB_FILE -o 'qseqid sseqid sacc slen qstart qend sstart send length qframe pident nident evalue qseq sseq' INPUT_FASTA_FILE