Tuesday, December 2, 2014

Building blast+ databases with taxonomy ID (taxid_map)

Building NCBI BLAST+ databases with linked taxonomy is far more difficult than it should be.
For example, in taxonomy-based tools such as Kraken, mapping
1) taxonomy id to sequence id (gi or accession) and
2) taxonomy id to a human-readable taxonomy tree,
are built-in and transparent to the user.
Unfortunately, with BLAST+ these steps must be completed manually and are included in two separate programs, makeblastdb for (1) and blastn/blastp/blastx for (2).

(1) Taxonomy id <–> sequence id

In BLAST+, a taxid_map file file must be created and passed to makeblastdb
makeblastdb -in <FASTA file> -dbtype nucl -parse_seqids -taxid_map taxid_map.txt 
where taxid_map.txt is a space or tab separated list of sequence ids (either gi or accession) and taxonomy ids.
For example, with gi:
556927176 4570
556926995 4573
501594995 3914
Alternatively with accession:
NC_022714.1 4570
NC_022666.1 4573
NC_021092.1 3914
There is no turn-key way to generate this mapping taxid to sequence_id for a moderately large set of sequencing.
Fortunately, there is always a hack work-around. NCBI allows export of both the FASTA and GenBank files. The former are used as the default input for makeblastdb, and the latter contain both the sequence_id and taxid. They can both be obtained from the NCBI, searching and exporting with Send to:
enter image description here
This simple Python code snippet will do the trick for small and moderately large datasets.
from Bio import SeqIO
genbankfile = "DNA.gb"
f = open('taxid_map.txt','w')
for gb in SeqIO.parse(genbankfile,"gb"):
        annotations = gb.annotations['gi']
        taxid = gb.features[0].qualifiers['db_xref'][0].split(':')[1]
        f.write("{} {}\n".format(annotations, taxid))
For large datasets, the bandwidth cost of of downloading the GenBank from NCBI becomes prohibitive, and the dictionary approach would probably be warranted.
Download both the FASTA and GenBank, alternatively extract the FASTA from GenBank, e.g. with BioPython.

(2) Taxonomy id <–> Taxonomy tree

Simply include this NCBI database in the same directory as your database for the look-up to work with blastn/blastp/blastx: ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

Eating your cake

blastn -db <DATABASE> -query <QUERY> -outfmt "10 qseqid sseqid pident staxids sscinames scomnames sblastnames sskingdoms"
1,gi|312233363|ref|NC_014692.1|,86.26,310261,Sus scrofa taiwanensis,Sus scrofa taiwanensis,even-toed ungulates,Eukaryota
1,gi|223976078|ref|NC_012095.1|,86.26,9825,Sus scrofa domesticus,domestic pig,even-toed ungulates,Eukaryota
1,gi|5835862|ref|NC_000845.1|,86.26,9823,Sus scrofa,pig,even-toed ungulates,Eukaryota
Your comma separated file (-outfmt) showing human-readable taxonomy info.


BioInfo said...

Hi, it was really nice but unfortunately I am unable to have a taxon id in BLAST output. It says N/A. I have downloaded FASTA sequences and I am doing BLAST all vs all.
makeblastdb -in file.fasta -db prot -out ./database/db
TaxDB is in the same folder of BLAST database. BLAST query:
blastp -db ./database/db -query file.fasta -out results.out -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend staxid sscinames scomnames stitle sstart send evalue bitscore"

Any suggestions? Your help will be appreciated.

Nathan Brown said...

Very helpful post! Thanks a million!
You may also want to add a test like this, to make sure that the taxid is what you think it is, rather than some sort of numbder from the Human Microbiome Project or ATCC. I ran into a couple of those in my set of genbank files.

if gb.features[0].qualifiers['db_xref'][0].split(':')[0] == "taxon":
taxid = gb.features[0].qualifiers['db_xref'][0].split(':')[1]