how to generate id_to_taxonomy file from NCBI database?

3,964 views
Skip to first unread message

binbin

unread,
Mar 24, 2011, 5:16:59 PM3/24/11
to Qiime Forum
I am using Qiime to analyze my functional genes from 454, because it
NOT 16S genes, rdp and greengenes are not possible. So I use blast and
nt database(one of NCBI databases).

Now my problem is, I do not have a id_to_taxonomy file for
assign_taxonomy.py.
Is there any methods to generate the id_to_taxonomy file from NCBI
databases?

Thanks!
Binbin



Jeff Werner

unread,
Mar 24, 2011, 9:34:16 PM3/24/11
to qiime...@googlegroups.com, binbin
Hi Binbin,

I hate to point you elsewhere, but I think the best way to get a consensus taxonomy for metagenomic sequences is to use a BLASTX search vs NCBI-nr, and input the blast results into MEGAN. If this is 454 data, however, that blast search can be very computationally intensive.

Another possibility is that, perhaps, the QIIME folks have the reference data posted for their application of "shotgun UniFrac?" And, if the MEGAN folks have a taxonomy mapped to NCBI accession numbers, then it MUST be possible to build an appropriate metagenomics database and mapping file for QIIME-based taxonomy classification.

Good luck,
Jeff

binbin

unread,
Mar 25, 2011, 8:34:21 AM3/25/11
to Qiime Forum
Thanks for your answer Jeff.

Acturally I am not working with metagenomic sequences, my sequences
are from
amplicons of single fuctional genes, like 16s they are all from one
gene.
So i only need blastn (this is used in Qiime , I think).

I use qiime because its fascinating downstream functions,such as PCoA,
alpha
rarefaction,beta diversity,et.al....I have trid MEGAN before, it does
not have
so many functions. The good thing is, MEGAN has NCBI texonomy mapping
file
while qiime does not have.

Now I am thinking to write a script to make a id_to_taxonomy file by
extracting
information from the NCBI taxonomy (ftp.ncbi.nlm.nih.gov/pub/taxonomy)
files,
I do not know if this is possible yet...any tips?

One stupid question I have now is, in the id_to_taxonomy file, the
first column
is supposed to be sample ID, should this ID be the ID of my sequences
or should
it be the ID in the database(database for blast) ?


Thanks
Binbin




Jeff Werner

unread,
Mar 25, 2011, 9:53:12 AM3/25/11
to qiime...@googlegroups.com, binbin
Hi Binbin,

Oh, I see!  Yes, QIIME would be a great way to process those seqs. To make a custom training set for assigning taxonomy with QIIME, you need two files:

1. A fasta file of reference sequences (ones that you know the taxonomy for).

2. A taxonomy file, that maps taxonomic hierarchy to each of the reference sequences. The format for the taxonomy file is to have one line for each of the reference sequences. It starts with the sample ID (e.g., the accession number) of a reference sequence, followed by a tab, and then the taxonomic hierarchy where each taxonomic level is separated by a semicolon and a space. E.g.:

1187    Bacteria; Aquificae; Aquificae_(class); Aquificales; Aquificaceae; Aquifex
1220    Bacteria; Thermotogae; Thermotogae_(class); Thermotogales; Thermotogaceae; Marinitoga

where "1187" and "1220" are the FASTA IDs for sequences in the reference sequences. I'm not sure if Google Groups will preserve the correct text format above, but that is supposed to be a tab between the accession number and the taxonomy string. And, there are some other issues. If you want to use the RDP Classifier with a custom training set, then each taxonomy outline has to have exactly six levels. With the BLAST method, however, this depth stipulation is not an issue. Good luck!

Cheers,
Jeff

Daniel McDonald

unread,
Mar 28, 2011, 10:21:31 PM3/28/11
to qiime...@googlegroups.com, Jeff Werner, binbin
Hi Binbin,

PyCogent (http://pycogent.sourceforge.net) has parsers in place for
the NCBI taxonomy and provides rich objects for the structure.
Unfortunately, the documentation on that is a bit light right now.
PyCogent is the main dependency for QIIME and is included in the
virtual machine.

Here is quick summary on how to get started, I can provide further
notes tomorrow (not on my main machine right now...). You'll need to
download nodes.dmp and names.dmp from the NCBI. The method to look at
is cogent.parse.ncbi_taxonomy.NcbiTaxonomyFromFiles and will return
the root node of the NCBI taxonomy
-Daniel

Daniel McDonald

unread,
Mar 29, 2011, 4:45:21 PM3/29/11
to qiime...@googlegroups.com, binbin
Hey all, below are some quick notes on playing with the NCBI taxonomy. You can get back relatively easily all of the NCBI Taxon IDs for, say, all species that descend from Bacteria. However, what would probably be most useful for this case is to:

1) BLAST sequences against your favorite NCBI db
2) obtain accessions from the best hits
3) use 'fastacmd' with the '-T' option to get back taxonomic information (NCBI taxon ids)
4) run through the NCBI taxonomy pulling out rank information for each of your NCBI taxon ids

...this would allow you to produce a file that maps your IDs to the NCBI taxonomy.

Hope this helps!
-Daniel

NCBI_taxonomy_notes.txt

Mabeuf

unread,
Apr 20, 2011, 11:16:23 AM4/20/11
to Qiime Forum
The way I have done it was by using the TaxCollector program (https://
github.com/audy/TaxCollector) that is associated with Pangea. I was
already looking at pangea when I realised how this would help out the
qiiming.

Taxcollector takes in a reference fasta database (ie. from rdp) and re-
heads the reference database with the taxonomic hierarchy using the
nodes.dmp and names.dmp ie:

>S000432353 uncultured delta proteobacterium; DGGE gel band M5-2; AF544067
gttgggttaagtcccgcaacgagcgcaacccctgntnctagttgccaacaggttaagctgagcactctacagggactgcc
tgggcaaccaggaggaaggcggggatgacgtcaagtcctcatggcccttatgnncagggctacacacgtgctacaatggg
cggtacagagngcagcnaactcgcgagagcaagcnaatcncacaaagccgtcctcagttcngattgcaggctgcaactcg
actgcatgaagctggaatcgctagtaatcggagatcagcacnctccggtgaatacgttcccgggccttgtacacac

becomes

>[1]Bacteria;[2]Proteobacteria;[3]Deltaproteobacteria;[4]null;[5]null;[6]null;[7]uncultured_delta;[8]uncultured_delta_proteobacterium
gttgggttaagtcccgcaacgagcgcaacccctgntnctagttgccaacaggttaagctgagcactctacagggactgcc
tgggcaaccaggaggaaggcggggatgacgtcaagtcctcatggcccttatgnncagggctacacacgtgctacaatggg
cggtacagagngcagcnaactcgcgagagcaagcnaatcncacaaagccgtcctcagttcngattgcaggctgcaactcg
actgcatgaagctggaatcgctagtaatcggagatcagcacnctccggtgaatacgttcccgggccttgtacacac


then I just wrote a perl script which split this into 2 files,
id_taxon:
1
Bacteria;Proteobacteria;Deltaproteobacteria;null;null;null;uncultured_delta;uncultured_delta_proteobacterium
2
Bacteria;Proteobacteria;Gammaproteobacteria;null;null;null;uncultured_gamma;uncultured_gamma_proteobacterium
etc...

and the modified blast db:
>1
*sequence*
>2
*sequence*
etc...


and I've never looked back. To summarise, check out taxcollector!
Message has been deleted

Daniel McDonald

unread,
Sep 20, 2013, 10:25:57 AM9/20/13
to qiime...@googlegroups.com
That method sounds reasonable, but I thought I'd bring your attention to the NCBI taxonomy parser available in cogent, which provides some rich functionality for navigating the tree:


Please note though that the NCBI taxonomy is not authoritative (as they state), and rife with errors particularly with 16S. It is possible to go from GI back to Greengenes IDs if you're interested by querying the Greengenes MySQL database (this includes NCBI taxonomy), a dump of which can be obtained here:


Best,
Daniel



On Fri, Sep 20, 2013 at 5:34 AM, jrvalverde <txo...@gmail.com> wrote:
I am trying to pythonize our workflow for insertion into any suitable existing package (I currently use both, QIIME and metAMOS), and one of the things I worry about is taxonomy.

This is a very simple and fast solution I came up with using python:
- first I extract the scientific names from names.dmp,
- then process all tax-id,name pairs into a dictionary
- next, I process nodes.dmp tax-id, parent pairs into a second dictionary
- finally I generate the taxonomy:
-- for each key in the tax-id,name dictionary
--- lookup the hierarchy of parents and store in a stack
--- print the tax-id, the parents stack and the name
- as a bonus I also generate a sorted taxonomy file

Using the same approach it is trivial now to process the GI:tax-id file:
- load taxonomy.txt (tax-id:tax-hierarchy pairs) into a dictionary
- for each line in the GI:tax-id file
-- look up the hierarchy of the associated tax.id
-- write the GI and the tax-hierarchy

Once you have that one, then it is trivial to process any file with GIs
- load GI:tax-hierarchy file in a dictionary
- for each line in the target file
-- extract the GI
-- lookup the GI associated tax-hierarchy
-- assign and do whatever.

If there is interest in the corresponding draft programs, I may upload them. The approach is simple and fast enough to process all genbank IDs and assign them a taxonomy in a reasonable time (330965563 lines in gi_taxid_nucl.dmp take about 1m 58s over an NFS mounted filesystem)

BTW, no SQL database involved or any other such trick. Just trivial python scripts.

--
 
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

KC

unread,
Jun 5, 2014, 5:03:16 PM6/5/14
to qiime...@googlegroups.com
Hi Daniel

I downloaded the MySQL database and deployed it but I was just randomly checking the GG_ID (1133362 = Vulcanisaeta distributa) to the NCBI_TAX_ID which gave me 2871757 but this seems to be wrong as from the greengenes website it is 572478

http://greengenes.lbl.gov/cgi-bin/show_one_record_v2.pl?prokMSA_id=1133362

When i search using NCBI_TAX_ID=572478, it gave me another organism: uncultured Sphingobacteriales bacterium

Do you know why this is happening?

Thank you very much.

best rgds
keng

Daniel McDonald

unread,
Jun 6, 2014, 1:52:26 PM6/6/14
to qiime...@googlegroups.com
Interesting, not sure. Looking into it. Thanks for flagging!


For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages