PyCogent (http://pycogent.sourceforge.net) has parsers in place for
the NCBI taxonomy and provides rich objects for the structure.
Unfortunately, the documentation on that is a bit light right now.
PyCogent is the main dependency for QIIME and is included in the
virtual machine.
Here is quick summary on how to get started, I can provide further
notes tomorrow (not on my main machine right now...). You'll need to
download nodes.dmp and names.dmp from the NCBI. The method to look at
is cogent.parse.ncbi_taxonomy.NcbiTaxonomyFromFiles and will return
the root node of the NCBI taxonomy
-Daniel
1) BLAST sequences against your favorite NCBI db
2) obtain accessions from the best hits
3) use 'fastacmd' with the '-T' option to get back taxonomic information (NCBI taxon ids)
4) run through the NCBI taxonomy pulling out rank information for each of your NCBI taxon ids
...this would allow you to produce a file that maps your IDs to the NCBI taxonomy.
Hope this helps!
-Daniel
I am trying to pythonize our workflow for insertion into any suitable existing package (I currently use both, QIIME and metAMOS), and one of the things I worry about is taxonomy.
This is a very simple and fast solution I came up with using python:
- first I extract the scientific names from names.dmp,
- then process all tax-id,name pairs into a dictionary
- next, I process nodes.dmp tax-id, parent pairs into a second dictionary
- finally I generate the taxonomy:
-- for each key in the tax-id,name dictionary
--- lookup the hierarchy of parents and store in a stack
--- print the tax-id, the parents stack and the name
- as a bonus I also generate a sorted taxonomy file
Using the same approach it is trivial now to process the GI:tax-id file:
- load taxonomy.txt (tax-id:tax-hierarchy pairs) into a dictionary
- for each line in the GI:tax-id file
-- look up the hierarchy of the associated tax.id
-- write the GI and the tax-hierarchy
Once you have that one, then it is trivial to process any file with GIs
- load GI:tax-hierarchy file in a dictionary
- for each line in the target file
-- extract the GI
-- lookup the GI associated tax-hierarchy
-- assign and do whatever.
If there is interest in the corresponding draft programs, I may upload them. The approach is simple and fast enough to process all genbank IDs and assign them a taxonomy in a reasonable time (330965563 lines in gi_taxid_nucl.dmp take about 1m 58s over an NFS mounted filesystem)
BTW, no SQL database involved or any other such trick. Just trivial python scripts.
--
---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
For more options, visit https://groups.google.com/d/optout.