Create a COI reference database for plankton samples

743 views
Skip to first unread message

Shannon Williams

unread,
Mar 26, 2014, 1:20:06 PM3/26/14
to qiime...@googlegroups.com
Hi lovely Qiime masters,

I am a relatively new user to Qiime and have two 1/8th 454 runs on water samples from the Monterey Bay for 28S and COI. I have worked through the 28S samples using the SILVA database but am now stuck with making a reference database for my COI data. I really want to stay within the Qiime environment because I really like the statistics it performs. Is there an (easier??) way to download the appropriate sequences from genbank and create the rep_set.fasta and the id_to_taxonomy.txt files? I am interested in pretty much everything in the water from bacteria, to Fungi, to algae, to zooplankton (and even an occasional vertebrate!)

Thanks!
Shannon


Greg Caporaso

unread,
Mar 27, 2014, 11:35:56 PM3/27/14
to qiime...@googlegroups.com
Hi Shannon,
I'm not aware of anyone having used one with QIIME, but you might check into the BOLD project:

I am fairly certain that they compile and annotate COI. See their data release page here:

Please do post back and let us know if this does or doesn't work out, as I'm sure that other users will be interested in the future.

Greg
 

Shannon Williams

unread,
Apr 24, 2014, 1:54:51 PM4/24/14
to qiime...@googlegroups.com
Hi Greg,

thanks for your reply! I am lucky enough to work with software engineers so I now have a python script that can output QIIME fasta and taxonomy files to re-train the RDP classifier. We can process fasta , BOLD, or genbank files..
Everything is working BUT I think i have some sort of new problem. After I re-train the rdp classifier, (assign_taxonomy.py) the assignment files i get are either all unclassified (which i know is not true because ive either played with these datasets in other programs or blasted them or with a trunkated SILVA LSU dataset in QIIME) - or I get a combination of one random hit many times (like a virus or something...) with a bunch of unclassified hits... so what the heck is going on? are my training sets too big? is there some syntax in the training set that is choking the analysis?? I have to increase the memory a lot... -m 100000... and it takes around an hour to query my rep_set but it does eventually finish. it is almost like the classifier gives up and just picks something (usually the same thing) over and over...

any ideas??

Shannon Williams

unread,
Apr 24, 2014, 2:09:42 PM4/24/14
to qiime...@googlegroups.com
PS, the smaller BOLD file works a ok... :)

Patricia Jones

unread,
Jul 9, 2014, 8:08:31 AM7/9/14
to qiime...@googlegroups.com
Dear Shannon,
I have the same problem! I want to make my own reference database from  CO1 sequences for a wide range of vertebrates and invertebrates from genbank. I downloaded the sequences as a fasta fils but I am stumped by Qiime's need for a taxonomy assignment .txt file. Does your python script solve this problem? I would greatly appreciate any help you might have. 
Thanks!
Patty

Paul Czechowski

unread,
Jul 17, 2014, 5:59:39 AM7/17/14
to qiime...@googlegroups.com
Dear Shannon and QIIME community,

I would like to build a custom COI reference database for Antarctic soil samples from which I'd like to identify invertebrates. Someone helped me on this forum, and by now I have a Python script that does the job for NCBI data, i.e. with GI numbers.

However I would like to take advantage of the DB releases of BOLD which are provided as a large .tsv tables regularly, as pointed out by Greg:


Here I saw your post:

"I am lucky enough to work with software engineers so I now have a python script that can output QIIME fasta and taxonomy files to re-train the RDP classifier. We can process [...]  BOLD [...] files.."

I was wondering if you are processing the whole DB realease or just individual downloads? In any case would you be willed to provide the Python script that does that? I am not very good at programming but I might be able to do some small adaptations. I would be happy to hear from you, perhaps I could receive a copy of the necessary code?

Kind regards,

Paul

Shannon Williams

unread,
Jul 17, 2014, 12:57:17 PM7/17/14
to qiime...@googlegroups.com
Hi Paul and Paticia, email me at sbwilliams216 at gmail and i can give you the scripts! :)


--

---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/O-U_DWRKOq0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jamal

unread,
Sep 10, 2014, 4:35:47 AM9/10/14
to qiime...@googlegroups.com
Dear Patricia

I have the same reference database and the same problem. can you please guide me how did you do your job. my email is: jamal.momeni at agrsci.dk

Thank you very much in advance.
Jamal

Catherine Breton

unread,
Dec 26, 2014, 10:06:47 AM12/26/14
to qiime...@googlegroups.com
Hy all memebers of the qiime community, 
I have the same probleme. I am looking for a script or the file for the alignement with PyNast with COI.
Do you have a solutuion?

Thank's 
Cathy

Shannon Williams

unread,
Dec 31, 2014, 1:42:20 PM12/31/14
to qiime...@googlegroups.com
Hi Kathy,

Please check out this script- I havent tried it with the new QIIME yet but it worked great for the old one... good luck and let me know how it goes!

:)
shan

what i do is search in entrez whatever gene/group you want (but remember- build too big a database and qiime freaks out) - then download the genbank formatted file. that is what you use the script for... you can also do this with the bold data release (works better but for some reason their taxonomy is pretty high level). good luck, we are working on a publication for this soon!

--
gbtoqiimecommand.txt
gb2qiime.py

Shannon Williams

unread,
Dec 31, 2014, 1:43:32 PM12/31/14
to qiime...@googlegroups.com
Jamal,

Did I ever reply to you? if not I am so sorry! here is the script and the commands if not... 

:)
Shannon




--
gbtoqiimecommand.txt
gb2qiime.py

ibis

unread,
Jun 9, 2015, 11:11:37 AM6/9/15
to qiime...@googlegroups.com
Dear Shannon,

many thanks for posting this script! So usefull when you need to use costum databases.
I ve used it on my COI MiSeq data and looks like it is working but with some exceptions.
I run assign_taxonomy.py with blast on my otu table using using .txt and .fasta database files that were output of the gb2qiime.py script. The output was of the correct format for downstream analysis but the taxonomy was a bit off. By that I mean that OTUs were assigned different taxonomies because the different hits had various levels of taxonomy assigned to them. E.g.

OTU_1Diptera;Nematocera;Chironomoidea;Chironomidae
OTU_2Endopterygota;Diptera;Nematocera;Chironomoidea;Chironomidae  




so when using this taxonomy assigned table for making plots downstream in Qiime they where considered as different taxa even though the same... I used a COI database .gb file downloaded from NCBI to blast against. Any ideas on how to manipulate this script to avoid this problem?
Reply all
Reply to author
Forward
0 new messages