Create a COI reference database for plankton samples

Shannon Williams

unread,

Mar 26, 2014, 1:20:06 PM3/26/14

to qiime...@googlegroups.com

Hi lovely Qiime masters,

I am a relatively new user to Qiime and have two 1/8th 454 runs on water samples from the Monterey Bay for 28S and COI. I have worked through the 28S samples using the SILVA database but am now stuck with making a reference database for my COI data. I really want to stay within the Qiime environment because I really like the statistics it performs. Is there an (easier??) way to download the appropriate sequences from genbank and create the rep_set.fasta and the id_to_taxonomy.txt files? I am interested in pretty much everything in the water from bacteria, to Fungi, to algae, to zooplankton (and even an occasional vertebrate!)

Thanks!
Shannon

Greg Caporaso

unread,

Mar 27, 2014, 11:35:56 PM3/27/14

to qiime...@googlegroups.com

Hi Shannon,

I'm not aware of anyone having used one with QIIME, but you might check into the BOLD project:

http://www.barcodinglife.org/

I am fairly certain that they compile and annotate COI. See their data release page here:

http://www.barcodinglife.org/index.php/datarelease

Please do post back and let us know if this does or doesn't work out, as I'm sure that other users will be interested in the future.

Greg

Shannon Williams

unread,

Apr 24, 2014, 1:54:51 PM4/24/14

to qiime...@googlegroups.com

Hi Greg,

thanks for your reply! I am lucky enough to work with software engineers so I now have a python script that can output QIIME fasta and taxonomy files to re-train the RDP classifier. We can process fasta , BOLD, or genbank files..
Everything is working BUT I think i have some sort of new problem. After I re-train the rdp classifier, (assign_taxonomy.py) the assignment files i get are either all unclassified (which i know is not true because ive either played with these datasets in other programs or blasted them or with a trunkated SILVA LSU dataset in QIIME) - or I get a combination of one random hit many times (like a virus or something...) with a bunch of unclassified hits... so what the heck is going on? are my training sets too big? is there some syntax in the training set that is choking the analysis?? I have to increase the memory a lot... -m 100000... and it takes around an hour to query my rep_set but it does eventually finish. it is almost like the classifier gives up and just picks something (usually the same thing) over and over...

any ideas??

Shannon Williams

unread,

Apr 24, 2014, 2:09:42 PM4/24/14

to qiime...@googlegroups.com

PS, the smaller BOLD file works a ok... :)

Patricia Jones

unread,

Jul 9, 2014, 8:08:31 AM7/9/14

to qiime...@googlegroups.com

Dear Shannon,

I have the same problem! I want to make my own reference database from CO1 sequences for a wide range of vertebrates and invertebrates from genbank. I downloaded the sequences as a fasta fils but I am stumped by Qiime's need for a taxonomy assignment .txt file. Does your python script solve this problem? I would greatly appreciate any help you might have.

Thanks!

Patty

Paul Czechowski

unread,

Jul 17, 2014, 5:59:39 AM7/17/14

to qiime...@googlegroups.com

Dear Shannon and QIIME community,

I would like to build a custom COI reference database for Antarctic soil samples from which I'd like to identify invertebrates. Someone helped me on this forum, and by now I have a Python script that does the job for NCBI data, i.e. with GI numbers.

However I would like to take advantage of the DB releases of BOLD which are provided as a large .tsv tables regularly, as pointed out by Greg:

http://www.barcodinglife.org/index.php/datarelease

Here I saw your post:

"I am lucky enough to work with software engineers so I now have a python script that can output QIIME fasta and taxonomy files to re-train the RDP classifier. We can process [...] BOLD [...] files.."

I was wondering if you are processing the whole DB realease or just individual downloads? In any case would you be willed to provide the Python script that does that? I am not very good at programming but I might be able to do some small adaptations. I would be happy to hear from you, perhaps I could receive a copy of the necessary code?

Kind regards,

Paul

Shannon Williams

unread,

Jul 17, 2014, 12:57:17 PM7/17/14

to qiime...@googlegroups.com

Hi Paul and Paticia, email me at sbwilliams216 at gmail and i can give you the scripts! :)

--

---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/O-U_DWRKOq0/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jamal

unread,

Sep 10, 2014, 4:35:47 AM9/10/14

to qiime...@googlegroups.com

Dear Patricia

I have the same reference database and the same problem. can you please guide me how did you do your job. my email is: jamal.momeni at agrsci.dk

Thank you very much in advance.

Jamal

Catherine Breton

unread,

Dec 26, 2014, 10:06:47 AM12/26/14

to qiime...@googlegroups.com

Hy all memebers of the qiime community,

I have the same probleme. I am looking for a script or the file for the alignement with PyNast with COI.

Do you have a solutuion?

Thank's

Cathy

Shannon Williams

unread,

Dec 31, 2014, 1:42:20 PM12/31/14

to qiime...@googlegroups.com

Hi Kathy,

Please check out this script- I havent tried it with the new QIIME yet but it worked great for the old one... good luck and let me know how it goes!

:)

shan

what i do is search in entrez whatever gene/group you want (but remember- build too big a database and qiime freaks out) - then download the genbank formatted file. that is what you use the script for... you can also do this with the bold data release (works better but for some reason their taxonomy is pretty high level). good luck, we are working on a publication for this soon!

--

gbtoqiimecommand.txt

gb2qiime.py

Shannon Williams

unread,

Dec 31, 2014, 1:43:32 PM12/31/14

to qiime...@googlegroups.com

Jamal,

Did I ever reply to you? if not I am so sorry! here is the script and the commands if not...

:)

Shannon

--

gbtoqiimecommand.txt

gb2qiime.py

ibis

unread,

Jun 9, 2015, 11:11:37 AM6/9/15

to qiime...@googlegroups.com

Dear Shannon,

many thanks for posting this script! So usefull when you need to use costum databases.
I ve used it on my COI MiSeq data and looks like it is working but with some exceptions.
I run assign_taxonomy.py with blast on my otu table using using .txt and .fasta database files that were output of the gb2qiime.py script. The output was of the correct format for downstream analysis but the taxonomy was a bit off. By that I mean that OTUs were assigned different taxonomies because the different hits had various levels of taxonomy assigned to them. E.g.

OTU_1Diptera;Nematocera;Chironomoidea;Chironomidae
OTU_2Endopterygota;Diptera;Nematocera;Chironomoidea;Chironomidae

so when using this taxonomy assigned table for making plots downstream in Qiime they where considered as different taxa even though the same... I used a COI database .gb file downloaded from NCBI to blast against. Any ideas on how to manipulate this script to avoid this problem?

Reply all

Reply to author

Forward