Using RDP classifier database

780 views
Skip to first unread message

Maria

unread,
Apr 25, 2017, 10:38:56 AM4/25/17
to Qiime 1 Forum
Hi QIIME users,

I am having trouble with a comand, when I run pick_open_reference_otus.py I included a parameter file to change the assign method to rdp. 

assign_taxonomy:assignment_method rdp
assign_taxonomy:confidence 0.8

In previous qiime versions, when I used -m rdp in the assign_taxonomy.py, it already returned me sequences classified with rdp classifier database. But now, when I used pick_open_reference_otus.py with the parameter file it returned me as I used green_genes database. 
In fact, at http://qiime.org/1.9.0/scripts/assign_taxonomy.html I realized that -t and -r used as default the green genes. And http://qiime.org/1.5.0/scripts/assign_taxonomy.html not specified -t and -r. Just out of curiosity, when I openned my rep_set_tax_assignments.txt from my previous analysis, I saw this:


RdpTaxonAssigner parameters:

Application:RDP classfier

Citation:Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007. Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7.

Taxonomy:RDP

Confidence:0.5

id_to_taxonomy_filepath:None

id_to_taxonomy_fp:None

max_memory:1500M

real_rdp_version:2.5

reference_sequences_fp:None

training_data_properties_fp:None


So, my question here is this, I want to classify my sequences with RDP classifier method and database using the comand pick_open_reference_otus.py How can I do that?

Thank you!

Maria

unread,
Apr 25, 2017, 1:14:07 PM4/25/17
to Qiime 1 Forum
Hello,

I found this topic https://groups.google.com/d/msg/qiime-forum/i8gIYxqsKZo/itmENqhACAAJ and I am using this database, am I going to the right path?

Thank you again! Sometimes, we just need to know how to look for the answers in the QIIME group! \o/

Maria

Colin Brislawn

unread,
Apr 25, 2017, 3:04:03 PM4/25/17
to Qiime 1 Forum
Hello Maria,

Thanks for getting in touch with us. I'm glad you found that other thread, but I'll see if I can help you over here.

Your parameter file looks good; that should effectively change the assignment algorithm to the RDP k-mer classifier. 

In previous qiime versions, when I used -m rdp in the assign_taxonomy.py, it already returned me sequences classified with rdp classifier database.
I'm more familiar with newer versions of qiime (from 1.6.0 forward), but I'm pretty sure qiime has been using the greengenes database as the default for a very long time. I read the 1.5.0 assign_taxonony page you linked, and it mentions the greengenes database. As you probably already know, you can use any taxonomy assignment method with any database, so you could have used the RDP assigner with the Greengenes database. I think RDP assignment + greengenes database was the default in qiime 1.6.0.

If you want to retrain the RDP classifier to work with something other than greengenes, this page might help:

I hope this is helpful. Perhaps a qiime developer who is familiar with qiime 1.5.0 could offer some advice.

Colin

Maria

unread,
Apr 25, 2017, 3:35:44 PM4/25/17
to Qiime 1 Forum
Hi Collin!

Thank you for your help! I am not sure about which version of QIIME I was using, but I did returned me sequences classified using RDP database, as you can see here in a copy of my rep_set_tax_assignments.txt

denovo1667 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.990
denovo10164 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.890
denovo6244 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.930
denovo5336 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.790
denovo30610 Bacteria 1.000
denovo17932 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.940
denovo15293 Bacteria;Firmicutes;Bacilli;Lactobacillales;Leuconostocaceae;Weissella 0.850
denovo20145 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.800
denovo18372 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.810
denovo6191 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.500
denovo20857 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.670
denovo5145 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.540
denovo21020 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.960
denovo22091 Bacteria;Bacteroidetes;Sphingobacteria;Sphingobacteriales;Sphingobacteriaceae 0.550
denovo16497 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Acetobacteraceae 0.610
denovo16165 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Paralactobacillus 0.690

But that is ok! I am using QIIME 1.9 version and, with the file from this topic (https://groups.google.com/d/msg/qiime-forum/i8gIYxqsKZo/itmENqhACAAJ) that seems to be the more recent from RDP, I was able to run pick_open_reference_otus.py (that gave me more reliable data than de_novo method) with RDP database in a small fraction of my data and it worked. Now I am gonna run with the whole data!

Again, thank you. I just wanna to explain what I did because I think it may help other, as I notice, this is not the first topic about using other databases to perform the assign_taxonomy.py.

Best,
Maria

Colin Brislawn

unread,
Apr 25, 2017, 9:23:58 PM4/25/17
to Qiime 1 Forum
Thanks for the update Maria,

I just wanna to explain what I did because I think it may help other,
I totally agree! Your feedback is fantastic.

Let me know how this goes,
Colin
 

Maria

unread,
Apr 26, 2017, 10:37:29 AM4/26/17
to Qiime 1 Forum
Hi Collin,

My analysis seemed to work pretty well! The results look consistent and I feel confident about it!

I just changed the -r and the -t in the assign_taxonomy.py step, using the parameter file in the pick_open_reference_otus.py. 

If you let me ask you one more thing, this database seems right to you? In your experience https://groups.google.com/forum/#!msg/qiime-forum/i8gIYxqsKZo/itmENqhACAAJ? I mean, it has the right format and, well, I was able to classify my sequences according to RDP database. I just ask that because I am a begginer here! haha

Thank you again!
Maria

Colin Brislawn

unread,
Apr 26, 2017, 11:56:33 AM4/26/17
to Qiime 1 Forum
Hello Maria,

The script will throw many errors if there is a problem, so it sounds like it worked for you. When using the default method -m uclust, you should be able to drop in any new database with -t and -r. Did you also change the algorithm with -m, or just the database with -r and -t?

Colin

Maria

unread,
Apr 26, 2017, 12:09:43 PM4/26/17
to Qiime 1 Forum
Hi Colin,

I used the default in the algorithm method, so it was uclust. I only changed -r and -t.

So its good to know! :) Thank you!

Maria

unread,
Apr 26, 2017, 7:13:35 PM4/26/17
to Qiime 1 Forum
Hi Colin,

Ok, so, I have a little bit of problem using uclust with rdp database. When I compare the sequences of rep_set.fna with rdp classifier online (http://rdp.cme.msu.edu/classifier/classifier.jsp), I have different results. Is this expected? Should I use -m rdp to obtain the same results?

I tried to run assign_taxonomy.py changing -r and -t to rdp database and, as expected, I have a issue with the format because my tax file didn't have 6 levels deep. So I fix that myself, and when I tried to run assign_taxonomy.py with this new file using uclust, I was able to do it. But when I tried to run with -m rdp, it returned me a message saying "Illegal taxonomy format at...".

What should I check?

Thank you!
Maria

Colin Brislawn

unread,
Apr 27, 2017, 1:21:56 AM4/27/17
to Qiime 1 Forum
Hello Maria,

If you want matching results, you should use the RDP algorithm too by passing -m rdp. (Keep in mind that the version of RDP on the website may be different from the one that ships with qiime, so results may still very.)

The RDP assignment method requires you to retrain the database if you switch to something other than the default (greengenes). You can read more about retraining the database here. I think this should address that error.

Or, you could use the default -m uclust method. I have an easier time understanding the underlying idea and limitations of the uclust method, so I think using it with your RDP database may be a good fit too. 

Let me know how I can help,
Colin

Maria

unread,
Apr 27, 2017, 12:40:51 PM4/27/17
to Qiime 1 Forum
Hi Colin,

Just let me know if I understand what retraining means. Retrain my database is to provide the -r and -t pathways instead of using the default? I am a little lost about this issue.

And I also attached the .fasta and .tax file that I am using in assign_taxonomy.py that is not running when I try to use -m rdp, but it does when I use -m uclust. Did you see any problem with my files?

Thank you!
Maria
teste.tax

Colin Brislawn

unread,
Apr 27, 2017, 1:00:30 PM4/27/17
to Qiime 1 Forum, Antonio González Peña
Hello Maria,

Thanks for posting that tax file. I wanted to check that first to see if it's formatting could be causing a problem, and I think that could be the issue!

Here is what a valid format looks like (from greengenes)
367523 k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__
187144 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__
836974 k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Cercozoa; f__; g__; s__
310669 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

Here is how that taxonomy looks:
AJ000684|S000004347 Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Mycobacteriaceae;Mycobacterium;
EF599163|S000871589 Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Vibrio;
AY859683|S000631792 Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Mycobacteriaceae;Mycobacterium;

In your reference database, are the reads named 'AJ000684|S000004347' or just 'AJ000684'? If this full name with the bar | symbol | is used then I think this looks good. And retraining is needed.


Searching for matching text is a hard problem in computer science. Searching for text that is almost matching is harder, and clever algorithms have been developed to make this process faster. When you try to match the DNA sequence from your OTUs to reads in a database, this is a text search problem, so we can make use of these algorithms.
Both the RDP and uclust methods make use of a process called k-mer counting as part of their algorithm. The k-mer method make use of string of DNA of length k; so ATG is three letters long and would be a 3mer, while ACTCGTAA is eight letters long and would be called an 8mer. How does counting 8mers or 3mers speed up searches? Take a look at how the uclust algorithm uses it: 

The RDP classifier also uses a k-mer method, and these k-mers need to be counted before the search can begin. The RDP devs call this process 'retraining,' but you could also call it 'indexing.' 

Before we dive into this, I want to check with a qiime dev about the best way to do this.
Keep in touch!
Colin

Maria

unread,
Apr 27, 2017, 1:12:26 PM4/27/17
to Qiime 1 Forum, antg...@gmail.com
Hi Colin,

So, this is what my .fasta file looks like:

>AJ000684|S000004347 Root;Bacteria;"Actinobacteria";Actinobacteria;Actinobacteridae;Actinomycetales;Corynebacterineae;Mycobacteriaceae;Mycobacterium
gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctcttcggagatactcgagtggcgaacgggtgagtaacacgtgggtaatctgccctgcacatcgggataagcctgggaaactgggtctaataccgaataggacctcgaggcgcatgccttgtggtggaaagcttttgcggtgtgggatgggcccgcggcctatcagcttgttggtggggtgacggcctaccaaggcgacgacgggtagccggcctgagagggtgtccggccacactgggactgagatacggcccagactcctacgggaggcagcagtggggaatattgcacaatgggcgcaagcctgatgcagcgacgccgcgtgggggatgacggncttcgggttgtaaacctctttcagcagggacgaagcgcaagtgacggtacctgcagaagaagcaccggccaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttgtccggaattactgggcgtaaagagctcgtaggtggtttgtcgcgttgttcgtgaaaaccgggggcttaaccctcggcgtgcgggcgatacgggcagactggagtactgcaggggagactggaattcctggtgtagcggtggaatgcgcagatatcaggaggaacaccggtggcgaaggcgggtctctgggcagtaactgacgctgaggagcgaaagcgtggggagcgaacaggattagataccctggtagtccacgccgtaaacggtgggtactaggtgtgggtttccttccttgggatccgtgccgtagctaacgcattaagtaccccgcctggggagtacggccgcaaggctaaaactcaaaggaattgacgggggcccgcacaagcggcggagcatgtggattaattcgatgcaacgcgaagaaccttacctgggtttgacatgcacaggacgccggcagagatgtcggttcccttgtggcctgtgtgcaggtggtgcatggctgtcgtcagctcgtgtcgtgagatgttgggttaagtcccgcaacgagcgcaacccttgtctcatgttgccagcgggtaatgccggggactcgtgagagactgccggggtcaactcggaggaaggtggggatgacgtcaagtcatcatgccccttatgtccagggcttcacacatgctacaatggccggtacaaagggctgcgatgccgcaaggttaagcgaatccttttaaagccggtctcagttcggatcggggtctgcaactcgaccccgtgaagtcggagtcgctagtaatcgcagatcagcaacgctgcggtgaatacgttcccgggccttgtacacaccgcccgtcacgtcatgaaagtcggtaacacccgaagccagtggcctaacctttgggagggagctgtcgaaggtgggatcggcgattgggacgaagtcgt
>EF599163|S000871589 Root;Bacteria;"Proteobacteria";Gammaproteobacteria;"Vibrionales";Vibrionaceae;Vibrio
gtttgatcctggctcagattgaacgctggcggcaggcctaacacatgcaagtcgagcggaaacgacactaacaatccttcgggtgcgttaatgggcgtcgagcggcggacgggtgagtaatgcctaggaaattgccttgatgtgggggataaccattggaaacgatggctaataccgcataatgcctacgggccaaagagggggaccttcgggcctctcgcgtcaagatatgcctaggtgggattagctagttggtgaggtaatggctcaccaaggcgacgatccctagctggtctgagaggatgatcagccacactggaactgagacacggtccagactcctacgggaggcagcagtggggaatattgcacaatgggcgaaagcctgatgcagccatgccgcgtgtatgaagaaggccttcgggttgtaaagtactttcagttgtgaggaagggtgtgtagttaatagctgcamatcttgacgttagcaacagaagaagcaccggctaactccgtgccagcagccgcggtaatacggagggtgcgagcgttaatcggaattactgggcgtaaagcgcatgcaggtggttcattaagtcagatgtgaaagcccggggctcaacctcggaactgcatttgaaactggtgaactagagtgctgtagaggggggtagaatttcaggtgtagcggtgaaatgcgtagagatctgaaggaataccagtggcgaaggcggccccctggacagacactgacactcagatgcgaaagcgtggggagcaaacaggattagataccctggtagtccacgccgtaaacgatgtctacttggaggttgtggccttgagccgtggctttcggagctaacgcgttaagtagaccgcctggggagtacggtcgcaagattaaaactcaaatgaattgacgggggcccgcacaagcggtggagcatgtggtttaattcgatgcaacgcgaagaaccttacctactcttgacatccagagaagccagcggagacgcaggtgtgccttcgggagctctgagacaggtgctgcatggctgtcgtcagctcgtgttgtgaaatgttgggttaagtcccgcaacgagcgcaacccttatccttgtttgccagcgagtaatgtcgggaactccagggagactgccggtgataaaccggaggaaggtggggacgacgtcaagtcatcatggcccttacgagtagggctacacacgtgctacaatggcgcatacagagggcagcaagctagcgatagtgagcgaatcccaaaaagtgcgtcgtagtccggattggagtctgcaactcgactccatgaagtcggaatcgctagtaatcgtgaatcagaatgtcacg

Just for you to know.

And I will wait for your answer. Thank you for looking that up for me!

Maria

Jai Ram Rideout

unread,
May 2, 2017, 4:10:50 PM5/2/17
to Qiime 1 Forum
Hi Maria,

In order to help you I'll need a few more details:

1. The exact command you're running to perform taxonomy assignment with the RDP database and RDP classifier

2. The entire error message you're receiving when running that command

3. Your QIIME parameters file as an attachment

4. The output from running print_qiime_config.py -tf

5. Where did you obtain your reference sequence FASTA file and taxonomy mapping file? I'd like to download the original files so I can look through them.

Thanks,
Jai

Maria

unread,
May 11, 2017, 9:46:28 AM5/11/17
to Qiime 1 Forum
Hi Jay,

Turns out that using RDP training sequences as database and the method UCLUST gave me a satisfatory result when I used --min_consensus_fraction of 0.7.

But thank you for your help, anyway! :)

Maria
Reply all
Reply to author
Forward
0 new messages