Using RDP classifier database

Maria

unread,

Apr 25, 2017, 10:38:56 AM4/25/17

to Qiime 1 Forum

Hi QIIME users,

I am having trouble with a comand, when I run pick_open_reference_otus.py I included a parameter file to change the assign method to rdp.

assign_taxonomy:assignment_method rdp

assign_taxonomy:confidence 0.8

In previous qiime versions, when I used -m rdp in the assign_taxonomy.py, it already returned me sequences classified with rdp classifier database. But now, when I used pick_open_reference_otus.py with the parameter file it returned me as I used green_genes database.

In fact, at http://qiime.org/1.9.0/scripts/assign_taxonomy.html I realized that -t and -r used as default the green genes. And http://qiime.org/1.5.0/scripts/assign_taxonomy.html not specified -t and -r. Just out of curiosity, when I openned my rep_set_tax_assignments.txt from my previous analysis, I saw this:

RdpTaxonAssigner parameters:

Application:RDP classfier

Citation:Wang, Q, G. M. Garrity, J. M. Tiedje, and J. R. Cole. 2007. Naive Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy. Appl Environ Microbiol. 73(16):5261-7.

Taxonomy:RDP

Confidence:0.5

id_to_taxonomy_filepath:None

id_to_taxonomy_fp:None

max_memory:1500M

real_rdp_version:2.5

reference_sequences_fp:None

training_data_properties_fp:None

So, my question here is this, I want to classify my sequences with RDP classifier method and database using the comand pick_open_reference_otus.py How can I do that?

Thank you!

Maria

unread,

Apr 25, 2017, 1:14:07 PM4/25/17

to Qiime 1 Forum

Hello,

I found this topic https://groups.google.com/d/msg/qiime-forum/i8gIYxqsKZo/itmENqhACAAJ and I am using this database, am I going to the right path?

Thank you again! Sometimes, we just need to know how to look for the answers in the QIIME group! \o/

Maria

Colin Brislawn

unread,

Apr 25, 2017, 3:04:03 PM4/25/17

to Qiime 1 Forum

Hello Maria,

Thanks for getting in touch with us. I'm glad you found that other thread, but I'll see if I can help you over here.

Your parameter file looks good; that should effectively change the assignment algorithm to the RDP k-mer classifier.

In previous qiime versions, when I used -m rdp in the assign_taxonomy.py, it already returned me sequences classified with rdp classifier database.

I'm more familiar with newer versions of qiime (from 1.6.0 forward), but I'm pretty sure qiime has been using the greengenes database as the default for a very long time. I read the 1.5.0 assign_taxonony page you linked, and it mentions the greengenes database. As you probably already know, you can use any taxonomy assignment method with any database, so you could have used the RDP assigner with the Greengenes database. I think RDP assignment + greengenes database was the default in qiime 1.6.0.

If you want to retrain the RDP classifier to work with something other than greengenes, this page might help:

http://qiime.org/tutorials/retraining_rdp.html

I hope this is helpful. Perhaps a qiime developer who is familiar with qiime 1.5.0 could offer some advice.

Colin

Maria

unread,

Apr 25, 2017, 3:35:44 PM4/25/17

to Qiime 1 Forum

Hi Collin!

Thank you for your help! I am not sure about which version of QIIME I was using, but I did returned me sequences classified using RDP database, as you can see here in a copy of my rep_set_tax_assignments.txt

denovo1667 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.990

denovo10164 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.890

denovo6244 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.930

denovo5336 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.790

denovo30610 Bacteria 1.000

denovo17932 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.940

denovo15293 Bacteria;Firmicutes;Bacilli;Lactobacillales;Leuconostocaceae;Weissella 0.850

denovo20145 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.800

denovo18372 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.810

denovo6191 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.500

denovo20857 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.670

denovo5145 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.540

denovo21020 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus 0.960

denovo22091 Bacteria;Bacteroidetes;Sphingobacteria;Sphingobacteriales;Sphingobacteriaceae 0.550

denovo16497 Bacteria;Proteobacteria;Alphaproteobacteria;Rhodospirillales;Acetobacteraceae 0.610

denovo16165 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Paralactobacillus 0.690

But that is ok! I am using QIIME 1.9 version and, with the file from this topic (https://groups.google.com/d/msg/qiime-forum/i8gIYxqsKZo/itmENqhACAAJ) that seems to be the more recent from RDP, I was able to run pick_open_reference_otus.py (that gave me more reliable data than de_novo method) with RDP database in a small fraction of my data and it worked. Now I am gonna run with the whole data!

Again, thank you. I just wanna to explain what I did because I think it may help other, as I notice, this is not the first topic about using other databases to perform the assign_taxonomy.py.

Best,

Maria

Colin Brislawn

unread,

Apr 25, 2017, 9:23:58 PM4/25/17

to Qiime 1 Forum

Thanks for the update Maria,

I just wanna to explain what I did because I think it may help other,

I totally agree! Your feedback is fantastic.

Let me know how this goes,

Colin

Maria

unread,

Apr 26, 2017, 10:37:29 AM4/26/17

to Qiime 1 Forum

Hi Collin,

My analysis seemed to work pretty well! The results look consistent and I feel confident about it!

I just changed the -r and the -t in the assign_taxonomy.py step, using the parameter file in the pick_open_reference_otus.py.

If you let me ask you one more thing, this database seems right to you? In your experience https://groups.google.com/forum/#!msg/qiime-forum/i8gIYxqsKZo/itmENqhACAAJ? I mean, it has the right format and, well, I was able to classify my sequences according to RDP database. I just ask that because I am a begginer here! haha

Thank you again!

Maria

Colin Brislawn

unread,

Apr 26, 2017, 11:56:33 AM4/26/17

to Qiime 1 Forum

Hello Maria,

The script will throw many errors if there is a problem, so it sounds like it worked for you. When using the default method -m uclust, you should be able to drop in any new database with -t and -r. Did you also change the algorithm with -m, or just the database with -r and -t?

Colin

Maria

unread,

Apr 26, 2017, 12:09:43 PM4/26/17

to Qiime 1 Forum

Hi Colin,

I used the default in the algorithm method, so it was uclust. I only changed -r and -t.

So its good to know! :) Thank you!

Maria

unread,

Apr 26, 2017, 7:13:35 PM4/26/17

to Qiime 1 Forum

Hi Colin,

Ok, so, I have a little bit of problem using uclust with rdp database. When I compare the sequences of rep_set.fna with rdp classifier online (http://rdp.cme.msu.edu/classifier/classifier.jsp), I have different results. Is this expected? Should I use -m rdp to obtain the same results?

I tried to run assign_taxonomy.py changing -r and -t to rdp database and, as expected, I have a issue with the format because my tax file didn't have 6 levels deep. So I fix that myself, and when I tried to run assign_taxonomy.py with this new file using uclust, I was able to do it. But when I tried to run with -m rdp, it returned me a message saying "Illegal taxonomy format at...".

What should I check?

Thank you!

Maria

Colin Brislawn

unread,

Apr 27, 2017, 1:21:56 AM4/27/17

to Qiime 1 Forum

Hello Maria,

If you want matching results, you should use the RDP algorithm too by passing -m rdp. (Keep in mind that the version of RDP on the website may be different from the one that ships with qiime, so results may still very.)

The RDP assignment method requires you to retrain the database if you switch to something other than the default (greengenes). You can read more about retraining the database here. I think this should address that error.

http://qiime.org/tutorials/retraining_rdp.html

Or, you could use the default -m uclust method. I have an easier time understanding the underlying idea and limitations of the uclust method, so I think using it with your RDP database may be a good fit too.

Let me know how I can help,

Colin

Maria

unread,

Apr 27, 2017, 12:40:51 PM4/27/17

to Qiime 1 Forum

Hi Colin,

Just let me know if I understand what retraining means. Retrain my database is to provide the -r and -t pathways instead of using the default? I am a little lost about this issue.

And I also attached the .fasta and .tax file that I am using in assign_taxonomy.py that is not running when I try to use -m rdp, but it does when I use -m uclust. Did you see any problem with my files?

Thank you!

Maria

teste.tax

Colin Brislawn

unread,

Apr 27, 2017, 1:00:30 PM4/27/17

to Qiime 1 Forum, Antonio González Peña

Hello Maria,

Thanks for posting that tax file. I wanted to check that first to see if it's formatting could be causing a problem, and I think that could be the issue!

Here is what a valid format looks like (from greengenes)
367523 k__Bacteria; p__Bacteroidetes; c__Flavobacteriia; o__Flavobacteriales; f__Flavobacteriaceae; g__Flavobacterium; s__
187144 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__
836974 k__Bacteria; p__Cyanobacteria; c__Chloroplast; o__Cercozoa; f__; g__; s__
310669 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__; g__; s__

Here is how that taxonomy looks:

AJ000684|S000004347 Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Mycobacteriaceae;Mycobacterium;

EF599163|S000871589 Bacteria;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Vibrio;

AY859683|S000631792 Bacteria;Actinobacteria;Actinobacteria;Actinomycetales;Mycobacteriaceae;Mycobacterium;

In your reference database, are the reads named 'AJ000684|S000004347' or just 'AJ000684'? If this full name with the bar | symbol | is used then I think this looks good. And retraining is needed.

Searching for matching text is a hard problem in computer science. Searching for text that is almost matching is harder, and clever algorithms have been developed to make this process faster. When you try to match the DNA sequence from your OTUs to reads in a database, this is a text search problem, so we can make use of these algorithms.

Both the RDP and uclust methods make use of a process called k-mer counting as part of their algorithm. The k-mer method make use of string of DNA of length k; so ATG is three letters long and would be a 3mer, while ACTCGTAA is eight letters long and would be called an 8mer. How does counting 8mers or 3mers speed up searches? Take a look at how the uclust algorithm uses it:

http://drive5.com/usearch/manual/usearch_algo.html

The RDP classifier also uses a k-mer method, and these k-mers need to be counted before the search can begin. The RDP devs call this process 'retraining,' but you could also call it 'indexing.'

Before we dive into this, I want to check with a qiime dev about the best way to do this.

Keep in touch!

Colin

Maria

unread,

Apr 27, 2017, 1:12:26 PM4/27/17

to Qiime 1 Forum, antg...@gmail.com

Hi Colin,

So, this is what my .fasta file looks like:

>AJ000684|S000004347 Root;Bacteria;"Actinobacteria";Actinobacteria;Actinobacteridae;Actinomycetales;Corynebacterineae;Mycobacteriaceae;Mycobacterium

gaacgctggcggcgtgcttaacacatgcaagtcgaacggaaaggtctcttcggagatactcgagtggcgaacgggtgagtaacacgtgggtaatctgccctgcacatcgggataagcctgggaaactgggtctaataccgaataggacctcgaggcgcatgccttgtggtggaaagcttttgcggtgtgggatgggcccgcggcctatcagcttgttggtggggtgacggcctaccaaggcgacgacgggtagccggcctgagagggtgtccggccacactgggactgagatacggcccagactcctacgggaggcagcagtggggaatattgcacaatgggcgcaagcctgatgcagcgacgccgcgtgggggatgacggncttcgggttgtaaacctctttcagcagggacgaagcgcaagtgacggtacctgcagaagaagcaccggccaactacgtgccagcagccgcggtaatacgtagggtgcgagcgttgtccggaattactgggcgtaaagagctcgtaggtggtttgtcgcgttgttcgtgaaaaccgggggcttaaccctcggcgtgcgggcgatacgggcagactggagtactgcaggggagactggaattcctggtgtagcggtggaatgcgcagatatcaggaggaacaccggtggcgaaggcgggtctctgggcagtaactgacgctgaggagcgaaagcgtggggagcgaacaggattagataccctggtagtccacgccgtaaacggtgggtactaggtgtgggtttccttccttgggatccgtgccgtagctaacgcattaagtaccccgcctggggagtacggccgcaaggctaaaactcaaaggaattgacgggggcccgcacaagcggcggagcatgtggattaattcgatgcaacgcgaagaaccttacctgggtttgacatgcacaggacgccggcagagatgtcggttcccttgtggcctgtgtgcaggtggtgcatggctgtcgtcagctcgtgtcgtgagatgttgggttaagtcccgcaacgagcgcaacccttgtctcatgttgccagcgggtaatgccggggactcgtgagagactgccggggtcaactcggaggaaggtggggatgacgtcaagtcatcatgccccttatgtccagggcttcacacatgctacaatggccggtacaaagggctgcgatgccgcaaggttaagcgaatccttttaaagccggtctcagttcggatcggggtctgcaactcgaccccgtgaagtcggagtcgctagtaatcgcagatcagcaacgctgcggtgaatacgttcccgggccttgtacacaccgcccgtcacgtcatgaaagtcggtaacacccgaagccagtggcctaacctttgggagggagctgtcgaaggtgggatcggcgattgggacgaagtcgt

>EF599163|S000871589 Root;Bacteria;"Proteobacteria";Gammaproteobacteria;"Vibrionales";Vibrionaceae;Vibrio

gtttgatcctggctcagattgaacgctggcggcaggcctaacacatgcaagtcgagcggaaacgacactaacaatccttcgggtgcgttaatgggcgtcgagcggcggacgggtgagtaatgcctaggaaattgccttgatgtgggggataaccattggaaacgatggctaataccgcataatgcctacgggccaaagagggggaccttcgggcctctcgcgtcaagatatgcctaggtgggattagctagttggtgaggtaatggctcaccaaggcgacgatccctagctggtctgagaggatgatcagccacactggaactgagacacggtccagactcctacgggaggcagcagtggggaatattgcacaatgggcgaaagcctgatgcagccatgccgcgtgtatgaagaaggccttcgggttgtaaagtactttcagttgtgaggaagggtgtgtagttaatagctgcamatcttgacgttagcaacagaagaagcaccggctaactccgtgccagcagccgcggtaatacggagggtgcgagcgttaatcggaattactgggcgtaaagcgcatgcaggtggttcattaagtcagatgtgaaagcccggggctcaacctcggaactgcatttgaaactggtgaactagagtgctgtagaggggggtagaatttcaggtgtagcggtgaaatgcgtagagatctgaaggaataccagtggcgaaggcggccccctggacagacactgacactcagatgcgaaagcgtggggagcaaacaggattagataccctggtagtccacgccgtaaacgatgtctacttggaggttgtggccttgagccgtggctttcggagctaacgcgttaagtagaccgcctggggagtacggtcgcaagattaaaactcaaatgaattgacgggggcccgcacaagcggtggagcatgtggtttaattcgatgcaacgcgaagaaccttacctactcttgacatccagagaagccagcggagacgcaggtgtgccttcgggagctctgagacaggtgctgcatggctgtcgtcagctcgtgttgtgaaatgttgggttaagtcccgcaacgagcgcaacccttatccttgtttgccagcgagtaatgtcgggaactccagggagactgccggtgataaaccggaggaaggtggggacgacgtcaagtcatcatggcccttacgagtagggctacacacgtgctacaatggcgcatacagagggcagcaagctagcgatagtgagcgaatcccaaaaagtgcgtcgtagtccggattggagtctgcaactcgactccatgaagtcggaatcgctagtaatcgtgaatcagaatgtcacg

Just for you to know.

And I will wait for your answer. Thank you for looking that up for me!

Maria

Jai Ram Rideout

unread,

May 2, 2017, 4:10:50 PM5/2/17

to Qiime 1 Forum

Hi Maria,

In order to help you I'll need a few more details:

1. The exact command you're running to perform taxonomy assignment with the RDP database and RDP classifier

2. The entire error message you're receiving when running that command

3. Your QIIME parameters file as an attachment

4. The output from running print_qiime_config.py -tf

5. Where did you obtain your reference sequence FASTA file and taxonomy mapping file? I'd like to download the original files so I can look through them.

Thanks,

Jai

Maria

unread,

May 11, 2017, 9:46:28 AM5/11/17

to Qiime 1 Forum

Hi Jay,

Turns out that using RDP training sequences as database and the method UCLUST gave me a satisfatory result when I used --min_consensus_fraction of 0.7.