assign_taxonomy.py

591 views
Skip to first unread message

kristin oosthuizen

unread,
Jul 27, 2017, 9:24:58 AM7/27/17
to Qiime 1 Forum
Hi,

Which method is better for ITS data: RDP or BLAST? 

Thanks,
Kristin

Colin Brislawn

unread,
Jul 27, 2017, 10:57:26 AM7/27/17
to Qiime 1 Forum
Hello Kristin,

Great question! I would recommend RDP or uclust instead of BLAST, but that's because of how the results are used by qiime (not due to some problem with the BLAST program itself).

In qiime, the blast taxonomy assignment method is really simple: for each OTU centroid, a blast search is performed and taxonomy is assigned based on the top 1 blast hit. You can probably already see the problem with this; if the top hit is a 94.9% match and the second hit is a 94.5% match, you could reasonably assume that the OTU could be assigned to either taxonomy. But qiime ignores the second blast hit! You end up getting overly specific taxonomy.

In qiime, the uclust taxonomy assigner is a lot like blast, but it looks at the top 3 hits, then assigned taxonomy based on the taxonomy levels all three hits share. This is a simple Last Common Ancestor (LCA) method, and you get much more reasonable results. (Note that blast and uclust both do a search and return results that are similar (or exactly the same!), but qiime makes use of the uclust results in a better way by inferring LCA, not just taking the top hit.)

The RDP method does a k-mer search (which uclust does too), then infers taxonomy based on matching k-mer profiles. 

I hope that helps,
Colin


PS Taking the top 1 hit is even worse in some cases. If blast finds multiple 100% hits, the top hit will depend on the order of the database, essentially giving you a random taxonomy that is overly specific. In qiime 1.9.1, blast is not a great way to assign taxonomy.

kristin oosthuizen

unread,
Jul 28, 2017, 9:03:40 AM7/28/17
to qiime...@googlegroups.com
Hi Colin,

Thanks for the information and advice. I was just wondering when working with the internal transcribed spacer region for fungal identification, what database would I need to specify. I have heard about a fungalits_warcup_trainingdata_V2 for RDP. So then I should probably just specify this with:
-t, --id_to_taxonomy_fp
Path to tab-delimited file mapping sequences to assigned taxonomy. Each assigned taxonomy is provided as a semicolon-separated list. For assignment with rdp, each assigned taxonomy must be exactly 6 levels deep. [default: /Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt]

-r, --reference_seqs_fp
Path to reference sequences. For assignment with blast, these are used to generate a blast database. For assignment with rdp, they are used as training sequences for the classifier. [default: /Users/caporaso/.virtualenvs/qiime/lib/python2.7/site-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta]

Is this correct? Is there anything else I should be specifying, or do to the training data. Does qiime retrain the dataset? Or is this a step that I should do before running assign_taxonomy.py?

The fungalits_warcup_trainingdata_V2 can be found at: https://sourceforge.net/projects/rdp-classifier/files/RDP_Classifier_TrainingData/

I downloaded the data and now I have two files: Warcup_v2.fasta and Warcup_v2_RDP_Taxonomy. 

I guess I am confused by the term "trainingdata'. So now I don't know if this will have to be 'retrained' for RDP in qiime, or if it is already fine and I just specify -t Warcup_v2_RDP_Taxonomy -r Warcup_v2.fasta

I am not sure I understand what this entails.

Thanks!
Kristin


Message has been deleted

Colin Brislawn

unread,
Jul 28, 2017, 10:20:49 AM7/28/17
to qiime...@googlegroups.com
Another great question, Kristin,

Yes, you will need to pass the -t and -r if you are not using the default database. However, qiime does not retrain the RDP classifier automatically. You will have to do this step manually, as described over here: 

The 'uclust' taxonomy classifier does not require retraining, so if you would like to try this method, you can simply pass -t and -r to change the database.

Colin

kristin oosthuizen

unread,
Jul 30, 2017, 12:34:45 PM7/30/17
to Qiime 1 Forum
Hi Colin,

I had a look and now I am still confused. I took a snapshot of the link:


I guess I still don't understand, because here the retraining looks exactly the same as taxonomy assignment, it specifies -i otus.fasta -t trainingdata.txt -r trainingdata.fasta
So I don't understand why this is the retraining step, when it seems the same as normal assign_taxonomy.py.

Obviously, I don't use the greengenes data, I am using the fungal training data I downloaded, but I know that it needs retraining, but I just don't understand how to do the retraining before assigning taxonomy, when it seems this link is explaining that it is doing both at once.

Thank you for all your help, and sorry for all my questions.

Kristin
Auto Generated Inline Image 1

Colin Brislawn

unread,
Jul 30, 2017, 5:20:16 PM7/30/17
to qiime...@googlegroups.com
Hello Kristin,

Yeah, I see what you mean. I've never tried to use a different database with RDP, so this caught me by surprise. I guess I though RDP needed an extra step, but that page makes it looks like the RDP method works just like uclust does ('retraining rdp' or 'indexing uclust' on-the-fly).

What happens when you pass -t -r -m rdp? Does the script finish? Do the outputs make sense?

Maybe this is easier then I thought!

Colin

kristin oosthuizen

unread,
Jul 31, 2017, 11:00:48 AM7/31/17
to Qiime 1 Forum
Hi Colin,

I must admit, I have not tried it yet, guess I just wanted to check on this forum if another step is needed before running the script. Will give it a go, and let you know.

Thanks again for your assistance!
Kristin

kristin oosthuizen

unread,
Aug 1, 2017, 10:45:28 AM8/1/17
to qiime...@googlegroups.com
Hi Colin,

Tried it:

module load python/2.7.11
module load app/RDP/2.2

assign_taxonomy.py -i otus.fasta -r Warcup_v2.fasta -t Warcup_v2_RDP_Taxonomy.txt -c 0.85 -o tax_assingments -m rdp 

Got the following error:

Traceback (most recent call last):
  File "/apps/python/2.7.11-gcc-4.8.5/bin/assign_taxonomy.py", line 417, in <module>
    main()
  File "/apps/python/2.7.11-gcc-4.8.5/bin/assign_taxonomy.py", line 394, in main
    log_path=log_path)
  File "/apps/python/2.7.11-gcc-4.8.5/lib/python2.7/site-packages/qiime/assign_taxonomy.py", line 855, in __call__
    taxonomy_file, training_seqs_file = self._generate_training_files()
  File "/apps/python/2.7.11-gcc-4.8.5/lib/python2.7/site-packages/qiime/assign_taxonomy.py", line 895, in _generate_training_files
    seq_id, lineage_str = map(strip, line.split('\t'))
ValueError: need more than 1 value to unpack

So I am not sure if this error is related to the retraining (that it cannot do it), or related to the actual process of assigning taxonomy.

I attached the Warcup training data .fasta and .txt files for -r and -t, respectively.

Have you seen this error before?

Thanks!
Kristin
Warcup_v2_RDP_Taxonomy.txt
Warcup_v2.fasta

Colin Brislawn

unread,
Aug 1, 2017, 12:25:46 PM8/1/17
to qiime...@googlegroups.com
Good morning,

Thanks for trying out this script and posting the full workflow. I've never seen this error from this script, but I bet it has something to do with taxonomy.

So warcup fasta file looks like this:
>JX090119 Root;Fungi;Basidiomycota;Agaricomycotina;Agaricomycetes;Agaricomycetes_Incertae sedis;Polyporales;Polyporaceae;Ryvardenia;Ryvardenia campyla
ACTT.....ACC
>HQ446352 Root;Fungi;Ascomycota;Pezizomycotina;Sordariomycetes;Sordariomycetidae;Sordariomycetidae_Incertae sedis;Sordariomycetidae_Incertae sedis;Savoryella;Savoryella verrucos
AAA....TG

>FN435791 Root;Fungi;Ascomycota;Pezizomycotina;Sordariomycetes;Xylariomycetidae;Xylariales;Xylariaceae;Xylaria;Xylaria hypoxylon

>FN435791 Root;Fungi;Ascomycota;Pezizomycotina;Sordariomycetes;Xylariomycetidae;Xylariales;Xylariaceae;Xylaria;Xylaria hypoxylon 
AA....TAAAG

The good news is that all the taxonomy is here. The bad news is that this is not the format qiime needs. Take a look at the greengenes database for an example of what works, or read up on the qiime format here:
http://qiime.org/documentation/file_formats.html#sequence-id-to-taxonomy-mapping-files 

You could also try using an ITS database that is already formatted for use with qiime. The UNITE database has worked pretty well for me. You can download it over here:
https://unite.ut.ee/repository.php  

Colin

Chris B

unread,
Aug 2, 2017, 1:00:29 PM8/2/17
to Qiime 1 Forum
Hi Kristin,

I'm guessing a bit here but it looks to me like this error is caused by the taxonomy file you supplied at -t. In particular, this line of the error

 seq_id, lineage_str = map(strip, line.split('\t'))

suggests that it's trying to split each line of the taxonomy file at the tab, but there are no tabs in your taxonomy file, so it fails. I think the taxonomy file you have here also isn't formatted appropriately for QIIME to use (each line should give the taxonomy string for a single sequence ID, whereas your file is essentially giving the structure of the whole tree) but assign_taxonomy is failing before it even gets to that point.

To use Warcup, here are two options.

First option is to extract the taxon info from the deflines of the fasta file and supply that as the taxonomy file at -t. Here is one way you could do it. Start by extracting the taxonomy information for each sequence. e.g. using bash on mac os x:

grep '^>' Warcup_v2.fasta | tr -d '>' | tr ';' '\t' >Warcup_v2_taxonomy.txt

This should give you a file where each line consists of a sequence identifier, followed by 10 taxonomy fields, all tab delimited. Since I believe the RDP classifier needs exactly 6 taxonomy fields, you will need to pare this down. So open your new Warcup_v2_taxonomy.txt file in Excel, delete the Root and Fungi columns (since they don't give you useful information), and then delete two further taxonomy columns -- whichever you think you can afford to lose. Save this file as Warcup_v2_taxonomy2.txt. So now you have a file where each line consists of a sequence identifier, followed by 6 taxonomy fields, all tab delimited. But QIIME needs the taxonomy fields to be separated by semicolons, so replace those tabs with semicolons. You could do that in bash on os x:

tr '\t' ';' <Warcup_v2_taxonomy2.txt | sed -E "s/;/$(printf '\t')/1" >Warcup_v2_taxonomy3.txt

Now the Warcup_v2_taxonomy3.txt file should be formatted appropriately to supply to assign_taxonomy at -t. Kind of complicated but hopefully works.

Alternatively, a second option would seem to be to supply pre-compiled training data to assign_taxonomy using -p. You can download a pre-compiled dataset for warcup from https://rdp.cme.msu.edu/download/rdpclassifiertraindata/data.tgz. If you decompress the download, the folder data/classifier/fungalits_warcup should contain the files you need. Keep the files in the same folder, but supply the filepath for the .properties file to assign_taxonomy at -p. Don't supply anything to -r or -t because all the info should be in the precompiled database.

Hope that one of those solutions works for you. Let us know how you get on, and feel free to ping me if you want to chat further about assigning taxonomy to fungal datasets using QIIME.

Cheers,
Chris

----------------------
Chris Baker
Postdoctoral Research Associate
Pringle and Tarnita Labs
Department of Ecology and Evolutionary Biology
107A Guyot Hall, Princeton University 
Princeton NJ 08544 USA
----------------------









Colin Brislawn

unread,
Aug 2, 2017, 1:07:30 PM8/2/17
to Qiime 1 Forum
Thanks for describing this process, Chris. Re-formatting databases is hard work and I appreciate your detailed guide.

Kristin, let us know if you get warcup up and running.

Colin

kristin oosthuizen

unread,
Aug 3, 2017, 7:20:40 AM8/3/17
to Qiime 1 Forum
Hi Chris,

Thank you for your assistance! Will try this, but I just realised another potential problem. Does qiime support RDP classifier 2.10 or above? Because I think from what I can tell from literature and ftp://197.155.77.5/sourceforge/r/rd/rdp-classifier/rdp-classifier/releaseNotes/release_notes.txt using warcup or UNITE RDP training data, and assigning taxonomy may not work with RDP version 2.2, or perhaps I don't understand it right. I know there are various versions of the classifier, and of the training data, but not sure how the two should be used together.

Not sure if I understand it right. I got this from ftp://197.155.77.5/sourceforge/r/rd/rdp-classifier/rdp-classifier/releaseNotes/release_notes.txt:

RDP Classifier 2.12 (June 2016) Release Note:

The Bacteria and Archaea hierarchy model used by RDP Classifier has been updated to training set No. 16. The new version has over 300 new genera and 2000 new sequences added. There are some rearrangements in genera Gp1, Gp3 and Gp4 of the Acidobacteria. 

Fungal ITS Warcup has been updated to Version 2. Version 2 of the Warcup training set provides more accurate and consistent classification of fungal ITS sequences. These improvements result from an effort to find and correct a number of small errors and inconsistencies that were present in the reference data used to build the V1 training set.



Thanks!
Kristin


kristin oosthuizen

unread,
Aug 3, 2017, 7:27:07 AM8/3/17
to Qiime 1 Forum
Hi Colin,

Will try and get it up and running, but I have noticed another potential problem. I am not sure what version of RDP is needed for warcup (or UNITE training data - considering using UNITE). It seems like qiime only uses RDP 2.2, but I may need 2.11 for warcup v2. Can qiime assign_taxonomy.py work with 2.11?

I know there are various version of the training data, and the classifier itself. Just don't know the compatibility between the two, what works with what. See: ftp://197.155.77.5/sourceforge/r/rd/rdp-classifier/rdp-classifier/releaseNotes/release_notes.txt

I am sorry, I am so confused. I really like working through qiime, so don't really want to do the taxonomic classification through another program, or even through RDP, and then have issues with compatibility for doing downstream qiime steps, which is why i like running tax assignments with qiime, as RDP is implemented in it. But now it seems 2.2 may not be the version I need. Not sure if I understand wrong.

Thanks for all your help,

Kristin

Chris Baker

unread,
Aug 3, 2017, 12:00:17 PM8/3/17
to Qiime 1 Forum
Hi Kristin,

I think if you're going to retrain the classifier in QIIME then 2.2 or whatever you have installed should be fine. The RDP version numbering is indeed confusing :) My understanding is that the RDP classifier comes pre-trained on a set of databases, but you can always re-train it if you have a suitable dataset, either via QIIME or by interacting directly with the classifier. Your dataset wouldn't have to be one of the ones that RDP has been released with -- you could, for example, train it on a database that you had constructed from fungal cultures that you'd ID'd and sequenced.

Perhaps just give it a try and see whether it works? I've attached the parsed taxonomy file here if you want to try with that.

Cheers,
Chris


Warcup_v2_taxonomy_out.txt

kristin oosthuizen

unread,
Aug 4, 2017, 6:09:36 AM8/4/17
to Qiime 1 Forum
Hi Chris,

Thanks so much, I actually followed your instructions to format the taxonomy file for RDP, but thanks so much for the attached file. I compared it, just to make sure mine is ok, and it looks exactly like yours, minus the root and fungi and 2 more fields. I actually tried running it now with RDP now, but got the following error:

script: 

module load python
module load app/RDP/2.2

assign_taxonomy.py -i ITS2_nonchimeras_rename_uparse.fasta -r Warcup_v2_RDP.fasta -t Warcup_v2_RDP_taxonomy.txt -c 0.8 -o ITS2_RDP_taxonomy_Warcup_v2 -m rdp --rdp_max_memory 4000

Error:

Traceback (most recent call last):
  File "/apps/python/2.7.11-gcc-4.8.5/bin/assign_taxonomy.py", line 417, in <module>
    main()
  File "/apps/python/2.7.11-gcc-4.8.5/bin/assign_taxonomy.py", line 394, in main
    log_path=log_path)
  File "/apps/python/2.7.11-gcc-4.8.5/lib/python2.7/site-packages/qiime/assign_taxonomy.py", line 860, in __call__
    max_memory=max_memory, tmp_dir=tmp_dir)
  File "/apps/python/2.7.11-gcc-4.8.5/lib/python2.7/site-packages/bfillings/rdp_classifier.py", line 515, in train_rdp_classifier_and_assign_taxonomy
    tmp_dir=tmp_dir)
  File "/apps/python/2.7.11-gcc-4.8.5/lib/python2.7/site-packages/bfillings/rdp_classifier.py", line 485, in train_rdp_classifier
    return app(training_seqs_file)
  File "/apps/python/2.7.11-gcc-4.8.5/lib/python2.7/site-packages/bfillings/rdp_classifier.py", line 327, in __call__
    remove_tmp=remove_tmp)
  File "/apps/python/2.7.11-gcc-4.8.5/lib/python2.7/site-packages/burrito/util.py", line 285, in __call__
    'StdErr:\n%s\n' % open(errfile).read())
burrito.util.ApplicationError: Unacceptable application exit status: 1
Command:
cd "/home/vitis/Kristin/ITS2_assign_taxonomy_1.0/RDP/"; java -Xmx4000M -cp "/apps/RDP/2.2/rdp_classifier-2.2.jar" edu.msu.cme.rdp.classifier.train.ClassifierTraineeMaker "/scratch/59218.launch.hpc/RdpTaxonomy_s0u4AE.txt" "/scratch/59218.launch.hpc/tmp5yJa24L7Ksh58EIs7SWZ.txt" 1 version1 cogent "/scratch/59218.launch.hpc/RdpTrainer_3U70Gi" > "/scratch/59218.launch.hpc/tmpAq4UCpYKPFsKibIh4QtF.txt" 2> "/scratch/59218.launch.hpc/tmppUNhL1b9yN0IiAjoEIc4.txt"
StdOut:

Trying it the other -p way now...

Thank you for all your help, really appreciate it.

Kristin

kristin oosthuizen

unread,
Aug 4, 2017, 6:20:27 AM8/4/17
to qiime...@googlegroups.com
Hi Chris,

Now I tried it with the data download, specifying .properties file for -p, and got strange results, so I don't even know if this data file is right. Now I am only getting k__bacteria or unidentified as my results, when I know for a fact this is fungal ITS data2 (extracted fungal ITS2 with ITSx software), there is no k__fungi result. It is weird that the .properties file is called rRNAClassifier.properties...seems like it is bacterial data

I don't know, I am also not getting anything past kingdom level.

This is obviously very strange and has me worried.

module load python
module load app/RDP/2.2

assign_taxonomy.py -i ITS2_nonchimeras_rename_uparse.fasta -p rRNAClassifier.properties -c 0.8 -o ITS2_RDP_taxonomy_Warcup_v2_precompiled -m rdp --rdp_max_memory 4000

OTU91631128872213 k__Bacteria 0.850
OTU13676221727560 k__Bacteria 0.940
OTU78588875380355 k__Bacteria 0.840
OTU97907261483915 Unclassified 1.000
OTU39168056327220 k__Bacteria 0.830
OTU91590272747188 Unclassified 1.000
OTU61607860005037 k__Bacteria 0.870
OTU16589795945220 Unclassified 1.000
OTU72792423751812 Unclassified 1.000
OTU49165773154835 Unclassified 1.000
OTU80303357145737 k__Bacteria 0.820
OTU57027789675433 Unclassified 1.000

I extracted ITSx, used BLAST in qiime against UNITE db, and then just perfomred conventional NCBI-BLAST, and never have I detected bacteria in any of the cases.

There must a something wrong with the pre-compiled training data.

Do you have any clue what might be wrong here?

Thanks so much for assisting me with all of this.
Kristin 

Chris Baker

unread,
Aug 6, 2017, 11:15:42 PM8/6/17
to Qiime 1 Forum
Hi Kristin,

Right, I just tried running these commands myself and got the same errors.

Regarding the -p method, assign_taxonomy is using the greengenes database instead of the warcup database that you're supplying at -p. Although you're not supplying anything for -t or -r, I think what is happening here is that assign_taxonomy is getting the greengenes file path from the qiime default parameters, and so when it runs those paths override the path that you're supplying at -p. I get the same thing on my qiime installation when I run it. For the script to use your filepath from -p, the values for -t and -r need to be null. I've tried to work out how to make them null using a qiime parameters file or something similar but without success. There's probably some proper way to do this but the somewhat sketchy workaround I came up with was to copy the assign_taxonomy.py script to a new file, insert a couple of lines that just delete those values within the script, and then run my modified script copy. You just need to insert a couple of extra lines at line 266:

    ...
    if assignment_method == 'rdp':
        try:
            validate_rdp_version()
        except RuntimeError as e:
            option_parser.error(e)

        opts.id_to_taxonomy_fp = None
        
        opts.reference_seqs_fp = None
        
        if opts.id_to_taxonomy_fp is not None:
    ....

But perhaps a less sketchy way to do this is to use the -t and -r arguments. After playing around with this, it seems that the main problem is the taxonomy information that follows the sequence ID in each defline of the fasta file. So create a new fasta file without the taxonomy information:

sed -E 's/Root.*//' Warcup_v2.fasta | tr -d '\t' >Warcup_v2_parse.fasta

Then extra the taxonomy information as before. Actually, despite the QIIME instructions for using RDP, it seems that you can have more than 6 levels of taxonomy, so no need to delete those columns in excel after all -- just extract the taxonomy info into a new file:

grep '^>' Warcup_v2.fasta | tr -d '>'>Warcup_v2_parse.txt

Then just run assign_taxonomy with those two files:

assign_taxonomy.py -r Warcup_v2_parse.fasta -t Warcup_v2_parse.txt -i yourdata.fasta -c 0.8 -m rdp --rdp_max_memory 8000 -o your_output_folder

These worked for me (parsing the files in terminal on my mac -- syntax might differ slightly on a different platform) and I was able to assign some correct-looking warcup taxonomy info to some fungal ITS sequences that I have here. Hopefully this works for you too! Apologies for the earlier misleading suggestions.

Cheers,
Chris

kristin oosthuizen

unread,
Aug 23, 2017, 7:42:33 AM8/23/17
to Qiime 1 Forum
Hi Chris,

Terribly sorry for only getting back to you now. I was on leave, and then started with other work, so I haven't looked at this yet, but I thank you for all your effort in helping me with this.

Will have a look,

Thanks!
Kristin
Reply all
Reply to author
Forward
0 new messages