Uclust Assigned Taxonomy

1,472 views
Skip to first unread message

D'Harris

unread,
May 28, 2014, 1:31:38 PM5/28/14
to qiime...@googlegroups.com
Hi,

I executed the following (using my own dataset) on the cluster (that has QIIME installed):

pick_de_novo_otus.py -i split_library_output/seqs.fna -o otus

which I understand does the following:
  1. Picking OTUs
  2. Picking a representative sequence set, one sequence from each OTU
  3. Aligning the representative sequence set
  4. Assigning taxonomy to the representative sequence set
  5. Filtering the alignment prior to tree building - removing positions which are all gaps, or not useful for phylogenetic inference
  6. Building a phylogenetic tree
  7. Building an OTU table

I got all the expected files except for the "uclust_assigned_taxonomy". Could it be because I do not have a database for comparison? Or am I required to conduct another step to taxonomically classify the rep sequences? I decided to work off the cluster (now using the virtual box) using my own dataset instead of the mice reads (in the QIIME tutorial), but still I could not get the taxonomy file, yet if I repeat the mice analyses I do get the taxonomy output. I intend to do a de novo OTU picking. Kindly assist.



Tony Walters

unread,
May 28, 2014, 3:20:29 PM5/28/14
to qiime...@googlegroups.com
Hello,

The taxonomy may have been assigned differently (pre QIIME 1.8.0 the default is RDP). Your OTU table likely has taxonomy already in it-you can convert it to a tab separated format and examine the taxonomy with a command like this (you may need to change the name based upon your OTU table file name):
biom convert -i table.biom -o table.from_biom_w_taxonomy.txt -b --header-key taxonomy

You should then be able to open the table. from_biom_w_taxonomy.txt file in Excel and see the taxonomy column (final column).

-Tony


--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tony Walters

unread,
May 28, 2014, 3:21:05 PM5/28/14
to qiime...@googlegroups.com
Sorry I forgot to link the page dealing with biom conversions: http://biom-format.org/documentation/biom_conversion.html

DoH

unread,
May 29, 2014, 6:43:18 AM5/29/14
to qiime...@googlegroups.com
Hi,

Thanks. Maybe I was not clear enough, but after running the pick_de_novo_otus.py, I NEVER get the directory written the "uclust_assigned_taxonomy". The only folders/directories I get are rep_set and unclust_picked_otus.I also get a log..txt file whose detaiols are shown below. Kindly assist.

========================================================================================================================
Logging started at 17:58:06 on 27 May 2014
QIIME version: 1.8.0-dev

qiime_config values:
jobs_to_start    1
sc_queue    all.q
denoiser_min_per_core    50
temp_dir    /tmp/
torque_queue    friendlyq
blastall_fp    blastall
seconds_to_sleep    2

parameter file values:
parallel:jobs_to_start    1

Input file md5 sums:
split_library_outputTEST/SMA_137-10/seqs.fna: 8cedd09b9ace1672b231da5d5dd3cdf4

Executing commands.

# Pick OTUs command
pick_otus.py -i split_library_outputTEST/SMA_137-10/seqs.fna -o otus/SMA-137-10/uclust_picked_otus

Stdout:

Stderr:

# Pick representative set command
pick_rep_set.py -i otus/SMA-137-10/uclust_picked_otus/seqs_otus.txt -f split_library_outputTEST/SMA_137-10/seqs.fna -l otus/SMA-137-10/rep_set//seqs_rep_set.log -o otus/SMA-137-10/rep_set//seqs_rep_set.fasta

Stdout:

Stderr:

# Assign taxonomy command
assign_taxonomy.py -o otus/SMA-137-10/uclust_assigned_taxonomy -i otus/SMA-137-10/rep_set//seqs_rep_set.fasta



*** ERROR RAISED DURING STEP: Assign taxonomy
Command run was:
 assign_taxonomy.py -o otus/SMA-137-10/uclust_assigned_taxonomy -i otus/SMA-137-10/rep_set//seqs_rep_set.fasta
Command returned exit status: 2
Stdout:

Stderr
Usage: assign_taxonomy.py [options] {-i/--input_fasta_fp INPUT_FASTA_FP}

[] indicates optional input (order unimportant)
{} indicates required input (order unimportant)

Contains code for assigning taxonomy, using several techniques.

Given a set of sequences, assign_taxonomy.py attempts to assign the taxonomy of each sequence. Currently the methods implemented are assignment with BLAST, the RDP classifier, RTAX, tax2tree, mothur, and uclust. The output of this step is an observation metadata mapping file of input sequence identifiers (1st column of output file) to taxonomy (2nd column) and quality score (3rd column). There may be method-specific information in subsequent columns.

Reference data sets and id-to-taxonomy maps for 16S rRNA sequences can be found in the Greengenes reference OTU builds. To get the latest build of the Greengenes OTUs (and other marker gene OTU collections), follow the "Resources" link from http://qiime.org. After downloading and unzipping you can use the following files as -r and -t, where <otus_dir> is the name of the new directory after unzipping the reference OTUs tgz file.

-r <otus_dir>/rep_set/97_otus.fasta
-t <otus_dir>/taxonomy/97_otu_taxonomy.txt


Example usage:
Print help message and exit
 assign_taxonomy.py -h

Assign taxonomy with the uclust consensus taxonomy assigner (default): Perform database search with uclust to retrive up to uclust_max_accepts hits for each query sequence. Then assign the most specific taxonomic label that is associated with at least uclust_min_consensus_fraction of the hits.
 assign_taxonomy.py -i repr_set_seqs.fasta -r ref_seq_set.fna -t id_to_taxonomy.txt

Assignment with BLAST: Taxonomy assignments are made by searching input sequences against a blast database of pre-assigned reference sequences. If a satisfactory match is found, the reference assignment is given to the input sequence. This method does not take the hierarchical structure of the taxonomy into account, but it is very fast and flexible. If a file of reference sequences is provided, a temporary blast database is built on-the-fly. The quality scores assigned by the BLAST taxonomy assigner are e-values.

To assign the sequences to the representative sequence set, using a reference set of sequences and a taxonomy to id assignment text file, where the results are output to default directory "blast_assigned_taxonomy", you can run the following command
 assign_taxonomy.py -i repr_set_seqs.fasta -r ref_seq_set.fna -t id_to_taxonomy.txt -m blast

Optionally, the user could changed the E-value ("-e"), using the following command
 assign_taxonomy.py -i repr_set_seqs.fasta -r ref_seq_set.fna -t id_to_taxonomy.txt -e 0.01 -m blast

Assignment with the RDP Classifier: The RDP Classifier program (Wang, Garrity, Tiedje, & Cole, 2007) assigns taxonomies by matching sequence segments of length 8 to a database of previously assigned sequences. It uses a naive bayesian algorithm, which means that for each potential assignment, it attempts to calculate the probability of the observed matches, assuming that the assignment is correct and that the sequence segments are completely independent. The RDP Classifier is distributed with a pre-built database of assigned sequence, which is used by default. The quality scores provided by the RDP classifier are confidence values.

Note: If a reference set of sequences and taxonomy to id assignment file are provided, the script will use them to generate a new training dataset for the RDP Classifier on-the-fly.  Because of the RDP Classifier's implementation, all lineages in the training dataset must contain the same number of ranks.

To assign the representative sequence set, where the output directory is "rdp_assigned_taxonomy", you can run the following command
 assign_taxonomy.py -i repr_set_seqs.fasta -m rdp

Alternatively, the user could change the minimum confidence score ("-c"), using the following command
 assign_taxonomy.py -i repr_set_seqs.fasta -m rdp -c 0.85

Assignment with RTAX: Taxonomy assignments are made by searching input sequences against a fasta database of pre-assigned reference sequences. All matches are collected which match the query within 0.5% identity of the best match.  A taxonomy assignment is made to the lowest rank at which more than half of these hits agree.  Note that both unclustered read fasta files are required as inputs in addition to the representative sequence file.

To make taxonomic classifications of the representative sequences, using a reference set of sequences and a taxonomy to id assignment text file, where the results are output to default directory "rtax_assigned_taxonomy", you can run the following command
 assign_taxonomy.py -i rtax_repr_set_seqs.fasta -m rtax --read_1_seqs_fp read_1.seqs.fna --read_2_seqs_fp read_2.seqs.fna -r rtax_ref_seq_set.fna -t rtax_id_to_taxonomy.txt

Assignment with Mothur: The Mothur software provides a naive bayes classifier similar to the RDP Classifier.A set of training sequences and id-to-taxonomy assignments must be provided.  Unlike the RDP Classifier, sequences in the training set may be assigned at any level of the taxonomy.

To make taxonomic classifications of the representative sequences, where the results are output to default directory "mothur_assigned_taxonomy", you can run the following command
 assign_taxonomy.py -i mothur_repr_set_seqs.fasta -m mothur -r mothur_ref_seq_set.fna -t mothur_id_to_taxonomy.txt

assign_taxonomy.py: error: Option --id_to_taxonomy_fp is required when assigning with uclust.


Logging stopped at 18:43:37 on 27 May 2014
===========================================================================================================

Tony Walters

unread,
May 29, 2014, 10:24:44 AM5/29/14
to qiime...@googlegroups.com
That's a different story, as it's halting on an error (the initial description said that "all the expected files" are being created except the taxonomic assignments, but it looks like it's only actually completing up to step 2).

Right now, you do not have the taxonomy mapping file specified (probably the matching reference sequence fasta file isn't available either).

You want to run print_qiime_config.py on your cluster, and look for the lines:
assign_taxonomy_reference_seqs_fp:
and
assign_taxonomy_id_to_taxonomy_fp:

You will want to download the reference fasta files and taxonomy mapping files, and specify these in your qiime_config file.
Reference data sets and id-to-taxonomy maps for 16S rRNA sequences can be found in the Greengenes reference OTU builds. To get the latest build of the Greengenes OTUs (and other marker gene OTU collections), follow the "Resources" link from http://qiime.org. After downloading and unzipping you can use the following files as -r and -t, where <otus_dir> is the name of the new directory after unzipping the reference OTUs tgz file.

Details about modifying the qiime_config file are here: http://qiime.org/install/qiime_config.html
You will want to specify the absolute filepaths when adding the assign_taxonomy_reference_seqs_fp and  assign_taxonomy_id_to_taxonomy_fp values to your .qiime_config file, and the added files should show up when you run print_qiime_config.py.

-Tony

Roger Huerlimann

unread,
Jul 12, 2015, 10:44:33 PM7/12/15
to qiime...@googlegroups.com
I'm currently getting the same problem as the original poster whenever I run pick_de_novo_otus.py.
I get the rep_set and uclust_picked_otus folders, but not the taxonomy folder, and I get the same error message about --id_to_taxonomy_fp.

I've run print_qiime_config.py and got the output below. The two files you mentioned seem to be linked properly, and I checked in the respective folders and the data files are present there.
Do you have an idea what else could cause this problem?


 > print_qiime_config.py

System information
==================
         Platform:    linux2
   Python version:    2.7.6 (default, Jun 22 2015, 17:58:13)  [GCC 4.8.2]
Python executable:    /usr/bin/python

QIIME default reference information
===================================
For details on what files are used as QIIME's default references, see here:
 https://github.com/biocore/qiime-default-reference/releases/tag/0.1.1

Dependency versions
===================
          QIIME library version:    1.9.0
           QIIME script version:    1.9.0+dfsg-0biolinux5
qiime-default-reference version:    0.1.1
                  NumPy version:    1.8.2
                  SciPy version:    0.13.3
                 pandas version:    0.13.1
             matplotlib version:    1.3.1
            biom-format version:    2.1.4
                   h5py version:    2.2.1 (HDF5 version: 1.8.11)
                   qcli version:    0.1.0
                   pyqi version:    0.3.2
             scikit-bio version:    0.2.3
                 PyNAST version:    1.2.2
                Emperor version:    0.9.51
                burrito version:    0.9.0
       burrito-fillings version:    Installed.
              sortmerna version:    SortMeRNA version 2.0, 29/11/2014
              sumaclust version:    SUMACLUST Version 1.0.01
                  swarm version:    Swarm 1.2.20 [Feb  1 2015 09:42:15]
                          gdata:    Not installed.

QIIME config values
===================
For definitions of these settings and to learn how to configure QIIME, see here:
 http://qiime.org/install/qiime_config.html
 http://qiime.org/tutorials/parallel_qiime.html

                     blastmat_dir:    /usr/share/ncbi/data
                  cluster_jobs_fp:    None
      pick_otus_reference_seqs_fp:    /usr/lib/python2.7/dist-packages/qiime_default_reference/gg_13_8_otus/rep_set/97_otus.fasta
                    jobs_to_start:    1
pynast_template_alignment_blastdb:    None
                qiime_scripts_dir:    /usr/lib/qiime/bin/
                      working_dir:    .
     pynast_template_alignment_fp:    /usr/share/qiime/data/core_set_aligned.fasta.imputed
                    python_exe_fp:    python
                         temp_dir:    /tmp
assign_taxonomy_reference_seqs_fp:    # /usr/share/qiime/data/gg_13_8_otus/rep_set/97_otus.fasta
                      blastall_fp:    blastall
                 seconds_to_sleep:    60
assign_taxonomy_id_to_taxonomy_fp:    # /usr/share/qiime/data/gg_13_8_otus/taxonomy/97_otu_taxonomy.txt

Tony Walters

unread,
Jul 12, 2015, 10:58:24 PM7/12/15
to qiime...@googlegroups.com
Roger, I think there is something awry with the .qiime_config file that's coming on (some?) of the biolinux installs, the "#" characters before the assign_taxonomy_ filepaths in the bottom of your print_qiime_config.py output looks like this thread: https://groups.google.com/forum/#!searchin/qiime-forum/I$20am$20running$20QIIME$20on$20a$20biolinux$20platform.$20I$20try$20to$20run$20the$20pick_closed_reference_otus.py$20script$2C$20but$20I$20always$20get$20the$20same$20error$20message/qiime-forum/OpjaZ_XE-hA/ra5Ftb9CKQAJ

Roger Huerlimann

unread,
Jul 12, 2015, 11:28:09 PM7/12/15
to qiime...@googlegroups.com
It worked! Thanks!

The file was in /etc/QIIME and was read-only, but I deleted the # and now the command works fine.

--

---
You received this message because you are subscribed to a topic in the Google Groups "Qiime Forum" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/qiime-forum/Kq_v9XOXEYk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to qiime-forum...@googlegroups.com.

Richard Rodrigues

unread,
Oct 9, 2015, 4:28:44 PM10/9/15
to Qiime Forum
I got a quick question about the similarity parameter in assign_taxonomy.py

What happens when the reference files for -t 97_otu_taxonomy.txt and -r 97_otus.fasta, while the similarity is 0.9? This is as per the defaults for the assign taxnonomy method, with uclust, qiime 1.8.


I am guessing instead of discarding the sequences with 91% similarity, it would assign it the taxanomy of the sequence from the reference. In that case would you call it 97 or 90% similarity.

Thanks.

Colin Brislawn

unread,
Oct 12, 2015, 2:13:05 AM10/12/15
to Qiime Forum
That's a pretty good question Richard. I want to start by mentioning that while OTU clustering and taxonomic assignment both use the uclust search algorithm, they do very different things with the results. 

During OTU clustering, a read is clustered to the OTU with the highest similarity (over 97% by default). Each read goes into the single, best OTU it matches to. This is called greedy, heuristic clustering.
During taxonomy assignment, an OTU centroid is searched against a database and the top hits (over 90%) are recorded. Each OTU is given a taxonomy based on the multiple, good OTUs it matches to. The process of assigning a tax label is called LCA for lowest common ancestor. More info on LCA here:


Let me know if that helps answer your question!

Colin

Richard Rodrigues

unread,
Oct 12, 2015, 10:05:46 AM10/12/15
to Qiime Forum
Just to make sure I understand it correctly, e.g. for OTU id 4453317, we have three taxonomy hits (>90% similarity) 3568713, 4453317, 4453316

H       72986   329     98.8    +       0       0       133I22MD306M1606I       4453317 8F_4095 3568713
H       93892   329     98.8    +       0       0       257I22MD306M1589I       4453317 8F_4095 4453317
H       93891   329     98.8    +       0       0       183I22MD306M1440I       4453317 8F_4095 4453316

So using LCA we decide that taxonomy to be k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__mitochondria

4453317 k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rickettsiales; f__mitochondria       1.00    3


So, during taxonomy assignment, an OTU centroid is searched against a database and the top (uclust_max_accepts=3) hits (over 90%) are recorded, not all hits above 90%. So this makes it ok to use a lower similarity threshold (90%) because anyways the algorithm would choose the top similarity hits. However, using this lower threshold also helps us to atleast get a (generic) taxonomy rather than calling it Unassigned.

Please correct me if I am misunderstanding something.

Thanks for your help.

Colin Brislawn

unread,
Oct 12, 2015, 12:48:45 PM10/12/15
to Qiime Forum
Hello Richard,

Ah ha! You have found the .uc results file for taxonomy assignment. I did not mention that file as it's too technical for many users, so I'm happy to learn that you are comfortable diving in.

Your description of the LCA method is completely correct. You make a great distinction that .90 is the minimum similarity needed for a hit, and many OTU centroids will have hits above .9. 

If you use the script parallel_assign_taxonomy_uclust.py, you can try changing these parameters.

If you reduce --similarity (say to 70%) your results should not change much because most OTUs have hits in the high 90s. 
If you increase --uclust_max_accep (to say 5) more 3 of 5 hits much match at a given assignment level, instead of 2 of 3. 

Let's say you wanted to use all hits above 90%, instead of just the top 3. You could pass --uclust_max_accep to 0 to return all hits over 90%, and the script could return the taxonomy that 51% of your hits have.


Let me know if you have any other questions!
Colin
 

Jenna Morgan Lang

unread,
Jul 1, 2016, 1:04:05 PM7/1/16
to Qiime 1 Forum
Hi Colin,

I have a follow-up question to this, concerning your statement that "If you increase --uclust_max_accep (to say 5) more 3 of 5 hits much match at a given assignment level, instead of 2 of 3". Does this mean that in order to get taxonomic assignment at, say, species level, using the default value of --uclust_max_accept, that you must have two representatives of speciesA in your reference database in order to ever have anything classified as speciesA?

Thanks for your help!
Jenna

Colin Brislawn

unread,
Jul 1, 2016, 6:24:14 PM7/1/16
to Qiime 1 Forum
Hello Jenna,

Does this mean that in order to get taxonomic assignment at, say, species level, using the default value of --uclust_max_accept, that you must have two representatives of speciesA in your reference database in order to ever have anything classified as speciesA?
That's a great question! I'm pretty sure the answer is no, because the taxonomy assigner does not require three hits.

Let's say you have a really unique OTU, with these hits in the database. 

speciesA 98%
speciesX 88%
speciesY 87%
speciesZ 74%

In this example, only one read it over the default of 90%.
speciesA 98%
Because 1 of 1 hits has the same taxonomy, this is chosen as the taxonomy for your OTU.

Does that help answer your question? 
Colin

Jenna Morgan Lang

unread,
Jul 1, 2016, 6:36:16 PM7/1/16
to Qiime 1 Forum
Yes, that makes sense. But, if you are interested in discriminating between species that are more than 97% similar (like, with a marker other than 16S, for example), or if you want to get sub-species resolution. Then, if you have

speciesA 100%
speciesX 98%
speciesY 97%


Nothing would ever be assigned to speciesA, right?

Thanks for your time, Colin!

Colin Brislawn

unread,
Jul 1, 2016, 7:02:27 PM7/1/16
to Qiime 1 Forum
I've extended the example a bit with actual taxonomy strings from greengenes.

If these are your top three hits:
speciesA 100%  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__Azospirillum; s__
speciesX 98%  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__Constrictibacter; s__antarcticus
speciesY 97% k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__; s__

your OTU would get the lowest taxonomic level at which two out of three agree, which is f__Rhodospirillaceae in this example. 

This has an interesting implication: if you add more things to your database, this can reduce the taxonomy classifications of some OTUs. More hits == less certainty about if all levels agree. 

Colin

Jenna Morgan Lang

unread,
Jul 1, 2016, 7:13:39 PM7/1/16
to Qiime 1 Forum
So, if you decrease uclust_max_accep to 2 and decrease min_consensus_fraction to 0.50, would you necessarily get the g__Azospirillum assignment?

Colin Brislawn

unread,
Jul 2, 2016, 9:52:28 PM7/2/16
to Qiime 1 Forum
Hello Jenna,

That's correct! 


I've worked through each step, just for provide an example.
decrease uclust_max_accep to 2
OK. So you could get this:
speciesA 100%  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__Azospirillum; s__
speciesX 98%  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__Constrictibacter; s__antarcticus

decrease min_consensus_fraction to 0.50
So you only need 50% (1 of 2) of the reads to agree to a given level. 
This would be the selected taxonomy:
speciesA 100%  k__Bacteria; p__Proteobacteria; c__Alphaproteobacteria; o__Rhodospirillales; f__Rhodospirillaceae; g__Azospirillum; s__

Have a good day! Happy 4th.
Colin

Reply all
Reply to author
Forward
0 new messages