Understanding assign_taxonomy.py output

Eddi Lin

unread,

Oct 20, 2016, 11:42:26 AM10/20/16

to Qiime 1 Forum

Hi Qiime,

I got the output from assign_taxonomy.py, and trying to make sense what it means for the 3rd and 4th columns. I read from http://qiime.org/scripts/assign_taxonomy.html, and learned that

The output of this step is an observation metadata mapping file of input sequence identifiers (1st column of output file) to taxonomy (2nd column) and quality score (3rd column). There may be method-specific information in subsequent columns.

I used the default method for classification, UCLUST, but I did not find any explanation for the bold numbers in the results:

denovo36730 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Oscillospira; s__ 1.00 3
denovo36737 Unassigned 1.00 1
denovo36736 Unassigned 1.00 1
denovo36735 Unassigned 1.00 1
denovo36734 Unassigned 1.00 1
denovo36739 Unassigned 1.00 1
denovo36738 Unassigned 1.00 1
denovo35282 Unassigned 1.00 1

I can only guess that the 3rd column is either the

--min_consensus_fraction
Minimum fraction of database hits that must have a specific taxonomic assignment to assign that taxonomy to a query, only used for sortmerna and uclust methods [default: 0.51]

or

--similarity
Minimum percent similarity (expressed as a fraction between 0 and 1) to consider a database match a hit, only used for sortmerna and uclust methods [default: 0.9]

Since all the numbers are above 0.51 (I do have some are 0.67), I am putting my money on the first one, --min_consensus_fraction, (But was my guess correct?) then what does the second number mean? I even look up at UCLUST website. You know how their website is, very hard to find any information about the parameters in QIIME.

If you have any idea what those numbers are, please let me know. Thanks a lot!

Huaiying

Daniel McDonald

unread,

Nov 4, 2016, 12:32:09 AM11/4/16

to Qiime 1 Forum

Hi Huaiying,

I apologize for the long delay. The third column is the quality score. I'm less sure what the 4th column is but I suspect its the number of matches.

Best,

Daniel

Colin Brislawn

unread,

Nov 4, 2016, 10:57:58 AM11/4/16

to Qiime 1 Forum

Hello Huaiying,

The fourth and final column is the number of hits selected during the uclust search. The default --uclust_max_accepts is 3, but you can increase it.

The third column is the percentage of hits that match the taxonomy shown in the second column. Because --uclust_max_accepts is usually 3, and --min_consensus_fraction is .51, this output is common:

denovo36730 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__Oscillospira; s__ 1.00 3

So this OTU had three hits, and 100% of those hit to the taxonomy listed.

This is also common:

denovo1521 k__Bacteria; p__Firmicutes; c__Clostridia; o__Clostridiales; f__Ruminococcaceae; g__; s__ 0.66 3

This OTU had three hits, but 66% of them (2 of 3) matched this taxonomy down to the family level. These three hits may have disagreed on the g__ level, so that level is returned as undefined.

I hope this helps!

Colin

PS Qiime uses a specially licensed version of uclust, v1.2.21q. You can get the full manual here: http://www.drive5.com/uclust/downloads1_2_21q.html

Eddi Lin

unread,

Nov 4, 2016, 11:12:47 AM11/4/16

to qiime...@googlegroups.com

Hi Daniel and Colin,

Thanks a lot for your explanation. I think I got it now. Wish you could put those on the help page of assign_taxonomy.py.

Happy Friday!

Huaiying

Colin Brislawn

unread,

Nov 4, 2016, 12:00:56 PM11/4/16

to Qiime 1 Forum

Hey Huaiying,

Yeah, perhaps we should describe more of the technical details on the website. I think there is a balance between a easy simplification and complicated explanation. The qiime website tends to be more simple and leave our the technical details you asked about. The technical stuff it mostly in the code itself. Check out these lines from the source code of the script:

https://github.com/biocore/qiime/blob/master/qiime/assign_taxonomy.py#L1244-L1259