OTU clustering algorithms: are they all based on similarity to centroids?

283 views
Skip to first unread message

Ahmed Abdelfattah

unread,
Nov 23, 2015, 9:26:09 AM11/23/15
to Qiime 1 Forum

Dear All,

I have recently started working with barcoding mainly on fungi using ITS. Qiime as well as other softwares rely essentially on the clustering sequences into OTUs. I have always thought that meaning of OTUs clustered at 97% is that sequences that are at least 97% similar to each other are grouped in one OTU. However, when I read more about  it I found that almost all available follow generally similar algorithms that makes centroids (seeds), then compares sequences to the centroids, if they are at least 97% similar they are grouped together. Taking UCLUS as an example http://drive5.com/usearch/manual/uclust_algo.html  the centroid is the center of circle were the radius (T) is the similarity threshold e.g. 97%, so identical sequences can be plotted on the centroid and any point (sequence) at the circumference is 3% dissimilar to the centroid, is that right? if that's correct, it means it's possible to have 2 sequences that are 3% dissimilar from the centroid but 6% dissimilar (94% similar) from each other. If all mentioned here is true, 
how can I justify classifying organisms in an OTU that contain sequences that are less than 97% similar?
Is there other algorithms that cluster sequences at 97% similarity to each other not to centroids? 

Any thought will be highly appreciated

Thanks 

Ahmed 
uclust_algo2.jpg

Colin Brislawn

unread,
Nov 23, 2015, 3:20:20 PM11/23/15
to Qiime 1 Forum
Hello Ahmed,

if that's correct, it means it's possible to have 2 sequences that are 3% dissimilar from the centroid but 6% dissimilar (94% similar) from each other.
That is correct. Way to read the methods! (I wish more people did that.) 

If all mentioned here is true, how can I justify classifying organisms in an OTU that contain sequences that are less than 97% similar?
People familiar with the clustering algorithms already know that greedy heuristic algorithms, like uclust, produce OTUs with 6% different reads in them. I'm not sure you have to justify it; it's already the convention in this field. 
You could use uclust to cluster at 99%, and reads in the resulting clusters would be within 2% of each other. Would that be better?

Is there other algorithms that cluster sequences at 97% similarity to each other not to centroids? 
Yes. It sounds like you may be looking for a 'hierarchical' clustering algorithm, instead of uclust which is a 'greedy heuristic' clustering algorithm.
This paper does an excellent job comparing different methods of clustering: http://aem.asm.org/content/77/10/3219.full
Here, Robert Edgar, the maker of uclust and that figure you posted, argues that "I don't know of a biological application where agglomerative clustering gives better results than the greedy clustering approach" http://drive5.com/usearch/manual/agg.html 


Let me know what you think.
The methods of OTU clustering are some of the most interesting and contentious topics in this field. 
Colin

Ahmed Abdelfattah

unread,
Nov 24, 2015, 12:16:35 PM11/24/15
to Qiime 1 Forum
Hi Colin,

Thank you very much for the explanation. The paper and the link you suggested were very useful. You were right, hierarchal clustering is what I was looking for, especially complete linkage, although Lance-Williams algorithm is very interesting. However, taking into consideration what Edgar says about agglomerative clustering in general, makes me wish I could test it in my case, which I don't know how :) However, going back to Qiime I couldn't find any method that implements complete linkage hierarchal clustering, except of mothur which unfortunately requires an alignment. As you may already know, unlike 16S,  fungal ITS sequences are difficult or shouldn't be aligned. 
I also tried using hierarchal clustering in usearch8, but it outputs only a cluster.txt (contains two columns 1- is a number and 2- is something like rep. seq. ids) and a tree. I wish it gave something like OTU table (which instances in which clusters). if I can't find my way I will just go with a higher similarity threshold as you said. 

Please let me know if there is any script in qiime that implement hierarchal clustering or if you have any comment on what I just said.

Thank you very much 

Ahmed 

Colin Brislawn

unread,
Nov 24, 2015, 2:06:54 PM11/24/15
to Qiime 1 Forum
Hello Ahmed,

Glad I could help! While qiime implements several greedy heuristic clustering algorithms, I don't think it has any hierarchal ones. You could email Robert Edgar directly and ask him about it in usearch8: robert at drive5.com. He's great over email. He says he is interested in working with ITS reads! http://www.drive5.com/usearch/manual/upp_its.html

For a test case, do you have any truth data or positive controls with your biological samples? You could try different methods on your positive controls and see what method is closest to the expected result. You could also use positive controls from another study. 

Colin

Ahmed Abdelfattah

unread,
Nov 24, 2015, 2:26:56 PM11/24/15
to Qiime 1 Forum
Hi Colin,

Thanks again. I will contact Edgar and I will post his answer here. Just to conclude this discussion and of course correct me if I'm wrong; 
There is no correct or incorrect answer in OTU clustering, it always depends on what is the question we trying to answer. Even though, there is a trend to use heuristic algorithms particularly in ecological studies. 
The similarity threshold (T) doesn't necessarily mean that all members of the same cluster are T similar, it always depends on the algorithm selected. although it is true when using hierarchal complete linkage clustering. 
Partick D. http://aem.asm.org/content/77/10/3219.full found that average neighbor clustering (unweighted arithmetic average clustering) outperformed other algorithms.


Colin, I appreciate the time you have given me, you helped me understand a lot of things. I'm sorry if my questions were sometimes too obvious :) I hope this thread will help others as well

All the best,

Ahmed

Colin Brislawn

unread,
Nov 24, 2015, 2:41:20 PM11/24/15
to Qiime 1 Forum
Hello Ahmed,

There is no correct or incorrect answer in OTU clustering, it always depends on what is the question we trying to answer. Even though, there is a trend to use heuristic algorithms particularly in ecological studies. 
The similarity threshold (T) doesn't necessarily mean that all members of the same cluster are T similar, it always depends on the algorithm selected. although it is true when using hierarchal complete linkage clustering. 
Bravo! Very well said. 

OTU clustering remains and interesting and contentious field. My new favorite algorithm is Swarm. It's hierarchal single linkage clustering, with a twist! Also runs in linear time (!!) and is open source. https://peerj.com/articles/593/ https://github.com/torognes/swarm 


Have a great day Ahmed,
Happy Qiimeing!
Colin

Reply all
Reply to author
Forward
0 new messages