clarification on UCLUST max_rejects

13 views
Skip to first unread message

Constantino Schillebeeckx

unread,
Apr 11, 2016, 11:17:08 AM4/11/16
to Qiime 1 Forum
I've been working on optimizing our OTU picking strategy by looking into the "optimal" settings of UCLUST. 

The option max_accepts is clear to me, however I'm struggling to fully understand the meaning and implications of the option max_rejects.  From the documentation we read:
Keep searching until n rejects have occurred, then report a failure to find a hit. Default 8. Zero means infinity, i.e. keep searching until all a hit is found or [all] database sequences have been tested.

In the case of using a reference database like GreenGenes am I correct in interpreting this setting as: given a subject sequence, if the alignment to the first 8 reads in the database does not meet the identity threshold, then the subject read is rejected?

I find it hard to believe that this interpretation is correct given the size of GreenGenes; furthermore it would mean that all subject reads must match to one of the first 8 reads or otherwise be reported as a failure to fine a hit.

I've searched the forum and the web, but nothing clear seems to come up.  Can someone help?

Colin Brislawn

unread,
Apr 11, 2016, 1:11:51 PM4/11/16
to Qiime 1 Forum
Hello Constantino, 

I'm glad you are diving into the uclust parameters. This greedy heuristic clustering strategy is at the core of qiime, so understanding it is important. 

the alignment to the first 8 reads in the database does not meet the identity threshold, then the subject read is rejected?
So close, but not quite right.

The uclust algorithm works in two stages. 
1. A k-mer filter to quickly list all similar reads in the database.
2. Glocal alignment to select the best hit among the similar reads. 

--max-rejects 8 refers to number of alignments (step 2) to try from the list of similar reads (from step 1) to try before giving up. This setting works well because if a good alignment is not generated from one of the top k-mer hits (from step 1), it probably means that there is no good alignments in the database at all. 

So it's the top 8 reads from the k-mer hit list, not from the database as a whole. 

(P.S. This is why uclust works much better and faster at high similarities, say 80% and 90% and kind of sucks at low identities. At 20% similar, the k-mer hit list will be useless, while at 97% all top hits will be identified through common k-mers.) 

I hope that helps! 
Colin 
Reply all
Reply to author
Forward
0 new messages