OTU clustering threshold (97% vs

Shantelle Claassen

unread,

Mar 6, 2016, 2:48:08 AM3/6/16

to Qiime 1 Forum

Dear All,

We are currently in the process of analyzing 16S rRNA Illumina MiSeq data from faecal specimens (targeting the V4 region) and have some uncertainties around the clustering of OTUs. It seems that there are many different views on this.

One of the opinions, for example, is that a clustering-free approach should be used (Tikhonov, M et al. 2015). Furthermore, studies have reported discrepancies around the threshold at which OTUs should be assigned (97% versus 99%).

When clustering our reads using UPARSE deNovo clustering at 97% and 99%, we observed the following:

1. We identify more genera and species using clustering at 99%:

- Genus-level: 5 genera at 97% were not identified at 99%

49 genera at 99% were not identified at 97%

- Species-level: 9 species at 97% were not identified at 99%

55 species at 99% were not identified at 97%.

2. We had larger proportions of unclassified OTUs when clustering at 97% compared to 99%:

97% 99%

- Phylum-level: 3.5% 2.3%

- Class-level: 3.7% 2.3%

- Order-level: 5.4% 3.4%

- Family-level: 21.3% 15.9%

- Genus-level: 59.7% 54.5%

- Species-level: 93.3% 88.6%

We would appreciate your opinions on this. We do realize that clustering at 99% may result in spurious OTUs and we are also aware that some taxa (especially the ones belonging to the phyla Firmicutes, Bacteroidetes, Proteobacter and Actinobacteria) may contain more 16S copy numbers compared to other taxa.

Please advise?

Andrew Krohn

unread,

Mar 7, 2016, 11:28:48 AM3/7/16

to Qiime 1 Forum

I'll take a stab at this. Others please jump in and/or pile on.

Do keep in mind that 97% threshold is based on alignments of the ENTIRE 16S gene. v4 region with 515/806 primers is only about 250 bases with above average entropy so really a lower threshold should be considered. As to taxonomic assignments, I'm not sure why you were getting more deep assignments at 99%, but I would hold any assignments below family with this region (sometimes genus is OK) with relative suspicion and contempt. If I can hazard a guess, I would say that miseq data has enough errors when sequencing amplicons, especially toward the 3' end that perhaps at a higher threshold, there were more representatives of spurious OTU sequences which offered the tax assigner the chance to provide assignments where maybe it wouldn't otherwise. At any rate, clustering at a high threshold is a great way to provide diversity estimates that are wickedly inflated. I particularly like the article by Kunin (2010) on this topic (using 454 data and a different region, but I think the result should still make miseq users be cautious) where they used a SINGLE e coli isolate as a "community" and then clustered at different values. Data that was untrimmed (as with QIIME split_library_fastq default settings -- not entirely true, but close) and clustered at 99% observed nearly 100 OTUs. They found that if they aggressively filtered the data where a base must have <0.2% probability of being an error was required to achieve the desired result, and they could cluster up to 98%. This means filtering data at q27 or greater (set -q to 26 in split_libraries_fastq.py). In doing this you will lose a LOT of data! However, if that data causes you to say something that you shouldn't (we observed a billion jillion OTUs!), that should be data you are happy to shed.

On a typical data set I will retain about 85% of reads using default settings, or maybe 25% of reads if I set -q to 19 (q20 or better), -r to 0, and -p to 0.95. I get much better results personally when all the reads are near to or exactly the same length. I suspect there is a reference for this effect, but I've not found one. Anyone else know?

Nelson et al (2014) put out a cool paper in PLOS (yea yea, those guys suck right now -- PLOS, not Nelson et al) where they reanalyzed miseq data substituting a mock community for the phix component. PhiX is usually how run metrics are determined during sequencing. They found that the amplicon data did not behave similarly to the balanced sequence of PhiX during sequencing and that the error rate was indeed higher than reported by the instrument. They go on to do a nice analysis about how one should process amplicon data, but the error rate observation was the really interesting part to me.

Maybe, like me, your data isn't good enough to filter at q27. If you did that, you would have nothing left to process and then you still can't say anything. Some runs are good enough to filter at q30, others not so much. Because I don't have a mock community to use to more accurately assess error rates during sequencing, I have to rely on the rates reported by the miseq for better or (more likely) for worse. q20 leaves me enough data to do substantive work, but I acknowledge that by amplicon size is 253 (v4). If I filter at q20 (5% error rate), I might expect 2.5 errors per sequence. Because of the issues with similarity clustering (attracting divergent reads to the same cluster seed), I favor distance-based clustering. This is available in swarm (Mahe et al 2014) which is de novo and in qiime 1.9.1 and is super fast due to the use of edit distances for calculation rather than local alignments. Another notable distance-based clustering method is minimum entropy decomposition (Eren et al, 2015). The approach used in dada2 (Callahan et al, 2015) is the same (I think) as the Tikhonov reference you mention, but also offers a denoising function to try to address the problem of systematic errors which occur on the miseq. Sure, we used to hate waiting for denoising 454 data, but it also may be a mistake by the microbial ecology community to have essentially ignored this analogous problem for so long on the new platform. I haven't had the time to make dada2 work for myself yet, but I have had some success using the bayeshammer denoiser (Nikolenko et al, 2013), available in spades 3.6 (Bankevich et al, 2012). That and using read overlap to "correct" errors through consensus can really improve the outcome of Illumina data.

Here are some screenshots from a presentation I gave with regard to data treatment. The first are quality plots of the first two reads (from fastqc). You can see how terrible read 2 is all by itself. Some might find this enough cause to use only the first read, but I was determined. The next shows read1 again (for comparison) and joined data. If you relax fastq-join enough (I use -p 30 -m 30) and do your quality filtering subsequently, you should get a lot of reads to overlap without sacrificing quality. As indicated in the image, I then trimmed the data to 220 bases prior to analysis. This gave me plenty of reads to play with, and the average quality to start with was much better than with read1 alone. These are data from fungal ITS2, so there wasn't a complete overlap for 2x250 data as with 16S v4. I outline some of this process in my wiki on ITS analysis considerations here: https://github.com/alk224/akutils-v1.2/wiki/ITS-analysis

References below. Hopefully you find this helpful.

Bankevich A., Nurk S., Antipov D., Gurevich A a., Dvorkin M., Kulikov AS., Lesin VM., Nikolenko SI., Pham S., Prjibelski AD., Pyshkin A V., Sirotkin A V., Vyahhi N., Tesler G., Alekseyev M a., Pevzner P a. 2012. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology 19:455–477.

Callahan BJ., Mcmurdie PJ., Rosen MJ., Han AW., Johnson AJ., Holmes SP. 2015. DADA2: High resolution sample inference from amplicon data. bioRxiv:0–14.

Eren AM., Morrison HG., Lescault PJ., Reveillaud J., Vineis JH., Sogin ML. 2015. Minimum entropy decomposition : Unsupervised oligotyping for sensitive partitioning of high- throughput marker gene sequences. Isme J 9:968–979.

Kunin V., Engelbrektson A., Ochman H., Hugenholtz P. 2010. Wrinkles in the rare biosphere: Pyrosequencing errors can lead to artificial inflation of diversity estimates. Environmental Microbiology 12:118–123.

Mahé F., Rognes T., Quince C., de Vargas C., Dunthorn M. 2014. Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2:e593.

Nelson MC., Morrison HG., Benjamino J., Grim SL., Graf J. 2014. Analysis, optimization and verification of Illumina-generated 16S rRNA gene amplicon surveys. PloS one 9:e94249.

Nikolenko SI., Korobeynikov AI., Alekseyev MA. 2013. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics 14:1–11.

Andrew Krohn

unread,

Mar 7, 2016, 12:57:47 PM3/7/16

to Qiime 1 Forum

I forgot to finish my justification of distance level. The amplicon is 250 bases and 2.5 errors per read. This suggests a minimum distance of 3 is required (with swarm, use d3). I use d4 to be conservative in my observation which might mean underestimating diversity, but we've been doing that all along.

Colin Brislawn

unread,

Mar 8, 2016, 12:47:34 PM3/8/16

to Qiime 1 Forum

Good morning,

This is a great question. I want to refer you to some of lead software developers, who have written about the challenge of defining an OTU:

Robert Edgar (uclust, usearch)

http://www.drive5.com/usearch/manual/otu_definition.html

Frederic-Mahe (swarm)

https://peerj.com/articles/1420/

Sogin (Oligotyping, MED)

http://www.nature.com/ismej/journal/v9/n4/full/ismej2014195a.html

I have some questions for you:

What is your core biological question?

Are you more concerned about preventing OTU inflation or about resolving as many unique microbes as possible?

This is a great question. Keep in touch,

Colin Brislawn

Shantelle Claassen

unread,

Mar 8, 2016, 3:11:15 PM3/8/16

to Qiime 1 Forum

Good evening,

Thank you Andrew and Colin for your replies. I really appreciate your opinions.

Colin, to answer your questions:

1) We are aiming to characterize the bacterial microbiota profiles of various sample types in a large birth cohort study called the Drakenstein Child Health Study. We want to firstly characterize the profiles of a number of participants (from different sample sites) and secondly want to determine whether any of these profiles are associated with disease manifestation.

2) Thus, I guess to answer your second question...we are aiming for a bit of both:
- We don't want OTU inflation since we want to characterize the REAL bacterial profiles present without providing associations between OTUs with diseases/participant characteristics which are not really there in the first place.

- On the other hand, we want to identify low abundant as well as high abundant microbes as these may be the links between health and disease. I guess resolving as many unique microbes as possible is not as important as providing true information about the profiles present between cases and controls. Yes, it would be great if we could report some OTUs which are rarely identified (or even be able to provide information on as many species as possible using Miseq V4 data), but we don't want to compromise the data with spurious OTUs.

Looking forward to hear from you soon!

Shantelle Claassen

Colin Brislawn

unread,

Mar 8, 2016, 4:49:56 PM3/8/16

to Qiime 1 Forum

That sounds like an excellent study! Thank you for telling me more.

I understand the tension between wanting to capture as many microbes as possible while not getting any spurious OTUs. You can get all the microbes and also lots of errors and extra OTUs, or you can only look at OTUs you know are real while throwing away many potentially useful OTUs. This is such a fundamental problem in data science, there is a term for it. Receiver Operator Characteristic:

https://en.wikipedia.org/wiki/Receiver_operating_characteristic#/media/File:Receiver_Operating_Characteristic.png

A perfect algorithm would let you get every real OTU and no spurious OTUs, but we don't have that. Instead, you can choose if you want to keep more real OTU (while knowing you will also keep more spurious OTUs) or you can throw out every spurious OTU possible (while knowing you will throw out real OTUs too). That choice is up to you :-)

I should mention that this is a problem for minor OTUs. The most common microbes will be elegantly captured by any clustering algorithm you use.

Good luck!

Colin Brislawn

Andrew Krohn

unread,

Mar 8, 2016, 8:38:50 PM3/8/16

to Qiime 1 Forum

If these are samples from humans with any potential to characterize disease state, make sure you choose a dual-indexed protocol. Single indexing will present problems with sample cross-talk onboard the sequencer if you use a miseq or hiseq. I haven't tried a nextseq or miniseq, and they have slightly different chemistry. If using pacbio or 454 (probably not), you can ignore this precaution.

For more info on single vs dual indexing check out:

Kircher M., Sawyer S., Meyer M. 2012. Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform: SI. Nucleic Acids Research 40:e3.

Andrew Krohn

unread,

Mar 8, 2016, 8:42:04 PM3/8/16

to Qiime 1 Forum

We use a variant on Kozich for our amplicon protocol. I designed my own tails and indexes and have a Kozich clone as well as a "universal" system with the aforementioned tails. If you want to target multiple loci per sample, I recommend a tailed approach since this can improve the overall run quality (increasing diversity while sequencing) and will allow the use of a single custom sequencing primer for all loci on board.

Kozich JJ., Westcott SL., Baxter NT., Highlander SK., Schloss PD. 2013. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the miseq illumina sequencing platform. Applied and Environmental Microbiology 79:5112–5120.

Reply all

Reply to author

Forward