Method of rarefaction before statistical tests

Lee Sze Han

unread,

Dec 9, 2013, 8:03:01 AM12/9/13

to qiime...@googlegroups.com

Hi all

I have been using QIIME for the past few months, and it has been very useful to my project. Cheers to the developers!

Right now, I finished the process of OTU picking, and I am intending to run otu_category_significance.py to see which OTUs are associated between the categories. Over in the tutorial page (http://qiime.org/scripts/otu_category_significance.html), the tutorial recommends to rarefy the OTU table prior to running these significance tests.

As my samples range from 12k to 100k reads per sample, I am intending to perform a single_rarefaction.py at a depth of 10k reads per sample. My concern is that sampling 10k reads may randomly omit certain OTUs from the statistical test (especially samples with more reads)

Is the above concern valid? Would there be a better rarefaction step before statistical tests be conducted? Would multiple_rarefactions_even_depth.py be appropriate here?

Thank you for your time to review this concern.

Regards

Sze Han

Luke Ursell

unread,

Dec 9, 2013, 11:38:58 AM12/9/13

to qiime...@googlegroups.com

Hi Sze,

You are correct that when you rarify your OTU table it is likely that low abundance OTUs will be removed from the table, and that these lower abundance OTUs may be significantly different between your samples.

There are a few things to keep in mind:

1) If the OTU is very low abundance, how positive are you that its found in your samples, and at appreciable levels to make a biological difference?

2) Rarifying your table will omit OTUs that do not make up a certain proportion of a samples overall sequences. The more likely the OTU is in your sample, the more likely it'll find its way into a rarefied table. You are of course more than welcome to do multiple_rarefactions_even_depth.py and compare your results across multiple rarefactions. I feel the results will be broadly similar.

3) Another very popular thing to do before running otu_category_significance.py is to use the filter_otus_from_otu_table.py script, and to remove singletons (aka OTUs that are only present in one sample), or to remove very low abundance OTUs by specifying the minimum number of sequences and the minimum number of samples an OTU has to be to remain in the table. Removing OTUs in this manner before rarefaction can be very helpful.

4) The less OTUs you've got in your OTU table, the less the multiple comparison correction will affect your p-values.

Best,

Luke

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Lee Sze Han

unread,

Dec 9, 2013, 8:47:32 PM12/9/13

to qiime...@googlegroups.com

Hi Luke

That is very insightful. I will try to filter out singletons then rarefy the otu table before applying statistical tests.

With regards to statistical tests, does otu_category_significance.py offer any non-parametric tests (e.g. Kruskal–Wallis), as well as repeated measure tests (e.g. repeated ANOVA,Friedman test)?

I am thinking of using non-parametric tests because some of my categories only have 3-4 sample within the category. I am also following all the categories through 4 time points.

Thanks for your input!

Luke Ursell

unread,

Dec 9, 2013, 11:14:36 PM12/9/13

to qiime...@googlegroups.com

Hi,

The next iteration of QIIME, version 1.8, that is soon to be released does have nonparametric tests, and boostrapping procedures, and corrects for multiple comparisons, so wait just a bit (before the end of the year).

Luke

Lee Sze Han

unread,

Dec 9, 2013, 11:21:42 PM12/9/13

to qiime...@googlegroups.com

Hi Luke

Good to hear that.

Thank you for your advice!

Meric

unread,

Jan 23, 2014, 6:36:27 AM1/23/14

to qiime...@googlegroups.com

Hi Luke,

I also have a question related to the removals of singletons and rare OTUs, and rarefaction. After denoising, chimera checking and pynast failures removal, I still observed a lot of rare OTUs makes my data difficult to interpret. I therefore removed singletons and then rare OTUs smaller than 1% with the option of --min_count_fraction 0.01. As a second step, I then determined the smallest count per sample and used that number as depth in rarefaction step. As a last step I summarized taxa with bar chart plots. Is this a correct approach to obtain normalized abundant OTUs? If so can I use the same approach to produce PCoA plots and heatmaps for more reliable results?

I will be waiting your reply.

Best wishes,

Meric

Jai Ram Rideout

unread,

Jan 24, 2014, 9:24:15 AM1/24/14

to qiime...@googlegroups.com

Hi Meric,

For taxa summary plots, we usually don't recommend rarefying the table beforehand since the taxa summaries are converted to relative abundances. However, rarefying your table is important for alpha and beta diversity. The two workflows alpha_rarefaction.py and beta_diversity_through_plots.py will rarefy your table for you as part of the steps they perform.

Regarding your question about even sampling depth, that is entirely up to you. It sounds like you chose the smallest number of sequences per sample, in which case you will retain all samples in your rarefied table. However, you need to consider what the depth actually is; for example, if your smallest number of seqs/sample is 10, you'd likely not want to use that as an even sampling depth. So, it's a tradeoff between keeping samples in your study and maintaining a sufficient sampling depth for your downstream analyses.

-Jai

Meric

unread,

Jan 27, 2014, 4:54:23 PM1/27/14

to qiime...@googlegroups.com

Hi Jai,

Thanks a lot for your reply. I will do rarefaction only for alpha and beta diversity then.

For the even sampling depth, my number sequences per sample is actually not that low, my lowest count is 1355 and the highest is 7565, mean is 3065 and std dev is 1622. I chose my depth 1200 for this dataset. Is this rarefaction depth ok?

Thanks in advance,

Meric

Jai Ram Rideout

unread,

Jan 28, 2014, 12:25:49 PM1/28/14

to qiime...@googlegroups.com

Hi Meric,

I can't give a definitive answer, as it depends on how many samples you'd want to sacrifice in order to get a higher even sampling depth. It also depends on what type of question you're trying to answer and what sort of effect size you're investigating. For example, if you were to look at differences between human body sites (e.g., gut vs. palm), you likely won't need a very high sampling depth to continue to see separation in e.g. beta diversity plots.

This paper might be of interest:

Direct sequencing of the human microbiome readily reveals community differences.

J Kuczynski et al. Genome Biology (2011).

You can try out a sampling depth of 1355, and compare that to a higher depth if necessary. I don't recommend choosing a sampling depth of 1200, since the lowest depth necessary to retain all samples is 1355.

-Jai

sunil mundra

unread,

May 11, 2014, 4:45:20 PM5/11/14

to qiime...@googlegroups.com

Hello Qiimers,

I have rather similar question to the community.

I have 182 sample in a project, seuqenced by illumina and per sample read depth vary from 200 to 200000. In this project i will have to look for alpha and beta diversity for different treatment.

I rarified the data at a depth of 1800 per sample (depth where i loose very few sample). But after this total read number for 650 OTUs became zero.

Mean i have to take them out of my community structure and taxonomic analysis...!!

I am afraid to loose so many OTUs at this stage. Whether it is normal for other people as well?

What will be the consequences of if i rarified or not rarified my dataset?

Suggestion will be very much helpful for further analysis.

regards

Sunil

Raul Miranda

unread,

May 11, 2014, 5:35:48 PM5/11/14

to qiime...@googlegroups.com

HI Sunil,

That’s a difference of 3 orders of magnitude. Depending on how many samples do you have with in the lower range, I would suggest you to exclude samples with low number of reads. You can sequence them again to try and get a higher number of reads. It’d be good idea to quantify your libraries by qPCR before sequencing so that you make sure that each sample is similarly represented.

Cheers

Raul

For more options, visit https://groups.google.com/d/optout.

Barvaz Sini

unread,

May 14, 2014, 12:59:12 AM5/14/14

to qiime...@googlegroups.com

Hi,

I think it really depends on your analysis goals.

One option is to do the 1800 reads/sample subsampling for the alpha/beta diversity analysis, and also do a higher number of reads/sample subsampling for looking at the contribution of the very low frequency OTUs (albeit at the price of losing more samples).

Note that in general, the low frequency OTUs usually do not contribute a lot to the bray-curtis distance, so if you intend to use it as the beta diversity metric, it shouldn't matter if you subsamples to 1800 reads/sample. However, for presence/absence metrics (such as binary-jaccard or unweighted unifrac), the low abundance OTUs can contribute a lot (if their number is high).

How many OTUs do you have to begin with (before the subsampling)? Is 650 OTUs a big number?

How many samples (out of what total) do you lose if you subsample to ~10000 reads/sample?

Cheers

Amnon

For more options, visit https://groups.google.com/d/optout.

sunil mundra

unread,

May 14, 2014, 5:56:06 AM5/14/14

to qiime...@googlegroups.com

Hi Amnon,

My average reads per sample is approx 25000.

In total i have 2324 OTUs before rarification from 168 samples, and if i rarify at 1800 reads per sample i loose 520 OTUs.

If i do subsampling for ex 5000 reads per sample i loose total 25 sample and if i do subsampling at 10000 reads per samples i loose 45 sample.

I checked very briefly and it looks i am loosing OTUs from High reads samples. Average per sample OTUs with unrarified data was 297 and average OTUs per sample after rarification is 110. Reads per samples and species richness (OTUs) were moderately correlated with other (R² = 0.59), Species richness before and after rarification were not related with other (R² = 0.23).

I am bit afraid to deal with this situation and need suggestions.

My experimental setup (see below) does not allow me to through even 20 samples. My objective is to see is there difference in between community structure of Experimental and Control with time or Not, and is there similar trends at more deep soil part or not?

Sampling time Point	Sampling depth-2cm		Sampling depth-5cm
	Experimental	Control	Experimental	Control
Week 1	6 Sample	6 Sample	6 Sample	6 Sample
Week 2	6 Sample	6 Sample	--	--
Week 3	6 Sample	6 Sample	6 Sample	6 Sample
Week 4	6 Sample	6 Sample	--	--
Week 5	6 Sample	6 Sample	6 Sample	6 Sample
Week 6	6 Sample	6 Sample	--	--
Week 7	6 Sample	6 Sample	6 Sample	6 Sample
Week 8	6 Sample	6 Sample	--	--
Week 9	6 Sample	6 Sample	6 Sample	6 Sample

Looking forward to hear from you...

Regards

Sunil

Barvaz Sini

unread,

May 15, 2014, 2:00:23 AM5/15/14

to qiime...@googlegroups.com

Hi Sunil,

Always when subsampling you lose reads and therefore OTUs (and also if you would have run your experiment on twice the amount of sequencing runs, and gotten twice the reads per sample, you would also get more OTUs which you are not observing in your current experiment...). In typical samples, subsampling to 1800 is not that bad (even though soil samples are very rich in low abundance OTUs).

If you cannot tolerate losing samples, I would suggest doing your analysis with 1800 reads/sample. For typical scenarios, this should suffice to see the differences in community structure (unless the differences are driven only by very low abundance OTUs, but this is not common).

To be on the safe side, you can also do the same analysis with subsampling to 10000 reads per sample (where you lose ~25% of your samples?), look at the PCoAs and see if there is a drastic shift compared to the PCoAs from the 1800 reads/sample. If they seem similar, i would say the rare OTUs thrown away by the subsampling do not affect the results.

Also, when checking species richness (# of OTUs per sample), you MUST subsample all samples to same depth, since as you observed, there is a correlation between # of reads in sample and # of OTUs in sample (since more reads allows to detect more rare OTUs)