Rarefying sequences with large standard deviations...more complicated than it sounds

129 views
Skip to first unread message

Jay T

unread,
Jul 12, 2017, 2:25:43 PM7/12/17
to Qiime 1 Forum
I have a total of 192 samples... 4 treatments that were replicated 4 times. Each treatment had a total of 12 individuals. From those 12 individuals we took 4 samples.....so a total of 48 samples per treatment. So 4 treatments X 3 individuals X 4 samples per individual x 4 replicated treatments.  

Originally I wanted to see if there was differences between between samples in the same sample and between sample categories existed. I did not detect any. I wanted to run core diversity analyses but when I summarized my biom table it says: 

Counts/sample summary:

 Min: 3,345.000

 Max: 62,083.000

 Median: 16,981.000

 Mean: 19,973.891

 Std. dev.: 13,152.330

 Sample Metadata Categories: collapsed_ids

 Observation Metadata Categories: taxonomy


Given the individual samples have such a large standard deviation is there a way to merge samples within the same treatment before rarefying while having an adequate number of samples per treatment to run stats on?


I hope this makes sense.  




Colin Brislawn

unread,
Jul 12, 2017, 7:14:15 PM7/12/17
to Qiime 1 Forum
Hello Jay T,

Good to hear from you again. I hope all is well.

Is there a way to merge samples within the same treatment?

In cases like this, I think that having the samples separate (not merged / not collapsed) could be more helpful than merging them and increasing your reads per sample. While you may pick a rarefaction depth of 4,000 reads per sample you may lose some replicates. However, you will still have multiple replicates for each treatment type in your anova design. This lets you measure variation within the same treatment, which is super important because it lets you argue that the treatment effect size is larger than the normal individual variation and thus is biologically relevant.

Finding good setting for normalization on a given data set is always a bit tricky, so let me know what balance works well for you.

Colin

Jay T

unread,
Jul 12, 2017, 8:14:45 PM7/12/17
to Qiime 1 Forum
Hi again Colin, 

As always I appreciate you answering my endless questions regarding microbial ecology. Anyhow, given the nested ANOVA design of the experiment I decided to just separate the samples by their origin and compare those within the same category per treatment. 

That's really the main question of the experiment...Do we see significant differential changes in OTUs between samples of the same category and across treatments.

One thing I would ask though is how is your familiarity with the G-test of Independence? I noticed that when analyzing my data using the default FDR corrected Kruskal Wallis test that the difference between groups is non-significant but for several taxa the results are significant with the G-test. I know KW works on ranks, and the G-test tests uses a very different strategy. Is one test more suitable than the other? 

Lastly is there a way to default the test to G-test in the Core Div analyses? 

Thanks,
JT

Colin Brislawn

unread,
Jul 12, 2017, 9:13:50 PM7/12/17
to Qiime 1 Forum, Yoshiki Vázquez Baeza
Hello Jay,

how is your familiarity with the G-test of Independence
I don't know much about the G-test. :-( 

difference between groups is non-significant but for several taxa the results are significant
Now this makes sense to me. Large groups could be similar, while specific features could be consistently different. So you would have two tests; one for groups, one for individual features.

To compare groups, I currently use the adonis test on distances matrices (unifrac or weighted unifrac distances). The ancon test is newer, and maybe better, but adonis is commonly used and simple to understand. This tells me how much variance in my distances matrices could be explained by these given groups.

To compare individual OTUs, I use DESeq2. Originally designed for RNA seq data, this method compares two groups of samples and identified specific OTUs which are consistently more abundant in one group, and OTUs which are more abundant in the other. 

I do this in R using the phyloseq package. I can share my code, if that's interesting.

Looks like Yoshiki is monitoring the forms now, so perhaps he can comment on the G-test or ancon vs adonis. 

Colin

Message has been deleted

Jay T

unread,
Jul 12, 2017, 10:24:51 PM7/12/17
to Qiime 1 Forum, yosh...@gmail.com
Definitely. I like phyloseq a lot. Very flexible package. Please share when you get a chance.

Thanks!

Yoshiki Vázquez Baeza

unread,
Jul 13, 2017, 1:38:28 PM7/13/17
to Qiime 1 Forum
Hello,

Regarding one of the original points in this thread:

Given the individual samples have such a large standard deviation is there a way to merge samples within the same treatment before rarefying while having an adequate number of samples per treatment to run stats on?

The large variation in the counts per sample is more common than what you think (usually a side-effect of the sequencing instrument or of low-biomass samples).

Note that the hypotheses you test with these tests are all slightly different, and that each make some assumptions. ANCOM is a differential abundance test, while ADONIS is a permutation-based clustering strength test (are groups of treatment types closer than others).

If you would like to test differential abundance in your samples, I would recommend any of the packages that have been mentioned above, also taking into consideration the discussion in these papers:

Finally, if you feel comfortable using R/Python trying out one of the new compositionally-aware packages would likely be a better approach:

 
Thanks!

Yoshiki.


Colin Brislawn

unread,
Jul 13, 2017, 2:48:20 PM7/13/17
to Qiime 1 Forum
Thanks for the update Yoshiki. Sounds like I should lean more about ANCOM!

Jay, both my methods for adonis and DESeq2 are based on the excellent Phyloseq vignettes by Paul McMurdie. 
ADONIS: https://joey711.github.io/phyloseq-demo/Restroom-Biogeography.html (No direct link. Seach the page for 'adonis')

Here's one example of adonis on a published project, but Phyloseq comes with better tutorials. 

Colin

Jay T

unread,
Jul 13, 2017, 4:07:04 PM7/13/17
to Qiime 1 Forum
I actually just finished the DESeq2 example along with running the differential abundance testing in QIIME and I feel like it looks excellent. 

Thanks Colin!

JT

Colin Brislawn

unread,
Jul 13, 2017, 5:02:32 PM7/13/17
to Qiime 1 Forum
Glad it's working well!
Colin

Jay T

unread,
Jul 13, 2017, 5:38:44 PM7/13/17
to qiime...@googlegroups.com
Colin - One last thing. So I was able to make a nice Log2FoldChange output but how exactly can I get the raw output from running it in R?

I ran the same otu table in QIIME using the differential abundance script and its output from the .txt looks very different than what I visualized in R. 

Thanks,
JT


Colin Brislawn

unread,
Jul 13, 2017, 5:59:09 PM7/13/17
to Qiime 1 Forum
I'm not sure how to take the deseq output from qiime and import it into R. When I've used this, I've simply done it directly in R.

I bet differences between outputs have to do with the details of implementation. Perhaps you could compare the phyloseq code to the qiime code to find where differences could emerge. 

One potential difference is this step, in which a pseudocount of 1 is added to every value of your biom table. This prevents any rows or columns from being totally empty, but also changes the shape of the negative binomial model. 

Colin

Jay T

unread,
Jul 13, 2017, 6:01:28 PM7/13/17
to Qiime 1 Forum
Sorry I misspoke. What I want are the Deseq2 results from the R output only...something I can put in an excel table.  

Colin Brislawn

unread,
Jul 13, 2017, 6:11:56 PM7/13/17
to Qiime 1 Forum
Got it.

OK, so do you want to varianceStabilized counts from the deseq object, or do you want the result of a contrast() between two groups?

Results of counts:
deseqobject %>% assay() %>% write.table()

Results of contrasts:
deseqobject %>% results(contrast = c("groups", "numerator-group", "denominator-group") %>% write.table()

I think those should work. The final arguments for write.table() might be something like this:
write.table(file="folder/filename.csv", quote=FALSE, sep=",", row.names=FALSE, col.names=TRUE)

I hope this helps. There are lots of ways to do this in R. 

Colin

Jay T

unread,
Jul 13, 2017, 6:15:44 PM7/13/17
to Qiime 1 Forum
That is exactly what I need. Thanks!

Jay T

unread,
Jul 13, 2017, 7:10:54 PM7/13/17
to qiime...@googlegroups.com
Here is another way to do it if someone else is following the phyloseq deseq2 conversion tutorial....

Save the "sigtab" deseq2 object listed in the tutorial:

write.csv(as.data.frame(sigtab), 
          file="Control_High_results.csv")

Change the file name of course.

Thanks Colin! 

Jay T

unread,
Jul 14, 2017, 3:29:36 PM7/14/17
to Qiime 1 Forum
Colin - Have you ever plotted by groups instead of using geom_points?

I think this looks superior to the output in the phyloseq deseq2 tutorial. I can't figure out how to get my code to work though. 

Colin Brislawn

unread,
Jul 14, 2017, 4:37:30 PM7/14/17
to Qiime 1 Forum
Hey Jay,

I like stacking up my points as shown in the deseq tutorial, but that's just me.

It looks like geom_col() + geom_errorbar() should make that graph. 

See the example at the very bottom of this page:


Colin

Jay T

unread,
Jul 14, 2017, 4:50:09 PM7/14/17
to Qiime 1 Forum
Yea I could see both ways. One thing I noticed in my data is that Fusobacteria is more abundant in Group 1 vs Group 2....but when plotting the relative abundance it looks Group 2 has more....

I guess this makes sense if Deseq2 is working on absolute abundances and not relative abundances. 

Colin Brislawn

unread,
Jul 14, 2017, 7:39:47 PM7/14/17
to Qiime 1 Forum
Hello Jay,

Interesting... Trying out multiple normalization methods could help test this. DESeq does require raw data, but you can also build graphs of abundance using rarified (subsampled) counts, or relative counts (out of 1), or raw counts just to compare. 

You could also check the contrasts function that's being called. Take this example:
results(deseqobject, contrast = c("Treatment", "A", "B"))
So this will calculate A over B.

Deseq let's you set up this A vs B comparison in a couple of different ways, but I like this best because it makes it clear which is the numerator and the denominator. If you are using another way, it's possible the software swapped your levels.

Colin

Jay T

unread,
Jul 14, 2017, 8:24:08 PM7/14/17
to Qiime 1 Forum
I'm actually slightly embarrassed. I had been looking at the DeSeq2 plot "upside down". I thought my "Control" group was on the positive log2 scale at the top of the plot and my "High" group was at the bottom. I kept looking at it funny because it didn't match my relative abundance plots in the slightest...then I remembered the treatment groups are arranged alphabetically from the bottom up unless specified. 

Thanks for that command Colin!

JT 

Colin Brislawn

unread,
Jul 15, 2017, 12:12:35 AM7/15/17
to Qiime 1 Forum
Yep, that exact same thing has happened to me. That's why I use the extra step of results() to list exatly what group is on top and on bottom, instead of leaving it up to R when using the ~ formula. 

Thanks for posting the result here to help future users. 
results(deseqobject, contrast = c("Treatment", "A", "B")) # A over B
This is the way to go.

Colin

Reply all
Reply to author
Forward
0 new messages