category significance (group significance) assumptions

Andrew Krohn

unread,

Jun 11, 2014, 9:35:16 PM6/11/14

to qiime...@googlegroups.com

Hello,

I am still on v1.7.0, so using the otu_category_significance script. I am wondering about some of the statistical processing. I work in an ecological system that has been somewhat characterized so we know where to expect differences. For my beta and alpha diversity plots those differences are apparent. But then when I run group (otu category) significance, nothing is significant so I was perplexed about what could be driving the separations. For beta diversity I tend to run unifrac and jaccard distances, both of which are useful data transformations (into matrices) that can deal with multiple zeros for some factor (such as an OTU). Since my sampling is blind (common with soil and marine collections, I take cores and sequence whatever I get), I can expect that I might have a zero for abundance of some taxon that we know to be present, so the assumption has to be that sampling is always incomplete. The group significance defaults to anova which operates under several assumptions including normally-distributed data and homogeneous variances. I was thinking about this last night so I ran shapiro wilk tests on a few of my otus (across all samples) this morning and indeed none of them satisfy the assumption of normality (thus probably also violating the variance assumption). I assume data transformation is needed to deal with zeros, but I'm not sure the best way to proceed. Can anyone make a reasonable suggestion here? Is there a way to perform a transformation directly on a biom table?

Will Van Treuren

unread,

Jun 12, 2014, 12:24:54 PM6/12/14

to qiime...@googlegroups.com

Hi Andrew,

We have also found that most OTU level abundance tables are not normally distributed. In QIIME 1.8 and above, the new script group_significance.py is the replacement for otu_category_significance.py and it defaults to using the Kruskal Wallis test rather than ANOVA.

Testing for significant differences between feature means in samples classes is a fundamentally different problem than testing for differences between samples via things like beta diversity. The transformation you describe (UniFrac/Jaccard) eliminates the features, and seeks to embed the samples in some other space (rather than the feature space). The reason you may not be seeing individual differences with otu_category_significance.py but do see differences with beta_diversity.py is that no single feature is discriminatory between the samples. Only through a combination of your features will you see differences between your samples. If you wanted to get a more quantitative look at which features were causing separation between your samples, I would suggest looking at the feature importance scores in the supervised_learning.py script.

If you are concerned about the normality of your data and want to transform it, you can make those transformations using the biom API and python functions, or you can convert your table to classic format, manipulate it in Excel or R, and then convert it back. Log transforms have worked well to normalize for me (casting -infs resulting from log(0) to 0).

Hope this helps,

Will

On Wed, Jun 11, 2014 at 7:35 PM, Andrew Krohn <andrew...@gmail.com> wrote:

Hello,

I am still on v1.7.0, so using the otu_category_significance script. I am wondering about some of the statistical processing. I work in an ecological system that has been somewhat characterized so we know where to expect differences. For my beta and alpha diversity plots those differences are apparent. But then when I run group (otu category) significance, nothing is significant so I was perplexed about what could be driving the separations. For beta diversity I tend to run unifrac and jaccard distances, both of which are useful data transformations (into matrices) that can deal with multiple zeros for some factor (such as an OTU). Since my sampling is blind (common with soil and marine collections, I take cores and sequence whatever I get), I can expect that I might have a zero for abundance of some taxon that we know to be present, so the assumption has to be that sampling is always incomplete. The group significance defaults to anova which operates under several assumptions including normally-distributed data and homogeneous variances. I was thinking about this last night so I ran shapiro wilk tests on a few of my otus (across all samples) this morning and indeed none of them satisfy the assumption of normality (thus probably also violating the variance assumption). I assume data transformation is needed to deal with zeros, but I'm not sure the best way to proceed. Can anyone make a reasonable suggestion here? Is there a way to perform a transformation directly on a biom table?

--

---
You received this message because you are subscribed to the Google Groups "Qiime Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to qiime-forum...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Krohn

unread,

Jun 12, 2014, 1:37:08 PM6/12/14

to qiime...@googlegroups.com

Thanks Will,

I have a lot of zero values (most otu tables do), so I would think a log transformation inappropriate since log(0) is undefined. I did some other things (sqrt(average(allsamples)+(value for sample))) to eliminate zero values, but I'm bot sure how justified I am, despite that it seems to push the taxa that are visually different to the top. Will give supervised learning a try instead.

Reply all

Reply to author

Forward