Changing the MetaPhlAn2's normalized output

387 views
Skip to first unread message

XG Yang

unread,
Oct 23, 2019, 11:37:05 AM10/23/19
to MetaPhlAn-users
As far as I understand, the output of MetaPhlAn2 is normalized taxa abundances, i.e., we deal with relative abundances that add up to one. Since the output is compositional (sum = 1), we cannot use many standard statistical tests for downstream analysis (e.g.,  the Mann-Whitney U test to compare taxa abundances for case/control studies). I know that quantifying absolute taxa abundances is not possible but I'm wondering if there is any way to circumvent this issue e.g., by having MetaPhlAn2 to output the identified taxa abundances in a non-compositional format?

XG Yang

unread,
Nov 6, 2019, 10:45:17 AM11/6/19
to MetaPhlAn-users
Any feedback on this is greatly appreciated!

Valentina Galata

unread,
Nov 7, 2019, 4:19:12 AM11/7/19
to MetaPhlAn-users
I do the following: I use MetaPhlAn2 to get the estimated counts (using the option "-t rel_ab_w_read_stats") and then use the R-package ALDEx2 to apply CLR-transformation to the counts and to compare two sample groups.

XG Yang

unread,
Dec 12, 2019, 10:23:57 AM12/12/19
to MetaPhlAn-users
Thanks Valentina for the feedback (I had missed your response before). My understanding is that count data are not normalized and that ALDEX2 uses CLR transformation for compositional data. So, I wonder why you are suggesting to use CLR on the count data and not on the normalized (compositional) abundance data.
I see two options here (please correct me if I'm wrong):
1. Use CLR transformation for the normalized abundances and then use any standard statistical test as needed
2. Just work with the count data (instead of normalized abundances) and use any standard statistical test as needed.
If these make sense to you, do think that one of these two options might be more appropriate than the other?

Valentina Galata

unread,
Dec 14, 2019, 1:33:06 PM12/14/19
to MetaPhlAn-users
Thanks Valentina for the feedback (I had missed your response before). My understanding is that count data are not normalized and that ALDEX2 uses CLR transformation for compositional data. So, I wonder why you are suggesting to use CLR on the count data and not on the normalized (compositional) abundance data.
 
I am not sure what you mean. ALDEx2 expects counts as input (not relative abundances) for the CLR-transformation (ALDEx2::aldex.clr) and the obtained object can be used to compute effect sizes (ALDEx2::aldex.effect) and perform significance tests (ALDEx2::aldex.ttest).
 
I see two options here (please correct me if I'm wrong):
1. Use CLR transformation for the normalized abundances and then use any standard statistical test as needed
2. Just work with the count data (instead of normalized abundances) and use any standard statistical test as needed.
If these make sense to you, do think that one of these two options might be more appropriate than the other?

I am not sure what you mean by "normalized abundances". Also, you cannot just apply standard statistical tests to compositional count data.

As a note: Besides ALDEx2, I would also like to mention such packages as selbal (https://github.com/UVic-omics/selbal) and propr (https://github.com/tpq/propr) for the analysis of compositional data.

XG Yang

unread,
Dec 18, 2019, 12:57:25 PM12/18/19
to MetaPhlAn-users
Thanks Valentina and sorry if I wasn't clear before. By "normalized abundance" i actually meant "relative abundance", which are compositional data and I thought this is the input to ALDEX2.
I do know that standard statistical methods cannot be applied to compositional data and that is the whole purpose of this discussion. My understanding (e.g., from PMID: 29187837) was that one way to handle compositional data is to first use CLR transformation and then use standard statistical techniques since the data are not compositional and we work with real numbers (is this correct?).
Now, I'm wondering why ALDEX2 uses the CLR transformation on the count data, which are not compositional data (i.e., it seems to me that ALDEX2 does not even deal with compositional data).

Valentina Galata

unread,
Dec 19, 2019, 2:15:18 AM12/19/19
to MetaPhlAn-users
First of all, depending on how the data was generated, feature counts should be viewed as compositional. Since a sequencer produces a certain total number of reads, the total number of reads is predefined and therefore not informative making such data compositional. Compositional data imposes some constraints which can be reduced using an appropriate transformation, e.g. the CLR transformation. Then, standard statistical approaches can be applied to the data.

ALDEx2 was designed to work with compositional datasets in form of feature counts. I do not quite understand why you think that it does not do that.
I would highly recommend that you read this paper: Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. It describes very well the nature of compositional data and how ALDEx2 works.

Hope that helps to answer your questions. :)

XG Yang

unread,
Dec 19, 2019, 11:31:22 AM12/19/19
to MetaPhlAn-users
Thanks so much, Valentina for the comments. I didn’t know that count data are considered compositional as well (similar to relative abundances) and that was the source of my confusion about ALDEX2. This simply shows how shallow my knowledge of next-gen sequencing is! Also, thanks for referring me to that paper. It’s certainly a much-needed read for me!
Reply all
Reply to author
Forward
0 new messages