Metadata groups with several levels

326 views
Skip to first unread message

jesse harrison

unread,
Nov 21, 2018, 11:40:06 AM11/21/18
to MaAsLin-users
Dear MaAsLin users,

I am working on a data set that includes metadata groups with several levels (e.g. subgroups A, B and C within a given metadata category).

When I check the output files, I can see that there are taxon abundance shifts associated with levels B and C (with negative coefficients for both groups), but no results are listed for level A. Plotting the raw data shows that the relative abundances (%) are indeed the highest for level A.

Even if this seems logical, I'm a little uncertain about interpreting the model coefficients in a case like this. Let's take a situation where taxon X is associated with Level B and has a coefficient of -0.02. Is this coefficient determined in relation to level A only, or both A and C?

Many thanks for your insights and help with this.

Best wishes,
Jesse Harrison

Himel Mallick

unread,
Nov 21, 2018, 8:29:27 PM11/21/18
to jesse harrison, MaAsLin-users
Hi Jesse,

You are correct about the interpretation. When you have a categorical variable, one of the levels will always be a reference category (MaAsLin determines this alphabetically which in your case is level A) and the coefficient estimates should always be interpreted in relation to the reference category. For your particular case, the coefficient -0.02 refers to the effect of level B in relation to level A. Hope this makes sense.

Many thanks,
Himel 
--
You received this message because you are subscribed to the Google Groups "MaAsLin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to maaslin-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jesse harrison

unread,
Nov 22, 2018, 4:30:22 AM11/22/18
to MaAsLin-users
Hi Himel,

Thanks, that clarifies things. Good to know also for future data sets. :-)

Best wishes,
Jesse
Message has been deleted

XG Yang

unread,
Nov 7, 2019, 11:24:12 AM11/7/19
to MaAsLin-users
Hi Himel,
I'm wondering if it is always the case that we won't observe any associations for the reference level because of the way MaAsLin works. After reading your comment, I noticed in my own results that all the metadata levels for which I don't observe any associated taxa are the ones that should have been used as a reference because they alphabetically come before other levels. For example, for the delivery mode (C-section vs. Vaginal), I don't see any associations for the C-section delivery while MaAsLin has found several taxa positively or negatively associated with Vaginal delivery, which is contrary to my expectation that C-section delivery should be associated with multiple taxa according to literature. This puzzles me on whether no association for the C-section is merely because C-section is used as the reference level ('C' comes before 'V' alphabetically) or if this is just the peculiarity of my data (like I said, I see this pattern for all of my metadata). Also, if the former is true, does "Taxon X is positively associated with Vaginal delivery" indirectly imply that "Taxon X is negatively associated with the C-section delivery" even if MaAsLin doesn't find any association between X and C-section?
Thanks!

On Wednesday, November 21, 2018 at 8:29:27 PM UTC-5, Himel Mallick wrote:
Hi Jesse,

You are correct about the interpretation. When you have a categorical variable, one of the levels will always be a reference category (MaAsLin determines this alphabetically which in your case is level A) and the coefficient estimates should always be interpreted in relation to the reference category. For your particular case, the coefficient -0.02 refers to the effect of level B in relation to level A. Hope this makes sense.

Many thanks,
Himel 


On Nov 21, 2018, at 10:10 PM, jesse harrison <jesse.p...@gmail.com> wrote:

Dear MaAsLin users,

I am working on a data set that includes metadata groups with several levels (e.g. subgroups A, B and C within a given metadata category).

When I check the output files, I can see that there are taxon abundance shifts associated with levels B and C (with negative coefficients for both groups), but no results are listed for level A. Plotting the raw data shows that the relative abundances (%) are indeed the highest for level A.

Even if this seems logical, I'm a little uncertain about interpreting the model coefficients in a case like this. Let's take a situation where taxon X is associated with Level B and has a coefficient of -0.02. Is this coefficient determined in relation to level A only, or both A and C?

Many thanks for your insights and help with this.

Best wishes,
Jesse Harrison

--
You received this message because you are subscribed to the Google Groups "MaAsLin-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to maasli...@googlegroups.com.

Mallick, Himel

unread,
Nov 7, 2019, 12:13:51 PM11/7/19
to XG Yang, MaAsLin-users

Hi XG – I am giving a detailed answer in case anyone else has a similar question in the future.

 

I am assuming that your categorical metadata ‘delivery mode’ has two levels: C-section and Vaginal. As you rightly guessed, MaAsLin (or R) will typically treat one level (determined alphabetically) as reference and model the other level as a dummy variable. In general, a categorical variable with K (K>=2) categories will have (K-1) dummy variables. 

 

Coming back to your question on how you should interpret results from a categorical variable, let me first confirm that the reference variable will never be included in your output table as, by definition, we have included dummy variables corresponding to only non-reference levels. In other words, what MaAsLin (or MaAsLin2) is modeling under the hood is a set of dummy variables (K-1 to be precise if you have K levels) along with other metadata (if specified).

 

To give you a specific example, suppose we are interested in a categorical variable that might assume three values - UC, CD, and non-IBD (assuming that the variable denotes the disease status of IBD patients). We could represent this variable with two dummy variables:

 

X1 = 1, if UC; X1 = 0, otherwise.

X2 = 1, if CD; X2 = 0, otherwise.

 

In this example, notice that we don't have to create another dummy variable to represent the "non-IBD" category of disease status which becomes redundant once X1 and X2 are defined. If X1 equals zero and X2 equals zero, we know that the patient is neither UC nor CD. Therefore, the patient must be non-IBD. The value of the categorical variable that is not represented explicitly by a dummy variable is called the reference group. In this example, the reference group consists of non-IBD patients.

 

Let's assume that Y in this case is microbial species abundance and the downstream linear regression models X1 and X2 along with other variables in the model. In modeling notation, the per-feature model boils down to Y ~ X1 + X2 + ...

 

Now, let’s assume that you get a significant positive coefficient corresponding to X1 (say, beta1) and a significant negative one corresponding to X2 (say, negative beta2) for taxa A (for other taxa it can be interpreted in a similar fashion). You can interpret the above several ways depending on what you consider a reference category:

 

1. After the effects of other covariates are taken into account, the UC patients have beta1 times higher mean abundance of taxa A than the reference group (non-IBD).

 

2. After the effects of other covariates are taken into account, non-IBD patients have beta1 times lower mean abundance of taxa A than the UC group.

 

3. After the effects of other covariates are taken into account, the CD group has beta2 times lower mean abundance of taxa A than the reference group (non-IBD).

 

4. After the effects of other covariates are taken into account, non-IBD patients have beta2 times lower mean abundance of taxa A than the CD group.

 

As you can notice, both 1,2 and 3,4 correspond to the same statement with the direction of association flipped.

 

All the above is to finally come back to your question:  “Does "Taxon X is positively associated with Vaginal delivery" indirectly imply that "Taxon X is negatively associated with the C-section delivery"? The answer is Yes. Based on the above, I would interpret it as “Women who went through a vaginal delivery have a higher mean abundance of Taxon X than women with C-section delivery” or “Women who went through a C-section delivery have a lower mean abundance of Taxon X than women with C-section delivery”.

 

Lastly, if you want to change the reference category to “C-section”, you can simply rename these levels such that “C-section” comes first alphabetically. For example, if you rename your levels as C-section = “A” and Vaginal = “B”, C-section will be modeled as the reference category.

 

I hope this clears the air a bit.

 

Many thanks,

Himel

 

From: maasli...@googlegroups.com <maasli...@googlegroups.com> On Behalf Of XG Yang
Sent: Thursday, November 7, 2019 11:24 AM
To: MaAsLin-users <maasli...@googlegroups.com>
Subject: Re: Metadata groups with several levels

 

EXTERNAL EMAIL – Use caution with any links or file attachments.

To unsubscribe from this group and stop receiving emails from it, send an email to maaslin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/maaslin-users/98cd7f1e-46a9-48bd-bb05-7c9b931cff60%40googlegroups.com.

Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
New Jersey, USA 07033), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from
your system.

XG Yang

unread,
Nov 7, 2019, 3:11:54 PM11/7/19
to MaAsLin-users
Hi Himel,
Thank you very much for this comprehensive answer! It is indeed a great help to better interpret the MaAsLin results. Just two minor things:
- I think there is a typo in 4. I guess what you meant was that ".... non-IBD patients have beta2 times HIGHER mean abundance of taxa A than the CD group."
- As for your last comment "Lastly, if you want to change ...", C-section is taken as the reference by default because alphabetically C (in C-section) comes before V (in Vaginal). So, probably what you meant to say was how to have Vaginal to serve as the reference level.
BTW it would have been nice if MaAsLin could provide a way of specifying the reference level without having to rename metadata values (though it's not really a big deal!).
Thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to maasli...@googlegroups.com.

Mallick, Himel

unread,
Nov 7, 2019, 3:19:50 PM11/7/19
to XG Yang, MaAsLin-users

Hi XG – thanks for catching the typos. Those are indeed copy-paste errors that I overlooked before clicking the send button 😊

 

About the possibility of automatic specification of reference level in MaAsLin, you can check out MaAsLin2 which can be run both as command line and R functions. Specifically, if you work in an R environment, you might be able to set up the reference level using the ‘relevel’ function before running MaAsLin2. I have not tested it myself but let me know if this approach does not work out with the MaAsLin2 R function.

 

Many thanks again,

To unsubscribe from this group and stop receiving emails from it, send an email to maaslin-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/maaslin-users/1b52b34a-6a85-4062-afaf-7af076400c27%40googlegroups.com.

Reply all
Reply to author
Forward
0 new messages