Interpreting LDA scores and significance when comparing 4 different classes

Metasaur

unread,

Mar 28, 2016, 4:26:01 PM3/28/16

to LEfSe-users

Hi All,

So I believe this was discussed before, but even after reading the explanations I was not able to make much sense of what to do in my scenario.

I'm using LEfSe to analyze 4 different classes (which equate to 4 different time points, 2, 21, 42, and 56 days old).

I am not using subclasses. And I set the strategy for a one against all method. I'll often set the LDA threshold very high (e.g., >4 to get the most significant differences). At any rate when I actually get my output I'll get often times LDA scores which are significant for bacterial groups at 2 days, and at 21 days, 42 days, and 56 days etc...

So my question is two fold, is there a way to to understand why it has generated an LDA score for, for instance day 2... For instance, is this particular group supposed to be more significantly abundant than days 21, 42, and 56? Or is it just more abundant statistically than just one other timepoint? For instance, when I look at the median $ of sequences for 2, 21, 42, and 56 days (Which LEfSe says that it says that Bacteroidia has a highly significant LDA score at day 56)... their respective sequences are 0.57, 37.14, 37.04, 39.39....

It would seem more likely that by some calculation maybe just day 56 had the highest percent of sequences? But doesn't necessarily appear as though it would be significantly higher than days 42 and days 21... I imagine that day 2 is probably what drove the difference there, but why was day 56 the only one to show a high LDA score? How would you write this out as a result? Bacteroidea were differentially abundant at day 56 compared to the rest of the timepoints? (and does that simply mean that it was simply the most abundant of them all)?

The latter part of my question, is perhaps by not using subclasses I guess I'm essentially only performing a kruskal wallis test, so does the LDA score really add any value to this?

Sorry if this is confusing.

Let me know if I can help clarify anything.

Kind regards,

Metasaur

unread,

Apr 7, 2016, 3:34:16 PM4/7/16

to LEfSe-users

Still looking for help here... anyone try to do similar analyses?

jfg

unread,

Apr 8, 2016, 11:27:42 AM4/8/16

to LEfSe-users

Metasaur,

You say you are using the one-against-all method: which day (your 'one') are you comparing against all others ('all')?

If you are using 'one-against-all', then I would assume that your significant points are significantly different in abundance with respect to the 'one' you are comparing them all against (i.e. is Day2 significantly different from Day 21; is Day2 significantly different from Day 42; is Day2 significantly different from Day 56; blah-blah...).

I'd also guess that day 56 in your example reaches the threshold for significant abundance (at the threshold you set..), whereas days 42/21 do not reach that threshold, although they appear numerically close (i.e. 37, 37, 39).

KW w/o classes vs LDA: I've seen this asked before, I think you're right and it's just a standard KW test - don't take my word for it though, give the older posts a lurk.

jfg

Metasaur

unread,

Apr 11, 2016, 5:55:17 PM4/11/16

to LEfSe-users

Good point. Ok I'm feeling like I should have realized this all against one comparison would only compare it against one group. So how do I go figuring out which "one" it makes all the comparisons group. I mean 2 days was the first batch of columns with info in my data set so maybe that is the one?... Also it would appear that when I look at the numbers that for each of the days that it identifies a differentially abundant feature that indeed they are much different than day 2, but not necessarily different than any of the other time points.

This was helpful thanks!

jfg

unread,

Apr 12, 2016, 10:05:57 AM4/12/16

to LEfSe-users

this thread might help clarify: https://groups.google.com/forum/#!topic/lefse-users/SU2sxI0DIyc

I've never seen this cleared up, but my understanding is that when you use "all v all", the score for each column is how significantly different that sample is from the others, (i.e. (a vs b, c, d and e); (b vs. a, c, d, e); (c vs. a, b, d, e) etc etc... ) for each of the rows.

From what you're saying, it sounds like:

your community changes 'significantly' at some point between days 2 and 21 (explaining why day 2 stands out from the others)
and after that change does not change 'significantly' between later days (i.e. thats why you are not seeing sig differences between later days).