Normalization in format_input.py seems to impact the final results

Jincheng Wang

unread,

Oct 4, 2017, 10:48:39 AM10/4/17

to LEfSe-users

Hello developers,

I have a question regarding the normalization in the first step of format_input.py. The problem is different -o parameter would impact how many features I am able to detect in the end.

My input dataset is proportions. The original counts have been rarified to have 10000 reads per sample. When I run it in lefse without any normalization (using the proportions, -o -1), no features were selected in the 2nd step. If I use -o 10000, 15 features were selected, but if I use -o 1000000, then all 97 features from 1st step would be selected.

Although I think this should not be the intended behavior, I understand why this happens. Using a very high normalization parameter may have just artificially increased the difference between groups and therefore would have a bigger effect than the pre-set effect size threshold.

But on the other hand, how small should this normalization factor be? For my dataset, using the original proportion seems not work. Using 10000 seems good because I got a manageable amount of features selected. But I feel the results are more of a result of data manipulation, or data manipulation plus true biology combined at least. I wonder if there is a better way to select a parameter that would better represent biology.

Thanks!

Jincheng

jfg

unread,

Nov 3, 2017, 9:17:45 AM11/3/17

to LEfSe-users

Heyo Jincheng et al.,

Thought maybe it's better to continue this below Jincheng's original post, if only so others can find it later on (based on the subject).

TL,DR: for questions regarding the purpose of transformations and which ones to use, check out the handy micro-eco/stats site GUSTA-ME and their quick page on choosing transformations (with OTUs, we are considering ecologically motivated transformations). Also see this recent-ish paper by Weiss et al. regarding different normalising transformations and differential abundance tests.

Can I also suggest you check (search for, not ask anew!) the QIIME1 user forums for similar discussions? That forum has been a goldmine of info for the community for a long time and I expect this issue has been dealt with repeatedly.

I agree that choice of normalisation is an important issue both generally, and in particular for the outcome in LEfSe. The problem is that there is no 'correct' solution to fit all cases as all datasets are unique. The default normalisation in LEfSe is a functional, robust and transparent way of dealing with uneven sample sizes, but it is not an ultimate solution. As you rightly point out, the magnitude of the transformation affects the outcome and there are additional problems with compositional transformations like this (see Weiss paper re: compositional data and alternative methods).

As such, the method galaxy-LEfSe defaults to is better than raw data, but not ideal.

However, the severity of any transformation will affect any differential abundance test outcome - it might be better to think of each LEfSe normalisation value (1, 10,000, 1,000,000 etc) as a different normalisation method using the same technique (relative proportions). The one that works best is the one that maximises comparability between samples but distorts the order (magnitude, scale) of your abundances the least.

As before, Jincheng's question about how to choose the correct transformation is a really good one that I don't have a ready answer for. Often I've resorted to seeing how different transformations balance sampling depth (i.e. ~uniform bar chart) and/or alter ordination grouping of samples according to some known experimental factor (time, site, treatment etc.) rather than sequence depth. I acknowledge that this is can be considered data mining - however I don't see any other way to explore one's data in a field that has not settled on a best approach.

Ideally, transformation methods for comparison/testing of microbiomes should bring all samples into a similar sampling depth/range (min - max), similar variation around the mean (be homoscedastic), and usually remain positive (all observations >0, required by some downstream methods). There are some defensible/popular alternatives to LEfSe's approach, such as Hellinger transformation (implemented via decostand() in R:vegan) or DESeq2's VST which should generally be applicable to microbial ecology data (good with sparse data / data with lots of zero abundances).

Love to be corrected on any of this!

jfg

From the previous conversation on Galaxy v. Docker:

On Tuesday, October 31, 2017 at 7:00:06 PM UTC, Jincheng Wang wrote:

Hi jfg,

As Robert P noted in his latest message, his problem indeed is due to the normalization. Because in the docker version of lefse, the default for -o is no normalization, and on the web server, if you select yes (the default), the normalization is 1M.

To continue this discussion, please forgive me for posting in this thread because nobody replies my message. I think the problem I have seen is not just a debatable issue about normalization. For example, if one has a set of samples with 4000 reads per sample, it may probably be wrong to normalize at 1000000 because the current algorithm that lefse adopted is likely to inflate the difference of any features between groups and deemed as significant.

So my question on what to choose as the normalization factor really means what is the statistically correct way to choose normalization factor. I think it is not something to debate, it is something we need to figure out how to do it correctly.

Jincheng

On Tuesday, October 31, 2017 at 9:09:54 AM UTC-4, jfg wrote:
Yo chaps,

This stuff remains outside my scope, but at a guess:

TL;DR: very small numbers are difficult for computers to manage reliably, so two different computers are very unlikely to agree on outcomes for very small calculations (such as your/our! normalised abundances).

If you have some wobble in the strength of biomarkers (score, LDA, 'p' of OTUs), but are seeing the same patterns overall, it could be a difference in the way the different platforms (different computers, difference operating systems, different versions of R or LEfSe, combination of all these) are handling the small numbers - when people are being nice about it, this is often referred to as a 'floating point arithmetic problem'. See this short stackoverflow discussion and this homework-for-the-weekend length article. Kind of mental.

If this sounds like your issue, treat it as an acknowledged problem/limitation in 21st century computing: try reduce it's effects to a level you are comfortable defending (e.g. try lowering the normalisation value, -o), and don't worry about having to solve it on Society's behalf.
Alternatively: choose your own normalisation method before LEfSe: see paper attached/presentation on 'Ecological Resemblance' as a starter or vegan's decostand() function in R, but be prepared to spend the rest of your life worrying about whether or not a 'right' method exists.

In Leah's link above I think Mr. Wang's question relates to the scale of differences (no normalisation, normalisation to 10,000, normalisation to 1,000,000) his normalisations are causing (are 1 and 2 significantly different / are 10,000 and 20,000 significantly different / are 1,000,000 and 2,000,000 significantly different? etc).
His related question about picking the most suitable normalisation method is a really important and open one, but is now frequently circumvented by using normalisation/standardisation/transformation methods worked into the published peer-reviewed software we use, as an intellectual shortcut we happily live with (e.g. the x1000000 relative abundance in LEfSe or the more complex but brilliant DESeq2 which uses a combination of methods to intelligently compensate for your data's distribution). I... have no strong answers O_o

Happy Halloween!

Hello all - we have recently started running LEfSE in a docker window and we are getting different results than when we run through Galaxy. As best we can tell, the analysis parameters are the same, although options exist in the command-line version that do not appear on Galaxy. We have not changed these from their default settings. The absolute LDA values differ depending on which platform we use, although they are generally congruent (i.e., LDA of feature x is greater than that of feature y regardless of platform, and the rank order differs only marginally). More features with LDA>2 are identified in the Galaxy-based analysis. Any suggestions as to what we should adjust to get better correlation between the two approaches?
Rob P

Jincheng Wang

unread,

Nov 7, 2017, 11:31:57 AM11/7/17

to LEfSe-users

Hi jfg,

Thanks a lot for the explanations. I agree with most of what you are discussing. But I still have one doubt with the default normalization factor (i.e. 1 M) that I hope I can get an answer with.

Maybe I did not make myself clear enough, what my concern is not on what normalization method/technique is best, which I agree and understand is debatable and complicated; instead, my concern is whether the "normalization in LEfSe" is "robust". I can understand and accept using different normalization techniques (e.g. TSS vs. CSS in Joseph Paulson's words) would result in different outcome. But now we have a problem that using TSS (the same technique) but with different scale would produce different outcomes, which I think might not be called robust. I was trying to understand what causes the issue and I think this should be a statistical question with a clear answer.

As said earlier, one of my guesses is that using 1M to normalize would manipulate the magnitude of the count. This should be fine for the univariate analysis (i.e. KW and Wilcoxon test) in LEfSe. Given my experience, the "Number of significantly discriminative features: before internal wilcoxon" did not change with a different normalization factor, which proves my previous point that the issue would not impact univariate analysis. But I don't have enough knowledge to determine if it should be fine for the linear discriminant analysis in the final step. This is where I think would require the developers to provide an answer.

Furthermore, I think this is a serious problem with practical implications because it is not unusual (actually pretty common) to find a lot of studies having a sequencing depth lower than 1 M. Thus, say if the user just input the pure abundance table (i.e. no hierarchy taxa aggregation), no matter if they input the percentage or raw count, the normalized count using default normalization factor would inflate the counts a lot. And if this inflation would indeed impact the LDA analysis, the results from a lot of publication would be problematic. So I really think the developers should clarify this concern.

Thanks!

Jincheng

jfg

unread,

Nov 7, 2017, 12:53:44 PM11/7/17

to LEfSe-users

Hey Jincheng,

I hadn't stopped to consider at which point in LEfSe normalisation might make the biggest difference - it's definitely an important point! From your post above, it certainly sounds like the effect is on the LDA output, but I'm already far enough out on my twig-like understanding of statistics to not start guessing as to why that should be.

Again, I think the normalisation method we are discussing is intended as a placeholder, and although the normalisation affects the LDA score, these scores are only calculated on taxa which are already detected to be biomarkers. LDA scores are also relative, thus only meaningful for comparisons within that particular test.

However, as you point out, LDA cutoffs decide the final output of the LEfSe program - so in practical terms it matters a lot. What sort of differences are you seeing in LDA score under different normalisation values?

Developer input appreciated!

~As for 'robustness', I meant this more in terms of it being universally applicable and unimpeded by 0/sparse data; it is not 'consistent' (in so far as the same method (i.e. TSS) using different 'Total Sums' effect different outcomes).

Jincheng Wang

unread,

Nov 7, 2017, 4:54:26 PM11/7/17

to jfg, LEfSe-users

Hi jfg,

Thank you for your response. I re-read the methods of LEfSe paper and found how they calculate LDA score. "This model is used to estimate their effect sizes, which are obtained by averaging the differences between class means (using unmodified feature values) with the differences between class means along the first linear discriminant axis, which equally weights features’ variability and discriminatory power." Apparently using a larger normalization factor would increase "the differences between class means (using unmodified feature values)". It is not that obvious how normalization would change the class means along the first linear discriminant axis, but I would assume it would increase that difference as well. I remember I saw the equation to calculate the LDA score somewhere, but couldn't find it right now.

Also from what you are describing, I just realized that it seems LEfSe just used KW-test and Wilcoxon to screen the features and all passed that screen was deemed significantly discriminative features? Then LDA was just a calculation of "effect size" and a further screen using a threshold? If that is the same, I think using univariate test in general to select features might not be a good idea considering features themselves are not independent from each other.

The difference I saw on a test dataset is that, 101 features were deemed significantly discriminative, and if I use 10000 to normalize my data, the LDA step would pick 36 features; but if I use 100000, the LDA would pick 97.

Thanks!

Jincheng

Best regards!

---------------------------------------------

Jincheng Wang, Ph.D.

Postdoctoral fellow,

New York University School of Medicine,

Dr. Dominguez-Bello lab,

VAMC 6009W

423 East 23th Street

New York, NY 10016

Jinche...@nyumc.org

Reply all

Reply to author

Forward