TL,DR: for questions regarding the purpose of transformations and which ones to use, check out the handy micro-eco/stats site GUSTA-ME and their quick page on choosing transformations (with OTUs, we are considering ecologically motivated transformations). Also see this recent-ish paper by Weiss et al. regarding different normalising transformations and differential abundance tests.
Can I also suggest you check (search for, not ask anew!) the QIIME1 user forums for similar discussions? That forum has been a goldmine of info for the community for a long time and I expect this issue has been dealt with repeatedly.
Hi jfg,As Robert P noted in his latest message, his problem indeed is due to the normalization. Because in the docker version of lefse, the default for -o is no normalization, and on the web server, if you select yes (the default), the normalization is 1M.To continue this discussion, please forgive me for posting in this thread because nobody replies my message. I think the problem I have seen is not just a debatable issue about normalization. For example, if one has a set of samples with 4000 reads per sample, it may probably be wrong to normalize at 1000000 because the current algorithm that lefse adopted is likely to inflate the difference of any features between groups and deemed as significant.So my question on what to choose as the normalization factor really means what is the statistically correct way to choose normalization factor. I think it is not something to debate, it is something we need to figure out how to do it correctly.Jincheng
On Tuesday, October 31, 2017 at 9:09:54 AM UTC-4, jfg wrote:Yo chaps,This stuff remains outside my scope, but at a guess:TL;DR: very small numbers are difficult for computers to manage reliably, so two different computers are very unlikely to agree on outcomes for very small calculations (such as your/our! normalised abundances).If you have some wobble in the strength of biomarkers (score, LDA, 'p' of OTUs), but are seeing the same patterns overall, it could be a difference in the way the different platforms (different computers, difference operating systems, different versions of R or LEfSe, combination of all these) are handling the small numbers - when people are being nice about it, this is often referred to as a 'floating point arithmetic problem'. See this short stackoverflow discussion and this homework-for-the-weekend length article. Kind of mental.If this sounds like your issue, treat it as an acknowledged problem/limitation in 21st century computing: try reduce it's effects to a level you are comfortable defending (e.g. try lowering the normalisation value, -o), and don't worry about having to solve it on Society's behalf.Alternatively: choose your own normalisation method before LEfSe: see paper attached/presentation on 'Ecological Resemblance' as a starter or vegan's decostand() function in R, but be prepared to spend the rest of your life worrying about whether or not a 'right' method exists.In Leah's link above I think Mr. Wang's question relates to the scale of differences (no normalisation, normalisation to 10,000, normalisation to 1,000,000) his normalisations are causing (are 1 and 2 significantly different / are 10,000 and 20,000 significantly different / are 1,000,000 and 2,000,000 significantly different? etc).His related question about picking the most suitable normalisation method is a really important and open one, but is now frequently circumvented by using normalisation/standardisation/transformation methods worked into the published peer-reviewed software we use, as an intellectual shortcut we happily live with (e.g. the x1000000 relative abundance in LEfSe or the more complex but brilliant DESeq2 which uses a combination of methods to intelligently compensate for your data's distribution). I... have no strong answers O_oHappy Halloween!
Hello all - we have recently started running LEfSE in a docker window and we are getting different results than when we run through Galaxy. As best we can tell, the analysis parameters are the same, although options exist in the command-line version that do not appear on Galaxy. We have not changed these from their default settings. The absolute LDA values differ depending on which platform we use, although they are generally congruent (i.e., LDA of feature x is greater than that of feature y regardless of platform, and the rank order differs only marginally). More features with LDA>2 are identified in the Galaxy-based analysis. Any suggestions as to what we should adjust to get better correlation between the two approaches?Rob P