Hey Nicola & Co.,
First, thanks for all the maintenance work going into the LEfSe package, and this group.
Second - Happy St Patrick's Day!
Third: I passed an OTU table (eleven subjects, with two classes) through LEfSe. My OTUs are identified by their taxonomic 'path': e.g.
Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Psychrobacter,
to genus level where possible. The '|' delimiter ensures (I think) that all OTUs from any shared clade will be weighed together when examining their strength as biomarkers: useful for establishing when e.g. Pseudomonadales is a biomarker for either class.
However where genus is the last level, this means that all genera are lumped together, then evaluated, despite being different OTUs, with (very) different abundances. To illustrate, I've attached a subset where 11 OTUs collapse to the Psychrobacter genus. One has decent abundances at 5% of all my sequences - but this signal is lost when averaged between homonymous genera as shown in the attached differential plot, psychro.png.
I understand that LEfSe measures differences between abundances, so 5% would be uninformative if it was evenly spread: however, am I right in thinking that this set-up will obscure the signal of strong marker taxa where many homonyms are present? Could this by amended by LEfSe not combining OTUs when they are the last informative taxonomy (i.e. not combining leaves)?
thanks for taking the time to read, hoping I've gotten this straight.
jfg