homonymous OTUs mask LDA Effect Size

562 views
Skip to first unread message

jfg

unread,
Mar 18, 2015, 2:09:58 PM3/18/15
to lefse...@googlegroups.com
Hey Nicola & Co.,

   First, thanks for all the maintenance work going into the LEfSe package, and this group.

   Second - Happy St Patrick's Day!

   Third: I passed an OTU table (eleven subjects, with two classes) through LEfSe. My OTUs are identified by their taxonomic 'path': e.g.
Bacteria|Proteobacteria|Gammaproteobacteria|Pseudomonadales|Moraxellaceae|Psychrobacter, 
to genus level where possible. The '|' delimiter ensures (I think) that all OTUs from any shared clade will be weighed together when examining their strength as biomarkers: useful for establishing when e.g. Pseudomonadales is a biomarker for either class.

   However where genus is the last level, this means that all genera are lumped together, then evaluated, despite being different OTUs, with (very) different abundances. To illustrate, I've attached a subset where 11 OTUs collapse to the Psychrobacter genus. One has decent abundances at 5% of all my sequences - but this signal is lost when averaged between homonymous genera as shown in the attached differential plot, psychro.png.

   I understand that LEfSe measures differences between abundances, so 5% would be uninformative if it was evenly spread: however, am I right in thinking that this set-up will obscure the signal of strong marker taxa where many homonyms are present? Could this by amended by LEfSe not combining OTUs when they are the last informative taxonomy (i.e. not combining leaves)?

thanks for taking the time to read, hoping I've gotten this straight.
jfg
0_Bacteria-Proteobacteria-Gammaproteobacteria-Pseudomonadales-Moraxellaceae-Psychrobacter.png
homonym_taxa.txt

jfg

unread,
Mar 19, 2015, 8:07:44 AM3/19/15
to lefse...@googlegroups.com
   As a workaround, could I use unique IDs for each OTU (e.g. denovo1234), then take this as the significance at the genus level, using it alongside the original (as in first post above) set of named taxonomies? 
   Does LEfSe consider the significance of abundances on a taxon-by-taxon basis (i.e. all phyla tested together, then all classes tested together, then orders...), or all abundances of all taxa taken together? (Or is there perhaps a third strategy?...)

-

   Additionally, if I remove '|', and separate the taxa with (e.g.) '_', as below:
Bacteria_Proteobacteria_Gammaproteobacteria_Pseudomonadales_Moraxellaceae_Psychrobacter

I get an error after Get Data>Upload File (tabular format), when I try step (A) Format Data for LEfSE:

Internal Server Error

Galaxy was unable to successfully complete your request

An error occurred.

This may be an intermittent problem due to load or other unpredictable factors, reloading the page may address the problem.

The error has been logged to our team.

Deleting the '_'-separated dataset and replacing with a '|'-seperated or shorter dataset resolves the issue, so I'm assuming LEfSe is refusing to consider taxonomies that are this long (i.e. it sees the taxonomy as one long cumbersome string).
lefse error.png

jfg

unread,
Mar 26, 2015, 7:54:40 AM3/26/15
to lefse...@googlegroups.com
Hello, 

   Just wondering if there were any thoughts on the effects of homonymous OTUs? Or a clarification on how LEfSe analyses the different taxa levels in relation to each other?

thanks
jfg

Nicola Segata

unread,
Mar 26, 2015, 4:17:02 PM3/26/15
to jfg, lefse...@googlegroups.com
Hi,
 sorry for the delay in replying. You should not use duplicated feature names. The right approach is to attache the OTU ID at the end of each feature (after a "|"). In this way, all the internal levels are computed correctly. 
I hope this helps
thanks
Nicola

jfg

unread,
Mar 27, 2015, 6:14:31 AM3/27/15
to lefse...@googlegroups.com
Cheers Nicola, 

   That makes sense and changes the number of discriminant taxa:
  • number of discriminants LDA above 2.5 with 'Ph|Cl|Or|Fa|Gen' format: 108
  • number of discriminants LDA above 2.5 with 'inflexible' 'denovo#' format: 170
  • number of discriminants LDA above 2.5 with combined 'Ph|Cl|Or|Fa|Gen|denovo#' format, as per your suggestion: 303
This seems partially due to having more levels to analyse, but there are some differences in the markers showing up too. 
thanks again,
jfg

am...@usc.edu

unread,
Mar 20, 2019, 6:26:21 PM3/20/19
to LEfSe-users
Hi Jfg and Nicola, 
I have tried to understand this by reading the different conversations in the group, but I am still not sure about what is the most proper way to input the different classification levels to Lefse. 

I used Mothur for my analysis and my lefse input file has the taxa shown as below 
Bacteria|Firmicutes|Bacilli|Lactobacillales|Lactobacillaceae|Lactobacillus|Otu001|
Bacteria|Bacteria|Bacteria|Bacteria|Bacteria|Bacteria|Otu002|
Bacteria|Cloacimonetes|Candidatus_Cloacamonas|Candidatus_Cloacamonas|Candidatus_Cloacamonas|Candidatus_Cloacamonas|Otu030|

I realized that the figures I get from plotting the Lefse results and cladogram will have repeated LDA score for the same OTU at different levels. I understand that Lefse computes for each classification level independently and this is the reason. I was wondering if there is any way I can adjust the data so I can get a more meaningful Cladogram?  My current solution for the LDA score is to filter manually and take one score for each OTU and draw my own diagram. 

Also, following up on your previous conversation - are you saying that in order to get accurate LDA score at higher phyla classification, we need to add OTU number or unique number after each level, as below? 

Bacteria|Firmicutes|Firmicutes_Otu0022|Firmicutes_Otu0022|Firmicutes_Otu0022|Firmicutes_Otu0022|Otu022|

Thank you so much for your time. 
Best, 
Yamrot Amha 
Reply all
Reply to author
Forward
0 new messages