Formating input file for LEfSe Galaxy

537 views
Skip to first unread message

dagmar.s...@gmail.com

unread,
Sep 20, 2018, 11:06:10 AM9/20/18
to LEfSe-users
Hi,
I am wondering why I get different results when I delete some characters behind the taxonomy separating pipes in the input file. The reason why I delete those characters is simple, I obtain summarized taxa chart from Qiime. In this txt file some  taxonomic levels contain  underscores without further specification (eg. k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__). Since LEfSe shows only the text behind the last pipe, in the resuIts I get only "g__ " and obviouslyI do not know where it belongs.
My problem is that when I remove the last unspecified taxonomy level (eg. k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae) I obtain different results. I enclose both input files (original and corrected) and both LEfSe results (from original file and from corrected file). In the input file I change only taxonomic text, I do not touch the numerical values at all.
Can someone help me explain this?
Thank you

otu_table.cat_sorted_L6_2.txt
otu_table.cat_sorted_L6_1.txt
Plot_LEfSe_Results_2.png
Plot_LEfSe_Results_1.png

Jamie FitzG

unread,
Sep 20, 2018, 11:58:39 AM9/20/18
to dagmar.s...@gmail.com, LEfSe-users
Heya Dagmar, 

PROBLEM
You are changing the composition of your OTU table by doing as you have explained. For testing, Lefse combines all the 'c__Actinobacteria' abundances together, all the 'f__Ruminococcaceae' abundances together for testing, all the 'G__Bacillus' abundances together for testing, and all the 'g__' abundances together, to check if any of these taxonomic levels are different between samples. 

The issue is that all the unidentified 'g__''s (and 'p__'s, 'f__',s, 'c__',s etc)  in your first set can be totally unrelated but will be combined by Lefse, and therefore your analysis involves spurious abundances - it's wrong!

A similar thing happens in your second example, where you have removed all the 'g__' OTUs, where the families they leave behind are grouped together without a corresponding OTU - closer, but still no good! 

As a result, your outcomes change between runs.


SOLUTION
Add a column AFTER g__, which gives each OTU a unique number, e.g.  k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__|OTU_1234 . This will allow you to identify the biomarkers (use the table outputs and make your own graphics!) while still including the unknown taxa levels, and will give Lefse the data it expects. 

NOTE, the results you get for unknown taxa (e.g. 'p__', 'c__', 'f__', 'g__' etc.) are still ~meaningless as you cannot attach identity to them, and (depending on your input) they will come from several unrelated sources. You could also give numbers to these unknowns to separate/partition them, but seeing as they are unknowable it will only tell you so much. 

All the best! 


dagmar.s...@gmail.com

unread,
Sep 26, 2018, 4:44:40 AM9/26/18
to LEfSe-users
Thank you for clarification!
Dagmar

Dne čtvrtek 20. září 2018 17:58:39 UTC+2 jfg napsal(a):
Reply all
Reply to author
Forward
0 new messages