Mothur formatted Lefse input

590 views
Skip to first unread message

Kristjan Oopkaup

unread,
Feb 4, 2014, 3:06:51 AM2/4/14
to lefse...@googlegroups.com
Hi
I formatted my data with Mothur (as it recently included a command to do so http://www.mothur.org/wiki/Make.lefse ), where every feature is an OTU phylotype. But Mothur is attaching OTU id to every clade and I think this will confuse Lefse as it now consider every feature as unique (and the LDA score analyse step did not finish). If I understand correctly when two or more same clades (for example Proteobacteria) are marked as biomarkers then they are shown as one on the plot (LDA scores averaged maybe)? So far I have removed the IDs, but it would be interesting to see which OTUs are statistically important among samples.

Kristjan

Nicola Segata

unread,
Feb 8, 2014, 2:45:42 AM2/8/14
to lefse...@googlegroups.com
Hi Kristjan,
 thanks for getting in touch. I agree that adding the OTU ID at every calde is confusing LEfSe.

The line
Bacteria_Otu001|"Bacteroidetes"_Otu001|"Bacteroidia"_Otu001|"Bacteroidales"_Otu001|"Porphyromonadaceae"_Otu001|unclassified
should be
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|Otu001

So in addition to removing the OTUs IDs in the internal leves, you have to add it in the final level of each phylotype.

LEfse adds the abundances of the internal levels (e.g. Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae) by summing up all abundances of OTUs at that level (as Otu001 above) and lower levels (e.g. OTUs in the Porphyromonas genus).

I hope this helps,
Nicola

Kristjan Oopkaup

unread,
Feb 11, 2014, 3:13:52 AM2/11/14
to lefse...@googlegroups.com
Thank you, that worked great. And one suggestion for the differential features plot - if sample IDs are in the original table then these could be added to the plot.

Nicola Segata

unread,
Feb 11, 2014, 8:22:13 AM2/11/14
to lefse...@googlegroups.com
Yep, that's a good suggestion, it is now in our to-do list!
thanks
Nicola

O. AlZahal

unread,
Nov 18, 2015, 9:55:46 PM11/18/15
to LEfSe-users
Hello folks, i would greatly appreciate some more clarification on this.
"""
The line
Bacteria_Otu001|"Bacteroidetes"_Otu001|"Bacteroidia"_Otu001|"Bacteroidales"_Otu001|"Porphyromonadaceae"_Otu001|unclassified
should be
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|Otu001
"""
I understand why Otu001 was removed from each clade, but i don't understand why the unclassified was removed and replaced with Otu001??
This is who i dealt with the unclassified since they can appear at any taxa level
for :
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|unclassifed
I replaced it with:
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|g_unclassifed
and for:
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|unclassified|unclassifed
I replaced it with:
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|f_unclassified|g_unclassifed


I have also another problem. Shouldn't there be a total line i.e.

Bacteria XX
Bacteria|Bacteroidetes XX 
Bacteria|Bacteroidetes|Bacteroidia XX

many many thanks 

OA

jfg

unread,
Nov 19, 2015, 5:26:14 AM11/19/15
to LEfSe-users
OA,

   In your first question the word 'unclassified' is probably removed because it is in many senses superfluous if there is no title for that taxa level: it can be understood as being unclassified if it lacks a name. Also, it adds a few more characters onto the string, making it that bit more awkward to display/move around...

   'Unclassified' is replaced with the OTU# because you need the final point in your taxonomy to be unique for each OTU - by this I mean 
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|Porphyromonadaceae|g_unclassifed-OTU1110
 or 
Bacteria|Bacteroidetes|Bacteroidia|Bacteroidales|f_unclassified|g_OTU437  etc...
 
The reason being that every entry with the same title (e.g. 'Bacteria|Unclassified_P|Unclassified_C|Unclassified_O|Unclassified_F|Unclassified_G' of which there may be hundreds or thousands...) will be treated as the same OTU , even though they might be totally unrelated! 
   Adding the OTU number to the end of the Genus clade is a really good way to keep track of each individual OTU.

   This leads in your second question: LEfSe will automatically sum abundance at  each level independently (i.e. summing at Bacteria, at Bacteroidetes, at Bacteroidia, at Bacteroidales, at Porphyromonadaceae, and at g_unclassifed-OTU1110), test each level for significance, and then tell you which are 'significant'. Because LEfSe separates and tests each clade at the '|' characters, you don't need a total line: the machine is clever enough to do it for you (because you were clever enough to do it already..).

   See also this earlier question:


jfg

O. AlZahal

unread,
Nov 19, 2015, 11:37:09 AM11/19/15
to LEfSe-users
Thanks Jfg, It is much clearer now but what confuses me is the example provided in the lefse sit, i.e.:
bodysite                                mucosal         mucosal         mucosal         mucosal         mucosal         non_mucosal     non_mucosal     non_mucosal     non_mucosal     non_mucosal
subsite                                 oral            gut             oral            oral            gut             skin            nasal           skin            ear             nasal
id                                      1023            1023            1672            1876            1672            159005010       1023            1023            1023            1672
Bacteria                                0.99999         0.99999         0.999993        0.999989        0.999997        0.999927        0.999977        0.999987        0.999997        0.999993
Bacteria|Actinobacteria                 0.311037        0.000864363     0.00446132      0.0312045       0.000773642     0.359354        0.761108        0.603002        0.95913         0.753688
Bacteria|Bacteroidetes                  0.0689602       0.804293        0.00983343      0.0303561       0.859838        0.0195298       0.0212741       0.145729        0.0115617       0.0114511
Bacteria|Firmicutes                     0.494223        0.173411        0.715345        0.813046        0.124552        0.177961        0.189178        0.188964        0.0226835       0.192665
Bacteria|Proteobacteria                 0.0914284       0.0180378       0.265664        0.109549        0.00941215      0.430869        0.0225884       0.0532684       0.00512034      0.0365453
Bacteria|Firmicutes|Clostridia          0.090041        0.170246        0.00483188      0.0465328       0.122702        0.0402301       0.0460614       0.135201        0.0115835       0.0537381
in this example and also the example file available for download both has totals. I found out without the totals, the “Relative abundance” on the LDA graph is not correct. Also, without the total line, you won’t be able to separate a cladogram by clade i.e. the if you input “bacteria.firmicutes” in “Root of the tree”, it won’t work. Please give it a try.

many thanks 
OA

jfg

unread,
Nov 24, 2015, 10:27:01 AM11/24/15
to LEfSe-users
 OA,

   When you say the relative abundance is not correct, do you mean that it does not add up to 1? This could be something to do with the way MOTHUR is outputting data (you are using MOTHUR yes?) - perhaps taxa with uncertain identity are not included in the list of abundances? If this was the case, there would be a small-medium shortfall in the total abundances when summed (depending on how the totals were calculated). I have not used MOTHUR, so my insight is particularly limited.

   Also, when you say 'Root at Firmicutes', do you mean restrict a cladogram of LefSe-computed biomarkers to only displaying Firmicutes? I'm not sure where you are accessing the 'root of the tree' option nor am I sure which cladogram you mean. Also, I don't have access to the resources to experiment with your data set I'm afraid.


jfg
Reply all
Reply to author
Forward
0 new messages