Re: Lefse taxa names

Curtis Huttenhower

unread,

Jan 29, 2018, 1:57:45 PM1/29/18

to Wylie, Todd, nicola...@unitn.it, Wylie, Kristine, lefse...@googlegroups.com

Thanks Todd - I'm not 100% confident enough personally to answer this without digging into the code a bit, but I'll CC in the LEfSe users list to help take a look as well. I _think_ it has to do with a combination of 1) the interpretation of |s as meaning hierarchy and 2) the scrambling of special characters that can occur in the rpy interface, but I'll let the experts answer if possible?

Thanks a bunch -

Curtis

On Mon, Jan 29, 2018 at 1:06 PM, Wylie, Todd <twy...@wustl.edu> wrote:

Looks like one of the email addresses I mailed bounced back; sorry if this email finds you twice. We have quite a bit of analysis wrapped-up in Lefse. Any help would be greatly appreciated….

Thanks,

Todd

Todd N. Wylie
Assistant Professor
Department of Pediatrics | Division of Infectious Diseases
McDonnell Genome Institute
Washington University School of Medicine
660 S. Euclid Avenue
Campus Box 8208
St. Louis, MO 63110

314.747.4069 (Pediatrics office)

314.286.1450 (MGI office)
314.286.2895 (FAX)
twy...@wustl.edu

Begin forwarded message:

From: Todd Wylie <twy...@wustl.edu>

Subject: Lefse taxa names

Date: January 23, 2018 at 2:43:25 PM CST

To: nicola...@unitn.it, nse...@hsph.harvard.edu

Cc: Todd Wylie <twy...@wustl.edu>, "Wylie, Kristine" <kwy...@wustl.edu>

Greetings, Nicola:

I have a few questions regarding the naming convention for taxa fields in Lefse input files. I would be very grateful for any guidance you may provide. For the examples outlined below, I've included all of my command line instructions and associated files in the attached zip file.

My taxa names are formatted at a specific level (genus) without any pipes or hierarchical information. As such, I notice I get varying (but reproducible) results depending on the formatting of the taxa names, though class, subject ids, and abundance values all remain the same. For example, the following naming conventions alter LDA results (see attached PDF):

1) taxa names with underscores (e.g. Clostridium_sensu_stricto_g)

2) taxa names with underscores removed (e.g. Clostridiumsensustrictog)

3) taxa names with underscores changed to "q" charcters (e.g. Clostridiumqsensuqstrictoqg)

4) taxa names with underscores changed to "x" charcters (e.g Clostridiumxsensuxstrictoxg)

5) taxa names with underscores changed to double underscores (e.g. Clostridium__sensu__stricto__g)

Also, I made a version of taxa names with full lineage (e.g. Bacteria_k|Firmicutes_p|Clostridia_c|Clostridiales_o|Clostridiaceae_1_f|Clostridium_sensu_stricto_g) and results vary from the specific genus-level LDA results.

How does taxa naming convention alter LDA results? What is the best practice?

I'm afraid I'm missing something fundamental on my end... apologies in advance.

Very best,

Todd

PS: I'm running from the command line using https://hub.docker.com/r/biobakery/lefse/ docker version.

The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Samantha

unread,

Feb 13, 2018, 3:04:10 PM2/13/18

to LEfSe-users

Hello,

I've noticed a similar issue, and am wondering if anyone has found an answer yet? My differences come from removing ";__". For example, I have two files with the taxa represented either as:

1. Bacteria|Acidobacteria|Acidobacteriia|Acidobacteriales|Acidobacteriaceae;__;__

2. Bacteria|Acidobacteria|Acidobacteriia|Acidobacteriales|Acidobacteriaceae

The first run through lefse gives me 24 significant hits, while the second file gives me 31.

I am using command line lefse with these commands:

format_input.py rel_freq_POB_no_bb-lev7-lefse.txt formatted_lefse.in -c 1 -o 1000000
run_lefse.py formatted_lefse.in lefse_out.res
plot_res.py --dpi 300 lefse_out.res LDA_plot_out.png

In regards to Dr. Huttenhower's second postulation are semicolons and underscores considered special characters, and therefore need to be avoided when formatting the file? Any assistance in the correct formatting of the file would be appreciated so I know I'm getting the most accurate results!

Thank you,

Samantha

Jamie FitzG

unread,

Feb 14, 2018, 5:25:54 AM2/14/18

to LEfSe-users, samantha...@gmail.com, twy...@wustl.edu, kwy...@wustl.edu, nicola...@unitn.it

Dear Samantha and Todd and lefsers,

Just another Lefse user, but from my own experience, formatting as the following would give different results:

1. Bacteria|Acidobacteria|Acidobacteriia|Acidobacteriales|Acidobacteriaceae

versus

2. Bacteria|Acidobacteria|Acidobacteriia|Acidobacteriales|Acidobacteriaceae|;__|;__

as Lefse understands the extra pipe characters " | " at the end to mean a new taxonomic level. I see however in Samantha's post, there are no extra pipes at the end of her taxonomies so this is unlikely to be the case.

Can I ask how severe the changes in LDA are? And what sorts of 'new' biomarkers are you seeing when you add/remove extra terms/characters (i.e. new significant phyla, significant genera, significant OTUs etc.?) I might also suggest you provide an example of the dataset you're looking at to the LEfSe group, so others can better understand the issue.

Underscores are used widely so are not special characters; semicolons are unlikely to be special characters but the developers will know better

jfg

Samantha

unread,

Feb 14, 2018, 11:40:55 AM2/14/18

to LEfSe-users

Hello all,

I'm not sure what you mean by how severe the changes are, but the LDA scores in my first example (with ";__") range from about -4 to 4 and in my second example they range from about -6 to 5. The new biomarkers range across all domains, there are "new" phyla all the way down to "new" species.

I know in general underscores aren't considered special characters but I was out of ideas!

I've attached both relative abundance tables (table 1 has the semicolons and underscores, while table 2 does not) and the corresponding LDA plot outputs.

Thanks,
Samantha

rel_freq_table_1.txt

rel_freq_table_2.txt

LDA_plot_out_2.png

LDA_plot_out.png

jfg

unread,

Feb 14, 2018, 12:32:08 PM2/14/18

to LEfSe-users

By severe, I simply meant how strong is the H score / LDA signal you're seeing: I was wondering if these were minor changes, but 7 to 4 is fairly large!

That said, looking at LDA plot out #1 (the smaller image), there's obviously something amiss as Bacteria are significant markers for both cases (Bacteria_____, Bacteria).

LEfSe sees these as two different taxonomic objects: One being the domain Bacteria which is based on the format of all the bacterial taxa beginning with 'Bacteria|'. The second one is a weird, stand alone entity called 'Bacteria;__;__;__;' as one word, that has no taxonomic rank, and who's abundance is quite different from any pattern relating to the bacterial domain - this is why it is giving you a weird double, differing reading for this taxon. Note how this doesn't happen to 'Unidentified;__;__;__; as it only appears once in your dataset.

The same applies to your Acidobacteriaceae case above: one is the Acidobacteriaceae family, and another is a weird stand-alone entity.

These weirdos appearing throughout your taxonomy table are changing how LEfSe tests abundances at different taxonomic levels, giving the discrepancies you see.

I'd guess these taxa with ;__;__;__; were OTUs which could not be identified beyond e.g. Bacteria, Acidobacteriaceae etc. by whatever method you chose (SilvaNGS?), but when you re-formatted for LEfSe you did not account for the way these taxa were entered lacking any appropriate text.

TL;DR: rel_freq_table2 is more correct! You can improve it by fixing the stubbed taxonomic string ;__;__;__;__ wherever you find it, i usually use something like: '...|unknown_family_OTU#|unknown_genus_OTU# ' so that you keep each such entry separate and can ID it later.

Note this is likely a different problem to Todd's above, where he seems to have removed All of the |'s in his dataset, and be experimenting with different characters at the genus level.

Samantha

unread,

Feb 14, 2018, 4:47:09 PM2/14/18

to LEfSe-users

Hi,

Thanks so much for clarifying! I did understand that Bacteria;__;__;__;__ was an instance where the OTU couldn't be identified beyond the Bacterial domain, but I didn't realize that lefse looked at all of the taxonomic levels like that. I was under the impression that it looked for differences using whatever was after the last bar. Therefore, I had been making separate tables for each domain when I wanted to see how they changed. I assigned taxonomy using greengenes in QIIME2.

I will definitely be adding the "unknown" tags that you mentioned!!

Thanks!

Samantha

Reply all

Reply to author

Forward