Re: Humann2 Pathway Analysis Question

Curtis Huttenhower

unread,

Jun 22, 2015, 1:44:33 PM6/22/15

to Battaglia, Thomas, humann...@googlegroups.com, Blaser, Martin

Hi Thomas - thanks for getting in touch, and I wanted to pass this along to the HUMAnN developers, who should be able to help out in more detail. HUMAnN2 folks, I spoke with Marty about this a bit in NYC and recommended ptools for visualization of the MetaCyc output, but we need to make sure the structured pathway abundances are pinned down first (ASAP!!) and provide some recommendations for filtering and using the output with MaAsLin (given how big the per-bug abundances are, as per Thomas' email).

Many thanks -

Curtis

On Mon, Jun 22, 2015 at 11:56 AM, Battaglia, Thomas <Thomas.B...@nyumc.org> wrote:

Hello Dr. Huttenhower,
I am a researcher at Blaser lab and I am working on meta-transcriptomic data from cecal contents of the hibernating 13-lined ground squirrel. Humann2 has shown to be extremely useful in the microbial functional profiling of this data. It is a very user-friendly and powerful program, but I have just a few questions about the data.

Firstly, is there any particular way you would recommend performing statical analysis on the data? I see the gene family data is in RPK/CPM normalized values, so I cannot use DESeq2, and the file was much too large for LEfSe. Are the pathway abundance values in more testable format?

Secondly, Dr. Blaser mentioned a program one can use for pathway analysis on the output files. I know the pathway abundance and converage files are calculated using metacyc and the gene families data. I have also regrouped and renamed the data into GO ontology and KEGG orthology terms.
Is there a particular tool or pathway database you recommend to handle the enormous amount of data generated from humann2?

Thank you very much for your time,

Thomas W. Battaglia

Blaser Lab

6026 W VAMC

423 East 23rd Street

New York, NY 10010

Eric Franzosa

unread,

Jun 22, 2015, 4:05:14 PM6/22/15

to humann...@googlegroups.com, Battaglia, Thomas

Hi Thomas,

I've been meaning to write an entry in the HUMAnN2 manual on downstream analysis, so your email gives me a chance to draft out some of those ideas!

* As you found, there are _a lot_ of gene families in the output -- too many to test en masse, probably even with aggressive filtering. The full table is provided for the sake of being comprehensive and to help with grouping genes into different pathway systems (e.g. metacyc or uniprot's own pathway definitions). You can also use the gene table for strain-level analysis (as described in the manual) or to pull out specific genes of interest for testing. Most folks will want to do their testing at the pathway level however.

* Even after grouping uniref gene families into reactions or pathways, the stratified output can still be quite large if your community contained a lot of different species. For this reason, we recommend doing your testing on the community totals, and then digging into the stratified output to understand the mechanisms underlying significant functional changes. For example, given that the abundance of pathway A changed significantly, was it because a particular A-contributing species expanded, or because new A-contributing species appeared (including potential "unclassified" species)? The only problem with this approach is that it fails to capture situations where community-level function was constant but individual species' contributions changed. However, I would argue that such situations are better described as a change in community composition, which one can assay directly by analyzing the MetaPhlAn profiles from the samples.

* After collapsing to pathways and considering community totals, you may still have a lot of pathways to analyze. In this case, we recommend prioritizing based on a combination of pathway (i) mean/median abundance and (ii) variance. For example, you might first select pathways with median abundance in the top ~50% to exclude rare pathways, and then among those select pathways with variance in the top ~25% to exclude housekeeping functions. Other selection schemes with a similar spirit (removing rare/invariant pathways) would also be fair game. The goal here is mostly to maintain statistical power, so if you have more samples you can be less stringent with your filtering.

In the future we can add in another utility script to help with these trimming procedures.

Thanks,

Eric

Tom Battaglia

unread,

Jul 21, 2015, 6:00:29 PM7/21/15

to HUMAnN Users, Thomas.B...@nyumc.org

Hey Eric, thanks for getting back to me.

Since your response, I have tried a different variations of data filtering. Using the approach you recommended, I subsetted by top 50% mean and top 25% highest variance. This brought the number of gene families and pathways down to a manageable number. But I wasn't quite sure what would be a correct significant test. Would LEfSe be appropriate for this type of data? I saw in your paper, "Relating the metagenomic and metatranscriptome of the human gut" that you filtered the functional KO's by 0.01% abundance and then performed an arcsine transformation and Witten-Bell smoothing on the relative abundances. Is this something you recommend doing before any significant testing?

Also. I have kept an eye on biobakery for changes to Humann2 and I have recently re-ran the data using version 0.2.0 which includes the metacyc structured and filtered pathway database. This drastically brought down the number of pathways and gene families for each sample. Now the total number of pathways across all samples are 77 (w/ taxonomy), down from 10,516. Is this due to the increased accuracy of the pathways?

And one last question. Do you know if it is at all possible to incorporate QIIME 16s metagenomic data for the "--taxonomic-profile" option? I know metaphlan uses the genus/specific identifier as the output, so I'm not sure if it is being recognized by humann2 because of the OTU table format of Qiime that lists each hierarchical level.

Thank you very much for all of you help and your time!

-TB

Eric Franzosa

unread,

Jul 22, 2015, 1:59:04 PM7/22/15

to humann...@googlegroups.com

Hi Tom,

* LEfSe should work well for this. Assuming you're just interested in 2 or more major classes (and not subclasses) the first stage of LEfSe is a Kruskal-Wallis (non-parametric ANOVA) which you could try independent of LEfSe.

* In the paper you're referring to, the 0.01% filter was another way of trimming down the list of features to test. I don't think you need to do this IN ADDITION to the filter you already did. The arcsin sqrt transform applies to relative abundance (sum=1) data and helps to variance stabilize such data for tests that assume normality. In that case we were doing a traditional two-way ANOVA, so this transform was helpful, but it won't be needed for a non-parametric test. Lastly, the Witten-Bell smoothing was necessary for normalizing RNA abundance by DNA abundance (it avoids divide-by-zero errors). If you're not performing this form of normalization then you won't need the Witten-Bell smoothing.

* We recently changed HUMAnN2 to be a lot more conservative about pathway calls. We have implemented structured versions of all the pathways (similar to KEGG modules) and now require a complete "path" through the pathway to be satisfied to call it as present. For example, a pathway might be structured like A->(B or C)->D, where A,B,C,D are enzymes. If A, B, and D were detected OR A, C, and D were detected, then HUMAnN2 would consider this pathway "satisfied." If A or D were zero, or both B and C were zero, the pathway could not be satisfied. The latest HUMAnN2 commit relaxes this procedure a bit to avoid penalizing reactions that can't be quantified (because they have no associated genes). You will probably get more pathways if you try this latest version.

* I'm a little confused about the QIIME/16S question - since HUMAnN2 is designed for profiling a metagenome, I would strongly recommend doing the taxonomic profiling directly from the metagenome. That said, you can give HUMAnN2 any taxonomic profile you like provided that it conforms to the format spelled out in the manual. :-)

Thanks,

Eric

Tom Battaglia

unread,

Jul 23, 2015, 11:14:44 AM7/23/15

to HUMAnN Users, fran...@hsph.harvard.edu

Hey Eric,

Thanks for the suggestions! Would I need to generate per-sample relative abundances, because the RPK data is considered "raw" count data, before analyzing the data in LEfSe? And I will pull the latest commit and check out the output for the difference in structured pathways!

When I attempted the use a custom taxonomic input from QIIME's 16s metagenome data, I get the error: CRITICAL ERROR: The directory provided for ChocoPhlAn does not contain files of the expected format (ie '^[g__][s__]'). To generate my abundance table I used summarize_taxa.py to generate relative abundances for each samples and then multiple by to get percentages. But I see that the taxonomy strings from greengenes are not as specific as MetaPhlAn2's, as seen below. The documentation states that the taxonomic profile file must, "match the naming convention used by the full ChocoPhlAn database." I think the names like Other|Other and g__|s__, are affecting the way the taxonomy is interpreted. Would I need to remove every instance the shows an unknown genus or species and Other?

k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae|g__Actinomyces|s__    0    0    0    0.016202203    0    0
k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Corynebacteriaceae|g__Corynebacterium|s__    0    0    0.086956522    0.064808814    0    0
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|Other|Other    0.033852404    0.02257846    0    0.113415424    0.0578369    0.114442664
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__|s__    0.059241706    0.293519982    0.724637681    0.226830849    0.0578369    0.251773861
k__Bacteria|p__Actinobacteria|c__Coriobacteriia|o__Coriobacteriales|f__Coriobacteriaceae|g__Adlercreutzia|s__    0.423155044    0.112892301    0.376811594    0.858716785    0.212068633    0.64087892
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|Other|Other|Other    0    0    0    0    0    0
k__Bacteria|p__Bacteroidetes|c__Bacteroidia|o__Bacteroidales|f__|g__|s__    0    0    0    0    0    0

Thank you very much!

O. AlZahal

unread,

Oct 28, 2015, 2:27:58 PM10/28/15

to HUMAnN Users, Thomas.B...@nyumc.org

Thanks Eric for the comprehensive answer. If i understand this, you can prioritize by filtering the medians/means and variances and this is done by gene/pathway across samples. To do that, i need to combine the abundances from all 8 samples into one file by gene/pathway. Can HUMAnN2 do that?

many thanks

OA

On Monday, June 22, 2015 at 4:05:14 PM UTC-4, Eric Franzosa wrote:

Eric Franzosa

unread,

Oct 28, 2015, 3:58:05 PM10/28/15

to humann...@googlegroups.com

Yep, we have a join_tables script that can do exactly that:

http://huttenhower.sph.harvard.edu/humann2/manual#markdown-header-2-join-tables

We don't have tools for the different filtering options, so that's something you would need to experiment with separately.

Thanks,

Eric

O. AlZahal

unread,

Nov 5, 2015, 3:52:05 PM11/5/15

to HUMAnN Users

hello Eric, i bringing this back to ask related questions. I would greatly appreciate the clarification.

If i decided to follow the above filter (i.e. 50% top mean and 20% top variance) and follow that with either parametric or non-paramteric stats, do i need to renormalize that data to sum back to 1. If yes, can i add instead "others" group to do that (i.e. if the filtered sum to 0.8, can i add "others"=i.2).

My other question is about the best way to filter stratified data (i.e. function by bug). Do i do that by per function or just indiscriminately.

I am afraid at some point that a given total row would make it through,but none of the stratified row that belong it would make it.

many thanks

OA

Eric Franzosa

unread,

Nov 5, 2015, 4:52:32 PM11/5/15

to humann...@googlegroups.com

* No need to renormalize after feature selection: you would just do your tests on the rows you selected as potentially interesting.

* With HUMAnN2, I recommend doing statistical tests on the community totals only. When you find a significantly different feature, you can then examine the stratifications of that feature to determine which bugs are driving it. This will improve your power quite a bit relative to testing every stratification separately.

Thanks,

Eric

Ousama AlZahal

unread,

Nov 5, 2015, 8:19:54 PM11/5/15

to humann...@googlegroups.com

Many thanks Eric, as usual great answers. I agree, stats should be done on the totals. But I am wondering, if i were to use LEfSe (as the n0n-parametric option) on the unstratified file (i.e. totals), would it still be possible to create a Cladogram?

many thanks

OA

Silvia Raineri

unread,

Jul 5, 2017, 4:53:14 AM7/5/17

to HUMAnN Users

Hello,

I am trying to analyse my Humann2 results, and after reading a bit the discussions in this forum, I split my genefamilies, pathwayabundance and pathway coverage tables into stratified and unstratified. I then normalized the unstratified tables to both relative abundances (for further analysis on Lefse) and copies per million. I have started the analysis from the pathway abundance, unstratifed table normalized with relative abundances. As Eric suggested a couple of posts above, I was trying to calculate the median abundance for each pathway, in order to select those in the top 50%. My problem is now that in my table, if I were to filter out pathways with less than 50% median abundance I would only have the unclassified and unmapped left. Should I try a different approach, like transforming to log10 (I did this for the heatmap on the Metaphlan2 results)? Should I just remove the unclassified and unmapped rows and do the analysis on all the other rows?

Kind regards,

Silvia

Eric Franzosa

unread,

Jul 7, 2017, 1:01:35 PM7/7/17

to humann...@googlegroups.com

Hi Silvia,

50% relative abundance is a very big value for filtering. For pathways, I would probably filter based on something like >0.01% abundance in at least N samples. N could be 2 for a small dataset (i.e. only consider pathways seen at least twice) or a percentage of your samples (e.g. 10% or 25%) if you have a larger dataset.