Re: Normalization with HUMAnN

Curtis Huttenhower

unread,

Apr 29, 2015, 7:06:55 PM4/29/15

to Clémence Defois, humann...@googlegroups.com

Hi HUMAnN folks - I wanted to forward this along, any input for Clemence?

Thanks a bunch -

Curtis

On Tue, Apr 28, 2015 at 5:32 AM Clémence Defois <clemenc...@live.fr> wrote:

Dear Curtis Huttenhower

I am a French PhD student and I am currently working on metatranscriptomic data with HUMAnN.

I've got a question on normalization. I have 8 samples and each sample has a different read number. For exemple after the SortMeRNA process there are 25000 sequences for one sample and 75000 for an other...

Do I have to normalize the number of sequences (for exemple at 25000 reads) before processing the BLAST against KEGG DB ? Or, does HUMAnN take into account this biais ?

Thank you

Best regards

Clémence Defois - Doctorat Université de Médecine

clemenc...@live.fr

EA 4678 CIDAM, CBRV 5ème étage
28 Place Henri Dunant
63001 Clermont-Ferrand
tél: +33 (0)4 73 17 83 09

Eric Franzosa

unread,

Apr 30, 2015, 12:47:40 PM4/30/15

to humann...@googlegroups.com

Hi Clémence,

There's no need to down-sample your reads to a constant depth. You can map all quality reads in all samples against the KEGG database. HUMAnN will internally sum-normalize your samples to make them comparable. Specifically, the 01b file contains sum-normalized gene family abundance and the 04b file contains sum-normalized pathway abundance.

Thanks,

Eric

Patrick

unread,

Sep 14, 2015, 4:56:01 PM9/14/15

to HUMAnN Users

Hello,

Can you explain what is a sum normalization ? Is it the same than total sum scaling normalization ?

Eric Franzosa

unread,

Sep 14, 2015, 5:08:47 PM9/14/15

to humann...@googlegroups.com

Hi Patrick,

Sorry for not being more specific. HUMAnN uses Total Sum Scaling (TSS) to normalize for differences in read depth across sample, i.e. each raw sample value is divided by the sum of all raw sample values such that the transformed values sum to 1.

Thanks,

Eric

helenal...@googlemail.com

unread,

May 6, 2016, 9:32:12 AM5/6/16

to HUMAnN Users

> clemenc...@live.fr
>
>
>
> EA
> 4678 CIDAM, CBRV 5ème étage
>
> 28 Place Henri Dunant
>
> 63001 Clermont-Ferrand
>
> tél: +33 (0)4 73 17 83 09

Hi Eric and Patrick,

I found this conversation on normalization very useful - thank you.

I was wondering if you advise using the ArcSin Sqrt transformation on Humann2 normalized data as you suggest with Metaphlan relative abundance data (for example in the Maaslin methodology)?

Thank you,

Best wishes,

Helen

Eric Franzosa

unread,

May 6, 2016, 5:31:23 PM5/6/16

to humann...@googlegroups.com

Hi Helen,

I haven't specifically validated the arcsin sqrt transformation in the context of HUMAnN2 data, but it's generally applicable to compositional (total sum normalized) data, so it should be OK to use. In the past I used this transformation to perform two-way ANOVA using KOs from HUMAnN1 and it worked nicely.

Thanks,

Eric

clémence defois

unread,

Nov 3, 2016, 12:04:35 PM11/3/16

to HUMAnN Users

Hello,

I come back on the normalization point. Humann2 gives us the gene family/pathway abundances in RPKM or TPM/CPM values although these normalization methods raise questions for differential expression between multiple samples.

They are accepted for comparing abundances of transcripts within a same sample but not for comparing a transcript's abundance across multiple samples.

Normalization methods such as TMM or DESeq seem more appropriate in the community. Have you any comments on this normalization issue ? Is it really not recommended to compare multiple samples with RPKM or TPM values ?

Thanks

Clemence

Eric Franzosa

unread,

Nov 3, 2016, 11:00:21 PM11/3/16

to humann...@googlegroups.com

Hi Clemence,

Some comments:

1) HUMAnN2 abundance outputs are in RPK (reads per kilobase) units. I.e. we normalize raw hits to the (alignable) length of gene sequences, which accounts for the fact that longer sequences contribute/recruit more reads.

2) The tool script humann2_renorm_table performs total-sum scaling (TSS) normalization. This can output traditional relative abundance units (sum=1) or the sometimes-more-convenient "copies per million" (CPM) units (sum=1e6). The latter form is equivalent to the idea of TPMs in RNAseq. (NOTE: CPM is sometimes used to refer to "counts per million" -- i.e. counts normalized by sequencing depth but not gene length. HUMAnN2 CPMs are always normalized by gene length.)

3) HUMAnN2 does not interact with RPKM units. Some recent work in the RNAseq field has demonstrated that a sample's underlying transcript length distribution can affect RPKM/FPKM units in a non-ideal way, and so TPMs are becoming the preferred measure there. We've adopted the equivalent CPM units for HUMAnN2.

4) Comparing relative abundance/CPMs/TPMs across samples is fairly common practice. While this approach is not without issues, I don't think there is a consensus in the microbiome field that another method is strictly better, though alternatives have certainly been proposed (including CSS, DEseq-style normalization, and genome-size normalization).

Clémence Defois - Doctorat Université de Médecine

clemenc...@live.fr EA 4678 CIDAM, CBRV 5ème étage 28 Place Henri Dunant 63001 Clermont-Ferrand tél: +33 (0)4 73 17 83 09