providing raw counts

Nicholas Youngblut

unread,

Dec 17, 2017, 2:21:26 AM12/17/17

to HUMAnN Users

This has been touched upon in other topics, but I want to propose this directly: it would be helpful to provide the option that the humann2 output tables are as raw counts, and not normalized.

While RPK or TSS normalization does make sense in many situations, other situations may not warrant their usage. Moreover, differential abundance detection software such as EdgeR and DESeq2 assume/require raw counts. Therefore, a researcher who wants to use such tools as EdgeR and DESeq2 cannot directly use humann2 output for this purpose. Instead the user must either: 1) try to back-calculate raw counts 2) abandon using humann2 altogether and instead use a simpler read-mapping-to-database approach 3) try to modify the humann2 code to bypass the normalization step(s).

I haven't assessed the humann2 code thoroughly, but naively it seems like a rather small change to the code is needed in order to provide raw counts. More specifically, why not write out the raw count tables prior to the normalization, of if that's not possible, just skip the normalization step (this may require changing the code to handle integer values versus floats, but it shouldn't be a huge change)? Again, this is a naive assumption about how the code for humann2 was written.

If implemented, I believe humann2 would become more attractive to researchers who either 1) want to experiment with EdgeR, DESeq2, or similar software for differential feature abundance detection 2) need to use EdgeR, DESeq2, etc. because a reviewer wants to see the results or to directly compare the analysis to a previous analysis or work from another study.

Eric Franzosa

unread,

Dec 18, 2017, 11:36:47 AM12/18/17

to humann...@googlegroups.com

Hi Nicholas,

Thanks for your suggestion. Adding an option to _not_ normalize by database sequence length is very possible. However, based on the way HUMAnN2 works, the output would _still_ not take the form of integer counts, as HUMAnN2 divides read weights over the database sequences they hit in proportion to alignment quality. Demonstrating that this approach is more accurate than simply taking the best hit was one of the main contributions of HUMAnN1. Do you think these non-integer raw counts would still work for your purposes? It sounds like recent implementations of EdgeR and DESeq2 _can_ operate on weighted (i.e. non-integer) counts, whereas some earlier versions could not.

Thanks,

Eric

Nicholas Youngblut

unread,

Dec 19, 2017, 2:51:42 AM12/19/17

to HUMAnN Users

Thanks Eric for the quick reply! Although the humann2 output still wouldn't be integers (raw counts), getting the values closer to raw counts would still probably be useful for those that want to use the data processing methods of DESeq2, EdgeR, etc.

On the other hand, I'm wondering about the validity of the assumptions for DESeq2, EdgeR, etc, which were designed for transcription data for model eukaryote organisms. For instance, I believe that these methods assume that gene length doesn't vary greatly in the population (eg., among a group of humans or mice), but this assumption may not hold when assessing gene shared across a genus or a larger portion of the tree of life. From my own experience in archaeal comparative genomics, gene length can vary quite a bit even within an archaeal "species", which would influence the number of reads mapped to the gene and the influence estimates of abundance (copy number). Maybe it is valid to use RPK values when looking at genes shared across broad taxonomic groups?

Nicholas Youngblut

unread,

Dec 20, 2017, 2:23:26 PM12/20/17

to HUMAnN Users

I'm curious why humann & humann2 correct for alignment quality, whereas the standard transcriptomic + DESeq2/EdgeR does not (raw counts used). The datasets aren't the same, but with multiple alignments of varying quality are also an aspect of transcriptomics. Is it necessary to correct for alignment quality, especially given the potential limitations/challenges of using humann2 output for DESeq2, EdgeR, etc.?

On Monday, December 18, 2017 at 5:36:47 PM UTC+1, Eric Franzosa wrote:

Nicholas Youngblut

unread,

Dec 21, 2017, 1:18:24 AM12/21/17

to HUMAnN Users

To elaborate a bit more: why not use essentially the same methods as used for standard RNA-seq workflows (eg., B.3 – An RNA-seq Work Flow or RNA-seq workflow: gene-level exploratory analysis and differential expression)? For these RNA-seq workflows, neither gene length nor alignment quality is corrected for, so I'm wondering why it is necessary for humann2. Any clarification on this would really be appreciated.

Eric Franzosa

unread,

Dec 21, 2017, 3:41:50 PM12/21/17

to humann...@googlegroups.com

Correcting for gene length allows you to compare genes within the same sample. So, for example, if you want to estimate the coverage of a pathway in a sample by the mean coverage of its component genes, then you wouldn't want a single large gene in a pathway (that recruits a lot of reads) to bias your estimate of the pathway's abundance. Similarly, if you want to profile a strain of a species in a metagenome, it's helpful to check for uniform coverage of its component genes (which requires adjusting for their individual lengths).

If you just want to know if a gene/transcript is differentially abundant between two types of samples (which is what those workflows are focused on), you have to correct for uneven sample read depth, but not necessarily gene length. For example, if all of my samples have 1M reads, and gene X recruits a mean of 2000 reads in cases and 1000 reads in controls, then its abundance is doubled in cases. You could normalize both means to the length of gene X, but it wouldn't change that conclusion. However, if another gene Y had mean abundance 2000 in controls, it would not be appropriate to say it was "twice as abundant as X" without normalizing for their respective lengths (Y might simply be twice as long as X).

The alignment quality issue is more specialized to metagenomics. In closed-reference RNA-seq you typically say that a read maps to one of N genes or it doesn't. When you're dealing with a pool of potentially novel organisms, the decision of whether a read mapped with sufficient homology (and if so, where) is more nuanced. That said, dividing read weights over candidate database sequences can also arise in RNA-seq. See, for example, EM approaches to dealing with overlapping isoforms.

Hope this helps!

Thanks,

Eric

Nicholas Youngblut

unread,

Jan 3, 2018, 6:52:28 AM1/3/18

to HUMAnN Users

That helps a lot! Thanks!

I see the usefulness of normalizing the read count data. However, that makes it hard to use the "standard" RNA-seq analysis software for differential feature detection. I've been looking into other analysis methods that can handle normalized count data and allow flexible experimental designs (ie., more complicated than comparing treatment vs control with no covariates). I'm debating on whether ANCOM, gneiss, or something else that uses log ratios to deal with the compositional nature of the data (and the high numbers of zeros). At least for 16S rRNA OTU datasets, EdgeR, DESeq2, limma-voom, etc can have really high false positive rates (see https://www.ncbi.nlm.nih.gov/pubmed/28968702). I don't know of any studies that have compared the accuracy of these analysis methods for metagenome data (either raw read counts or normalized data). Any thoughts?

Eric Franzosa

unread,

Jan 3, 2018, 6:17:13 PM1/3/18

to humann...@googlegroups.com

We recently published a review that covers post-meta'omic profiling statistical analyses, which may be of some use to you:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5708111/

See especially Table 3 and the associated text.

Thanks,

Eric

Josh Espinoza

unread,

Aug 8, 2018, 7:57:31 PM8/8/18

to HUMAnN Users

Thanks for posting this. This publication and convo is really useful b/c I was trying to do the same as the OP.

Reply all

Reply to author

Forward