TMM normalized FPKM vs TPM: which metric to use?

ken

unread,

Sep 27, 2012, 4:09:27 AM9/27/12

to rsem-...@googlegroups.com

Hi,

I've run RSEM on a number of samples, and retrieved the TPM values.

I've also run a script from the trinity package that calculates TMM normalized FPKM values based on the expected counts from the same dataset.

I am comparing these but not sure which to use as they do vary quite a bit depending on the transcript. I'm not doing DE yet but want to get an idea of the which transcripts may possible vary between my samples.

The library sizes of my sample do vary quite a bit, with one almost having double the number of reads compared to all the other samples so I'm abit lost on which method to use.

In the RSEM paper it suggests to use TPMs as they are more comparable than FPKM, but is this the case when TMM is done, especially with varying library sizes.

My understanding of these methods is still very rudimentary, so apologies in advance.

Thanks,

Ken

Colin Dewey

unread,

Sep 27, 2012, 11:35:51 AM9/27/12

to rsem-...@googlegroups.com

Hi Ken,

No worries, there are a lot of subtle issues here that are poorly understood. Here is the brief summary of what you should know:

* If you want to compare *relative abundances*, then you should be using TPM, which is a simply a fraction. As we (and others) have noted in our papers, FPKM/RPKM are not good measures of relative abundance because the FPKM/RPKM of a transcript can change between two samples even if its relative abundance stays the same.

* The trouble with looking at relative abundances (which is what RNA-Seq directly measures) is that the abundance of one gene affects the relative abundances of all other genes. For example, if a very highly expressed gene increases in its abundance, then the relative abundances of all other genes will go down, even though their *absolute* abundances may remain the same. Thus, a number of "normalization" schemes (e.g., TMM, third-quartile normalization) have been devised that effectively transform counts or FPKM/RPKM from RNA-Seq into *absolute* measures of abundance (or more accurately, they put measures from several samples onto a common absolute scale). Note that you cannot apply these normalization schemes to TPM values because they are relative values and, by definition, the TPM values of all transcripts must sum to 10^6.

So an even briefer summary is:

if you want to compare relative abundances: use TPM
if you want to compare absolute abundances: use normalized read count or normalized FPKM values (where "normalized" = the results of TMM or a similar method)

Hopefully that makes things a bit clearer,
Colin

Erik Aronesty

unread,

Sep 27, 2012, 1:46:05 PM9/27/12

to rsem-...@googlegroups.com

I've found upper quartile normalization to be

- adequate

- stable across a wide variety of experimental conditions (from human, to mirna counts, to bacterial transcriptome)

- very easy to implement

- easy to explain in a paper

- robust in response to situations where there is both high and low replicate variation, etc.

FPKM is, essentially, "total count normalization" and has many issues:

http://www.biomedcentral.com/1471-2105/11/94/

- Erik

Julien Roux

unread,

Jun 30, 2015, 5:41:07 AM6/30/15

to rsem-...@googlegroups.com

Dear Colin,
Your explanation seems quite clear
Just one question: I was wondering if it was possible to obtain "absolute" abundances by normalizing TPM values (e.g., with TMM)?
My intuition is that this should perform better than normalizing RPKM values with TMM
What do you think?
Best
Julien

Bo Li

unread,

Jul 2, 2015, 4:11:06 AM7/2/15

to rsem-...@googlegroups.com

Hi Julien,

Yes, you can use TPM.

Best,
Bo

> --
> RSEM website: http://deweylab.biostat.wisc.edu/rsem/ [1]
> ---
> You received this message because you are subscribed to the Google
> Groups "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to rsem-users+...@googlegroups.com.
> To post to this group, send email to rsem-...@googlegroups.com.
> Visit this group at http://groups.google.com/group/rsem-users [2].
>
>
> Links:
> ------
> [1] http://deweylab.biostat.wisc.edu/rsem/
> [2] http://groups.google.com/group/rsem-users

Jason

unread,

Jan 31, 2018, 12:33:00 AM1/31/18

to RSEM Users

Hi Colin and Bo,

I have read this post and I'm confused with whether you could use TMM normalization on TPM values.

Colin wrote "Note that you cannot apply these normalization schemes to TPM values because they are relative values and, by definition, the TPM values of all transcripts must sum to 10^6. "

and

Bo replied to Julien "Yes, you can use TPM. " on his question " I was wondering if it was possible to obtain "absolute" abundances by normalizing TPM values (e.g., with TMM)? My intuition is that this should perform better than normalizing RPKM values with TMM"

So it seems these are not consistent.. and can we do TMM-normalized TPM??

My goal is to be able to compare expressions of a set of gene both within one sample and across samples of different treatment groups. I'd like to plot these expression values and visually inspect these genes in different samples.

I can found "TMM-normalized TPM" mentioned in the following page/paper:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Trinity-Transcript-Quantification

https://www.biorxiv.org/content/biorxiv/early/2017/05/28/143289.full.pdf

It's quite confusing and your clarifications is greatly appreciated..

Thanks,

Jason

Message has been deleted

Christopher Conley

unread,

Feb 1, 2018, 10:46:14 AM2/1/18

to RSEM Users

Hi Jason,

See this post from Rob Patro (author of Salmon) for clarification:

https://groups.google.com/forum/#!topic/sailfish-users/jBf9SGiH1AM

In short, technically TMM is using raw read counts so as to extract library size (which is not available through TPM).

Best,

Chris

Brian Haas

unread,

Feb 1, 2018, 3:05:36 PM2/1/18

to rsem-...@googlegroups.com

In case it helps, we use TMM normalization in a couple of ways in the Trinity pipeline. It's used as part of the differential expression analysis internally within the bioconductor tools (ie. edgeR), where it's applied to count data. Later on, separately, we use it to scale the TPM matrix, with the same goal: scale the matrix so that most genes are not differentially expressed. The scaled TPM values (via TMM normalization or scaling) are used for making heatmaps and other various expression plots. We rely entirely on the bioconductor tools for any of the statistical significance and estimated fold change calculations.

The rescaled TPM values are no longer proper TPM values (they won't sum to 1 million per sample), but continue to be treated as relative expression values.

Now the difference between using FPKM vs. TPM... FPKM values and TPM values have an exact linear relationship, and it's my understanding that it really doesn't matter which one you use once you've performed cross-sample scaling using TMM normalization. TPM, in general, has better aesthetic properties in that the values (prior to TMM normalization) can be thought of like the concentration of transcripts in the cell, and is in general, the preferred metric to use these days for individual sample reporting.

disclaimer: I'm not much of a 'math guy'. The above just reflects my current understanding and I try to adhere to best practices. I'm always interested in a better way to do it, though, if something is awry here. (and I do often consult with others along those lines).

best,

~brian

sent...@gmail.com

unread,

Feb 4, 2018, 12:04:46 AM2/4/18

to RSEM Users

In my opinion, TMM-TPM is not the appropriate way of reporting gene expression because TPM is already normalized ( proportion within a sample ) and on top of it you are normalizing (scaling) again using TMM. So basically you are double normalizing the expression values which doesn't look right.

Brian Haas

unread,

Feb 4, 2018, 7:31:33 AM2/4/18

to rsem-...@googlegroups.com

TPM is length normalized and is fine for comparing expression within a sample, but for cross-sample comparisons, you're still going to have the issue that differences in sample composition could cause average log fold changes to be different than zero, and the need for cross-sample normalization continues to apply. Details are well described here:

https://genomebiology.biomedcentral.com/articles/10.1186/gb-2010-11-3-r25

--
RSEM website: http://deweylab.biostat.wisc.edu/rsem/

---
You received this message because you are subscribed to the Google Groups "RSEM Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+unsubscribe@googlegroups.com.

To post to this group, send email to rsem-...@googlegroups.com.

Visit this group at https://groups.google.com/group/rsem-users.

--

--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

Jason

unread,

Feb 18, 2018, 11:08:46 AM2/18/18

to RSEM Users

Thanks for the link, Chris, it's helpful!

Thanks for the detailed explanation, Brian! I agree that TPM and FPKM values have linear relationship, which was discussed in the 2010 RSEM paper. I think I've read somewhere that FPKM doesn't really have as much biological meaning because it has the total number of mapped reads in the denominator and that varies from one library to another, compared to TPM which always give you relative measurement within a sample/library of transcript abundance.

To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.

To post to this group, send email to rsem-...@googlegroups.com.
Visit this group at https://groups.google.com/group/rsem-users.

Reply all

Reply to author

Forward