How to Normalize Salmon TPM output?

Nick Bernstein

unread,

Sep 18, 2015, 9:42:46 AM9/18/15

to Sailfish Users Group

Hello all,

I'm fairly new to RNA-seq analysis, and have some basic questions. Once I have quantified my samples using Salmon. What are the various methodologies for normalizing transcript expression when working with TPM calculated by Salmon? Is normalization always needed? How do I determine if normalization is needed?

I apologize if this is not the correct forum for such a post.

Best,

Nick

Rob

unread,

Sep 18, 2015, 1:19:29 PM9/18/15

to Sailfish Users Group

Hi Nick,

Welcome to the user-group. I assume that you're interested in normalizing the expression estimates across samples? The answer really depends upon what these estimates will be used for. If you're simply trying to get an idea of how relative abundance looks across different samples, then the TPM estimates themselves will let you explore that. However, it is important to recognize that TPM (and F/RPKM etc.) are purely relative abundance estimates, and therefore cannot be used directly to compare abundances across samples. To use such estimates across samples, another level of normalization is required (e.g. TMM). Many DE packages like EdgeR and DESeq implement such normalization methods (or have their own). This paper (from a few years ago) compares some techniques commonly used for normalization between samples.

Best,

Rob

Nick Bernstein

unread,

Sep 18, 2015, 2:13:26 PM9/18/15

to Sailfish Users Group

Thanks so much for the explanation. Exactly what I was looking for.

Nick Bernstein

unread,

Oct 8, 2015, 12:13:01 PM10/8/15

to Sailfish Users Group

Hello Rob,

Sorry for the late follow up. I've been looking into normalization more, and I was wondering about a few things that perhaps you might be able to answer or discuss

So we have within samples normalization (TPM or others) and between samples normalization (TMM or others), but is it necessary to do both ever, i.e. is it ever necessary to normalize relative abundances across a cohort?

I don't think it would be, but another scenario which seems to be quite common is filtering out isoforms that have no expression for 90% (or some other threshold) of the samples if working with a large cohort. But if you do this while working with TPM then the sum of TPM for every isoform for each subject will no longer be equal. Would it make sense to then use TMM after such a filtration process? I think it would.

Do you think such filter out of isoforms is flawed in some manner?

My guess is it used because people are worried about the sensitivity of RNA-seq and biologically most think that for specific tissue type a good percentage of genes are not expressed. So I think it makes some sense

It seems like all between samples normalizations require raw counts as input, and leave it there. I read harold pimentel's blog post about it (https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/, very informative) but I haven't seen a follow up about this problem if it is a problem.

What are your thoughts?

Best,

Nick

On Friday, September 18, 2015 at 1:19:29 PM UTC-4, Rob wrote:

Rob

unread,

Oct 13, 2015, 12:48:42 PM10/13/15

to Sailfish Users Group

Hi Nick,

No problem — sorry for the slow response myself. My thoughts are below:

So we have within samples normalization (TPM or others) and between samples normalization (TMM or others), but is it necessary to do both ever, i.e. is it ever necessary to normalize relative abundances across a cohort?

TPM and TMM are very different types of normalization. Specifically, between-sample TMM normalization is carried out on estimated read counts, not TPM estimate. The reason for this is that when performing TMM normalization, one typically must take into account the library sizes of the experiments being compared. TPM normalization explicitly erases information about library size. That is, it estimates a relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments.

TMM, on the other hand, explicitly calculates "normalization" factors between different libraries, trying to make e.g. a 20M read library in one sample and a 30M read library in another sample comparable. Thus, it doesn't so much "erase" library size information like TPM, but rather explicitly tries to scale expected counts to be comparable.

I don't think it would be, but another scenario which seems to be quite common is filtering out isoforms that have no expression for 90% (or some other threshold) of the samples if working with a large cohort. But if you do this while working with TPM then the sum of TPM for every isoform for each subject will no longer be equal. Would it make sense to then use TMM after such a filtration process? I think it would.

Yes, you're right, which is exactly why TMM should be computed in terms of expected read counts. If you filter out low-abundance transcripts, the TPM values in each sample will no longer add up to 1,000,000 (or, if you re-normalize them, the scaled values will depend on what you have removed). However, if lowly-expressed transcripts are filtered out prior to computing a TMM estimate, the fact that fewer mapped reads are accounted for should be explicitly considered in the normalization calculation.

Do you think such filter out of isoforms is flawed in some manner?
My guess is it used because people are worried about the sensitivity of RNA-seq and biologically most think that for specific tissue type a good percentage of genes are not expressed. So I think it makes some sense

It seems like all between samples normalizations require raw counts as input, and leave it there. I read harold pimentel's blog post about it (https://haroldpimentel.wordpress.com/2014/12/08/in-rna-seq-2-2-between-sample-normalization/, very informative) but I haven't seen a follow up about this problem if it is a problem.

Between sample normalization is an inherently tricky problem, because RNA-seq, by its nature, gives us relative estimates of abundance. That said, between-sample normalization methods like TMM and the DeSeq normalization attempt to make estimated counts between samples comparable and, in general, do a reasonably good job. One way people try to mitigate the problem experimentally is by including a spike-in in their library. The spike-in has a known concentration, and so you can use the estimated relative abundance of the spike-in to calibrate the meaning of relative abundance estimates in a sample to be on a more absolute scale. However, if you're dealing with experimental data that has already been gathered without a spike-in, this isn't really an option. You can find information about how to incorporate spike-ins in the manual of the DE-testing tool you're using (if it is supported). Finally, one relatively new option for performing differential expression testing (as well as visualizing changes in transcript-level expression across conditions / cohorts) with Salmon / Sailfish is to use it in conjunction with the Sleuth tool — a walk-through of how to do this is described in a recent post on my blog. This takes advantage of estimates of uncertainty / variance in estimated read counts for transcripts (which are computed using bootstrapping). I expect that more methods will begin to incorporate such variance information as the ability to compute such samples becomes more prevalent.