Salmon NumReads quantification among isoforms

445 views
Skip to first unread message

Pedro Blecua

unread,
Jun 29, 2015, 6:47:35 PM6/29/15
to sailfis...@googlegroups.com, blec...@mskcc.org
Dear Salmon developers,

I am a research associate at MSKCC in NYC. I have an inquiry regarding the way NumReads is quantified in Salmon. It has troubled me for a while now and cannot figure it out yet in a comprehensive way. I will illustrate my question with one concrete example, including only protein coding transcripts. Please see below. This example is for a gene I am very interested in, RPL31, and in particular three of its isoforms.

Example of data:


Gene       Transcript ID (gencode_v19)              Length             TPM           NumReads    Transcript abundance within the gene   Number of Exons     Overlapping exons (may be partial overlap) with most abundant transcript

RPL31  ENST00000456292.1                  357       1.70232e+03     3.03235e+04                   7.389296e+01%                                       3                                                      3 (exons 1-3, itself)
RPL31  ENST00000409711.1                  2328      1.99106e-17      2.31280e-15                    8.642636e-19 %                                      2                                                      2 (exons 1-2, sense-wise)
RPL31  ENST00000441435.1                  717       4.17974e+00    1.49533e+02                    1.814308e-01 %                                       4                                                      3 (exons 1-3, sense-wise) 

In this particular case, the coverage of this gene is very high. See attached IGV/Sashimi plot snapshots from the bam file I generated for this sample and take a look at the scale: it ranges from 0 to approximately 25k reads across the selected range of the gene.

When I observe the Salmon output, I immediately realized that the first transcript, ENST00000456292.1, the  most abundant one, has a NumRead estimate of the order of magnitud of the real coverage in the bam file (again, please refer to IGV snapshot/Sashimi plot). However, the other two transcripts, say, ENST00000409711.1 and ENST00000441435.1 have NumReads from Salmon ~0 and ~149, respectively. What I do not understand, is why is so, as as specified in my example and easily seen in the IGV snapshot these three transcripts share highly covered exons. It seems as if only the full coverage of each exon was calculated by Salmon for the most abundant transcript. I would (very) naively expect the other two transcripts, regarding at least the shared exons, to have a way higher estimate of NumReads. This might be related to the way Salmon treats overlapping reads among Transcripts/shared exons? Specially striking is the ~0 NumReads prediction  for transcript ENST00000409711.1, that shares two highly covered full exons with the most abundant transcript. What is the origin of this discrepancy between the isoform NumReads estimate from Salmon and the coverage of the transcripts when visualizing the bam files? Why is only one transcript with a 'reasonable' estimated NumReads and why the other two, sharing highly covered exons with the former, have a way lower estimate?

This is not the only example I have noted thus far, but the other ones follow a very similar reasoning. 

I would highly appreciate some input with this, as I rely very much on Salmon for my research. I am sure I am missing out something, but still do not know what. 

Thanks for providing the community with such an excellent piece of software,

Yours sincerely,

Pedro

Sashimi_Salmon.png
igv_snapshot_Salmon.png

Rob

unread,
Jun 29, 2015, 7:36:52 PM6/29/15
to sailfis...@googlegroups.com, pedro....@gmail.com, blec...@mskcc.org
Hi Pedro,

  Thank's for your question! I'll do my best to explain what's happening, but please let me know if my explanation is unclear or if you need further details (by the way, you can read a detailed pre-print of the salmon method now at bioRxiv).  I'd also encourage you to upgrade to the latest version of salmon if you haven't already done so, as it contains some improvements and optimizations detailed in the paper.

  What you're seeing looks consistent with the expected behavior of Salmon (and, for that matter, other methods for the estimation of relative transcript abundance like RSEM, eXpress, etc.).  Further, assigning a large number of reads to these other transcripts is, likely, an example of the most common failure mode of count-based methods (e.g. HT-seq, etc.).  Looking at the plots you provided, there is very little evidence for the existence of transcript ENST00000409711.1 in the absence of ENST00000456292.1.  As you mention, ENST00000409711.1 shares its only highly-expressed regions with ENST00000456292.1.  The way methods such as Salmon, RSEM, etc. work is to assign (conceptually) each sequenced fragment to a single transcriptomic locus.  This means that a read cannot come from ENST00000456292.1, ENST00000409711.1 and ENST00000441435.1 --- rather it must be assigned a locus.  In reality, because we are dealing with a probabilistic model, the assignments aren't "hard", but we might say that the read has a probability of 0.95 of coming from ENST00000456292.1, 0.04 of coming from ENST00000441435.1 and 0.01 of coming from ENST00000409711.1.

  The way these probabilities are determined are by considering, simultaneously, all of the other reads in the sequencing experiment (again, for many more details, see the preprint).  Specifically, we look for the parameters (the assignments of reads to transcripts and therefore, indirectly, transcript abundances), that maximize the likelihood of the observed data.  This means that the assignments your seeing reflected in the # of mapped reads produce, likely, a much higher likelihood of observing all of the reads in your sequencing experiment than if, for example, many more reads had been assigned to ENST00000409711.1.  While the algorithms work completely in terms of a probabilistic model, you can also think of this intuitively (again, this is not how the method actually works --- it is based on a full probabilistic model, not a parsimony model) as a sort of parsimony condition.  If I have a transcript (A) with 3 exons, and another transcript (B) with 4 exons --- 3 of which are shared with (A)  --- and all of my reads map to the exons of (A), it is not parsimonious to posit the presence of (B), even though it shares the high-coverage exons with (A).  Further, the lack of high coverage of B's remaining exon is strong evidence that it is not, in fact, expressed.  So, what I believe you're seeing here is that the majority of your reads can be explained by ENST00000456292.1, and most of the remaining reads are explained by ENST00000441435.1.  Thus, given strong lack of evidence for ENST00000409711.1 (the fact that pretty much any reads mapping to it can be explained, with higher likelihood, by the other two isoforms), it is assigned a read count of essentially 0.

  I'm not sure if these are the only genes / transcripts in this genomic region, but, again, I'd like to stress that Salmon takes into account all of the fragments (reads) and transcripts when optimizing fragment assignment.  So, there may, in fact, be even stronger evidence in favor of the output you're seeing.  Please let me know if you have any questions about what I've explained above or the manner in which I've explained it.  Thanks for your interest in Salmon, for using the software, and for your feedback!

Best,
Rob

Pedro Blecua

unread,
Jun 30, 2015, 11:12:17 AM6/30/15
to sailfis...@googlegroups.com, blec...@mskcc.org
Dear Rob,

Thank you very much for your great and 'easy' explanation. I actually ended up with the preprint last Sunday, but I have very strong time constraints. However, as I am formally trained in mathematical and statistical physics, I will make a gap in my busy schedule to study the method in detail. I am actually eager to do it, now that you provided me with an excellent 'intuition' of how Salmon works. Thanks you very much once again.

Following up your discussion, when you say "Looking at the plots you provided, there is very little evidence for the existence of transcript ENST00000409711.1 in the absence of ENST00000456292.1" just looking at the igv snapshot and sashimi plots, how do you arrive at that conclusion? Because of the very low number of reads spanning the junction between the two exons of ENST00000409711.1, that the sashimi plot picks up? Sorry if this question is trivial, but I am transitioning from physics to genomics and have still lots to learn and intuition to develop.

Another case I found interesting is attached (igv and Sashimi plots). It is the gene DNM2. As one can clearly see, the last exon of transcript ENST00000591818.1 (Salmon NumReads ~ 0) overlaps with the antisense transcribed gene TMED1. In the sashimi plot, it seems that the software picks up on the fact that supports that this transcript is hardly expressed, judging for the low number of reads spanning the junction between the two exons. However, when I take a look at the igv snapshot, the second exon of this DNM2 transcript seems 'inflated' with reads coming from the gene TMED1. I wonder, therefore, how does Salmon treats this gene/transcript overlaps? Say, how does Salmon discriminate between this DNM2 transcript (and its assigned reads) and the transcripts from TMED1 with overlapping exons? Once more, pending reading the paper, I would like an intuitive as possible answer, as you greatly did yesterday.

Thank you, Rob, very much once more, and be sure I will update to the latest Salmon version!

Regards,

Pedro

On Monday, June 29, 2015 at 6:47:35 PM UTC-4, Pedro Blecua wrote:
Sashimi_Salmon_DNM2.png
igv_snapshot_Salmon_DNM2.png
Reply all
Reply to author
Forward
0 new messages