Hi Pedro,
Thank's for your question! I'll do my best to explain what's happening, but please let me know if my explanation is unclear or if you need further details (by the way, you can read a
detailed pre-print of the salmon method now at bioRxiv). I'd also encourage you to upgrade to the latest version of salmon if you haven't already done so, as it contains some improvements and optimizations detailed in the paper.
What you're seeing looks consistent with the expected behavior of Salmon (and, for that matter, other methods for the estimation of relative transcript abundance like RSEM, eXpress, etc.). Further, assigning a large number of reads to these other transcripts is, likely, an example of the most common failure mode of count-based methods (e.g. HT-seq, etc.). Looking at the plots you provided, there is very little evidence for the existence of transcript ENST00000409711.1 in the absence of ENST00000456292.1. As you mention, ENST00000409711.1 shares its only highly-expressed regions with ENST00000456292.1. The way methods such as Salmon, RSEM, etc. work is to assign (conceptually) each sequenced fragment to a single transcriptomic locus. This means that a read cannot come from ENST00000456292.1, ENST00000409711.1 and ENST00000441435.1 --- rather it must be assigned a locus. In reality, because we are dealing with a probabilistic model, the assignments aren't "hard", but we might say that the read has a probability of 0.95 of coming from ENST00000456292.1, 0.04 of coming from ENST00000441435.1 and 0.01 of coming from ENST00000409711.1.
The way these probabilities are determined are by considering, simultaneously, all of the other reads in the sequencing experiment (again, for many more details, see the
preprint). Specifically, we look for the parameters (the assignments of reads to transcripts and therefore, indirectly, transcript abundances), that
maximize the likelihood of the observed data. This means that the assignments your seeing reflected in the # of mapped reads produce, likely, a much higher likelihood of observing all of the reads in your sequencing experiment than if, for example, many more reads had been assigned to ENST00000409711.1. While the algorithms work completely in terms of a probabilistic model, you can also think of this intuitively (again, this is not how the method actually works --- it is based on a full probabilistic model, not a parsimony model) as a sort of parsimony condition. If I have a transcript (A) with 3 exons, and another transcript (B) with 4 exons --- 3 of which are shared with (A) --- and all of my reads map to the exons of (A), it is not parsimonious to posit the presence of (B), even though it shares the high-coverage exons with (A). Further, the
lack of high coverage of B's remaining exon is strong evidence that it is not, in fact, expressed. So, what I believe you're seeing here is that the majority of your reads can be explained by ENST00000456292.1, and most of the remaining reads are explained by ENST00000441435.1. Thus, given strong lack of evidence for ENST00000409711.1 (the fact that pretty much any reads mapping to it can be explained, with higher likelihood, by the other two isoforms), it is assigned a read count of essentially 0.
I'm not sure if these are the only genes / transcripts in this genomic region, but, again, I'd like to stress that Salmon takes into account all of the fragments (reads) and transcripts when optimizing fragment assignment. So, there may, in fact, be even stronger evidence in favor of the output you're seeing. Please let me know if you have any questions about what I've explained above or the manner in which I've explained it. Thanks for your interest in Salmon, for using the software, and for your feedback!
Best,
Rob