Nested Transcripts and Salmon transcript estimates

103 views
Skip to first unread message

Daniel Gaston

unread,
Jun 9, 2016, 10:08:48 AM6/9/16
to Sailfish Users Group
I've been testing using Salmon on some data from an Ion Torrent dataset from a viral system, so we have a mix of human and viral transcripts. I've constructed a transcriptome representing both and obviously the speed was quite impressive. In looking over the data with my collaborator we are noticing that many of the viral transcripts that are overlapping with one another have TPM and counts of zero, and we don't expect them too. Some of these alternative transcripts only have small differences with alternative exons that may differ in splice site by a relatively small number of nucleotides.

My guess would be that unless the quasi-mapping results in a relatively high probability of a read coming from that small differentiating bit of sequence this would occur. Am I correct in this assumption? I am also running STAR + Bowtie on the dataset to see if explicit mapping aligns reads to these regions. Any insight or suggestions are much appreciated. 

Rob

unread,
Jun 9, 2016, 10:26:01 PM6/9/16
to Sailfish Users Group
Hi Daniel,

  While quasi-mapping doesn't produce a traditional alignment, it's not actually less precise than traditional alignment.  What I mean by this is that, if there is even a small differentiating region of sequence between the overlapping transcripts, reads that originate from that region should be distinguished and mapped to that transcript that produces them.  However, for the large majority of reads, coming from regions where the containing transcript and the contained transcript have identical sequence, quasi-mappings will be generated for both of these transcripts.  This multi-mapping will then be resolved by the inference algorithm.

  Basically, if no reads come from only the contained transcript, but there exist reads that can be assigned to only the containing transcript, then it is very likely that having the contained transcript with an abundance close to zero is a high likelihood solution.  Consider the following example:

 -- --    --  -- -- -- -- -- --       -- --      --   --      --
|====================================| X
         |==============*========|                    Y

In the "diagram" above, X is the containing transcript, Y is the contained transcript, "--" represents a read and "*" represents the small region of Y that is distinct from X.
In this case, since no reads overlap "*", every read that can be accounted for by Y can also be accounted for by X.  Further, there are reads that only X can account for.
So, in this case, Y is likely to be assigned very low abundance (possibly 0).  Of course, if there are reads overlapping "*", then that shouldn't happen.

If you want to know which (read, transcript) pairs are being generated by quasi-mapping (and hence, considered by salmon), you can run RapMap on the data and take a look at the resulting SAM file.  Then, you'll be able to see which reads are multi-mapping to X and Y, and which reads, if any, are mapping only to Y.

Best,
Rob 

Daniel Gaston

unread,
Jun 10, 2016, 10:13:56 AM6/10/16
to Sailfish Users Group
Thanks Rob, that was my exact intuition about what would be going on with the inference algorithm. I am getting ready to run RapMap on the data now because I came across it through GitHub. I have some STAR + Bowtie2 alignments as well and just haven't had a chance to look at them visually yet. According to my collaborator the transcript should be there based on other things they had done previously with the data but I'm not yet convinced. Thanks for the suggestion with RapMap.

Daniel Gaston

unread,
Jun 14, 2016, 8:19:07 AM6/14/16
to Sailfish Users Group
So I used RapMap on the same data and have been looking at the output. It seems, when I look into it with samtools view or tview, and assuming converted correctly from the genomic coordinates to the appropriate transcript level positions, that there are reads in the position of interest. I was incorrect though in that this position isn't actually unique to that transcript, another transcript that covers the whole locus basically the locus looks like this:

|-----------------------------------------------------------------------------| locus
|-----------------------------------------------------------------------------> transcript 1
|-----------------------------                               ----------------------> transcript 2
|---------------------------------------                   ----------------------> transcript 3

Based on some quick inspection all alignments to transcript 1 are the primary alignments. Reads for transcripts 2 and 3 are marked as supplementary alignments. Transcript 2 has non-zero read count and TPM values, transcript 3 is 0 and 0. 

For visualization does anyone have a good tool for viewing the alignments against the transcriptome in the genomic context? 


On Thursday, June 9, 2016 at 11:26:01 PM UTC-3, Rob wrote:

Rob

unread,
Jun 14, 2016, 1:14:30 PM6/14/16
to Sailfish Users Group
Hi Dan,

  As you can imagine, such a situation is among the trickiest in terms of transcript-level abundance estimation.  When a transcript has no uniquely mapping reads, the nature of the exact maximum likelihood solution will depend on how reads map to other transcripts that share the multimapping reads with this one, as well as the lengths of all of the transcripts involved.  I wouldn't worry about the "primary" vs. "secondary" alignment information in the RapMap mappings; it simply chooses the first alignment as the primary one and the rest as secondary.  It doesn't necessarily imply that the mappings to transcripts 2 and 3 are any worse.

  One thing you might consider doing, to try and determine how much variance might be the result of the read sampling and the inference procedure, is to let salmon perform some posterior Gibbs sampling (--numGibbsSamples) or some bootstrap samples (--numBootstraps).  From these, you could get an estimate of the variability of the estimates for these transcript abundances (as you can imagine, lower abundance transcripts tend to have higher relative variability).  It's probably unlikely, however, that you'll see transcript 3 go from 0 expression to substantial expression across the different posterior samples, but it's probably worth looking at.

  Regarding tools for viewing transcriptomic alignments in the genomic context, I'm not aware of any off the top of my head, but I'm looking for something similar, so please let me know if you find something good!

Best,
Rob

Daniel Gaston

unread,
Jun 16, 2016, 1:52:57 PM6/16/16
to Sailfish Users Group
Thanks Rob. I had done bootstrapping runs but haven't looked at the variance yet for those particular transcripts. I'll take a look.
Reply all
Reply to author
Forward
0 new messages