Bimodal TPMs

53 views
Skip to first unread message

Daniel Hyduke

unread,
Sep 9, 2016, 5:01:37 PM9/9/16
to Sailfish Users Group
One thing that I've noticed with salmon is that it generates a bimodal distribution of TPM estimates whereas kallisto does not; kallisto does have a long tail on the left.  Since the mean of the lower mode is << 1 TPM, these transcript abundances are probably not relevant to cells with ~1-2 million transcripts, but I'm still interested in why this is happening (perhaps, I've set things incorrectly)?

In the attached figure, I show the transcript distributions for salmon run for illumina paired end 100x100 reads with the following settings:

0) Index generation: salmon index -t gencode.v25.pc_transcripts.fa -i k_31 --type quasi -p 100 -k 21
1) Uncorrected='/usr/local/bin/salmon quant -l IU -i k_31 -1 [set of left fastqs] -2 [corresponding set of right fastqs] -p 49 --useVBOpt'
2) Seq_Corrected="${Uncorrected} --seqBias"
3) GC_Seq_Corrected="${Seq_Corrected} --gcBias"

For Kallisto:
0) Index generation: kallisto index -i kallisto_31 gencode.v25.pc_transcripts.fa
1) kallisto quant -t 49 -n 100 -i kallisto_31 -o KALLISTO fastqs/*gz | samtools view -Sb - > kallisto.bam

I've seen the same behavior with 4 other data sets.



salmon_bimodal_vs_kallisto.png

Rob

unread,
Sep 9, 2016, 5:21:19 PM9/9/16
to Sailfish Users Group
Hi Daniel,

  Thanks for providing a thorough description of your observations (and how you processed the data).  The difference you're seeing is almost certainly a result of running salmon with the --useVBOpt option.   When run with the varational bayesian EM, salmon places a weak but uniform prior of 1e-3 nucleotides per-base across the entire transcriptome.  The practical effect of this prior is that it reduces the variance among very low expression transcripts (i.e. it acts as a sort of regularizer), however it does this by preferring to allow these transcripts to be expressed at a consistent but low abundance.  Thus, the overall results returned by the default optimizer (the EM) will tend to be sparser, while the overall results returned by the VBEM optimizer will tend to be less sparse (in exactly the manner you are seeing here).

  I would venture to guess that if you drop the --useVBOpt option, you will see the very low abundance peak go away.  Alternatively, I would venture to guess that you would also see this go away if you set a sparser prior (controlled by the --vbPrior option, where the interpretation of this argument is the default per-nucleotide, transcriptome-wide prior).  Interestingly, the difference you see between these approaches is consistent with the general differences you'd tend to observe between such optimization approaches based on the strength of the prior (e.g. the default VB prior for Salmon should give you sparsity similar to e.g. BitSeq, while the EM algorithm should give you sparsity similar to RSEM / kallisto).  Please let me know if this explains the difference, and if my description makes sense to you.  As an aside, it is actually, I would argue, still an active research question as to which approach is "better".  A (uniform) prior tends to stabilize variance, especially at low abundance, but the cost paid for this is bias toward less sparse estimates and the calling of certain absent transcripts as expressed at very low levels.  Conversely, the EM algorithm tends to produce much sparser estimates and therefore calls 0's as 0's more often, but at the price of substantially higher variance at the low-end of expression.  We (and others) are exploring the downstream effects of these different optimization approaches, but I've yet to find a single, comprehensive metric that properly captures a perfectly intuitive and practical notion of "accuracy".  

Best,
Rob

Vasisht Tadigotla

unread,
Sep 12, 2016, 10:52:34 AM9/12/16
to Rob, Sailfish Users Group
Hi Rob,

That certainly seems to be case with the dataset I've been looking at. Not using the --useVBOpt options results in a long tail on the left. On a separate thought, I'm not sure how 'real' the very low abundance estimates are. We've been exploring using different dilutions of ERCC transcripts to look at the limit of detection for transcripts. 

Cheers,
Vasisht

--
Sailfish is available at https://github.com/kingsfordgroup/sailfish
Citation:
Patro, Rob, Stephen M. Mount, and Carl Kingsford. "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nature biotechnology 32.5 (2014): 462-464.
---
You received this message because you are subscribed to the Google Groups "Sailfish Users Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sailfish-users+unsubscribe@googlegroups.com.
To post to this group, send email to sailfish-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sailfish-users/fa842c69-dca4-47e4-a6b5-30839e2e650a%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Pour trouver les limites du possible il faut tenter l'impossible.

Rob

unread,
Sep 12, 2016, 1:58:16 PM9/12/16
to Sailfish Users Group
Hi Vasisht,

  Thanks for the input!  Regarding the very-low expression transcripts (specifically with the default VB inference), to be clear, I'm not suggesting that the spike (e.g. in Daniel's plot) is real.  In fact, I'd venture to guess that if you define a false positive as returning a non-zero expression when the true expression is 0, then the VB optimization with the default prior will have a much higher false positive rate than the default inference algorithm (though if you consider a more lenient definition like TPM > 0.1 when the true expression is 0, most of these would go away).  One question that becomes important, though, is the effect these very low and consistent abundances have on downstream analysis (e.g. DE detection).  The argument in favor of VB is that these transcripts will be too low abundance to call as DE anyway, and the enhanced consistency in their abundance estimates may reduce the potential for false detection.  However, I think that more testing under a host of different experimental and simulated data will be required to provide a firm evidence-based answer to this question (and certainly, there will be some part of the effect that is dependent on the tools being used downstream for DE etc.).

Thanks,
Rob 



On Monday, September 12, 2016 at 10:52:34 AM UTC-4, Vasisht Tadigotla wrote:
Hi Rob,

That certainly seems to be case with the dataset I've been looking at. Not using the --useVBOpt options results in a long tail on the left. On a separate thought, I'm not sure how 'real' the very low abundance estimates are. We've been exploring using different dilutions of ERCC transcripts to look at the limit of detection for transcripts. 

Cheers,
Vasisht
Reply all
Reply to author
Forward
0 new messages