Effective Length and TPM calculation

359 views
Skip to first unread message

Will Gammerdinger

unread,
Aug 3, 2018, 11:01:11 AM8/3/18
to Sailfish Users Group
I have been playing around with Salmon and I have a question regarding effective length. My goal is to find genes (paralogs) that Salmon has trouble distinguishing between. I simulated reads (using ART) from a transcriptome and then used Salmon to get TPM estimates. I was able to ground truth my TPM estimates assuming perfect alignment (using the method found here https://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/) since I knew how many reads were produced for each gene. I noticed a discrepancy for all genes between my estimates and what Salmon gave back, in particular, short transcripts deviated strongly from my expectation. However, my estimates were based upon the entire transcript length and not the effective transcript length. My question is, can I just replace transcript length with effective transcript length (produced from Salmon) to get the method that Salmon uses to calculate TPM? This will hopefully give me an expected value for each genes TPM, so that I can exclude genes that deviate too far from this in my actual analyses. Thank you!!!

Response:

You can have Salmon ignore effective length correction by passing the flag `--noEffectiveLengthCorrection` to the quant command.  However, effective length correction generally makes sense to do.  Specifically, for each given transcript, the effective length is equal to the actual length minus the mean of the conditional fragment length distribution.  For example, if I have a fragment length distribution with a mean of 200, then the effective length of a transcript of length 1000 will be 800.  However, the effective length of a transcript of length 200 will be 200 - mean( frag_length_dist_{conditioned on max frag of 200} ) --- that is, consider the fragment length distribution from 0 to 200, renormalize it, and take the mean of that distribution.  If you have the perfect (true) read counts, you can simply take Salmon's effective length estimates and use those to compute what the "true" TPMs should be accounting for the fragment length distribution.  You can, of course, use the true fragment length distribution mean (if you have it) --- however, I expect salmon's estimate of that quantity to be very accurate and quite stable, so there is probably not much trouble with just using the salmon effective lengths.


Reply all
Reply to author
Forward
0 new messages