RSEM multiple reads vs accurate abundance estimates

475 views
Skip to first unread message

Devinder Kaur

unread,
Jul 15, 2017, 8:33:37 AM7/15/17
to RSEM Users
Hello Bo Li,

RSEM is known for the most accurate abundance estimates. RSEM handle reads that maps ambiguously (multiple reads) to achieve the best estimation accuracies.

I ran RSEM to look the abundance of repeat sequences LINE1 (Long Interspersed Nucleotide Elements) which are 900+ in number in whole genome and as they are repeat sequences they share similarity maximum of 95% and has SNPs. RSEM gives read count for them and when I generate the wiggle plot file for the unique reads ('--show-unique), I found many LINE1 transcript with only red bars density plots.

My question is: On one side RSEM has given expected_count (supposed to be accurate abundance using ML and detailed statistical model) against the transcript thus showing that there is expression for the transcript in subject but on the other hand when we looked for the wiggle plot we found that all multiple reads were mapping to it, so how come RSEM says that such transcript is getting expressed.. the expression might actually be from another similar copy (see picture attached)? Can the expected_count be considered as the accurate abundance for the transcripts of LINE1 ? Can we say that all 4 LINE1 are getting expressed with mentioned expected_count (see picture) ? Or is there other way to say about accurate expression of such repeat sequences. 


Picture2.jpg

Colin Dewey

unread,
Sep 19, 2017, 3:21:17 PM9/19/17
to RSEM Users
Hi Devinder,

Given the large number and high similarity of the LINE elements you are examining, there is a lot of uncertainty associated with their abundance estimates.  As such, I highly recommend that you run RSEM with the --calc-ci option, which will give you 95% credibility intervals for the abundance estimates (and other quantities that will help you examine uncertainty).  You could use these in your assessment of which elements might actually be expressed.

Also, it should be noted that the RSEM maximum likelihood estimates are the abundances one would assign to maximize the probability of the observed RNA-seq reads, which is related, but not the same as the problem of determining which genes are truly expressed.

Another issue with attempting to quantify LINE expression from short RNA-seq data, is that you don't necessarily know what the transcription start and end sites are for these elements (assuming they are actually expressed).  An inaccurate annotation of the the transcriptional boundaries of these elements will reduce the accuracy of the abundance estimates.

You might consider more analysis pipelines more specialized for repetitive elements, e.g., 

Hope that helps,
Colin
Message has been deleted

Devinder Kaur

unread,
Feb 16, 2018, 8:20:13 AM2/16/18
to RSEM Users
Hi Colin,

As per your suggestion, I used calc-ci option and got 95% credibility intervals for the abundance estimates. I used the TPM_ci_lower_bound and TPM_ci_upper_bound to say abouth the expression of LINE elements. I was looking its expression from both sense and antisense strand. As for sense strand, the TPM vlues lies in the lower and upper bound range whereas in case of antisense data, the TPM value comes out to be quite higher then the lower and upper_bound range. example in attached table.


I found it strange and getting confuse with this result. Could it be a bug in the program ?  I need your comment or suggestions on this. How to go ahead and interpret such data?

Reply all
Reply to author
Forward
0 new messages