Effective Length calculation

2035 views
Skip to first unread message

Matt Newman

unread,
Jan 6, 2013, 10:28:10 AM1/6/13
to rsem-...@googlegroups.com
Hi-

I am trying to fully understand the effective length calculation for RSEM.

My understanding was that it was as follows:

'length' is this transcript's sequence length (poly(A) tail is not counted). 'effective_length' counts only the positions that can generate a valid fragment. If no poly(A) tail is added, 'effective_length' is equal to transcript length - mean fragment length + 1. If one transcript's effective length is less than 1, this transcript's both effective length and abundance estimates are set to 0.

However, if I look at my data (no poly(A) tail added), I find that for most transcripts, the difference between transcript length and effective length is 85 (which generally makes sense for my data).  However, for shorter transcripts, it seems to vary.

I understand that when the effective length would end up being less than 0, it gets set to 0.  But in my case, it does vary for these smaller transcripts.

Thanks in advance for your help!

--Matt Newman

transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
NR_003498_chr15 HBII-52-45_chr15 53 1.02 0 0 0 0 51.98
NR_036157_chr19 MIR320E_chr19 53 1.02 0 0 0 0 51.98
NR_036093_chr7 MIR548T_chr7 53 1.02 0 0 0 0 51.98
NR_031740_chr1 MIR1976_chr1 52 0 0 0 0 0 52
NR_036172_chr22 MIR3201_chr22 52 0 0 0 0 0 52
NR_031694_chr22 MIR1281_chr22 54 1.12 0 0 0 0 52.88
NR_036226_chr2 MIR4262_chr2 54 1.12 0 0 0 0 52.88
NR_003944_chr1 SNORD78_chr1 54 1.12 0 0 0 0 52.88
NR_030386_chr20 MIR663_chr20 93 11.06 0 0 0 0 81.94
NR_036144_chr16 MIR3180-3_chr16 94 11.5 0 0 0 0 82.5

NR_030294_chr3 MIR551B_chr3 96 12.41 0 0 0 0 83.59
NR_015380_chr19 A1BG-AS1_chr19 2130 2045 0 0 0 0 85
NM_130786_chr19 A1BG_chr19 1766 1681 0 0 0 0 85
NM_001198818_chr10 A1CF_chr10 9362 9277 0 0 0 0 85
NM_001198819_chr10 A1CF_chr10 9529 9444 0 0 0 0 85
NM_001198820_chr10 A1CF_chr10 9412 9327 0 0 0 0 85
NM_014576_chr10 A1CF_chr10 9269 9184 0 0 0 0 85
NM_138932_chr10 A1CF_chr10 9293 9208 0 0 0 0 85
NM_138933_chr10 A1CF_chr10 9364 9279 0 0 0 0 85

b...@cs.wisc.edu

unread,
Jan 6, 2013, 9:26:23 PM1/6/13
to rsem-...@googlegroups.com
Hi Matt,

In order for me to learn more about this issue, can you answer me several
questions?

1) Are your data single-end or paired-end?

2) If your data are single-end, do all reads have the same length or do
you use the '--fragment-length-mean'?

3) RSEM's transcript results should contain 8 fields per line but the
output you pasted seems contain 9 fields. Can you double check it?

Thanks,
Bo

Matt Newman

unread,
Jan 6, 2013, 9:48:13 PM1/6/13
to rsem-...@googlegroups.com
Hi Bo-

Thanks for responding.

1. Single end data

2. The reads have different lengths (they've been trimmed for quality during alignment using another alignment algorithm--I did not use Bowtie or rsem for this). Do I need to calculate fragment-length-mean myself? How about fragment-length-sd? I'm sure this explains somewhat what I'm seeing, but I just want to make sure I understand whats happening.

3. Please ignore the 9th column. That was me trying to figure out what was being subtracted.

Thanks again for your help!

--Matt

b...@cs.wisc.edu

unread,
Jan 6, 2013, 11:51:12 PM1/6/13
to rsem-...@googlegroups.com
Hi Matt,

I see. Since not all reads in your data have the same length, the
effective length is calculated according to the read length distribution
instead of a particular read length. I guess that there are very short
reads due to the trimming process, that's why for the short transcripts
the effective length is not 0 but a small number.

Best,
Bo

Matt Newman

unread,
Jan 7, 2013, 8:26:04 AM1/7/13
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Hi Bo-

Thanks again.

So, is there a way for me to figure out manually what the algorithm is doing? I just want to make sure I understand what you mean by "according to the read length distribution" instead of a particular read length.

I appreciate all your help.

--Matt

b...@cs.wisc.edu

unread,
Jan 7, 2013, 10:21:46 AM1/7/13
to rsem-...@googlegroups.com
Hi Matt,

Sure. If we assume the fragment length distribution or read length
distribution be p(x), where x is a particular read length. Also, assume we
have a function t_i(x), which maps x to the number of valid start
positions for transcript i if read length is x. Then the expected
effective length l_i is defined as

l_i = \sum_x p(x)t_i(x).

If l_i < 1.0, RSEM will set l_i to 0.

Hope it helps,
Bo

edge

unread,
Mar 12, 2013, 1:01:00 AM3/12/13
to rsem-...@googlegroups.com, b...@cs.wisc.edu
Hi Bo,

Similar question like Matt.
But I can't really link the length, effective_length of my gene data with the transcript data :(
transcripts.results
comp100000_c0_seq1 comp100000_c0 815 761.64 21.00 1.00 0.78 79.82
comp100000_c0_seq2 comp100000_c0 343 289.84 2.00 0.25 0.20 20.18
genes.results
comp100000_c0  comp100000_c0_seq1,comp100000_c0_seq2 719.77 666.45 23.00 1.25 0.98

I just not sure how to calculate the length, effective_length result shown in genes.results T_T

Do you mind to give some ideas based on my situation?

Thanks.

best regards
edge 

Colin Dewey

unread,
Mar 12, 2013, 3:00:34 PM3/12/13
to rsem-...@googlegroups.com
Hi edge,

The calculations of the gene-level length and effective length are described in the rsem-calculate-expression help:

"A gene’s ’length’ and ’effective_length’ are defined as the weighted average of its transcripts’ lengths and effective lengths (weighted by ’IsoPct’)"

So, for example, in your output the gene length is computed as 

815 * 79.82% + 343 * 20.18% = 719.75 (this is off by 0.02 due to the rounding of the RSEM IsoPct values)

Best,
Colin

--
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

edge

unread,
Mar 18, 2013, 11:28:37 AM3/18/13
to rsem-...@googlegroups.com

edge

unread,
Mar 18, 2013, 11:33:02 AM3/18/13
to rsem-...@googlegroups.com
Hi Colin,

Sorry for my late reply.
Connection down at my institute past few days.

I'm understand effective length right now.
I will ask you if I got further inquiry about rsem calculation later on.

Apart from that, is it I should run rsem-generate-ngvector, rsem-generate-data-matrix, followed by rsem-find-DE once I get the sampleA,isoforms.results , sampleB,isoforms.results by running rsem?

I have only two transcriptome data set. Same genus and same species but just treated with two different stress condition.
I don't have any biological replicates for my experiment.

Thanks and looking forward to hear from you.

best regards
Edge

edge

unread,
Mar 20, 2013, 12:20:21 AM3/20/13
to rsem-...@googlegroups.com
Hi Colin,

In my case shown below,
transcripts.results
transcript_idgene_idlengtheffective_lengthexpected_countTPMFPKMIsoPct
comp100000_c0_seq1 comp100000_c0  815 761.64 21.00 1.00 0.78 79.82
comp100000_c0_seq2 comp100000_c0 343 289.84 2.00 0.25 0.20 20.18
genes.results
gene_id                transcript_id(s)                                                length       effective_length     expected_count    TPM    FPKM
comp100000_c0  comp100000_c0_seq1,comp100000_c0_seq2 719.77 666.45 23.00 1.25 0.98

The difference between transcript length and effective length is around 66 in my case. 
Do you mind to share more by using my example above to find out how to calculate effective_length and IsoPct that obtained from "transcripts.results"?

I still can't really understand how to get the effective_length and IsoPct even read through website and forum :(
I just curious why all the difference between transcript length and effective length is range between 66.00 to 66.99 too.

Really thanks and appreciate for knowledge sharing.

best regards
Edge

Erik Aronesty

unread,
Sep 24, 2014, 1:28:06 PM9/24/14
to rsem-...@googlegroups.com, b...@cs.wisc.edu
So, if 500 bp poly-A tails are added, and you're using,say 180bp fragments, then your longest reads can align past the end of the gene with poly-a in the paired.  so there's always, essentially, <transcript-length> valid start positions for every read.   Does this mean the effective lengths should always be the same?

Or is there some heuristic used to calucate the number of possible "valid start positions" for a fragment?


On Monday, January 7, 2013 10:21:46 AM UTC-5, b...@cs.wisc.edu wrote:
Hi Matt,

Bo Li

unread,
Sep 24, 2014, 1:30:22 PM9/24/14
to Erik Aronesty, rsem-...@googlegroups.com
Hi Erik,

In that case, the effective length should be equal to the transcript
length itself.

Best,
Bo
Reply all
Reply to author
Forward
0 new messages