Effective Length calculation

Matt Newman

unread,

Jan 6, 2013, 10:28:10 AM1/6/13

to rsem-...@googlegroups.com

Hi-

I am trying to fully understand the effective length calculation for RSEM.

My understanding was that it was as follows:

'length' is this transcript's sequence length (poly(A) tail is not counted). 'effective_length' counts only the positions that can generate a valid fragment. If no poly(A) tail is added, 'effective_length' is equal to transcript length - mean fragment length + 1. If one transcript's effective length is less than 1, this transcript's both effective length and abundance estimates are set to 0.

However, if I look at my data (no poly(A) tail added), I find that for most transcripts, the difference between transcript length and effective length is 85 (which generally makes sense for my data). However, for shorter transcripts, it seems to vary.

I understand that when the effective length would end up being less than 0, it gets set to 0. But in my case, it does vary for these smaller transcripts.

Thanks in advance for your help!

--Matt Newman

transcript_id

gene_id

length

effective_length

expected_count

TPM

FPKM

IsoPct

NR_003498_chr15	HBII-52-45_chr15	53	1.02	51.98
NR_036157_chr19	MIR320E_chr19	53	1.02	51.98
NR_036093_chr7	MIR548T_chr7	53	1.02	51.98
NR_031740_chr1	MIR1976_chr1	52	0	52
NR_036172_chr22	MIR3201_chr22	52	0	52
NR_031694_chr22	MIR1281_chr22	54	1.12	52.88
NR_036226_chr2	MIR4262_chr2	54	1.12	52.88
NR_003944_chr1	SNORD78_chr1	54	1.12	52.88

NR_030386_chr20

MIR663_chr20

93

11.06

0

81.94

NR_036144_chr16

MIR3180-3_chr16

94

11.5

0

82.5

NR_030294_chr3

MIR551B_chr3

96

12.41

0

83.59

NR_015380_chr19	A1BG-AS1_chr19	2130	2045	85
NM_130786_chr19	A1BG_chr19	1766	1681	85
NM_001198818_chr10	A1CF_chr10	9362	9277	85
NM_001198819_chr10	A1CF_chr10	9529	9444	85
NM_001198820_chr10	A1CF_chr10	9412	9327	85
NM_014576_chr10	A1CF_chr10	9269	9184	85
NM_138932_chr10	A1CF_chr10	9293	9208	85
NM_138933_chr10	A1CF_chr10	9364	9279	85

b...@cs.wisc.edu

unread,

Jan 6, 2013, 9:26:23 PM1/6/13

to rsem-...@googlegroups.com

Hi Matt,

In order for me to learn more about this issue, can you answer me several
questions?

1) Are your data single-end or paired-end?

2) If your data are single-end, do all reads have the same length or do
you use the '--fragment-length-mean'?

3) RSEM's transcript results should contain 8 fields per line but the
output you pasted seems contain 9 fields. Can you double check it?

Thanks,
Bo

Matt Newman

unread,

Jan 6, 2013, 9:48:13 PM1/6/13

to rsem-...@googlegroups.com

Hi Bo-

Thanks for responding.

1. Single end data

2. The reads have different lengths (they've been trimmed for quality during alignment using another alignment algorithm--I did not use Bowtie or rsem for this). Do I need to calculate fragment-length-mean myself? How about fragment-length-sd? I'm sure this explains somewhat what I'm seeing, but I just want to make sure I understand whats happening.

3. Please ignore the 9th column. That was me trying to figure out what was being subtracted.

Thanks again for your help!

--Matt

b...@cs.wisc.edu

unread,

Jan 6, 2013, 11:51:12 PM1/6/13

to rsem-...@googlegroups.com

Hi Matt,

I see. Since not all reads in your data have the same length, the
effective length is calculated according to the read length distribution
instead of a particular read length. I guess that there are very short
reads due to the trimming process, that's why for the short transcripts
the effective length is not 0 but a small number.

Best,
Bo

Matt Newman

unread,

Jan 7, 2013, 8:26:04 AM1/7/13

to rsem-...@googlegroups.com, b...@cs.wisc.edu

Hi Bo-

Thanks again.

So, is there a way for me to figure out manually what the algorithm is doing? I just want to make sure I understand what you mean by "according to the read length distribution" instead of a particular read length.

I appreciate all your help.

--Matt

b...@cs.wisc.edu

unread,

Jan 7, 2013, 10:21:46 AM1/7/13

to rsem-...@googlegroups.com

Hi Matt,

Sure. If we assume the fragment length distribution or read length
distribution be p(x), where x is a particular read length. Also, assume we
have a function t_i(x), which maps x to the number of valid start
positions for transcript i if read length is x. Then the expected
effective length l_i is defined as

l_i = \sum_x p(x)t_i(x).

If l_i < 1.0, RSEM will set l_i to 0.

Hope it helps,
Bo

edge

unread,

Mar 12, 2013, 1:01:00 AM3/12/13

to rsem-...@googlegroups.com, b...@cs.wisc.edu

Hi Bo,

Colin Dewey

unread,

Mar 12, 2013, 3:00:34 PM3/12/13

to rsem-...@googlegroups.com

Hi edge,

The calculations of the gene-level length and effective length are described in the rsem-calculate-expression help:

"A gene’s ’length’ and ’effective_length’ are defined as the weighted average of its transcripts’ lengths and effective lengths (weighted by ’IsoPct’)"

So, for example, in your output the gene length is computed as

815 * 79.82% + 343 * 20.18% = 719.75 (this is off by 0.02 due to the rounding of the RSEM IsoPct values)

Best,

Colin

--
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

edge

unread,

Mar 18, 2013, 11:28:37 AM3/18/13

to rsem-...@googlegroups.com

edge

unread,

Mar 18, 2013, 11:33:02 AM3/18/13

to rsem-...@googlegroups.com

Hi Colin,

Sorry for my late reply.

Connection down at my institute past few days.

I'm understand effective length right now.

I will ask you if I got further inquiry about rsem calculation later on.

Apart from that, is it I should run rsem-generate-ngvector, rsem-generate-data-matrix, followed by rsem-find-DE once I get the sampleA,isoforms.results , sampleB,isoforms.results by running rsem?

I have only two transcriptome data set. Same genus and same species but just treated with two different stress condition.

I don't have any biological replicates for my experiment.

Thanks and looking forward to hear from you.

best regards

Edge

edge

unread,

Mar 20, 2013, 12:20:21 AM3/20/13

to rsem-...@googlegroups.com

Hi Colin,

In my case shown below,

transcripts.results

transcript_id

gene_id

length

effective_length

expected_count

TPM

FPKM

IsoPct

comp100000_c0_seq1 comp100000_c0 815 761.64 21.00 1.00 0.78 79.82

comp100000_c0_seq2 comp100000_c0 343 289.84 2.00 0.25 0.20 20.18

genes.results

gene_id transcript_id(s) length effective_length expected_count TPM FPKM

comp100000_c0 comp100000_c0_seq1,comp100000_c0_seq2 719.77 666.45 23.00 1.25 0.98

The difference between transcript length and effective length is around 66 in my case.

Do you mind to share more by using my example above to find out how to calculate effective_length and IsoPct that obtained from "transcripts.results"?

I still can't really understand how to get the effective_length and IsoPct even read through website and forum :(

I just curious why all the difference between transcript length and effective length is range between 66.00 to 66.99 too.

Really thanks and appreciate for knowledge sharing.

best regards

Edge

Erik Aronesty

unread,

Sep 24, 2014, 1:28:06 PM9/24/14

to rsem-...@googlegroups.com, b...@cs.wisc.edu

So, if 500 bp poly-A tails are added, and you're using,say 180bp fragments, then your longest reads can align past the end of the gene with poly-a in the paired. so there's always, essentially, <transcript-length> valid start positions for every read. Does this mean the effective lengths should always be the same?

Or is there some heuristic used to calucate the number of possible "valid start positions" for a fragment?

On Monday, January 7, 2013 10:21:46 AM UTC-5, b...@cs.wisc.edu wrote:

Hi Matt,

Bo Li

unread,

Sep 24, 2014, 1:30:22 PM9/24/14

to Erik Aronesty, rsem-...@googlegroups.com

Hi Erik,

In that case, the effective length should be equal to the transcript
length itself.

Best,
Bo

Reply all

Reply to author

Forward