Expected count and expected length

1,825 views
Skip to first unread message

Ashish Kumar Pathak

unread,
Feb 22, 2015, 11:52:03 PM2/22/15
to trinityrn...@googlegroups.com
Greeting all, 

Kindly help me to understand what does expected length and excepted count means? We get after RSEM results. 





Ashish Kumar Pathak
DBT- JRF 
National Agri-Food Biotechnology Institute
C-127, Industrial Area
SAS Nagar,Phase 8
Mohali-160071
Punjab,India



Brian Haas

unread,
Feb 23, 2015, 8:19:45 AM2/23/15
to Ashish Kumar Pathak, trinityrn...@googlegroups.com
Expected length represents the number of positions within a transcript from which an rnaseq fragment could have been derived from.  It's generally equal to the raw transcript length minus the mean fragment length (+1). See RSEM for exact details.

Expected counts represent the maximum likelihood estimate for the number of rnaseq fragments that are derived from a given transcript, taking into account read mapping uncertainty.  These are the counts you would use with bio conductor tools such as edger or deseq2 for diff expression analysis.  Also see RSEM paper as well as our protocol paper for details (linked from trinityrnaseq.guthub.io )

Best

-Brian
(by iPhone)

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Ashish Kumar Pathak

unread,
Feb 27, 2015, 12:36:50 AM2/27/15
to trinityrn...@googlegroups.com
Greetings to all,

I had executed Trinity as instructed. After RSEM, I am have output file as follows:
 
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
c0_g1_i1 c0_g1 223 36.71 1 2 1.83 100
c10001_g1_i1 c10001_g1 522 310.18 7 1.65 1.51 100
c10002_g1_i1 c10002_g1 220 34.74 1 2.11 1.93 100

I want to know actual count for a contig. For example, for contig c0_g1_i1 what is the raw count ? How this information can be obtained.
One more query is the method to calculate TPM and FPKM manually for isoforms. As per my understanding, TPM and FPKM are derived from effective count (which is different from expected count). May I get an example for the calculation of FPKM and TPM.
This will be great help to me.


Thanks and regards

Ashish Kumar Pathak
DBT- JRF 
National Agri-Food Biotechnology Institute
C-127, Industrial Area
SAS Nagar,Phase 8
Mohali-160071
Punjab,India




Brian Haas

unread,
Feb 28, 2015, 2:27:39 PM2/28/15
to Ashish Kumar Pathak, trinityrn...@googlegroups.com
Hi Ashish,

For genes, I think you'll only get expected counts, not raw counts, because the alignments are all going to the transcripts - and the gene simply represents an aggregate of those transcripts.

FPKM =  effective counts per kb effective length per million frags mapped.

TPM = fpkm_i / sum_all_fpkm * 1M

best,

~brian

--
--
Brian J. Haas
The Broad Institute
http://broadinstitute.org/~bhaas

 

Ashish Kumar Pathak

unread,
Mar 13, 2015, 6:40:15 AM3/13/15
to Brian Haas, trinityrn...@googlegroups.com

Dear Sir / Madam,

 

Greetings!

 

I am a PhD student, in an academic institute of India. I have gone through Trinity protocol for transcriptome analysis. I am trying to understand the pipeline used in trinity for transcriptome analysis.  I have the following queries:

 

1)      For expression analysis, mapped reads on the reference is used by RSEM to calculate TPM, FPKM, effective length, expected count and length. I had confusion regarding calculation of expected count. According to literature, expected count is calculated by RSEM using rescue all reads method, in which non-unique reads are mapped and as per their distribution posterior probability is calculated, which is assigned as expected count. If a contig is derived from only those reads which are distributed equally (in mapping) among all the contigs, then this contig is assigned zero expected count.

 

2)      Kindly help me in understanding, how expected count is calculated from the raw counts?

 

3)      For the calculation of expected count does quality scores (Phred score in fastQ) are also considered? 

 

4)      After executing RSEM, trinity uses expected count for the calculation of log fold change, and log counts per million using edgeR package of R. In this package library size is calculated, which represents sum of expected count of each contigs. For the calculation of effective library size, trimmed mean is used to calculate normalization factor. Kindly help to understand calculation of normalization factors is calculated? 

 

5)      Multiplying the normalization factor with library size we get effective library size, and after that this effective library size is used for the calculation of normalized expected count. Kindly help me to understand how normalized expected count is calculated? 

 

6)      Kindly also explain how the TMM_normalized FPKM is calculated.

 

To process some of my transcriptome data of doctoral study, I need to clarify the aforementioned doubts in my mind.

 

I look forward your response.

 

With regards

  


Ashish Kumar Pathak
DBT- JRF 
National Agri-Food Biotechnology Institute
C-127, Industrial Area
SAS Nagar,Phase 8
Mohali-160071
Punjab,India




Milton Yutaka Nishiyama Junior

unread,
Sep 4, 2015, 6:40:34 PM9/4/15
to trinityrnaseq-users
Hi Brian and All,

I have, maybe, an very simple question, but why in your Trinity paper you talk about the FPKM and RPKM values and specially for dif. expr. analysis and clustering you use the "expected_count" ?

The expected_count reflect better the gene expression profile, better than FPKM ?

And I would like to compare the transcriptome and proteome expression values, the expected_count  would be a better measure ?

Thank You,

Milton

Brian Haas

unread,
Sep 4, 2015, 8:26:17 PM9/4/15
to Milton Yutaka Nishiyama Junior, trinityrnaseq-users
Hi Milton,

This is because the DE analysis tools such as edgeR are based on read count statistics.  So, the counts are used for DE analysis, and the normalized expression values (FPKM, RPKM, TPM) are used when generating heatmaps or other comparisons.

best,

~b

--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-u...@googlegroups.com.
To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.

Milton Yutaka Nishiyama Junior

unread,
Sep 4, 2015, 8:38:34 PM9/4/15
to trinityrnaseq-users, yuta...@gmail.com
Hi Brian,

Thank You for the explanation, I didn't pay attention for the steps for DE and the comparisons.

Best,


On Friday, September 4, 2015 at 9:26:17 PM UTC-3, Brian Haas wrote:
Hi Milton,

This is because the DE analysis tools such as edgeR are based on read count statistics.  So, the counts are used for DE analysis, and the normalized expression values (FPKM, RPKM, TPM) are used when generating heatmaps or other comparisons.

best,

~b
On Fri, Sep 4, 2015 at 6:40 PM, Milton Yutaka Nishiyama Junior <yuta...@gmail.com> wrote:
Hi Brian and All,

I have, maybe, an very simple question, but why in your Trinity paper you talk about the FPKM and RPKM values and specially for dif. expr. analysis and clustering you use the "expected_count" ?

The expected_count reflect better the gene expression profile, better than FPKM ?

And I would like to compare the transcriptome and proteome expression values, the expected_count  would be a better measure ?

Thank You,

Milton


On Monday, February 23, 2015 at 1:52:03 AM UTC-3, ashish pathak wrote:
Greeting all, 

Kindly help me to understand what does expected length and excepted count means? We get after RSEM results. 





Ashish Kumar Pathak
DBT- JRF 
National Agri-Food Biotechnology Institute
C-127, Industrial Area
SAS Nagar,Phase 8
Mohali-160071
Punjab,India



--
You received this message because you are subscribed to the Google Groups "trinityrnaseq-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to trinityrnaseq-users+unsub...@googlegroups.com.

To post to this group, send email to trinityrn...@googlegroups.com.
Visit this group at http://groups.google.com/group/trinityrnaseq-users.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages