Re: [rsem-users] Differential expression analysis on RSEM processed data

3,307 views
Skip to first unread message

Ning Leng

unread,
Mar 26, 2013, 11:31:00 AM3/26/13
to rsem-...@googlegroups.com
Hi Dvir,

You might consider EBSeq
http://www.biostat.wisc.edu/~kendzior/EBSEQ/

RSEM-EBSeq pipeline
http://deweylab.biostat.wisc.edu/rsem/README.html#de

Best,
Ning

On Tue, Mar 26, 2013 at 1:10 AM, Dvir <dvir...@gmail.com> wrote:
> Hello,
>
> Which methods can be applied on RSEM output files to detect differentially
> expressed genes?
> It seems that methods like EdgeR and DESeq require gene level raw count and
> therefore cannot be applied to RSEM outputs.
>
> I have RSEM files of gene level downloaded from TCGA web site.
>
> Many thanks,
> Dvir
>
> --
> You received this message because you are subscribed to the Google Groups
> "RSEM Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to rsem-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>



--
Ning Leng
University of Wisconsin Madison
Department of Statistics
4720 Medical Sciences Center
1300 University Avenue
Madison, Wisconsin 53706

Colin Dewey

unread,
Mar 26, 2013, 11:36:03 AM3/26/13
to rsem-...@googlegroups.com
Hi Dvir,

In the RSEM output, you should find an "expected count" column. After rounding these values to the nearest integer, you can use them with differential expression packages such as EBSeq, edgeR, and DESeq.

If that column is not available in the TCGA data, you might consider contacting TCGA to have them provide that data.

Best,
Colin

b...@cs.wisc.edu

unread,
Mar 26, 2013, 11:49:48 AM3/26/13
to rsem-...@googlegroups.com
Hi Dvir,

As Ning mentioned, RSEM includes EBSeq and associated scripts to detect DE
genes for you. Also, you should be able to give the gene level expected
counts (rounded to nearest integer) to edgeR or DESeq. Although it is not
ideal, it is what most people do.

Best,
Bo
Message has been deleted

Dvir

unread,
Mar 28, 2013, 12:35:09 PM3/28/13
to rsem-...@googlegroups.com
Hi and thanks...
 
As I wrote to Colin, my problem is that I need to somehow hook up to specific data files available by TCGA which are in some kind of rsem output format, but I'm not sure if they contain the 'expected count' values.
 
My other option is to download the extremely large sequencing data files and run RSEM and them EBseq locally, but we're talking on hundreds of very large files so this may be challenging, and I'd rather use the already provided Level 3 TCGA output files, even though I'm not sure how to interpret their format.
 
Thanks,
Dvir

Dvir

unread,
Mar 28, 2013, 1:01:18 PM3/28/13
to rsem-...@googlegroups.com
Hi again,
 
It seems like my previous post didn't get through so I'll repost...
 
I have the following files available for me at the TCGA web site, all are supposed to be RSEM output files:

unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_exon.txt

unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_junction.txt

unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_gene.txt

unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_gene_normalized.txt

unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_isoforms.txt

unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_isoforms_normalized.txt

 
For the differential analysis I wish to conduct I have focused on the '..._expression_rsem_gene.txt' files which are the gene-level unnormalized output. This file is of the following format:
 

barcode

gene_id

raw_count

scaled_estimate

transcript_id

TCGA-AN-A03X-01A-21R-A00Z-07

?|100130426

0

0

uc011lsn.1

TCGA-AN-A03X-01A-21R-A00Z-07

?|100133144

12

5.65E-07

uc010unu.1,uc010uoa.1

TCGA-AN-A03X-01A-21R-A00Z-07

?|100134869

0

0

uc002bgz.2,uc002bic.2

TCGA-AN-A03X-01A-21R-A00Z-07

?|10357

123

1.34E-05

uc010zzl.1

TCGA-AN-A03X-01A-21R-A00Z-07

?|10431

1360

6.68E-05

uc001jiu.2,uc010qhg.1

TCGA-AN-A03X-01A-21R-A00Z-07

?|136542

0

0

uc011krn.1

TCGA-AN-A03X-01A-21R-A00Z-07

HLA-A|3105

17614.94

0.00063316

uc003nok.2,uc003nol.2,uc003nom.2,uc003non.2,uc003noo.2,uc010jrq.2,uc010jrr.2,uc010klp.2,uc011dmc.1,uc011dmd.1

TCGA-AN-A03X-01A-21R-A00Z-07

HLA-B|3106

20212.19

0.000956254

uc003ntf.2,uc003ntg.1,uc003nth.2,uc003nti.1,uc010jsm.1,uc010jsn.1,uc010jso.2,uc011dnk.1

TCGA-AN-A03X-01A-21R-A00Z-07

HLA-C|3107

17775.81

0.000668874

uc003nsx.2,uc003nsy.2,uc003nsz.2,uc003nta.2,uc003ntb.2,uc003ntc.1,uc010jsl.2,uc011dnj.1,uc011dnl.1

 

You mentioned that should use the 'expected counts' values, but they only thing similar is the 'raw_count' column in the above file.
 
I wonder if anyone knows this file format and can tell me if I can use the 'raw_count' column for DE analysis (I'm not if this file format is the original RSEM output format or a TCGA format).
 
Many thanks,
Dvir.
 
 

On Tuesday, March 26, 2013 5:31:00 PM UTC+2, leng...@gmail.com wrote:

Bo Li

unread,
Mar 28, 2013, 1:28:47 PM3/28/13
to rsem-...@googlegroups.com
Hi Dvir,

Can you tell me where I can find the RSEM like outputs you mentioned? Then maybe I can give you some suggestions on how to run DE tools on them.

Best,
Bo
Message has been deleted

Bo Li

unread,
Mar 28, 2013, 1:33:26 PM3/28/13
to rsem-...@googlegroups.com, Dvir
Hi Dvir,

I think that you are safe to use the "raw_count" column. This is the expected counts we generated. You can see that there are 2 digits after the decimal point, which suggests it is not raw count ( integers) but expected count (real numbers). I'm not sure why they rename the fields in that way...

Best,
Bo
Message has been deleted

Dvir

unread,
Mar 28, 2013, 4:31:33 PM3/28/13
to rsem-...@googlegroups.com
Thanks Bo.
 
So I will use the raw_count column as input for DESeq/EdgeR after rounding the values. Can I also use them for EBSeq or do I need the entire RSEM output for that?
 
And another question if you will - for clustering analysis, should I use the normalized files ? They have the following format:
 

barcode

gene_id

normalized_count

TCGA-A8-A082-01A-11R-A00Z-07

?|100130426

0

TCGA-A8-A082-01A-11R-A00Z-07

?|100133144

9.372

TCGA-A8-A082-01A-11R-A00Z-07

?|100134869

1.028

TCGA-A8-A082-01A-11R-A00Z-07

?|10357

130.64

TCGA-A8-A082-01A-11R-A00Z-07

 

 

Thanks,

Dvir

?|10431

1015.2

Ning Leng

unread,
Mar 28, 2013, 9:18:42 PM3/28/13
to rsem-...@googlegroups.com
Hi Dvir,

EBSeq doesn't require integer inputs. So you could use the raw_count matrix.
And I think it makes more sense to use normalized values for clustering analysis.

Best,
Ning

--
You received this message because you are subscribed to the Google Groups "RSEM Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rsem-users+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Ning Leng

unread,
Mar 29, 2013, 10:11:40 AM3/29/13
to rsem-...@googlegroups.com
Hi Dvir,

I think raw_count is the expected count from rsem.

In the DESCRIPTION.txt in the data folder:

RSEM abundance estimation results in two files, gene and isoform level quantification. More
information regarding the content of these output files can be found on the RSEM website
the format indicates the feature name in column 1, esimated count in colum 2, scaled
estimate in column 3, and contributing isoforms in column 4 (gene level only). These files
will have the following extensions:

rsem.genes.results
rsem.isoforms.results

I believe the "estimated count" is the expected count.

Best,
Ning




On Fri, Mar 29, 2013 at 1:34 AM, Dvir <dvir...@gmail.com> wrote:
Hi,
 
I use the Data Portal's data matrix to download the relevant RNA-Seq data.
 
 
Using this link -
I choose
Data Type -> RNASeqV2
Data Level -> Level 3
Availability -> Available
 
Then, in the data matrix, I select a few samples and can build an archive containing the files I mentioned for each sample.
 
Thanks !
Dvir
 

On Tuesday, March 26, 2013 5:31:00 PM UTC+2, leng...@gmail.com wrote:

b...@cs.wisc.edu

unread,
Mar 29, 2013, 1:15:35 PM3/29/13
to rsem-...@googlegroups.com
Hi Dvir,

It appears that they changed the format somehow. But the raw count field
should be the expected count field and scaled_estimate should be tau
values.

Unfortunately, you cannot use 'rsem-generate-data-matrix' to collect the
data you need. But after you have the data matrix, all following steps
should be the same as RSEM's tutorial. In addition, you should be able to
modify 'rsem-generate-data-matrix' to collect the required data from TCGA
files.

Best,
Bo

> Hi and thanks for your help.
>
> I'm not sure how easy it would be to get the TCGA data in a different
> format so I want to be sure the currently available one isn't the one I
> need.
> TCGA data contains the following files for each sample:
>
> unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_exon.txt
>
> unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_junction.txt
>
> unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_gene.txt
>
> unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_gene_normalized.txt
>
> unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_isoforms.txt
>
> unc.edu__IlluminaHiSeq_RNASeqV2__TCGA-A7-A0D9-01A-31R-A056-07__expression_rsem_isoforms_normalized.txt
>
> I focused on the *....rsem_expression_gene.txt* files for the differential
> analysis because they contain a column called 'raw_count' which seems
> closest to the 'expected count' column you mentioned.
>
>
> Here's the format of this file :
>
>
> barcode
>
> gene_id
>
> *raw_count*
> Is this format familiar to you or is it some kind of TCGA format? Do you
> think the raw_count column on the above file format is the one I can use
> instead of expected_count to run DESeq/EdgeR ?
>
> Many thanks,
> Dvir
>
>
>
>
> On Tuesday, March 26, 2013 5:36:03 PM UTC+2, Colin Dewey wrote:
>
>> Hi Dvir,
>>
>> In the RSEM output, you should find an "expected count" column. After
>> rounding these values to the nearest integer, you can use them with
>> differential expression packages such as EBSeq, edgeR, and DESeq.
>>
>> If that column is not available in the TCGA data, you might consider
>> contacting TCGA to have them provide that data.
>>
>> Best,
>> Colin
>>
>> On Mar 26, 2013, at 10:31 AM, Ning Leng <leng...@gmail.com
>> <javascript:>>
>> wrote:
>>
>> > Hi Dvir,
>> >
>> > You might consider EBSeq
>> > http://www.biostat.wisc.edu/~kendzior/EBSEQ/
>> >
>> > RSEM-EBSeq pipeline
>> > http://deweylab.biostat.wisc.edu/rsem/README.html#de
>> >
>> > Best,
>> > Ning
>> >
>> > On Tue, Mar 26, 2013 at 1:10 AM, Dvir <dvir...@gmail.com
>> <javascript:>>
>> wrote:
>> >> Hello,
>> >>
>> >> Which methods can be applied on RSEM output files to detect
>> differentially
>> >> expressed genes?
>> >> It seems that methods like EdgeR and DESeq require gene level raw
>> count
>> and
>> >> therefore cannot be applied to RSEM outputs.
>> >>
>> >> I have RSEM files of gene level downloaded from TCGA web site.
>> >>
>> >> Many thanks,
>> >> Dvir
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> Groups
>> >> "RSEM Users" group.
>> >> To unsubscribe from this group and stop receiving emails from it,
>> send
>> an
>> >> email to rsem-users+...@googlegroups.com <javascript:>.
>> >> For more options, visit https://groups.google.com/groups/opt_out.
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Ning Leng
>> > University of Wisconsin Madison
>> > Department of Statistics
>> > 4720 Medical Sciences Center
>> > 1300 University Avenue
>> > Madison, Wisconsin 53706
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "RSEM Users" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to rsem-users+...@googlegroups.com <javascript:>.
Reply all
Reply to author
Forward
0 new messages