About data on COADREAD

Hiromasa Morikawa

unread,

Feb 16, 2016, 10:53:18 AM2/16/16

to GDAC Pipeline Role Account, David Heiman

Dear GDAC support

Thank you for your contribution on TCGA.

I have a question about data management, especially on COADREAD.

I analyzed the data from COADREAD (data set "20141206”) last year.
And now again I analyzed the same data, then I found that the number of samples from RNA-seq2 normalized is reduced from 376 to 263.

Even latest version of COADREAD include only around 264 on RNA-seq normalized in all.
From information on the web, TCGA include around 600 RNA-seq on COADREAD.
http://firebrowse.org/?cohort=COADREAD&download_dialog=true#

My question is that
1. Were around 100 RNA-seq data removed from data set?
2. Does such kind of data management have influence on not only coming newer dataset but also already published dataset?

Thank you.

Best,

Hiromasa Morikawa
--------------------------------------------------------
Hiromasa Morikawa, Med.Dr/PhD
Unit of Computational Medicine
Center for Molecular Medicine
Department of Medicine, Karolinska Institutet
Karolinska University Hospital, L8:05
S-171 76 Stockholm
Cell phone: + 46 76-292 38 21
Skype: hsmorikawa
Linkedin: http://jp.linkedin.com/in/hmorikawa/
Email: hiromasa...@ki.se

David Heiman

unread,

Feb 16, 2016, 11:09:34 AM2/16/16

to Hiromasa Morikawa, GDAC Pipeline Role Account

Dear Hiromasa,

Please note that there are two types of RNASeqV2 data, illumina HiSeq and Illumina GA. If you download the mRNAseq_Preprocess archive from the firebrowse.org link you have mentioned, you will get them merged together, with all 624 samples.

We provide alternative methods to access this data other than simply direct archive download. You may be interested in using the FireBrowse API for direct programmatic access to this data. Please see our tutorial for details.

Regards,

David

--

David Heiman

Run Operations Engineer

TCGA Genome Data Analysis Center

The Broad Institute of MIT and Harvard

Hiromasa Morikawa

unread,

Feb 16, 2016, 11:17:22 AM2/16/16

to David Heiman, GDAC Pipeline Role Account

Dear David

Thank you very much for your quick reply.
I am happy if you can explain about the management of samples.
I am using RTCGAToolbox and download
"http://gdac.broadinstitute.org/runs/stddata__2015_04_02/data/COADREAD/20150402/gdac.broadinstitute.org_COADREAD.Merge_rnaseqv2__illuminaga_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2015040200.0.0.tar.gz"
through this function on R.

Maybe this is what you mentioned (merged).
But I can see only 263 samples.
Where does this problem come from?

Sorry for quick reply and questions.
But some times the result is different from previous in only specific cancer type, which is very critical for publication.

Thanks.

Best,

Hiromasa Morikawa

David Heiman

unread,

Feb 16, 2016, 11:29:38 AM2/16/16

to Hiromasa Morikawa, GDAC Pipeline Role Account

We do not support the RTCGA toolbox, so I can't help you in its use. However you should note from the filename that you have only downloaded the Illumina GA RNASeqV2 data, and not the Illumina HiSeq portion, which explains the missing samples. The "merge" portion of the filenames indicates that it is the merger of all samples of the type illuminaga_rnaseqv2. More detailed nomenclature can be found in our FAQ. The tutorial I linked to has methods to access our data, including older runs. The archive I linked to is the latest merged version of Illumina GA and Illumina HiSeq RNASeqV2 data.

Regards,

David

Hiromasa Morikawa

unread,

Feb 16, 2016, 11:34:57 AM2/16/16

to David Heiman, GDAC Pipeline Role Account

Dear David

Thank you again for quick reply.
I totally understand what you mentioned.
So, I downloaded only ga part without hiseq.

Sorry for one more question.
Does “merge” mean that only combining two sample sets together or normalizing again for merging??

Best,

Hiromasa Morikawa

David Heiman

unread,

Feb 16, 2016, 11:43:51 AM2/16/16

to Hiromasa Morikawa, GDAC Pipeline Role Account

Hi Hiromasa,

Most TCGA data is one set of files per sample. Our Merge pipelines take all samples for a cohort of a sample type and merge them into one file (e.g. RNASeq, with illumina GA and HiSeq, each further subdivided into genes, isoforms, etc.).

When I use merge in terms of mRNAseq_Preprocess, I mean initial processing of data for use in downstream analyses. In this case merging both Illumina types, and doing any necessary processing/normalization (see our Documentation).

Regards,

David

Hiromasa Morikawa

unread,

Feb 16, 2016, 12:08:08 PM2/16/16

to David Heiman, GDAC Pipeline Role Account

Dear David

Thanks again.
I could understand well about GDAC better than 1 hour before!
All I have to do other than it is reading the document you mentioned.

Thanks a lot.
Have a nice day.

Hiromasa Morikawa

> 2016/02/16 17:43、David Heiman <dhe...@broadinstitute.org> のメール：

Hiromasa Morikawa

unread,

Feb 17, 2016, 6:14:24 AM2/17/16

to David Heiman, GDAC Pipeline Role Account

Dear David

Thank you for your help and sorry for one more question.
From
gdac.broadinstitute.org_COADREAD.mRNAseq_Preprocess.Level_3.2015110100.0.0.tar.gz
I can get several files including
COADREAD.uncv2.mRNAseq_RSEM_normalized_log2_PARADIGM.txt
COADREAD.uncv2.mRNAseq_RSEM_normalized_log2.txt
.
What is the difference between these two?
I found in your home page that PARADIGM is a pipeline for analyzing expression data and copy number data.
Does this mean
COADREAD.uncv2.mRNAseq_RSEM_normalized_log2_PARADIGM.txt
is a output from this pipeline and usually you should use
COADREAD.uncv2.mRNAseq_RSEM_normalized_log2.txt
as normalized mRNA expression?

David Heiman

unread,

Feb 17, 2016, 2:24:48 PM2/17/16

to Hiromasa Morikawa, GDAC Pipeline Role Account

Hi Hiromasa,

The PARADIGM file is actually meant as input to PARADIGM, it is not produced by PARADIGM.

Unless you plan on running PARADIGM analyses yourself, you are correct in choosing COADREAD.uncv2.mRNAseq_RSEM_normalized_log2.txt.

Regards,

David

Hiromasa Morikawa

unread,

Feb 17, 2016, 2:48:46 PM2/17/16

to David Heiman, GDAC Pipeline Role Account

Dear David

Thank you very much.
I will use the data you mentioned.
Now I am analyzing with latest version of data set and can get nice results.

Thanks.

Hiromasa M.

> 2016/02/17 20:24、David Heiman <dhe...@broadinstitute.org> のメール：

Reply all

Reply to author

Forward