how interrogate / download RNA-seq data from prostate TCGA data, Robinson 2015, and Trento/Cornell

Alfonso Urbanucci

unread,

Feb 16, 2018, 9:02:30 AM2/16/18

to cBioPortal for Cancer Genomics Discussion Group

Hi,

I would like to compare gene expression from primary tumours and metastatic from the three different datasets mentioned above (TCGA data, Robinson 2015, and Trento/Cornell). I know that comparing z-scores and RPKM values per gene is not the way to go. Moreover at the moment the data do not seem accessible on cbioportal.

Can you please indicate the fastest way to interrogate single genes (if there is a publicly available tool)?

Alternatively could you indicate where to download total number of reads per gene (e.g. RSEM/STAR outputs)? I can then normalize and compare the data myself.

Thank you,
Alfonso

Nikolaus Schultz

unread,

Feb 16, 2018, 9:34:06 AM2/16/18

to Alfonso Urbanucci, cBioPortal for Cancer Genomics Discussion Group

Hi Alfonso,

There is currently no way to merge expression data from different studies in cBioPortal (unless they are from TCGA - in this case, since they are all processed and normalized the same way, we show expression levels of individual genes in a separate tab of cross-cancer queries).

You can download gene-level expression data for each study from our data hub:

https://github.com/cBioPortal/datahub/tree/master/public

Niki.

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To post to this group, send email to cbiop...@googlegroups.com.
Visit this group at https://groups.google.com/group/cbioportal.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/85d60e46-faaf-440c-ac95-e86f542d95c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alfonso Urbanucci

unread,

Feb 16, 2018, 2:39:28 PM2/16/18

to Nikolaus Schultz, cBioPortal for Cancer Genomics Discussion Group

Thank you Nikolaus,

I have now downloaded the files:

TCGA:
From what I can tell, the “data_RNA_Seq_v2_expression_median.txt” are the RSEM values, described here:
https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2
The meta file says:
profile_description: Expression levels for 20532 genes in 550 prad cases (RNA Seq V2 RSEM).
Something to keep in mind: you can alternately get different version of the TCGA data at these locations:
https://portal.gdc.cancer.gov/ to download the STAR 2-pass HTSeq-counts or FPKM data.
http://gdac.broadinstitute.org/ to download RNAseq-v2 RSEM (I believe this should be similar to what is at cBio Portal)

Robinson:
They have two versions of RNA expression based on two different types of libraries. One is polyA where the library method includes a polyA –selection step and generates a sequencing library from the mRNA. The other library instead “captures” the coding genes from the total RNA using exonic probes, a different way of removing ribosomal RNA, but might end up with unprocessed/unspliced RNA.
I typically use the polyA library data but you could compare and decide which one you think is better. They describe this in the extended version of the article (http://www.cell.com/cell/pdfExtended/S0092-8674(15)00548-6)
data_RNA_Seq_expression_median.txt
profile_name: mRNA expression / polyA (RNA Seq RPKM)
data_RNA_Seq_expression_capture.txt
profile_description: mRNA expression from capture (RNA Seq RPKM)

Trento/Cornell:
data_expression_median.txt
profile_description: Expression levels.
I believe these are FPKMs, pipeline described here: https://www.nature.com/articles/nm.4045

In your opinion, would I be able to compare these datasets like this, using FPKM data per gene… or should I download the bam files or fastq files somewhere (would be great if you could indicate a place where I could download them in block), and analyse all over again ?

Thank you in advance,

Alfonso

Pichai Raman

unread,

Feb 16, 2018, 2:44:39 PM2/16/18

to Alfonso Urbanucci, Nikolaus Schultz, cBioPortal for Cancer Genomics Discussion Group

Hi Alfonso,

My recommendation is if you want to do a formal analysis, if you're combining data from different studies it's best to start from fastq or BAM. This way you can confirm the pipeline is the same (lots of pipelines can report FPKM) and parameters/reference files (HG19, HG38 etc...) are the same. Alternatively, if there are enough samples in each study you can do more of a meta-analysis, especially if you have a few more studies. Hope this helps.

Cheers,

Pichai

Alfonso

To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+unsubscribe@googlegroups.com.

To post to this group, send email to cbiop...@googlegroups.com.
Visit this group at https://groups.google.com/group/cbioportal.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/85d60e46-faaf-440c-ac95-e86f542d95c6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.

To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+unsubscribe@googlegroups.com.

To post to this group, send email to cbiop...@googlegroups.com.
Visit this group at https://groups.google.com/group/cbioportal.

To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/3EA5CBCF-27C5-4F50-ACCD-3FF4EC1CBF40%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

--

The purpose of a fish trap is to catch fish, and when the fish are caught, the trap is forgotten. The purpose of a rabbit snare is to catch rabbits. When the rabbits are caught, the snare is forgotten. The purpose of words is to convey ideas. When the ideas are grasped, the words are forgotten. Where can I find a man who has forgotten words? He is the one I would like to talk to. -Chuang Tzu...

Nikolaus Schultz

unread,

Feb 16, 2018, 3:26:51 PM2/16/18

to Alfonso Urbanucci, Pichai Raman, cBioPortal for Cancer Genomics Discussion Group

Hi Alfonso,

We don’t keep track of that information - but you can go to the original publications and find the dbGAP accession numbers. TCGA data are available through the Genomic Data Commons.

However, even when reprocessing all data uniformly, you will likely still have batch effects related to the data source. See our study here:

https://www.biorxiv.org/content/early/2017/08/04/110734

Niki.

On Feb 16, 2018, at 3:06 PM, Alfonso Urbanucci <alfonsou...@gmail.com> wrote:

Thank you Pichai,
could you please indicate the fastest way to get the fastq/bam files for all these studies then?
Do you have a repository I could get the link to?

Thank you in advance,
Alfonso

Pichai Raman

unread,

Feb 16, 2018, 3:41:24 PM2/16/18

to Alfonso Urbanucci, Nikolaus Schultz, cBioPortal for Cancer Genomics Discussion Group

Hi Alfonso,

For most of these data sets you will either need to get access to the study through dbGaP or EGA, depending on where they deposited it. For the TCGA specifically, once you get access through dbGaP for raw data you can query and download from the GDC.

Cheers,

Pichai

On Fri, Feb 16, 2018 at 3:06 PM, Alfonso Urbanucci <alfonsou...@gmail.com> wrote:

Thank you Pichai,
could you please indicate the fastest way to get the fastq/bam files for all these studies then?
Do you have a repository I could get the link to?

Thank you in advance,
Alfonso

On 16 Feb 2018, at 20:44, Pichai Raman <pichai...@gmail.com> wrote:

Pichai Raman

unread,

Feb 16, 2018, 3:56:25 PM2/16/18

to Alfonso Urbanucci, Nikolaus Schultz, cBioPortal for Cancer Genomics Discussion Group

Hi Alfonso,

You will first need a dbGaP/ERA account and you will need to formally apply through dbGaP for that specific study. It can take a little while to get access after applying and your IT director will have to be on board.

https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/GetPdf.cgi?document_name=GeneralAAInstructions.pdf

Cheers,

Pichai

On Fri, Feb 16, 2018 at 3:49 PM, Alfonso Urbanucci <alfonsou...@gmail.com> wrote:

Thank you Pichai and Nikolaus for your help,

so: e.g. once I am here:
https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000909.v1.p1&phv=252257&phd=&pha=&pht=5250&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1
how do I manage to get access to this particular study?
thank you,
Alfonso

Alfonso Urbanucci

unread,

Feb 17, 2018, 8:43:56 PM2/17/18

to Pichai Raman, Nikolaus Schultz, cBioPortal for Cancer Genomics Discussion Group

Thank you Pichai,

will do!

have a nice week end,

Alfonso

Alfonso Urbanucci

unread,

Feb 17, 2018, 8:43:56 PM2/17/18

to Pichai Raman, Nikolaus Schultz, cBioPortal for Cancer Genomics Discussion Group

Thank you Pichai,

could you please indicate the fastest way to get the fastq/bam files for all these studies then?

Do you have a repository I could get the link to?

Thank you in advance,

Alfonso

On 16 Feb 2018, at 20:44, Pichai Raman <pichai...@gmail.com> wrote:

Alfonso Urbanucci

unread,

Feb 17, 2018, 8:43:56 PM2/17/18

to Pichai Raman, Nikolaus Schultz, cBioPortal for Cancer Genomics Discussion Group

Thank you Pichai and Nikolaus for your help,

so: e.g. once I am here:

https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000909.v1.p1&phv=252257&phd=&pha=&pht=5250&phvf=&phdf=&phaf=&phtf=&dssp=1&consent=&temp=1

how do I manage to get access to this particular study?

thank you,

Alfonso

Reply all

Reply to author

Forward