TCGA TARGET GTEx dataset

1,446 views
Skip to first unread message

Christian Mazzeo

unread,
May 23, 2018, 11:14:42 AM5/23/18
to UCSC Xena and Cancer Genomics Browser
Hi!
My doubt is witch of gene expression RNAseq dataset i have to use for full compatibility between TCGA and GTex to make a differential expression analysis?
RSEM norm_count , RSEM expected_count o RSEM expected_count (DESeq2 standarized)?
Thanks in advance!

Chris

Mary Goldman

unread,
May 23, 2018, 3:36:15 PM5/23/18
to Christian Mazzeo, UCSC Xena and Cancer Genomics Browser
Hi Chris,

All of those will allow full compatibility between TCGA and GTEx. 

Best,
Mary
-------------
Mary Goldman
UCSC Xena
UC Santa Cruz Genomics Institute

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsubs...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christian Mazzeo

unread,
Jun 21, 2018, 12:35:25 PM6/21/18
to UCSC Xena and Cancer Genomics Browser
Hi Mary, 
Thanks for the answer. I read the paper carefully, let me see if i get this clear, basically , what it says, is that  all RNASeq files has been processed with the same pipelines, to make them all compatible.
Here are my questions:
The ouput of the pipeline is RSEM expected_count file?
The file RSEM expected_count , is the output of DESeq2 with normalization when the inputs is RSEM expect_count?
What i dont get, is which tool/transform generate the file RSEM norm_count?
Thans in advance!
Chris 
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsub...@googlegroups.com.

Mary Goldman

unread,
Jun 26, 2018, 11:13:47 AM6/26/18
to Christian Mazzeo, UCSC Xena and Cancer Genomics Browser
Hi Chris,

Please see my answers below. Write in again if you have any further questions!

Best,
Mary
-------------
Mary Goldman
UCSC Xena
UC Santa Cruz Genomics Institute

---------- Forwarded message ----------
From: Christian Mazzeo <christi...@icloud.com>
Date: Thu, Jun 21, 2018 at 5:09 AM
Subject: Re: [ucsc-cancer-genomics-browser] TCGA TARGET GTEx dataset
To: UCSC Xena and Cancer Genomics Browser <ucsc-cancer-genomics-browser@googlegroups.com>


Hi Mary, 
Thanks for the answer. I read the paper carefully, let me see if i get this clear, basically , what it says, is that  all RNASeq files has been processed with the same pipelines, to make them all compatible.
Here are my questions:
The ouput of the pipeline is RSEM expected_count file?

-- RSEM expected_count file is one of the outputs.

The file RSEM expected_count , is the output of DESeq2 with normalization when the inputs is RSEM expect_count?

-- RSEM expected_count is NOT "the output of DESeq2 with normalization". It is one of the standard outputs from RSEM, log(x+1) transformed. The RSEM expected_count file was the input for DESeq2.

What i dont get, is which tool/transform generate the file RSEM norm_count?

-- upper quantile normalization

Thans in advance!
Chris 

On Wednesday, May 23, 2018 at 4:36:15 PM UTC-3, Mary Goldman wrote:
Hi Chris,

All of those will allow full compatibility between TCGA and GTEx. 

Best,
Mary
-------------
Mary Goldman
UCSC Xena
UC Santa Cruz Genomics Institute

---------- Forwarded message ----------
From: Christian Mazzeo <christi...@icloud.com>
Date: Tue, May 22, 2018 at 9:50 PM
Subject: [ucsc-cancer-genomics-browser] TCGA TARGET GTEx dataset
To: UCSC Xena and Cancer Genomics Browser <ucsc-cancer-ge...@googlegroups.com>


Hi!
My doubt is witch of gene expression RNAseq dataset i have to use for full compatibility between TCGA and GTex to make a differential expression analysis?
RSEM norm_count , RSEM expected_count o RSEM expected_count (DESeq2 standarized)?
Thanks in advance!

Chris

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsubs...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics-browser+unsubs...@googlegroups.com.
Message has been deleted

Helical Joe

unread,
Dec 3, 2018, 5:19:08 PM12/3/18
to UCSC Xena and Cancer Genomics Browser
Hi.

It's great to have TCGA, GTEx and TARGET data in the same realm and including normal samples, but it would be great to have (links to) more detail on the transformations behind expression data right on the pages where the files are downloadable. For example, I've looked:
  1. in the BD2KGenomics/toil-rnaseq repo
  2. in the xenaPython repo
  3. in the toil-lib repo
  4. in the ucsc-xena-server repo
  5. in the xenaPython repo, 
  6. in the DataBiosphere/toil repo
  7. in the Toil article, and 
  8. in John Vivian's thesis
... and I can't find the call to DeSeq2 that would produce the Deseq2-standardized/normalized TCGA/GTEx/TARGET data posted here. I see run_rsem and run_rsem_postprocess in tools/quantifiers.py, and the calls to these in toil_rnaseq.py, but no indication of DeSeq or any normalization. This is less important, but likewise I'm not able to see where the upper-quartile normalization takes place for Hugo "normalized" counts. It looks like you're rescaling by setting the upper-quartile of non-zero values to 1000, as in GDAC, then taking the log, but it's not quite exact. It would be useful to see the exact methods.

Thanks.

Mary Goldman

unread,
Dec 4, 2018, 2:34:41 PM12/4/18
to adnan...@gmail.com, ucsc-cancer-ge...@googlegroups.com, John V, Lon Blauvelt, David Steinberg
Hello,

Yes, we have the inputs for most of the Toil pipeline here: https://github.com/BD2KGenomics/toil-rnaseq/wiki/Workflow-Inputs, but not DeSeq2. I am ccing some folks from the Toil team to see if they know where this is stored. 

I believe they were the ones who did the upper quantile normalization, so I'll ask them to weigh in on that as well. I know that we take the log2 as a pre-processing step for Xena. 


Best,
Mary
-------------
Mary Goldman
UCSC Xena
UC Santa Cruz Genomics Institute

---------- Forwarded message ---------
From: Helical Joe <adnan...@gmail.com>
Date: Mon, Dec 3, 2018 at 2:19 PM
Subject: Re: [ucsc-cancer-genomics-browser] TCGA TARGET GTEx dataset
--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics...@googlegroups.com.

John V

unread,
Dec 5, 2018, 6:29:34 PM12/5/18
to Mary Goldman, adnan...@gmail.com, ucsc-cancer-ge...@googlegroups.com, Lon Blauvelt, David Steinberg

Hi Adnan,

and I can’t find the call to DeSeq2 that would produce the Deseq2-standardized/normalized TCGA/GTEx/TARGET data posted here

This is a good point — this was added about a year after the initial recompute by request and I didn’t think to send along the commands to the Xena team. The deseq2 normalized counts were obtained using a method recommended by the author. I wrote a wrapper function to create the counts here.

This is less important, but likewise I’m not able to see where the upper-quartile normalization takes place for Hugo “normalized” counts. It looks like you’re rescaling by setting the upper-quartile of non-zero values to 1000, as in GDAC, then taking the log, but it’s not quite exact. It would be useful to see the exact methods.

The script to normalize the counts is actually embedded inside a Docker container which is why you don’t see it in the source code, but you can see its contents here. I ported it over from the original TCGA RNA-seq workflow but ended up removing it from the toil-rnaseq workflow after I realized they were running it on every RSEM output (like TPM, which doesn’t really make sense). Currently the workflow just includes the raw RSEM output with no normalization.

Let me know if you have any further questions and apologies for making you dig around so much. 

Cheers,
John

Helical Joe

unread,
Dec 5, 2018, 6:59:51 PM12/5/18
to UCSC Xena and Cancer Genomics Browser
John & Mary,

Thank you both for the details, and the great resource.

Adnan

Jing Zhu

unread,
Dec 6, 2018, 7:14:00 PM12/6/18
to adnan...@gmail.com, ucsc-cancer-ge...@googlegroups.com
>This is less important, but likewise I’m not able to see where the upper-quartile normalization takes place for Hugo “normalized” counts. It looks like you’re rescaling by setting the upper-quartile of non-zero values to 1000, as in GDAC, then taking the log, but it’s not quite exact. It would be useful to see the exact methods.

You are pretty much correct, to be specific the log transformation is log2(x+1). therefore the downloadable dataset unit is log2(norm_count+1).


Jing


Chad Smith

unread,
Apr 29, 2021, 4:16:41 PM4/29/21
to UCSC Xena and Cancer Genomics Browser
Can you comment on what batch correction method was used to integrate TCGA, GTEX, and TARGET for the combined cohort of TCGA, TARGET and GTEx samples, and if possible a link to the code? Thanks!

Mary Goldman

unread,
Apr 29, 2021, 4:23:10 PM4/29/21
to Chad Smith, UCSC Xena and Cancer Genomics Browser
Hi Chad,

No batch correction method was used beyond running the raw sequencing data from the 3 projects through the same computational pipeline. You can see more information about the pipeline that the group at UCSC used to generate the data on our hub page: https://xenabrowser.net/datapages/?host=https%3A%2F%2Ftoil.xenahubs.net and in their publication: https://www.nature.com/articles/nbt.3772.

Best,
Mary
-----
Mary Goldman, Design and Outreach Engineer
Revealing life's code


Sheila Zúñiga

unread,
Sep 20, 2021, 11:49:01 AM9/20/21
to UCSC Xena and Cancer Genomics Browser
Hi,
Following the provided code here as stated above for the application of DESeq, I saw that the applied transformation does not account for covariates such as sample type (cell line or tissue) or project (TCGA, GTEX, TARGET). Shouldn't be these differences accounted for in any way in the data transformation? In this dataset there are cell lines and tissues, which are completely different.

On the other hand, in the Gitbook it says "To compare tumor vs normal, you will need to filter down to just the samples you want to compare and then compare gene expression between your groups of samples.". Is the displayed normalization in Xenabrowser the RSEM norm_count based on the upper quartile normalization? Does this transformation account for tissue type, sample_type or project?

Thanks in advance.

Best regards,

Sheila

Mary Goldman

unread,
Sep 22, 2021, 5:48:56 PM9/22/21
to Sheila Zúñiga, UCSC Xena and Cancer Genomics Browser
Hi Sheila,

Apologies for the delay in my reply - been at a conference all week.

No, none of the normalization or analyses take into account tissue type, sample_type or project. The normalizations in the data are per-sample normalizations.

Best,
Mary
-----
Mary Goldman, Design and Outreach Engineer
Revealing life's code


Shen He

unread,
Aug 26, 2025, 4:47:42 PM (9 days ago) Aug 26
to UCSC Xena and Cancer Genomics Browser
Thanks for your great job.
I wonder if PyDESeq2 or DESeq2 could deal with these data (e.g. RSEM expected_count (DESeq2 standardized)). Is 'raw read count' strictly needed as an input for PyDESeq2/DESeq2?

Thanks in advance and really looking forward to hearing from you.

Mary Goldman

unread,
Aug 26, 2025, 4:51:32 PM (9 days ago) Aug 26
to Shen He, UCSC Xena and Cancer Genomics Browser
Hi Shen,

Sorry, but we are not funded to advise about tools other than UCSC Xena. 

Personally I have found Biostars to be helpful for these types of questions: https://www.biostars.org/ PyDESeq2 or DESeq2 may also have a mailing list or forum.

Best,
Mary
-----
Mary Goldman (she/her), Design and Outreach Engineer 

A button with "Hear my name" text for name playback in email signature

Reply all
Reply to author
Forward
0 new messages