TCGA Endometrial cancer RNAseq sample difference?

Kaiwen Lin

unread,

Aug 4, 2022, 5:41:53 PM8/4/22

to UCSC Xena and Cancer Genomics Browser

Hi there,

I was looking for TCGA UCEC data - looking through Xena browser, when using 'TCGA Endometrioid cancer (UCEC)', there's only around 200 samples out of the ~600 samples that have gene expression data:

While if I use the ‘GDC TCGA Endometrioid Cancer (UCEC)’ data it shows it has most the samples with gene expression data:

I looked into the dataset description on Xena, while the older version (https://xenabrowser.net/datapages/?dataset=EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) seems to have the larger and more complete sample size, the Toil reprocessed one (https://xenabrowser.net/datapages/?dataset=tcga_RSEM_gene_tpm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) seem to be the one with a smaller subset of the available samples.

Questions are:

1. Are these difference caused by the reprocessing procedure with Toil?

2. If so, what's the reason some samples were not reprocessed in this case? Is there a biological reason that only these were reprocessed? Or it's simply logistical reasons?

3. Is there a way to have the most of the data again?

Thank you!

Best,

Kevin

Mary Goldman

unread,

Aug 4, 2022, 5:59:17 PM8/4/22

to Kaiwen Lin, UCSC Xena and Cancer Genomics Browser

Hi Kevin,

There are 4 sets of TCGA data on Xena: the legacy TCGA data, the data from the GDC, the data from the PanCan Atlas project, and the data from the UCSC Toil recompute project. More information can be found here: https://ucsc-xena.gitbook.io/project/public-data-we-host/tcga. You mention all 4 of these sources in this email.

As a note, you can always get more information about the data behind a column by clicking on the 3 dot column menu at the top of the column and choosing 'About'.

The top screenshot of the TCGA UCEC cohort shows the dataset here: https://xenabrowser.net/datapages/?host=https%3A%2F%2Ftcga.xenahubs.net&dataset=TCGA.UCEC.sampleMap%2FHiSeqV2. This is the legacy data from the original TCGA DCC.

The bottom screenshot of the GDC TCGA UCEC cohort shows the dataset here: https://xenabrowser.net/datapages/?host=https%3A%2F%2Fgdc.xenahubs.net&dataset=TCGA-UCS.htseq_fpkm-uq.tsv. The GDC is the new home of the TCGA data and this data has been run through the GDC pipelines.

The first dataset you linked to (https://xenabrowser.net/datapages/?dataset=EB%2B%2BAdjustPANCAN_IlluminaHiSeq_RNASeqV2.geneExp.xena&host=https%3A%2F%2Fpancanatlas.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) is the data from the PanCan Atlas project.

The second dataset you linked to (https://xenabrowser.net/datapages/?dataset=tcga_RSEM_gene_tpm&host=https%3A%2F%2Ftoil.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443) is the Toil recompute data.

While there is overlap in which samples have which types of information between the 4 data sources, there is variation depending on the data source.

If you ultimately decide to use the PanCan Atlas data or the Toil data, you will need to filter to just UCEC samples. More information about how to do this is here: https://ucsc-xena.gitbook.io/project/how-do-i/how-do-i-filter-to-just-one-cancer-type.

If you have any questions, please let us know.

Best,

Mary

-----

Mary Goldman (she/her), Design and Outreach Engineer

UCSC Xena

UC Santa Cruz Genomics Institute

Revealing life's code

--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ucsc-cancer-genomics-browser/fec31322-c469-44d9-879d-f5a1ac59fccen%40googlegroups.com.

Kevin Lin

unread,

Aug 8, 2022, 4:46:36 PM8/8/22

to UCSC Xena and Cancer Genomics Browser

Hi Mary,

Thanks for your detailed response - seems like I confused my self a bit there. If we just focus on the latter 2 dataset - PanCan Atlas vs Toil recompute, do you know what caused the difference in sample number, specifically in Endometrial(Uterine Corpus) cancer patients? I tried looking up some references but so far there hasn't been a good explanation yet. Or if there's some documentation you can point me to it that would be great. Appreciate it!

Best,

Kevin

ma...@soe.ucsc.edu 在 2022年8月4日星期四下午2:59:17 [UTC-7] 的信中寫道：

Mary Goldman

unread,

Aug 9, 2022, 11:46:01 AM8/9/22

to Kevin Lin, UCSC Xena and Cancer Genomics Browser

Hi Kevin,

It is likely due to differences in QC thresholds between the two datasets. You can look up the QC thresholds for the Toil data here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5546205/ and the PanCan Atlas data here: https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html

Best,

To view this discussion on the web visit https://groups.google.com/d/msgid/ucsc-cancer-genomics-browser/7697e3ff-b59e-406c-b9f5-84cbed3dac49n%40googlegroups.com.

Reply all

Reply to author

Forward