TCGA Endometrial cancer RNAseq sample difference?

25 views
Skip to first unread message

Kaiwen Lin

unread,
Aug 4, 2022, 5:41:53 PM8/4/22
to UCSC Xena and Cancer Genomics Browser
Hi there, 

I was looking for TCGA UCEC data - looking through Xena browser, when using 'TCGA Endometrioid cancer (UCEC)', there's only around 200 samples out of the ~600 samples that have gene expression data:
image002.png

While if I use the  ‘GDC TCGA Endometrioid Cancer (UCEC)’ data it shows it has most the samples with gene expression data:
image004.png


Questions are: 
1.  Are these difference caused by the reprocessing procedure with Toil?
2. If so, what's the reason some samples were not reprocessed in this case? Is there a biological reason that only these were reprocessed? Or it's simply logistical reasons?
3. Is there a way to have the most of the data again?

Thank you!

Best,
Kevin

Mary Goldman

unread,
Aug 4, 2022, 5:59:17 PM8/4/22
to Kaiwen Lin, UCSC Xena and Cancer Genomics Browser
Hi Kevin,

There are 4 sets of TCGA data on Xena: the legacy TCGA data, the data from the GDC, the data from the PanCan Atlas project, and the data from the UCSC Toil recompute project. More information can be found here: https://ucsc-xena.gitbook.io/project/public-data-we-host/tcga. You mention all 4 of these sources in this email.

As a note, you can always get more information about the data behind a column by clicking on the 3 dot column menu at the top of the column and choosing 'About'.

The top screenshot of the TCGA UCEC cohort shows the dataset here: https://xenabrowser.net/datapages/?host=https%3A%2F%2Ftcga.xenahubs.net&dataset=TCGA.UCEC.sampleMap%2FHiSeqV2. This is the legacy data from the original TCGA DCC.

The bottom screenshot of the GDC TCGA UCEC cohort shows the dataset here: https://xenabrowser.net/datapages/?host=https%3A%2F%2Fgdc.xenahubs.net&dataset=TCGA-UCS.htseq_fpkm-uq.tsv. The GDC is the new home of the TCGA data and this data has been run through the GDC pipelines.



While there is overlap in which samples have which types of information between the 4 data sources, there is variation depending on the data source. 

If you ultimately decide to use the PanCan Atlas data or the Toil data, you will need to filter to just UCEC samples. More information about how to do this is here: https://ucsc-xena.gitbook.io/project/how-do-i/how-do-i-filter-to-just-one-cancer-type.

If you have any questions, please let us know.

Best,
Mary
-----
Mary Goldman (she/her), Design and Outreach Engineer
Revealing life's code


--
You received this message because you are subscribed to the Google Groups "UCSC Xena and Cancer Genomics Browser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ucsc-cancer-genomics...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ucsc-cancer-genomics-browser/fec31322-c469-44d9-879d-f5a1ac59fccen%40googlegroups.com.

Kevin Lin

unread,
Aug 8, 2022, 4:46:36 PM8/8/22
to UCSC Xena and Cancer Genomics Browser
Hi Mary, 

Thanks for your detailed response - seems like I confused my self a bit there. If we just focus on the latter 2 dataset - PanCan Atlas vs Toil recompute, do you know what caused the difference in sample number, specifically in Endometrial(Uterine Corpus) cancer patients? I tried looking up some references but so far there hasn't been a good explanation yet. Or if there's some documentation you can point me to it that would be great. Appreciate it!

Best,
Kevin


ma...@soe.ucsc.edu 在 2022年8月4日 星期四下午2:59:17 [UTC-7] 的信中寫道:

Mary Goldman

unread,
Aug 9, 2022, 11:46:01 AM8/9/22
to Kevin Lin, UCSC Xena and Cancer Genomics Browser
Hi Kevin,

It is likely due to differences in QC thresholds between the two datasets. You can look up the QC thresholds for the Toil data here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5546205/ and the PanCan Atlas data here: https://www.cell.com/pb-assets/consortium/pancanceratlas/pancani3/index.html

Best,
Reply all
Reply to author
Forward
0 new messages