TCGA Data Source/Availability Questions

157 views
Skip to first unread message

Ylagan, Maya

unread,
Aug 14, 2023, 12:58:30 PM8/14/23
to cbiop...@googlegroups.com

To Whom It May Concern at cBioPortal,

 

Thank you for putting this tool together, it is a wonderful and helpful resource, and use it in much of our research.  While using the tool I had a few questions about the TCGA data:

 

  • I know there are multiple versions of TCGA data available.  For the TCGA pancan atlas 2018 and for the TCGA firehose legacy what are the genome builds used to generate these files? From the GDC documentation I assume TCGA pancan atlas 2018 is GCh38, but what about firehose legacy?  I found the link to the source, but I can’t find any more information about these files.
  • What subset of samples are you using because for the TCGA pancan atlas 2018 the sample sizes from GDC don’t line up with what you have in your portal, and I am unable to find any documentation of how this data is any different from the GDC data. 
  • For TCGA pancan atlas 2018 data on cBioPortal, is there gene level copy number values available? Or is it only threshold values? I know the gene level copy number values are available in firehose legacy, but I was curious if you had gene level copy number values for the updated TCGA pancan atlas 2018 versions. If not, is there any way that I could get this?

 

Thank you so much for your time and help! I look forward to hearing from you.

Best,

 

 

----

Maya Ylagan
she/her/hers

Data Analyst || Department of Oncology

Kowalski-Muegge Lab
Dell Medical School || The University of Texas at Austin

o: 512.495.5761 || dellmed.utexas.edu

 

Nikolaus Schultz

unread,
Aug 17, 2023, 6:15:41 PM8/17/23
to Ylagan, Maya, cbiop...@googlegroups.com
Hi Maya,

Thank you for your praise for cBioPortal and your questions.

All TCGA genome builds used in cBioPortal are hg19. 

TCGA data sets in cBioPortal match a couple of different studies:
1. The last Broad Firehose run
2. The final sample set used by the PanCancer Atlas Project
3. In some cases, the sample set that was used in a given TCGA publication.
I am not too familiar what is in the GDC, but I assume they used all available samples for each TCGA cancer type, while some samples might have been removed from the PanCancer Atlas studies far various reasons. 

Another major difference between the Firehose, PanCancer, published studies in cBioPortal and the GDC data is the fact that mutation calling was redone for the GDC, which will inadvertently lead to differences between the different data sources.

Log2 copy-number values are available for the PanCancer Atlas cohort (if that is what you are referring to). You can display them in the Plots lab, as in this example

You can also find them in Datahub, in the data_log2_cna.txt file in each study folder, e.g.:

I hope this helps. 

Niki.


-- 
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/PH7PR06MB8941398150F04491407E7DB3B210A%40PH7PR06MB8941.namprd06.prod.outlook.com.

张远哲

unread,
Sep 8, 2023, 10:49:52 AM9/8/23
to cBioPortal for Cancer Genomics Discussion Group
Hi all,

I'm working on the CNV recently and I have a question on the log2 copy-number provided in PanCancer Atlas cohort. I have learnt from here that the "log2 copy number" might not be really in log-scale but just (copy-ratio - ploidy). So is the "log2 copy number" in PanCaner Dataset really log2(CN)-1 or not in log-scale actually? Thank you.

Sincerely,
Yuanzhe Zhang

Reply all
Reply to author
Forward
0 new messages