Mutation Count Mismatch between GDC and cBioPortal

18 views
Skip to first unread message

Jack Flower

unread,
Mar 8, 2023, 10:07:20 PM3/8/23
to cBioPortal for Cancer Genomics Discussion Group
Hi everyone, 

I am Jack Flower, a second-year PhD student at the University of Oxford in the Buczacki research group in the Department of Oncology. I am currently analysing TCGA-COAD data and have come across a significant mismatch of data between a previous GDC download and what is appearing on cBioPortal.
I know the option to download one MAF file for all TCGA-COAD samples is now not available but this mutect file/folder was downloaded before this option was removed:
  • File ID: 03652df4-6090-4f5a-a2ff-ee28a37f9301  
  • File Name: TCGA.COAD.mutect.03652df4-6090-4f5a-a2ff-ee28a37f9301.DR-10.0.somatic.maf.gz
  • Download Date: 21st March 2022
  • Also see screenshot attached
I decided to use this in my analysis to save some time over collating multiple mafs that are now available. However, when importing into R as a MAF file, the total mutation counts are very different to what's on cBioPortal - see screenshots using TCGA-AM-5820-01A and TCGA-AM-5821-01A as  examples (see also the screenshot showing multiple mutations per gene which suggests a potential error in mutation calling?). For example, TCGA-AM-5820-01A is MSS and reports 72 mutations on cBioPortal yet in my MAF file I am getting 1479 with multiple mutations in the same gene? I understand this MAF only accounts for Mutect calling and cBioPortal is a combination of calling methods, however I can't seem to grasp the reason for such a difference in mutation count. When I download the MAF that is available on GDC for TCGA-AM-5820-01A specifically, the total counts are essentially what is seen in cBioPortal. 
As the above file-referenced MAF was available not so long ago, is there an explanation for this? Going forward, I will be using the individual MAFs available now on GDC and collating them together just to be safe, however I do know of other scientists in the lab that have used this particular reference MAF file above and now worry about the results they have obtained. Furthermore, do you have any advice on where the best place to obtain updated and accurate mutational info for TCGA-COAD is as this is currently foundational for my PhD project?
I contacted Nikolaus Schultz who has kindly referred me to this group - would much appreciate any help/advice.

Kind regards,
Jack
cBio_TCGA-AM-5820.png
original_maf_download_TCGA-AM-5820_5821.png
cBio_TCGA-AM-5821.png
original_maf_download.png
original_maf_download_TCGA-AM-5820_mutations.png

de Bruijn, Ino

unread,
Mar 10, 2023, 9:29:36 AM3/10/23
to Jack Flower, cBioPortal for Cancer Genomics Discussion Group

Hi Jack,

 

Thanks for reaching out!

 

I am not super familiar with the intricacies of the different calling methods and filtering for that TCGA MAF file. I do recall that Mutect2 tends to call more events than other callers (or at least used to):

 

https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-019-0636-y

 

I imagine that might be one of the reasons, but to get more certainty one would probably need to do a more comprehensive comparison between the MAFs comparing allele frequencies and the differences in types of alterations. Maybe someone else on the mailing list more familiar with this specific file is able to provide a better answer. Glad to hear though that the current GDC MAF is more consistent with what’s in cBioPortal

 

> Furthermore, do you have any advice on where the best place to obtain updated and accurate mutational info for TCGA-COAD is as this is currently foundational for my PhD project?

 

I would use the TCGA Pancan data provided here:

 

https://gdc.cancer.gov/about-data/publications/pancanatlas

 

You can also use the cBioPortal mutation data, though note that we apply some harmonization (re-annoation thru Genome Nexus), to make the data more suitable for comparison with other mutation data in cBioPortal:

 

https://github.com/cBioPortal/datahub/tree/master/public/coadread_tcga_pan_can_atlas_2018

 

Hope that helps!

 

Best wishes,

Ino

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/90efedf8-242a-4538-8f15-6c0efb3199fbn%40googlegroups.com.

=====================================================================

Please note that this e-mail and any files transmitted from
Memorial Sloan Kettering Cancer Center may be privileged, confidential,
and protected from disclosure under applicable law. If the reader of
this message is not the intended recipient, or an employee or agent
responsible for delivering this message to the intended recipient,
you are hereby notified that any reading, dissemination, distribution,
copying, or other use of this communication or any of its attachments
is strictly prohibited. If you have received this communication in
error, please notify the sender immediately by replying to this message
and deleting this message, any attachments, and all copies and backups
from your computer.

Disclaimer ID:MSKCC
Reply all
Reply to author
Forward
0 new messages