Discrepancy in sample size and mutation frequencies V18

7 views
Skip to first unread message

Yi Q

unread,
Oct 7, 2025, 11:30:14 AM (9 days ago) Oct 7
to cBioPortal for Cancer Genomics Discussion Group
Hi there - I am finding discrepancies in sample sizes and mutation frequencies between the V18.0 data downloaded from Synapse (https://www.synapse.org/Synapse:syn68719152 ) vs on CBioPortal. For example, just looking at the unique number of samples from data_mutations_extended.txt - I see 204,292 samples, but on the CBIO portal it shows 250,018 samples. 

There are also discrepancies in the mutation frequencies. For example, on the CBIO portal, if I select NSCLC (total N=36,069), and then pick TP53, I see 15,738 samples at 44.3% frequency. If I calculated this from the Synapse data_mutations_extended.txt file, these numbers are NSCLC N=29,752 , and of these 15,780 unique samples with TP53. 

My questions are: 1) Why are these numbers different for cancer and mutation frequencies? 2) How is the 44.3% prevalence calculated (as 15738/36069=43.6%)?

Thanks in advance for your help, Yi

Ritika Kundra

unread,
Oct 7, 2025, 12:49:38 PM (9 days ago) Oct 7
to Yi Q, cBioPortal for Cancer Genomics Discussion Group
Hi Yi,

Thank you for reaching out.
For your first question, the total sequenced sample count is 250,018 samples; however, not all samples harbor a mutation. The data_mutations_extended.txt file (MAF) only includes samples that have at least one called mutation, so the number of unique samples in that file will always be lower.

For your second question, the sample count difference could be related to the reason above. The variant difference, by default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank, and 5'Flank variants, except for TERT promoter mutations. So you might see more alterations in the MAF file than what is displayed on the portal.
Also, when cBioPortal calculates mutation frequency, it uses the subset of samples that were actually profiled for that gene, not the total number of NSCLC samples. That is:
# of samples with a mutation in this gene that are also profiled for mutations: 15,738/# of samples profiled for mutations in this gene: 35,560

Let me know if you have any other questions.

Thanks,
Ritika

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/cbioportal/f548c0ff-a7c3-4ff7-b249-75d751910450n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages