Hi all,
Thanks for reaching out!
It might be tricky to get the exact number of mutations to match in both these portals due to the complexities of variant annotation. But I would hope we can get to something that’s at least in the same ballpark
The issue is that (1) the reference genome is different. GDC is using GRCh38 and cBioPortal GRch37. Then (2) the version of VEP might be different. And lastly (3) although we use VEP via our annotation tool called Genome Nexus, we pick a single annotation for each variant as it applies to the canonical transcript for each gene. If I apply these filters:
It seems like it gets a bit closer to the number of mutations in e.g., TCGA-A5-A0G2:
https://www.cbioportal.org/patient?studyId=ucec_tcga_pan_can_atlas_2018&caseId=TCGA-A5-A0G2
Looping in Angelica and Avery to see if they have a better idea of the exact variants we are importing. I noticed in our documentation we mention these allowed variant classifications: https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format. But I’m not 100% sure if all of them do indeed get imported
Hope that helps!
Best wishes,
Ino
--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
cbioportal+...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/cbioportal/A2C636A0-1091-4571-8C7D-06E6A10070F6%40vumc.org.
*** Only open attachments or links from trusted senders. Report phishing to inf...@mskcc.org ***
=====================================================================
Please note that this e-mail and any files transmitted from
Memorial Sloan Kettering Cancer Center may be privileged, confidential,
and protected from disclosure under applicable law. If the reader of
this message is not the intended recipient, or an employee or agent
responsible for delivering this message to the intended recipient,
you are hereby notified that any reading, dissemination, distribution,
copying, or other use of this communication or any of its attachments
is strictly prohibited. If you have received this communication in
error, please notify the sender immediately by replying to this message
and deleting this message, any attachments, and all copies and backups
from your computer.
Hi Ino, & Co.
For a given TCGA sample’s “Mutation Count” in cBioPortal, we are seeking definition of the subset of classes that are specifically included, among the superset of bullet #3 of your reference https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format. This detail is needed for a manuscript in preparation that uses the data, & cites cBioPortal.
cBioBortal Mutation Count for a given sample seems closest to that of GDC for protein coding genes (only), and mutation classes that are most likely to be functional (??):
Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation (stop_gained),Splice_Site (splice_acceptor_variant & splice_donor_variant), Translation_Start_Site (start_lost), Nonstop_Mutation (stop_lost).
-Best,
Jeff
[ WARNING : This email came from an external source. Please treat this message with additional caution.]
Hi Jeff,
I think this should be the list:
Does that help?
Best wishes,
Ino
Hi Ino
Thanks for helping!
The section “Variant Classification Filter” of your reference https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format states: “By default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank and 5'Flank, except for the promoter mutations of the TERT gene.” Might those classes have been removed from presentation?
Github code lines 157-183 of the link below appear to be the superset that includes rather than filters these, and also includes non-coding and miRNA gene variants. Checking all such selections on the GDC web site for a given TCGA sample yields a GDC mutation count that is considerably greater than the cBioPortal mutation count.
-Jeff
Hi Jeff,
Sure thing!
> The section “Variant Classification Filter” of your reference https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format states: “By default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank and 5'Flank, except for the promoter mutations of the TERT gene.” Might those classes have been removed from presentation?
This is right
> Github code lines 157-183 of the link below appear to be the superset that includes rather than filters these, and also includes non-coding and miRNA gene variants. Checking all such selections on the GDC web site for a given TCGA sample yields a GDC mutation count that is considerably greater than the cBioPortal mutation count.
The way the data flow works is that Genome Nexus first collapses all those terms in lines 157-183 to the variant classification of Splice_Site/Nonstop_Mutation/Silent etc. After that cBioPortal on import filters out the variant classifications listed above. So the filter should probably be something like this:
This gets a bit closer e.g. TCGA-A5-A0G2 has 25,820 mutations in cBioPortal and 26,792 in GDC. For a few other cases I see similar discrepancies. One thing I neglected to mention here:
> The issue is that (1) the reference genome is different. GDC is using GRCh38 and cBioPortal GRch37. Then (2) the version of VEP might be different. And lastly (3) although we use VEP via our annotation tool called Genome Nexus, we pick a single annotation for each variant as it applies to the canonical transcript for each gene.
There is also (4) TCGA has a few different processing flavors. For instance, in cBioPortal for uterine corpus endometrial carcinoma have the original published Nature 2013 data, the firehose legacy data and the latest PanCancer Atlas data (see for more about these differences the faq). I believe GDC should be the same calls as the PanCancer Atlas data but it’s possible there are some variations in what calls were included from mutect2/varscan2/muse/somaticsniper and pindel. CC’ing JJ who might know about this aspect
To get to the bottom of this, one analysis option could be to get the calls from both GDC and cBioPortal (https://github.com/cbioportal/datahub) and compare if there’s enrichment for specific types of variants or genes. That might shed some light into why GDC seems to list more mutation events
Hope that helps!
Best wishes,
Ino
Hi Jeff,
Thanks for that great summary – we’ll update the docs accordingly: https://github.com/cBioPortal/cbioportal/issues/9336
To your question. I believe that’s right. Looking at the code it looks like genome nexus handles it the same way as vcf2maf:
That is, it gets annotated as Targeted_Region if not defined in the variantMap and it’s not a complex inframe event. If you happen to have a mutation event that has any of these annotations, I can confirm it
To view this discussion on the web visit
https://groups.google.com/d/msgid/cbioportal/1178CBF2-EC10-4C97-A8AD-53DF8DAD7C18%40vumc.org.
[ WARNING : This email came from an external source. Please treat this message with additional caution.]