data dictionary?

Smith, Jeffrey R

unread,

Feb 22, 2022, 5:08:15 AM2/22/22

to cbiop...@googlegroups.com, Dupont, William, Parl, Fritz

Hi,

Who might we reach out to in order to clarify what data cBioPortal displays for TCGA “Mutation Count” of a given sample?

Despite our best effort, we are unable to find its definition on the cBioPortal web site, or reference to a definition elsewhere.

We have also been unable to infer it based upon a comparison of the mutation count for a given TCGA sample in cBioPortal versus in the NIH GDC. For the latter, we explored the various mutation categories but were unable to arrive at a count matching that of cBioPortal.

GDC counts are found here:

https://portal.gdc.cancer.gov/exploration?facetTab=mutations&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22genes.biotype%22%2C%22value%22%3A%5B%22protein_coding%22%5D%7D%7D%5D%7D

-Best,

Jeff Smith, WIlliam Dupont, Fritz Parl

Vanderbilt

debr...@mskcc.org

unread,

Feb 22, 2022, 6:35:27 PM2/22/22

to jeffre...@vumc.org, cbiop...@googlegroups.com, william...@vumc.org, fritz...@vanderbilt.edu, och...@mskcc.org, wan...@mskcc.org

Hi all,

Thanks for reaching out!

It might be tricky to get the exact number of mutations to match in both these portals due to the complexities of variant annotation. But I would hope we can get to something that’s at least in the same ballpark

The issue is that (1) the reference genome is different. GDC is using GRCh38 and cBioPortal GRch37. Then (2) the version of VEP might be different. And lastly (3) although we use VEP via our annotation tool called Genome Nexus, we pick a single annotation for each variant as it applies to the canonical transcript for each gene. If I apply these filters:

https://portal.gdc.cancer.gov/exploration?facetTab=mutations&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22genes.biotype%22%2C%22value%22%3A%5B%22protein_coding%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22ssms.consequence.transcript.consequence_type%22%2C%22value%22%3A%5B%22frameshift_variant%22%2C%22inframe_deletion%22%2C%22inframe_insertion%22%2C%22missense_variant%22%2C%22splice_acceptor_variant%22%2C%22splice_donor_variant%22%2C%22splice_region_variant%22%2C%22stop_gained%22%5D%7D%7D%5D%7D

It seems like it gets a bit closer to the number of mutations in e.g., TCGA-A5-A0G2:

https://www.cbioportal.org/patient?studyId=ucec_tcga_pan_can_atlas_2018&caseId=TCGA-A5-A0G2

Looping in Angelica and Avery to see if they have a better idea of the exact variants we are importing. I noticed in our documentation we mention these allowed variant classifications: https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format. But I’m not 100% sure if all of them do indeed get imported

Hope that helps!

Best wishes,

Ino

--
You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/A2C636A0-1091-4571-8C7D-06E6A10070F6%40vumc.org.

*** Only open attachments or links from trusted senders. Report phishing to inf...@mskcc.org ***

=====================================================================

Please note that this e-mail and any files transmitted from
Memorial Sloan Kettering Cancer Center may be privileged, confidential,
and protected from disclosure under applicable law. If the reader of
this message is not the intended recipient, or an employee or agent
responsible for delivering this message to the intended recipient,
you are hereby notified that any reading, dissemination, distribution,
copying, or other use of this communication or any of its attachments
is strictly prohibited. If you have received this communication in
error, please notify the sender immediately by replying to this message
and deleting this message, any attachments, and all copies and backups
from your computer.

Smith, Jeffrey R

unread,

Feb 22, 2022, 7:10:42 PM2/22/22

to debr...@mskcc.org, cbiop...@googlegroups.com, Dupont, William, fritz...@vanderbilt.edu, och...@mskcc.org, wan...@mskcc.org

Hi Ino, & Co.

For a given TCGA sample’s “Mutation Count” in cBioPortal, we are seeking definition of the subset of classes that are specifically included, among the superset of bullet #3 of your reference https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format. This detail is needed for a manuscript in preparation that uses the data, & cites cBioPortal.

cBioBortal Mutation Count for a given sample seems closest to that of GDC for protein coding genes (only), and mutation classes that are most likely to be functional (??):

Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation (stop_gained),Splice_Site (splice_acceptor_variant & splice_donor_variant), Translation_Start_Site (start_lost), Nonstop_Mutation (stop_lost).

-Best,

Jeff

[ WARNING : This email came from an external source. Please treat this message with additional caution.]

debr...@mskcc.org

unread,

Feb 22, 2022, 7:28:11 PM2/22/22

to jeffre...@vumc.org, cbiop...@googlegroups.com, william...@vumc.org, fritz...@vanderbilt.edu, och...@mskcc.org, wan...@mskcc.org

Hi Jeff,

I think this should be the list:

https://github.com/genome-nexus/genome-nexus/blob/master/component/src/main/java/org/cbioportal/genome_nexus/component/annotation/VariantClassificationResolver.java#L139-L191

Does that help?

Best wishes,

Ino

Smith, Jeffrey R

unread,

Feb 22, 2022, 8:03:41 PM2/22/22

to debr...@mskcc.org, cbiop...@googlegroups.com, Dupont, William, fritz...@vanderbilt.edu, och...@mskcc.org, wan...@mskcc.org

Hi Ino

Thanks for helping!

The section “Variant Classification Filter” of your reference https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format states: “By default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank and 5'Flank, except for the promoter mutations of the TERT gene.” Might those classes have been removed from presentation?

Github code lines 157-183 of the link below appear to be the superset that includes rather than filters these, and also includes non-coding and miRNA gene variants. Checking all such selections on the GDC web site for a given TCGA sample yields a GDC mutation count that is considerably greater than the cBioPortal mutation count.

-Jeff

debr...@mskcc.org

unread,

Feb 23, 2022, 9:21:02 AM2/23/22

to jeffre...@vumc.org, cbiop...@googlegroups.com, william...@vumc.org, fritz...@vanderbilt.edu, och...@mskcc.org, wan...@mskcc.org, ga...@mskcc.org

Hi Jeff,

Sure thing!

> The section “Variant Classification Filter” of your reference https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format states: “By default, cBioPortal filters out Silent, Intron, IGR, 3'UTR, 5'UTR, 3'Flank and 5'Flank, except for the promoter mutations of the TERT gene.” Might those classes have been removed from presentation?

This is right

> Github code lines 157-183 of the link below appear to be the superset that includes rather than filters these, and also includes non-coding and miRNA gene variants. Checking all such selections on the GDC web site for a given TCGA sample yields a GDC mutation count that is considerably greater than the cBioPortal mutation count.

The way the data flow works is that Genome Nexus first collapses all those terms in lines 157-183 to the variant classification of Splice_Site/Nonstop_Mutation/Silent etc. After that cBioPortal on import filters out the variant classifications listed above. So the filter should probably be something like this:

https://portal.gdc.cancer.gov/exploration?facetTab=mutations&filters=%7B%22op%22%3A%22and%22%2C%22content%22%3A%5B%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22genes.biotype%22%2C%22value%22%3A%5B%22protein_coding%22%5D%7D%7D%2C%7B%22op%22%3A%22in%22%2C%22content%22%3A%7B%22field%22%3A%22ssms.consequence.transcript.consequence_type%22%2C%22value%22%3A%5B%22coding_sequence_variant%22%2C%22frameshift_variant%22%2C%22inframe_deletion%22%2C%22inframe_insertion%22%2C%22missense_variant%22%2C%22protein_altering_variant%22%2C%22splice_acceptor_variant%22%2C%22splice_donor_variant%22%2C%22splice_region_variant%22%2C%22start_lost%22%2C%22stop_gained%22%2C%22stop_lost%22%5D%7D%7D%5D%7D

This gets a bit closer e.g. TCGA-A5-A0G2 has 25,820 mutations in cBioPortal and 26,792 in GDC. For a few other cases I see similar discrepancies. One thing I neglected to mention here:

> The issue is that (1) the reference genome is different. GDC is using GRCh38 and cBioPortal GRch37. Then (2) the version of VEP might be different. And lastly (3) although we use VEP via our annotation tool called Genome Nexus, we pick a single annotation for each variant as it applies to the canonical transcript for each gene.

There is also (4) TCGA has a few different processing flavors. For instance, in cBioPortal for uterine corpus endometrial carcinoma have the original published Nature 2013 data, the firehose legacy data and the latest PanCancer Atlas data (see for more about these differences the faq). I believe GDC should be the same calls as the PanCancer Atlas data but it’s possible there are some variations in what calls were included from mutect2/varscan2/muse/somaticsniper and pindel. CC’ing JJ who might know about this aspect

To get to the bottom of this, one analysis option could be to get the calls from both GDC and cBioPortal (https://github.com/cbioportal/datahub) and compare if there’s enrichment for specific types of variants or genes. That might shed some light into why GDC seems to list more mutation events

Hope that helps!

Best wishes,

Ino

Smith, Jeffrey R

unread,

Feb 23, 2022, 11:56:38 AM2/23/22

to debr...@mskcc.org, cbiop...@googlegroups.com, Dupont, William, fritz...@vanderbilt.edu, och...@mskcc.org, wan...@mskcc.org, ga...@mskcc.org

Hi Ino

Thanks for clarifying cBioPortal’s “Mutation Count” header for us. We are close! A summary of this might be helpful to others as an FAQ. Our understanding is as follows...

INCLUDED:

Protein coding genes only.

Frame_Shift_Del, Frame_Shift_Ins, In_Frame_Del, In_Frame_Ins, Missense_Mutation, Nonsense_Mutation, Splice_Site, Translation_Start_Site, Nonstop_Mutation, Targeted_Region, De_novo_Start_InFrame, De_novo_Start_OutOfFrame.

Question — Is "Targeted_Region" defined by https://github.com/mskcc/vcf2maf/issues/33, to include: TFBS_ablation, TFBS_amplification, regulatory_region_ablation, regulatory_region_amplification, feature_elongation, feature_truncation ? This category would seem less definitively pathogenic, relative to the other categories above. But the text does state that it is included.

EXCLUDED (per https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#minimal-maf-format):

"cBioPortal skips the following types during the import: Silent, Intron, 3'UTR, 3'Flank, 5'UTR, 5'Flank, IGR and RNA. Two extra values are allowed by cBioPortal here as well: Splice_Region, Unknown.”

The intended meaning of the latter sentence was unclear. I was able to find an example of a Splice_Region mutation (sample TCGA-A5-A0G2, hg37 chr3:184556537 A to G) that is included as a “Splice” mutation by cBioPortal as well as a “Splice Region” mutation in GDC (on hg38 map). It is not a canonical GT splice donor or AG acceptor site, and so helps to clarify that the Splice_Region category is included rather than excluded. It implies that the “Unknown” category is also included. I suspect that the Unknown bin is not the PolyPhen “unknown” functional prediction, but something else?

-Jeff

debr...@mskcc.org

unread,

Feb 24, 2022, 10:07:36 AM2/24/22

to jeffre...@vumc.org, cbiop...@googlegroups.com, william...@vumc.org, fritz...@vanderbilt.edu, och...@mskcc.org, wan...@mskcc.org, ga...@mskcc.org

Hi Jeff,

Thanks for that great summary – we’ll update the docs accordingly: https://github.com/cBioPortal/cbioportal/issues/9336

To your question. I believe that’s right. Looking at the code it looks like genome nexus handles it the same way as vcf2maf:

https://github.com/genome-nexus/genome-nexus/blob/74d125af810187e8a94d87f369b68e3ef5ee441f/component/src/main/java/org/cbioportal/genome_nexus/component/annotation/VariantClassificationResolver.java#L98-L133

That is, it gets annotated as Targeted_Region if not defined in the variantMap and it’s not a complex inframe event. If you happen to have a mutation event that has any of these annotations, I can confirm it

To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/1178CBF2-EC10-4C97-A8AD-53DF8DAD7C18%40vumc.org.

Smith, Jeffrey R

unread,

Feb 24, 2022, 10:22:42 AM2/24/22

to debr...@mskcc.org, cbiop...@googlegroups.com, Dupont, William, <fritz.parl@vanderbilt.edu>, och...@mskcc.org, wan...@mskcc.org, ga...@mskcc.org

Ino

Got it! Thanks for all your help. And for the strong resource that your crew has created.

-Jeff

[ WARNING : This email came from an external source. Please treat this message with additional caution.]

Reply all

Reply to author

Forward