Import of fusions/structural variants

Uhrig, Sebastian

unread,

Oct 13, 2021, 9:06:25 AM10/13/21

to cbiop...@googlegroups.com

Hello,

I am currently writing a file format conversion script to import fusions/structural variants into cBioPortal. I have read the docs (https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#structural-variant-data) and also found some helpful examples in the cBioPortal DataHub on GitHub (https://github.com/cBioPortal/datahub/tree/master/public). However, some things are unclear to me:

1. What is the difference between "Tumor_Read_Count" and "Tumor_Variant_Count"?

2. How should gene-related fields be filled if a breakpoint hits an intergenic region? Should the closest gene be annotated? If yes, is it somehow possible to convey to the user that it is a nearby hit and not a direct hit?

3. While browsing through examples on the DataHub, I noticed that the column Site2_Effect_On_Frame sometimes contains the value "NON_CODING", even though the manual specifies that only the values "FRAMESHIFT" and "IN_FRAME" should be valid values. An additional value "NON_CODING" makes sense in my opinion, because the other two values are simply not applicable in case of non-coding genes. But is it valid to use this additional value?

Many thanks in advance,
Sebastian

Benjamin Gross

unread,

Oct 13, 2021, 2:39:55 PM10/13/21

to Uhrig, Sebastian, Kundra, Ritika/Sloan Kettering Institute, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute

Hi Sebastian,

Thanks for the email. I should start by saying our support of SV is a work-in-progress and there may be some evolution of the file specification over time. Some comments below -

> On Oct 12, 2021, at 12:50 PM, Uhrig, Sebastian <s.u...@dkfz-heidelberg.de> wrote:
>
> Hello,
>
> I am currently writing a file format conversion script to import fusions/structural variants into cBioPortal. I have read the docs (https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#structural-variant-data) and also found some helpful examples in the cBioPortal DataHub on GitHub (https://github.com/cBioPortal/datahub/tree/master/public). However, some things are unclear to me:

Our curation team was going to start an effort like this in the near future. I wonder if you would be willing to contribute your tool to the codebase. We can help evolve it over time. I’ve added our lead curator, Ritika Kundra on his email to follow-up.

>
> 1. What is the difference between "Tumor_Read_Count" and "Tumor_Variant_Count"?

Tumor_Read_Count is the total number of reads of the tumor tissue. Tumor_Variant_Count is the number of reads of the tumor tissue that have the variant/allele which supports the structural variant.

>
> 2. How should gene-related fields be filled if a breakpoint hits an intergenic region? Should the closest gene be annotated? If yes, is it somehow possible to convey to the user that it is a nearby hit and not a direct hit?

Someone else may have a better answer. If the event is within the same gene, then gene1 and gene2 can be the same. You can convey this information in the COMMENTS field of the record.

>
> 3. While browsing through examples on the DataHub, I noticed that the column Site2_Effect_On_Frame sometimes contains the value "NON_CODING", even though the manual specifies that only the values "FRAMESHIFT" and "IN_FRAME" should be valid values. An additional value "NON_CODING" makes sense in my opinion, because the other two values are simply not applicable in case of non-coding genes. But is it valid to use this additional value?

This could be an error in our documentation. I did a quick check of our validator and its only checking for the basic fields (Site1_Entrez, Exon, etc). You should be able to go ahead and use NON_CODING in this field. (Cc’ing Ino)

Best,
Benjamin

>
> Many thanks in advance,
> Sebastian
>

> --
> You received this message because you are subscribed to the Google Groups "cBioPortal for Cancer Genomics Discussion Group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cbioportal+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/cbioportal/1634057403815.51873%40dkfz-heidelberg.de.

Uhrig, Sebastian

unread,

Oct 14, 2021, 11:34:06 AM10/14/21

to Benjamin Gross, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Hi Benjamin,

Thank you for your explanations.

I can contribute my conversion script if you find it to be useful to the community. It is based on fusion calls from Arriba (https://github.com/suhrig/arriba), which I developed. Arriba is part of the GDC RNA-Seq pipeline (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#rna-seq-alignment-workflow), so it may indeed be of use to a broader range of people.

> > 2. How should gene-related fields be filled if a breakpoint hits an intergenic region?
> > Should the closest gene be annotated? If yes, is it somehow possible to convey to the
> > user that it is a nearby hit and not a direct hit?
>
> Someone else may have a better answer. If the event is within the same gene, then
> gene1 and gene2 can be the same. You can convey this information in the COMMENTS
> field of the record.

I was talking about intERgenic breakpoints, not intRAgenic breakpoints, i.e., breakpoints between genes rather than within a single gene. I suppose it would make sense to annotate the closest gene in most cases (e.g., enhancer hijacking translocations between IGH and the vicinity of CCND1 could be annotated as IGH-CCND1 translocations). But for some cases, this would be ambiguous, for example, when there is a deletion of a few terminal exons of a gene, say PTEN, it can lead to a fusion between PTEN and an intergenic downstream of it. When the deletion is small enough, the gene closest to the intergenic breakpoint would be PTEN itself, resulting in a fusion that would be annotated as PTEN--PTEN, which appears to the user as a fusion of the gene with itself, although it really is a fusion PTEN--intergenic.

Regards,
Sebastian

________________________________________
From: Benjamin Gross <benjami...@gmail.com>
Sent: Wednesday, October 13, 2021 8:39 PM
To: Uhrig, Sebastian; Kundra, Ritika/Sloan Kettering Institute
Cc: cbiop...@googlegroups.com; de Bruijn, Ino/Sloan Kettering Institute
Subject: Re: [cbioportal] Import of fusions/structural variants

Benjamin Gross

unread,

Oct 14, 2021, 1:20:26 PM10/14/21

to Uhrig, Sebastian, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Hi Sebastian,

> On Oct 14, 2021, at 11:34 AM, Uhrig, Sebastian <s.u...@dkfz-heidelberg.de> wrote:
>
> Hi Benjamin,
>
> Thank you for your explanations.
>
> I can contribute my conversion script if you find it to be useful to the community. It is based on fusion calls from Arriba (https://github.com/suhrig/arriba), which I developed. Arriba is part of the GDC RNA-Seq pipeline (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#rna-seq-alignment-workflow), so it may indeed be of use to a broader range of people.

I believe I misinterpreted your email. I was thinking you were writing code to convert fusion data from the old format to the more general SV. I will take a closer look at your git repos.

>
>>> 2. How should gene-related fields be filled if a breakpoint hits an intergenic region?
>>> Should the closest gene be annotated? If yes, is it somehow possible to convey to the
>>> user that it is a nearby hit and not a direct hit?
>>
>> Someone else may have a better answer. If the event is within the same gene, then
>> gene1 and gene2 can be the same. You can convey this information in the COMMENTS
>> field of the record.
>
> I was talking about intERgenic breakpoints, not intRAgenic breakpoints, i.e., breakpoints between genes rather than within a single gene. I suppose it would make sense to annotate the closest gene in most cases (e.g., enhancer hijacking translocations between IGH and the vicinity of CCND1 could be annotated as IGH-CCND1 translocations). But for some cases, this would be ambiguous, for example, when there is a deletion of a few terminal exons of a gene, say PTEN, it can lead to a fusion between PTEN and an intergenic downstream of it. When the deletion is small enough, the gene closest to the intergenic breakpoint would be PTEN itself, resulting in a fusion that would be annotated as PTEN--PTEN, which appears to the user as a fusion of the gene with itself, although it really is a fusion PTEN--intergenic.

Not being a biologist, its hard for me to say, but it seems like in this case, maybe we shouldn’t require a partner gene (which our importer currently requires). I think we need to take this use case under consideration. For now using the COMMENTS or EVENT_INFO fields is probably the best we can do.

Best,
Benjamin

Uhrig, Sebastian

unread,

Oct 19, 2021, 8:24:58 AM10/19/21

to Benjamin Gross, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Hi Benjamin,

I have successfully imported a single structural variant as a test case into my local cBioPortal instance. There were no errors or warnings during import. On the study summary page a new molecular profile appeared, named "SV Data". The chart "Structural Variant Genes" correctly reports 1 profiled sample, but the box is empty (no genes are listed). Also, on the query page, there is no "genomic profile" for "structural variants" to select, as if SV data did not exist for the cohort. I am a bit clueless why the data is not shown in the portal. Do you have any advice? I have attached the import files.

Furthermore, I have tried to import a bigger batch of structural variants. The import process often (but not always) complains about "Invalid Site 1 or 2 Ensembl transcript ID or exon found, ignoring structural variant for SV record". Similarly, the import process crashes on some entries with an error "java.lang.RuntimeException: org.mskcc.cbio.portal.dao.DaoException: DB Error: only 180 of the 184 records were inserted in `structural_variant`. More error/warning details: Cannot add or update a child row: a foreign key constraint fails (`cbioportal`.`structural_variant`, CONSTRAINT `structural_variant_ibfk_2` FOREIGN KEY (`SITE1_ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`) ON DELETE CASCADE) See tmp file for more details: /tmp/structural_variant15999843332152537521.tempTable". Apparently, I supplied unrecognized Entrez gene IDs. Can you tell me where I can find all recognized transcript IDs and Entrez gene IDs? What source did cBioPortal use to populate its annotation database (ENSEMBL/GENCODE/which version)?

Many thanks for your help,
Sebastian

________________________________________
From: Benjamin Gross <benjami...@gmail.com>

Sent: Thursday, October 14, 2021 7:20 PM
To: Uhrig, Sebastian
Cc: cbiop...@googlegroups.com; de Bruijn, Ino/Sloan Kettering Institute; Kundra, Ritika/Sloan Kettering Institute

sv_import.zip

Benjamin Gross

unread,

Oct 19, 2021, 1:16:14 PM10/19/21

to Uhrig, Sebastian, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Hi Sebastian,

Glad you made some progress. The files you provide look good to me. I’m not sure what is happening. I can try to import these myself, but it will take a few days to get to - can you provide me with the rest of the study files?

About the issues during batch import. cBioportal uses HGNC as a source of gene information (https://www.genenames.org/). There are no versions, but you can find when a snapshot was taken on the following page (looks like the latest snapshot is Feb 20, 2021)

https://github.com/cBioPortal/datahub/tree/master/seedDB#release-notes

One way to get an idea of what gene information is missing is by running our Dataset Validator. I would run the validator on your data files which will report any issues with the data, including (I believe) unknown genes. There is some information on that here:

https://docs.cbioportal.org/5.1-data-loading/data-loading/using-the-dataset-validator

Best,

Benjamin

sv_import.zip

Uhrig, Sebastian

unread,

Oct 20, 2021, 1:39:12 PM10/20/21

to Benjamin Gross, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Dear Benjamin,

Thank you for the very informative response.

The seedDB snapshot seems like a very good source of information about Entrez gene IDs. This will help me identify unrecognized IDs prior to the upload. I have come to realize, though, that cBioPortal has no problem when I simply omit the Entrez IDs. I suspect it then performs the mapping by gene symbol, which is not different from what I have done. So I will probably just leave it at that.

The seedDB does not contain information about transcript IDs, however. "grep"ing for "ENST..." IDs only returned a handful of hits. I am still clueless which gene model is used by cBioPortal and hence, which transcript IDs are recognized. A big chunk of the SVs fail to be imported for this reason.

I ran the data validator. It is also helpful at identifying invalid Entrez IDs prior to the upload. But it performs no checks on the transcript IDs.

I appreciate the offer to troubleshoot my test data when you find the time. I attached a complete minimal set of files that should be able to reproduce my issue.

Regards,

Sebastian

From: Benjamin Gross <benjami...@gmail.com>
Sent: Tuesday, October 19, 2021 7:16 PM

teststudy.zip

Uhrig, Sebastian

unread,

Dec 4, 2021, 1:15:23 PM12/4/21

to Benjamin Gross, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Hi Benjamin,

I have solved the problem of SVs failing to be imported due to having invalid transcript IDs. As it turns out, the issue was another one (invalid exon numbers) and not related to invalid transcript IDs. The transcript IDs are not validated at all. Any text is allowed.

However, this still does not explain why the successfully imported SVs do not show up in the web-portal. Have you had a chance to debug this issue using the example I sent to you earlier?

Thanks,
Sebastian

From: Uhrig, Sebastian
Sent: Wednesday, October 20, 2021 7:39 PM
To: Benjamin Gross

Benjamin Gross

unread,

Dec 6, 2021, 10:14:36 AM12/6/21

to Uhrig, Sebastian, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Hi Sebastian,

I haven’t had time to debug this issue yet, but we are prioritizing the completion of proper SV support starting in Jan (I say that, because right now we have a hybrid implementation in place where fusion data comes in via the mutation table or the SV table). The lead engineer of the current implementation has moved on from the cBioPortal project, so it will take some extra time for our engineers to familiarize themselves with the current codebase. We will use the opportunity to understand the code, remove any transient code, and maybe put better validation in place (afaik, we currently do not have a convenient way to validate incoming transcript ids or exon numbers).

I will keep you posted. Let me know if you have any other questions.

Best,

Benjamin

Benjamin Gross

unread,

Jan 19, 2022, 2:43:10 PM1/19/22

to Uhrig, Sebastian, cbiop...@googlegroups.com, de Bruijn, Ino/Sloan Kettering Institute, Kundra, Ritika/Sloan Kettering Institute

Hi Sebastian,

Just wanted to keep you in the loop regarding our SV data refactoring. We've been working on a PR which removes support of fusion data via the mutation channel and brings it in through a proper SV channel. I’ve added a card on our scrumboard to test the data you’ve provided and identify any inconsistencies in our code and documentation. We will also be revising SV support in the validator tool and fixing any inconsistencies.

Best,

Benjamin

Michal Inbar

unread,

Jan 30, 2022, 5:04:59 AM1/30/22

to cBioPortal for Cancer Genomics Discussion Group

Great, we are also waiting for this feature. Waiting for your update :)

Reply all

Reply to author

Forward