I am currently writing a file format conversion script to import fusions/structural variants into cBioPortal. I have read the docs (https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#structural-variant-data) and also found some helpful examples in the cBioPortal DataHub on GitHub (https://github.com/cBioPortal/datahub/tree/master/public). However, some things are unclear to me:
1. What is the difference between "Tumor_Read_Count" and "Tumor_Variant_Count"?
2. How should gene-related fields be filled if a breakpoint hits an intergenic region? Should the closest gene be annotated? If yes, is it somehow possible to convey to the user that it is a nearby hit and not a direct hit?
3. While browsing through examples on the DataHub, I noticed that the column Site2_Effect_On_Frame sometimes contains the value "NON_CODING", even though the manual specifies that only the values "FRAMESHIFT" and "IN_FRAME" should be valid values. An additional value "NON_CODING" makes sense in my opinion, because the other two values are simply not applicable in case of non-coding genes. But is it valid to use this additional value?
Many thanks in advance,
Sebastian
Thank you for your explanations.
I can contribute my conversion script if you find it to be useful to the community. It is based on fusion calls from Arriba (https://github.com/suhrig/arriba), which I developed. Arriba is part of the GDC RNA-Seq pipeline (https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/#rna-seq-alignment-workflow), so it may indeed be of use to a broader range of people.
> > 2. How should gene-related fields be filled if a breakpoint hits an intergenic region?
> > Should the closest gene be annotated? If yes, is it somehow possible to convey to the
> > user that it is a nearby hit and not a direct hit?
>
> Someone else may have a better answer. If the event is within the same gene, then
> gene1 and gene2 can be the same. You can convey this information in the COMMENTS
> field of the record.
I was talking about intERgenic breakpoints, not intRAgenic breakpoints, i.e., breakpoints between genes rather than within a single gene. I suppose it would make sense to annotate the closest gene in most cases (e.g., enhancer hijacking translocations between IGH and the vicinity of CCND1 could be annotated as IGH-CCND1 translocations). But for some cases, this would be ambiguous, for example, when there is a deletion of a few terminal exons of a gene, say PTEN, it can lead to a fusion between PTEN and an intergenic downstream of it. When the deletion is small enough, the gene closest to the intergenic breakpoint would be PTEN itself, resulting in a fusion that would be annotated as PTEN--PTEN, which appears to the user as a fusion of the gene with itself, although it really is a fusion PTEN--intergenic.
Regards,
Sebastian
________________________________________
From: Benjamin Gross <benjami...@gmail.com>
Sent: Wednesday, October 13, 2021 8:39 PM
To: Uhrig, Sebastian; Kundra, Ritika/Sloan Kettering Institute
Cc: cbiop...@googlegroups.com; de Bruijn, Ino/Sloan Kettering Institute
Subject: Re: [cbioportal] Import of fusions/structural variants
I have successfully imported a single structural variant as a test case into my local cBioPortal instance. There were no errors or warnings during import. On the study summary page a new molecular profile appeared, named "SV Data". The chart "Structural Variant Genes" correctly reports 1 profiled sample, but the box is empty (no genes are listed). Also, on the query page, there is no "genomic profile" for "structural variants" to select, as if SV data did not exist for the cohort. I am a bit clueless why the data is not shown in the portal. Do you have any advice? I have attached the import files.
Furthermore, I have tried to import a bigger batch of structural variants. The import process often (but not always) complains about "Invalid Site 1 or 2 Ensembl transcript ID or exon found, ignoring structural variant for SV record". Similarly, the import process crashes on some entries with an error "java.lang.RuntimeException: org.mskcc.cbio.portal.dao.DaoException: DB Error: only 180 of the 184 records were inserted in `structural_variant`. More error/warning details: Cannot add or update a child row: a foreign key constraint fails (`cbioportal`.`structural_variant`, CONSTRAINT `structural_variant_ibfk_2` FOREIGN KEY (`SITE1_ENTREZ_GENE_ID`) REFERENCES `gene` (`ENTREZ_GENE_ID`) ON DELETE CASCADE) See tmp file for more details: /tmp/structural_variant15999843332152537521.tempTable". Apparently, I supplied unrecognized Entrez gene IDs. Can you tell me where I can find all recognized transcript IDs and Entrez gene IDs? What source did cBioPortal use to populate its annotation database (ENSEMBL/GENCODE/which version)?
Many thanks for your help,
Sebastian
________________________________________
From: Benjamin Gross <benjami...@gmail.com>
Sent: Thursday, October 14, 2021 7:20 PM
To: Uhrig, Sebastian
Cc: cbiop...@googlegroups.com; de Bruijn, Ino/Sloan Kettering Institute; Kundra, Ritika/Sloan Kettering Institute
Dear Benjamin,
Thank you for the very informative response.
The seedDB snapshot seems like a very good source of information about Entrez gene IDs. This will help me identify unrecognized IDs prior to the upload. I have come to realize, though, that cBioPortal has no problem when I simply omit the Entrez IDs. I suspect it then performs the mapping by gene symbol, which is not different from what I have done. So I will probably just leave it at that.
The seedDB does not contain information about transcript IDs, however. "grep"ing for "ENST..." IDs only returned a handful of hits. I am still clueless which gene model is used by cBioPortal and hence, which transcript IDs are recognized. A big chunk of
the SVs fail to be imported for this reason.
I ran the data validator. It is also helpful at identifying invalid Entrez IDs prior to the upload. But it performs no checks on the transcript IDs.
I appreciate the offer to troubleshoot my test data when you find the time. I attached a complete minimal set of files that should be able to reproduce my issue.
Hi Benjamin,
I have solved the problem of SVs failing to be imported due to having invalid transcript IDs. As it turns out, the issue was another one (invalid exon numbers) and not related to invalid transcript IDs. The transcript IDs are not validated at all. Any text
is allowed.
However, this still does not explain why the successfully imported SVs do not show up in the web-portal. Have you had a chance to debug this issue using the example I sent to you earlier?
Thanks,
Sebastian