September, 2024
Welcome to the September 2024 edition of the cBioPortal Newsletter!
We had an exciting summer working with 9 contributors through the Google Summer of Code. In this issue we’ll highlight their work, as well as other recent improvements to cBioPortal.
New Data in cBioPortal
Added data consisting of 12,863 samples from 33 TCGA and 10 CPTAC studies from the Genomic Data Commons (GDC) as part of the Cancer Research Data Commons (NCI-CRDC) initiative. More information can be found on our FAQ.
Additionally, added data consisting of 2,411 samples from 10 studies including:
Pediatric European MAPPYACTS Trial (Gustave Roussy, Cancer Discov 2022): Actionable mutation from whole-exome sequencing data from the pan-cancer pediatric trial MAPPYACTS.
Hepatocellular Carcinoma (CLCA, Nature 2024): Includes whole-exome sequencing from 494 hepatocellular carcinoma samples from the Chinese Liver Cancer Atlas.
For the full list, visit our News page.
Google Summer of Code Projects
Extend Chart Types in Study View
GSoC Contributor: Olzhas Mukayev
The Study View Page in cBioPortal provides users the ability to view and generate charts for clinical, genomic and other types of data for a set of studies. Currently, the Study View Page in cBioPortal supports pie charts, bar charts, and tables. This project enhances the Study View Page by implementing new chart functionality: the ability to show a zoom preview for bar charts and the addition of line charts. These features enhance the user experience of cBioPortal, giving users more freedom and tools to customize charts on the Study View Page for visualization and research needs.
Code Link (Partially in production)
Frontend Visualization and Incorporation of Single Cell Data in cBioPortal
GSoC Contributor: Suraj Sharma
Researchers often aim to combine insights from various omics techniques, and single-cell gene expression data provides an additional layer of depth to cancer genomic analyses. By integrating single-cell data at the cell type and sample level, researchers can compare gene expression between cell types within or across groups, uncovering tumor heterogeneity and distinct gene expression profiles. In this project, a new single-cell tab has been integrated to the cBioPortal frontend for analyzing data at the cell type-patient level, along with enhanced functionality for stacked bar plots in the portal.
Automated Curation and Harmonization of cBioPortal Clinical Metadata using Sentence Transformers
GSoC Contributor: Abhilash Dhal
This project aims to develop automated tools for metadata harmonization and the standardization of clinical metadata in cBioPortal. Abhilash established the framework for two main components of the tool: schema mapping and ontology mapping. The schema mapper harmonizes column names across different studies using frequency- and transformer-based approaches. The frequency-based methods achieved over 80% accuracy on test data. For ontology mapping, a three-stage framework was implemented, incorporating exact matching, language model (LM)-based matching, and large language model (LLM)- based matching. Abhilash tested various BERT models, with SAP-BERT showing the best performance across different categories (treatment_name, body site, and disease), achieving over 80% accuracy for the top 5 matches. These outcomes are significant progress in automating metadata harmonization for cBioPortal, with potential for further improvements through fine-turning models and implementing more advanced NLP techniques, ultimately enhancing the FAIRness and AI/ML-readiness of cBioPortal data across studies.
Add the Ability to Spawn Code Notebooks from cBioPortal Queries
GSoC Contributor: Gautam Sarawagi
Users may want to perform custom analysis on data queried in the cBioPortal. This project introduces the ability to spawn a browser-based JuypterLite code notebook populated with data exported from the Oncoprint. A sample Python script renders a visualization in the Oncoprint. This feature could easily be extended to other export points in the portal, as well as other 3rd-party analytics tools.
Code Link (In production)
Chatbot Trained on Documentation Site and Conversations
GSoC Contributor: Xinling Wang
This project was conducted to simplify how users are able to find information they need about the cBioPortal project through a customized chatbot based on GPT4. This chatbot uses retrieval augmented generation paired with a routing mechanism to match user questions to cBioPortal content including Google Group conversations and documentation. Additionally, it can address basic questions using content returned from the cBioPortal API. There are ongoing plans to make this prototype available more broadly in a maintainable manner.
Integration of AlphaMissense Pathogenicity Predictions into Genome Nexus and cBioPortal
GSoC Contributor: Ivy Zou
AlphaMissense is an AI model developed by Google DeepMind that predicts the pathogenicity of missense variants. This model offers highly accurate predictions by classifying these variants as either benign or pathogenic, which is crucial for understanding genetic diseases. The project aims to integrate AlphaMissense data into Genome Nexus API to programmatically provide precise pathogenicity predictions for missense mutations, and display it on cBioPortal and Genome Nexus website for on-site analysis. It enhances the cBioPortal’s ability to provide actionable insights into cancer genomics.
Visualize OncoKB Annotation and Patient Report Generation
GSoC Contributor: Aishika Nandi
OncoKB provides an endpoint for programmatic annotation of genomic data; however, this data may not be easily digestible as it is in JSON format. Aishika developed a module for visualizing OncoKB annotations, which takes the data returned from OncoKB endpoints as an input. Soon, this package will be available on NPM, and users will be able to easily leverage this interface. This package will also be incorporated into the cBioPortal and OncoKB ecosystems in the near future.
Create Pipeline/Interface to Prioritize Variants for OncoKB Curation
GSoC Contributor: Yameng Ge
OnocKB’s curation team needs to analyze gigabytes of real patient genomic data which is available in cBioPortal. The curation team has limited resources and needs to prioritize analyzing specific genomic variants that will most likely enhance our understanding of target therapy options in oncology. In this project, Yameng created a new webpage in OncoKB’s curation platform to display to the curators statistical information about frequencies of variants in specific tumor types. Ideally, if there is a high frequency of a specific variant in a tumor type then the chances of our curators finding something potentially important to share on OncoKB’s website is much higher.
Improve Navigability of the HTAN Data Standards
GSoC Contributor: Ankita Sahu
Enhance the Data Standards Page for the Human Tumor Atlas Network (HTAN), a cBioPortal-adjacent project. HTAN is a collaborative NCI-funded initiative dedicated to mapping the cellular complexities of human cancers to improve diagnosis and treatment. The Data Standards Page helps submitters and reusers of HTAN data to understand the underlying data model. The Data Standards Page empowers researchers to effectively leverage HTAN's rich public datasets, ultimately advancing our comprehension of cancer biology and facilitating the development of targeted therapies. Ankita implemented improvements to find specific attributes across a variety of data modalities.
Code Link (Partially in production)
To learn more about cBioPortal, review our documentation which includes FAQs, webinars and tutorials. If you have questions, please don't hesitate to reach out to us at cbiop...@googlegroups.com.
Stay tuned for more updates, and thank you for your continued support in advancing cancer genomics research with cBioPortal.