Polo 6n Wiring Diagram

0 views
Skip to first unread message

Jeanett Fite

unread,
Aug 3, 2024, 6:00:12 PM8/3/24
to mobapepa

Robustly predicting outcome for cancer patients from gene expression is an important challenge on the road to better personalized treatment. Network-based outcome predictors (NOPs), which considers the cellular wiring diagram in the classification, hold much promise to improve performance, stability and interpretability of identified marker genes. Problematically, reports on the efficacy of NOPs are conflicting and for instance suggest that utilizing random networks performs on par to networks that describe biologically relevant interactions. In this paper we turn the prediction problem around: instead of using a given biological network in the NOP, we aim to identify the network of genes that truly improves outcome prediction. To this end, we propose SyNet, a gene network constructed ab initio from synergistic gene pairs derived from survival-labelled gene expression data. To obtain SyNet, we evaluate synergy for all 69 million pairwise combinations of genes resulting in a network that is specific to the dataset and phenotype under study and can be used to in a NOP model. We evaluated SyNet and 11 other networks on a compendium dataset of >4000 survival-labelled breast cancer samples. For this purpose, we used cross-study validation which more closely emulates real world application of these outcome predictors. We find that SyNet is the only network that truly improves performance, stability and interpretability in several existing NOPs. We show that SyNet overlaps significantly with existing gene networks, and can be confidently predicted (85% AUC) from graph-topological descriptions of these networks, in particular the breast tissue-specific network. Due to its data-driven nature, SyNet is not biased to well-studied genes and thus facilitates post-hoc interpretation. We find that SyNet is highly enriched for known breast cancer genes and genes related to e.g. histological grade and tamoxifen resistance, suggestive of a role in determining breast cancer outcome.

Cancer is caused by disrupted activity of several pathways. Therefore, to predict cancer patient prognosis from gene expression profiles, it may be beneficial to consider the cellular interactome (e.g. the protein interaction network). These so-called Network based Outcome Predictors (NOPs) hold the potential to facilitate identification of dysregulated pathways and delivering improved prognosis. Nonetheless, recent studies revealed that compared to classical models, neither performance nor consistency (in terms of identified markers across independent studies) can be improved using NOPs. In this work, we argue that NOPs can only perform well when supplied with suitable networks. The commonly used networks may miss associations specially for under-studied genes. Additionally, these networks are often generic with low coverage of perturbations that arise in cancer. To address this issue, we exploit 4100 samples and infer a disease-specific network called SyNet linking synergistic gene pairs that collectively show predictivity beyond the individual performance of genes. Using a thorough cross-validation, we show that a NOP yields superior performance and that this performance gain is the result of the wiring of genes in SyNet. Due to simplicity of our approach, this framework can be used for any phenotype of interest. Our findings confirm the value of network-based models and the crucial role of the interactome in improving outcome prediction.

Copyright: 2019 Allahyar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The normalized and batch effect removed cohort can be downloaded from SyNet.deRidderLab.nl. This collection excludes METABRIC expressions. METABRIC requires access approval through synapse.org, which is implemented by the original data collectors to protect the privacy and confidentiality of participants in this study. Additional clinical variables for samples collected in this study can be downloaded from S1 Table. The full matrix of SyNet in binary format as well as top gene pairs (including their tri-score used to calculate their fitness) is available for download in tab-delimited format from SyNet.deRidderLab.nl. Moreover, all scripts used for preparation of data and figures in this manuscript are available for download from github.com/UMCUGenetics/SyNet. To ensure the complete reproducibility of our results, the indices utilized for training and testing of all models (including inner and outer cross-validations) are also available for download through Mendeley data repository

Funding: Part of computations required for this work was carried out on the Dutch national e-infrastructure (e-infra160001) with the support of the SURF Foundation. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Metastases at distant sites (e.g. in bone, lung, liver and brain) is the major cause of death in breast cancer patients [1]. However, it is currently difficult to assess tumor progression in these patients using common clinical variables (e.g. tumor size, lymph-node status, etc.) [2]. Therefore, for 80% of these patients, chemotherapy is prescribed [3]. Meanwhile, randomized clinical trials showed that at least 40% of these patients survive without chemotherapy and thus unnecessarily suffer from the toxic side effect of this treatment [3, 4]. For this reason, substantial efforts have been made to derive molecular classifiers that can predict clinical outcome based on gene expression profiles obtained from the primary tumor at the time of diagnosis [5, 6].

Several reasons for this lack of consistency have been proposed, including small sample size [11, 13, 14], inherent measurement noise [15] and batch effects [16, 17]. Apart from these technical explanations, it is recognized that traditional models ignore the fact that genes are organized in pathways [18]. One important cancer hallmark is that perturbation of these pathways may be caused by deregulation of disparate sets of genes which in turn complicates marker gene discovery [19, 20].

In recent years, a range of improvements to the original NOP formulation has been proposed. In the prediction step, various linear and nonlinear classifiers have been evaluated[26, 27]. Problematically, the reported accuracies are often an overestimation as many studies neglected to use cross-study evaluation scheme which more closely resembles the real-world application of these models [7]. Also for the aggregation step, which is responsible for forming meta-genes from gene sets, several distinct approaches are proposed such as clustering [23] and greedy expansion of seed genes into subnetworks [18]. Moreover, in addition to simple averaging, alternative means by which genes can be aggregated, such as linear or nonlinear embeddings, have been proposed [17, 28]. Most recent work combines these steps into a unified model [8, 29]. Recent efforts that extend these concepts to sequencing data by exploiting the concept of cancer hallmark networks have also been proposed [30].

Despite these efforts and initial positive findings, there is still much debate over the utility of NOPs compared to classical methods, with several studies showing no performance improvement [21, 31, 32]. Perhaps even more striking is the finding that utilizing a permuted network [32] or aggregating random genes [10] performs on par with networks describing true biological relationships. Several meta-analyses attempting to establish the utility of NOPs have appeared with contradicting conclusions. Notably, Staiger et al. compared performance of nearest mean classifier [33] in this setting and concluded that network derived meta-genes are not more predictive than individual genes [21, 32]. This is in contradiction to Roy et al. who achieved improvements in outcome prediction when genes were ranked according to their t-test statistics compared to their page rank property [34] in PPI network [28, 35]. It is thus still an open question whether NOPs truly improve outcome prediction in terms of predictive performance, cross-study robustness or interpretability of the gene signatures.

In this work, we propose to construct a network ab initio that is specifically designed to improve outcome prediction in terms of cross-study generalization and performance stability. To achieve this, we will effectively turn the problem around: instead of using a given biological network, we aim to use the labelled gene expression datasets to identify the network of genes that truly improves outcome prediction (see Fig 1 for a schematic overview).

Our approach relies on the identification of synergistic gene pairs, i.e. genes whose joint prediction power is beyond what is attainable by both genes individually [49]. To identify these pairs, we employed grid computing to evaluate all 69 million pairwise combinations of genes. The resulting network, called SyNet, is specific to the dataset and phenotype under study and can be used to infer a NOP model with improved performance.

To obtain SyNet, and allow for rigorous cross-study validation, a dataset of substantial size is required. For this reason, we combined 14 publicly available datasets to form a compendium encompassing 4129 survival labeled samples. To the best of our knowledge, the data combined in this study represents the largest breast cancer gene expression compendium to date. Further, to ensure unbiased evaluation, sample assignments in the inner as well as the outer cross-validations folds are kept equal across all assessments throughout the paper.

In the remainder of this paper, we will demonstrate that integrating genes based on SyNet provides superior performance and stability of predictions when these models are tested on independent cohorts. In contrast to previous reports, where shuffled versions of networks also performed well, we show that the performance drops substantially when SyNet links are shuffled (while containing the same set of genes), suggesting that SyNet connections are truly informative. We further evaluate the content and structure of SyNet by overlaying it with known gene sets and existing networks, revealing marked enrichment for known breast cancer prognostic markers. While overlap with existing networks is highly significant, the majority of direct links in SyNet is absent from these networks explaining the observed lack of performance when NOPs are guided by the phenotype-unaware networks. Interestingly, SyNet links can be reliably predicted from existing networks when more complex topological descriptors are employed. Taken together, our findings suggest that compared to generic gene networks, phenotype-specific networks, which are derived directly from labeled data, can provide superior performance while at the same time revealing valuable insight into etiology of breast cancer.

c80f0f1006
Reply all
Reply to author
Forward
0 new messages