Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.
Earlier detection of pancreatic ductal adenocarcinoma (PDAC) is key to improving patient outcomes, as it is mostly detected at advanced stages which are associated with poor survival. Developing non-invasive blood tests for early detection would be an important breakthrough.
The primary objective of the work presented here is to use a dataset that is prospectively collected, to quantify a set of cancer-associated proteins and construct multi-marker models with the capacity to predict PDAC years before diagnosis. The data used is part of a nested case-control study within the UK Collaborative Trial of Ovarian Cancer Screening and is comprised of 218 samples, collected from a total of 143 post-menopausal women who were diagnosed with pancreatic cancer within 70 months after sample collection, and 249 matched non-cancer controls. We develop a stacked ensemble modelling technique to achieve robustness in predictions and, therefore, improve performance in newly collected datasets.
The ensemble modelling strategy explored here outperforms considerably biomarker combinations cited in the literature. Further developments in the selection of classifiers balancing performance and heterogeneity should further enhance the predictive capacity of the method.
Pancreatic cancers are most frequently detected at an advanced stage. This limits treatment options and contributes to the dismal survival rates currently recorded. The development of new tests that could improve detection of early-stage disease is fundamental to improve outcomes. Here, we use advanced data analysis techniques to devise an early detection test for pancreatic cancer. We use data on markers in the blood from people enrolled on a screening trial. Our test correctly identifies as positive for pancreatic cancer 91% of the time up to 1 year prior to diagnosis, and 78% of the time up to 2 years prior to diagnosis. These results surpass previously reported tests and should encourage further evaluation of the test in different populations, to see whether it should be adopted in the clinic.
Here we apply the stacking approach due to its simplicity and computational efficiency. The use of multi-datasets and multi-platform integration in pancreatic cancer studies24 are essential for early detection and aligns at a fundamental level with the data analysis methodology applied in our work. We demonstrate how using a stacked ensemble approach which relies on a panel of 20 features, including cancer-associated proteins and clinical covariates, outperforms state-of-the-art multi-biomarker combinations previously applied in pancreatic ductal adenocarcinoma early detection.
Diabetes status was collected from a UKCTOCS first follow-up questionnaire, from in-patient and out-patient Hospital Episode Statistics or from death certificates. The full dataset used in our work included only 44 individuals for which type was determined. For the rest, only yes/no information was available with respect to diabetes. 24 PDAC cases with diabetes mellitus (DM) were non-insulin-dependent, 3 were classified as insulin-dependent, 3 had both a classification of insulin-dependent and non-insulin-dependent DM, therefore inconclusive, 1 had a non-insulin-dependent and other specified DM, 1 had a non-insulin-dependent and unspecified DM, 5 had an only unspecified DM classification. Regarding the controls, only 1 had insulin-dependent DM and 5 were classified as non-insulin-dependent. Due to the smaller size of the subset for which DM type was available and the incomplete nature of this information, we decided to not incorporate type into our analysis.
As an external validation cohort, we resorted to the Accelerated Diagnosis of neuro Endocrine and Pancreatic TumourS (ADEPTS) study27 (UCL/UCLH Research Ethics Committee reference 06/Q0512/106), which is an early biomarker study aiming to detect pancreatic cancer in patients at a much earlier stage. The ADEPTS study, previously referred to as TRANSlational research in BILiary tract and pancreatic diseases (TRANSBIL) study, collected serum samples from adult patients who presented to University College London and the Royal Free London Hospitals between 2017 and 2019 with abdominal symptoms suggestive of hepatobiliary disorders and pancreatic cancer. For the purpose of this work, samples from patients with no underlying gastrointestinal disorders or samples from cases diagnosed with pancreatic cancer were used. The number of cases and controls selected for external validation of the PDAC signature developed in the UKCTOCS samples presented above can be seen in Table 2 (see also Supplementary Table 3), as well as other sample characteristics. 17 PDAC cases and 17 controls were available for the work presented here. The controls from the ADEPTS study are the closest to the control population collected from UKCTOCS as they did not present underlying gastrointestinal pathology. The PDAC cases used here had been matched by age, gender and diabetes status. Hormone replacement therapy (HRT) use at randomization and oral contraceptive pill (OCP) use (ever) information was not collected for the female participants. All patients have given written consent for the use of their samples for research purposes and data were anonymized. The samples were processed according to NIHR standards28 and diagnoses were confirmed by interrogating patient hospital electronic records at University College London and the Royal Free Hospitals.
As with the UKCTOCS samples, the same assays were run in the subset of samples collected from the ADEPTS study, as well as the same Olink platform of biomarkers. This secured that the full biomarker signature developed in UKCTOCS samples could be validated in a different cohort.
Our main dataset is part of a nested case-control study within UKCTOCS25,26 and is comprised of 143 individuals with PDAC and 249 controls (see Table 1 and Supplementary Tables 1 and 2). Thirty-five of the PDAC-diagnosed patients provided longitudinal samples, ranging between 2 and 6 annual samples per individual collected prior to diagnosis, with an average of 1.53 samples per individual (see Table 1). Despite the fact that 35 of the PDAC cases had longitudinal data, all samples were taken as independent, and no intra-individual correlation was imposed or explicitly modelled during data analysis in this instance. For the purpose of data analysis, we divided all samples prior to any classifier development into a training (2/3) set and test (1/3) set, by stratifying for age quartile, HRT use at randomization, OCP use (ever), diabetes status (Yes/No), BMI quartile, PDAC or control status and for sample single time-group, i.e., 0-1,1-2,2-3,3-4 and 4+ years to diagnosis (YTD). Sample single time groups were attributed to each sample and determined by the time to diagnosis at sample collection (compare Table 1 with Supplementary Tables 1 and 2, see also Supplementary Table 12 for the total number of cases and controls per single time-group). The stratification outlined above enabled a clearer evaluation of PDAC classifier panel performances in collected samples not used in training, i.e., the test set, and ensured that the results are realistic and representative, and are not biased by data or information leakage29.
Receiver operating characteristic (ROC) curves were constructed for each model to assess diagnostic accuracy. The area under the curve (AUC) for the ROC curves was used as the performance metric during optimization. Models were selected based on their rank in the training set across cross-validation folds. ROC curves were generated with the pROC R package (version 1.18.0, -project.org/web/packages/pROC/index.html). 95% CI for AUCs were determined by stratified bootstrapping. All AUC confidence intervals crossing 0.5 were deemed insignificant. In addition, sensitivity, positive and negative predictive values and Matthews correlation coefficients at 90% specificity are also reported. Comparison of ROC curves was performed with a bootstrap test in pROC.
The selection of base-learners was grounded on covering a number of state-of-art methods and algorithmic families, from bagging and boosting to general linear models with in-built feature extraction, previously referenced in the literature19,22, that would be able to capture different aspects of the data with an efficient computational effort and that had, for the most part, typical hyperparameter ranges published in the literature22, some with applications in biology21. Due to the size of the datasets, we narrowed down the size of the set of base-learners to 10. Further work on ensemble selection from libraries of models should contribute to clarifying if other techniques provide additional value19 by testing performance against base-learner pool diversity21. The training of the base-learners was executed in two ways: by taking joined/combined time-group samples, i.e., collected 0-1, 0-2, 0-3, 0-4, 0-4+ YTD or by training the set of base-learners in each single time-group specific samples, i.e., 0-1, 1-2, 2-3, 3-4, 4+ YTD. The first model forces the base-learners to learn specific and cross-time-group details together, whereas the second model creates specialized groups of base-learners per single time-group. We tested several staking procedures, i.e. the meta-learners: by Bayesian Model Averaging (BMA) (version 3.18.17, -project.org/web/packages/BMA/index.html) with an underlying logistic regression model (BMA stack), by averaging with an arithmetic mean (MEAN stack) and geometric mean across the probabilities attributed by each base-learner (GEOMEAN stack), or by taking the maximum probability of being a case across all base-learners (MAX stack). This class is named throughout this paper as Joined Time Group 2 Layer (JTG2L) (see Fig. 1 and Supplementary Figs. 24 and Supplementary Table 13 for the optimal hyperparameters found through a random selection of 1000 combinations for each base-learner). For the second model we tested a 2-layer and a 3-layer stacked model. The first, referred to as Single Time Group 2 Layer (STG2L) (Supplementary Fig. 25), took the base learners trained in each single time-group and applied the 4 stacks mentioned above, although to a larger stack input space. If, for example, we are training with samples belonging to every single time-group, i.e., 0-1, 1-2, 2-3, 3-4, 4+, the stack feature input space will have 10 times 5 dimensions; each base-learner is trained on each single time-group, giving 5 models per base-learner and a total of 50 base-models (Supplementary Fig. 25). Subsequently, the probability output from each base-learner model is concatenated and fed into the meta-learner. For the specific case of the STG2L protocol, we also tested an average neural network meta-learner model (AVNNET stack) trained on the concatenated probability matrix created from each base-learner probability output. The second, named Single Time Group 3 Layer (STG3L) (Supplementary Fig. 26), stacks twice and, therefore, has 3 layers. First it stacks the base-learners per single time-group with a BMA stack and, subsequently, stacks the result, a 5-dimension feature space of probabilities with either a BMA stack, a MEAN stack, a GEOMEAN stack or a MAX stack, if, for example, we are training with samples belonging to every single time-group (see Supplementary Fig. 26 for further details). Other combinations of time-groups were also tested, e.g., 0-1 plus 1-4, 0-2 plus 2-4, etc., but the stacked classifiers either underperformed or were not robust.
b1e95dc632