Abstract:
Focus:
Near real-time analytics of disease and its spread at scale
Data-driven Clinical decision tools
National health security
Individual health
MOSSAIC Project: https://www.olcf.ornl.gov/tag/mossaic/
Collaboration between ORNL and NIH
ADMIRRAL: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/admirral
IMPROVE: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/improve
ATOM: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/atom
CANDLE: https://datascience.cancer.gov/collaborations/nci-department-energy-collaborations/candle
Real-world surveillance
Surveillance Epidemiology End-Results (SEER) Registries: https://seer.cancer.gov
850k/year cancer diagnoses collected
Many regional registries collecting data
Lots of manual extraction
2 year lag in reporting
MOSSAIC Challenge
Using AI to bring cancer surveillance to near-real time
>90% of cancers histologically reported (pathology reports from tumor slide observations; reports are thousands of words, require domain expertise to understand)
Can use ML models to read pathology reports and code them into tabular records:
Collect data from multiple SEER registries: 6m pathology reports
Mostly text but starting a pilot on images
Combination of multiple models that extract different features and use different documents
Cancer categorization:
Malignancy
Phenotype
Pediatric cancer classification
System deployed at SEER sites, covering 48% of US population
Auto-extraction:
Model predicts own confidence, presents uncertain predictions to human experts
Where confidence is very high, model auto codes. Done with 23-27% of pathology reports with >98% accuracy.
Collaboration with Veterans Affairs (VA) registry to adapt model to VA’s own data to make predictions out of sample with good accuracy
Prevailing challenges:
Computational limitations
Hospitals produce 50 PB of data, 97% goes unused
DOE compute facilities provide lots of compute power, CITADEL secure facility allows them to work with health data: https://www.olcf.ornl.gov/tag/citadel/
Data complexity, regulatory hurdles
Data sources, types, schemas and quality are very heterogeneous
Active work on harmonized data models
Using the North American Association of Central Cancer Registries (NAACCR) data model
Others: Sentinel, PCORnet, i2b2, OMOP
Automatic Classification for Common Data Model:
Bert-based NLP model
Multimodal ensembles for identifying recurrent disease
Hard task since recurrence not commonly tracked in registries
Integration of diverse social/environmental determinants of health
Distribution of diseases is biased in space and sub-populations
Especially true for rare diseases, where the sample size is small in any local dataset
Focus on privacy-preserving federated learning
Tradeoff between privacy and accuracy
Analysis of data within the individual registries can focus more on accuracy
Federation of results across registries must maintain privacy
Near real-time analysis
Many risk factors for disease are socio-environmental
Need to understand these drivers
SEER Residential History Data
LexusNexus data for where individuals have lived 1995-2020
Can connect to individuals pollution exposure
Requires collaboration across many different domains of experts
Medium-Range goals
EHRLICH: surveillance for biopreparedness
Agent-based models, parameterized by live data feeds, expansion of FrESCO data harmonization/ingest infrastructure
Improved text information retrieval models via neural attention mechanisms
C-HER: centralized repository for environmental determinants of health data
Integrating 73 datasets
Integrated Health Security Surveillance Response Tools
Dual purpose tools for precision medicine AND population health
Data management, early warning, etc.