Summary
Focus: modeling complex systems of chemical reactions
Goals:
Understand and Design chemical system
This requires accurate predictive modeling that captures
The key performance properties of chemicals
The steps needed to synthesize them
The degradation of chemicals over time (e.g. in storage)
Key piece: prediction of reactions and reaction rates
RMS: expert system for chemical modeling
Given a mixture of chemicals and conditions: predict how it will evolve over time
Attached to chemical solvers
Challenge: limited accuracy for predictions for chemical properties
Chemprop: Machine-learned model of chemical properties
Prediction tasks
Chemical synthesis (ASKCOS)
Rates
Products
Construct Kinetic Models and solve them for products, rates
Reaction Model Generator (RMG)
Validated against experimental datasets
Chemical engineering
Lab experiments (small synthesis unit, “cheap”)
Pilot plant (close in size to full plant, expensive)
Fit simple extrapolation model to predict how a full factory will work
Predict behavior of factory, then build it
Problem: often these designs don’t actually work, so this flow is risky and expensive
Goal: skip the pilot plant and predict from lab experiment to factory
Challenge: model accuracy (numerical error, accounting for all reactions)
Rate predictions most vulnerable
Chemistry is a Big problem: many molecules and reactions
Total number of possible molecules is ~1020, number of interactions is > 1040
Only a tiny fraction of these have been evaluated
PubChem database contains 108 molecules,
107 molecules have actually been studied
107 reactions have been studied
Want to predict molecule properties before expensive predictions
Challenge:
Data on molecule properties and reactions is sparse
Hard to access data from papers (publisher restrictions and unstructured data extraction challenge)
~104 molecules have detailed datasets
Traditional approach:
Infer based on related molecules
Only works for molecule types with a lot of data
ML-based alternative: learned fingerprint
Chemprop: Graph neural network
Learned to extend this to pairs of molecules (solvent and solute)
Challenging to get enough data to train a Chemprop model
Physical fundamentals-driven ML:
Use underlying physical constraints to guide modeling
E.g. Shrodinger’s equation, statistical mechanics and rate theory
Challenge: very computationally-intensive, so either run out of compute or numerical approximations reduce accuracy
Synthesis:
Use quantum chemistry calculations (e.g. COSMO-RS) to create a training dataset for an ML model
Use model to tweak parameters for higher-level simulations (e.g. Solvation)
There’s a point where the error from using the model reaches the error level of experimental data: this corresponds to the level of noise in the data
Applied approach to multiple types of properties: thermochemistry, solvation energies, reaction barriers, spectra
Can use ML to accelerate conventional first-principles (Quantum Chemistry) calculations
Guess 3D geometries of different molecules to estimate their potential space
Feedback training: used successes in model to create larger dataset, which improved model predictions
Large HPC-heavy workflow
Developed heuristic estimators to detect a model that is not converging and stop it early
ML model is more accurate than Density Functional Theory models but not as accurate as models that account for spatial distribution of electrons (very computationally expensive)
Can use ML trained on quantum to predict solvent effects on reaction rates; good accuracy with only simulations, no experimental data
Workflow
RMG - Reaction Mechanism Generator (https://rmg.mit.edu/): Database of molecular properties
Use parameters from RMG as inputs for dynamic reaction simulations
Get predictions of chemical parameters, feed back into database
Validate against experiments
Example application: high-temperature pyrolysis
Convert natural gas + waste into ethylene acetylene
RMG build a kinetic model to predict 71 pilot plant experiments on 12 different feed compositions; 664 chemical species, 8121 reactions
Pure prediction, except for the free parameter: heat loss through reactor walls
Predictions were very close to experiment but not perfect; much cheaper than building the pilot plant
Sponsor used simulations to design changes to the process
Working towards
Heavier molecules: more dynamical complexity and more chemical species, so error is higher
Multiple phases of reactants: film on surfaces of materials, liquid flowing and vapor
Challenge: quantum chemistry for large molecules with complex 3D structures
Computationally expensive
Less experimental data
3d effects (e.g. folding) are highly variable and dynamic
Huge numbers of reactions
Some progress in this area
RMG works in many cases but many failure cases too
For >10k species direct ODE solves are too complex