Focus: accelerating the molecule/materials design cycle
Design: what molecule to make?
How to make it
Test
Chemical data sources
Need: chemical reaction space (how to make molecules)
Published literature: extensive but not accessible
Experiment digital lab books
Simulations (highly usable but limited to the types of reactions each model can support)
Patents (valuable but has errors)
Daniel Lowe and Roger Sayle has text-mined reactions from patents
SMILES: linear representation of molecular graphs (tree hierarchy with back-edges to form cycles)
ML for reaction predictions:
MolecularTransformer can be used to map reaction precursors to products
https://github.com/pschwllr/MolecularTransformer
Currently best approaches use transformers
Retrosynthesis:
Target molecule
Known/available building blocks
Design sequence of reactions to produce target molecule
Typically done by specifying reaction rules and searching over the space to reach the target
ML: RoboRXN
Multi-step synthesis planning
Molecular transformer for Retro and Forward steps
Transformer predicts entire recipe with all the actions (stir, filter, etc.) that one can give to a robotic platform
Use of general purpose LLMs to for chemical tasks (above was specialized models)
Moving from encoder-decoder to decoder-only GPT models
Many computational chemistry tools on github. Hard to set up and use
Aim: bridge the gap between computational and experimental chemistry
Generic LLMs are bad at chemistry; ChemCrow extends them using chemical tools
LLM uses existing specialized tools to solve chemical problems
Example: automated synthesis
Plan and execute synthesis of an insect repellent
Find the chemical to synthesize
Generic name => SMILES => molecular graphs
Run reaction planner to get recipe
Execute recipe on robot
Example: molecular discovery
Given experimental data that describes molecule’s properties
Use ChemCrow to discover the molecule consistent with the data
Example: Safety tools
Interact with tool to ask the dangers from using various chemicals and likely outcomes of usage scenarios
Automates synthesis is not yet a solver problem
Supply chain/robotics challenges
Weak synthesis planning models
Real organic molecules are much more complex than current planning tools are capable of
Bayesian optimization for reactions
Working to figure out the granularity of the way molecules are describes one-hot, DFT
BoChemian: LLM embeddings of the text that describes reaction procedures
Generative De Novo Molecule design
Distribution learning (Transfer learning)
Goal directed Learning (Reinforcement Learning)
Generation using a high-fidelity oracle
Oracle: high fidelity/cost simulation
Protein design algorithm can call oracle a limited number of times
Sample efficiency is critical: learn from few observations
Approaches:
Augmented memory: combines data augmentation with experience replay
Saturn: sample-efficient de novo design
Synthesizability constrained generation
TANGO: enforcing building blocks in synthesis routes
New reward function
Tanimoto similarity
Substructure match
Accelerates search for high-value molecules
FSscore: Chemist’s personalized feasibility score
Different chemists find it easier to synthesize different molecules
Can fine-tune model to align with chemist preferences
Can replace human expert with a make-on-demand molecule library