Summary
Today big data and compute hardware are powering AI systems to enable novel biological discoveries
Focus: structure-based de-novo design of protein-based therapeutics
Structure-based reasoning:
Efficient 3D structure representation of protein structure based on graphs
Attention/transformer models
Transfer learning
Diffusion models
Sequence-based reasoning
Protein language models
Fable therapeutics: application of lab’s research work
Graphs: natural way to represent molecular structure
Encode distance, bonds relationships between atoms
Do not require discretization
Alternative: 3d coordinates/voxels are not rotation invariant
Now superior approaches have been developed (e.g. Fable-RE)
PROTEINSOLVER: inverse folding using graph neural networks (https://www.sciencedirect.com/science/article/pii/S2405471220303276)
Inverse folding: find sequence that produces given 3d structure
Given a sequence and a training set of graphs
Infers the structural graph edges that imply the 3D structure
Newer methods: ESM-IF, ProteinMPNN (https://www.science.org/doi/10.1126/science.add2187)
PepNN: accurate predictor of peptide binding sites (https://www.nature.com/articles/s42003-022-03445-2)
Attention modules: updates protein and peptide embedding while enforcing symmetry
Final embedding layers to predict peptide binding site
Transfer learning across scarce related peptide-protein / antibody-antigen datasets
Use fragments of proteins that are likely to bind to expand size of dataset
De novo protein design
Design protein that satisfies desired properties
ProteinSGM: diffusion generative protein model
Protein represented as graph matrices (distance, angles) that are treated as images
Almost all generated backbones are designable and lead to real proteins
Many are novel
Can create a protein Photoshop, where certain key regions are specified and the rest of the protein is generated
HelixGAN, HelixDiff: full atom peptide generation (https://pubmed.ncbi.nlm.nih.gov/36651657/)
Antibody-SGM: diffusion model for full antibody generation (https://icml-compbio.github.io/2023/papers/WCBICML2023_paper143.pdf)
Next frontier: reflecting dynamics in ML models
Proteins are flexible, adapt and move
Need methods that learn the conformational space
Physics models: map out the space of energy space: minima/maxima, etc.
Computationally very expensive to describe entire landscape even given the energy function
Boltzmann generators: use normalizing flows to sample from the molecule’s energy distribution
Generator model that generates the entire energy space distribution of the molecule’s conformations
PepFlow:
Structural model that learns various structural components of the model based on the sequence
Separate sub-models for backbone, sidechain heavy atoms, protonation
Generates space of allowable conformations of the sequence
Next stage: PepFlow++
Adapted PFGM++ architecture
Equiformer layers
Epigenetic editing
Zinc Finger proteins
Small proteins
Fully human proteins (no immune system reaction)
Can bind to DNA but the adjacent binding sites interact, which induces a combinatorial explosion of the solution space
Approach:
Used extensive experimental data to determine which zinc finger structures are compatible
Trained transformer-based language model to generate compatible finger sequences