Mutant Year Zero Mutations Guide

0 views

Skip to first unread message

Hilary Laite

unread,

Aug 4, 2024, 9:20:33 PM8/4/24

to fomrworkprofcount

Thankyou for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Mutations in amino acid sequences can provoke changes in protein function. Accurate and unsupervised prediction of mutation effects is critical in biotechnology and biomedicine, but remains a fundamental challenge. To resolve this challenge, here we present Protein Mutational Effect Predictor (ProMEP), a general and multiple sequence alignment-free method that enables zero-shot prediction of mutation effects. A multimodal deep representation learning model embedded in ProMEP was developed to comprehensively learn both sequence and structure contexts from 160 million proteins. ProMEP achieves state-of-the-art performance in mutational effect prediction and accomplishes a tremendous improvement in speed, enabling efficient and intelligent protein engineering. Specifically, ProMEP accurately forecasts mutational consequences on the gene-editing enzymes TnpB and TadA, and successfully guides the development of high-performance gene-editing tools with their engineered variants. The gene-editing efficiency of a 5-site mutant of TnpB reaches up to 74.04% (vs 24.66% for the wild type); and the base editing tool developed on the basis of a TadA 15-site mutant (in addition to the A106V/D108N double mutation that renders deoxyadenosine deaminase activity to TadA) exhibits an A-to-G conversion frequency of up to 77.27% (vs 69.80% for ABE8e, a previous TadA-based adenine base editor) with significantly reduced bystander and off-target effects compared to ABE8e. ProMEP not only showcases superior performance in predicting mutational effects on proteins but also demonstrates a great capability to guide protein engineering. Therefore, ProMEP enables efficient exploration of the gigantic protein space and facilitates practical design of proteins, thereby advancing studies in biomedicine and synthetic biology.

Growing evidence from molecular evolution suggests that mutations in protein sequences are often associated with changes in protein function, which may lead to enzyme deficiencies,1 human diseases2 and viral escape.3,4 Deciphering the effects of mutations is thus important in many fields of biological sciences, particularly for the design of protein variants with enhanced or novel functions. Recent efforts have demonstrated that learning the effects of mutations aids in protein engineering and has the potential to overcome the challenges of directed evolution and rational protein design.5,6,7 By navigating the fitness landscape of the target protein and identifying a small set of advantageous mutations, mutation effect prediction could diminish the labor-intensive procedures stemming from multi-round random variation and screening,8 as well as reduce reliance on expert knowledge about protein structure and function during the rational design of proteins.9,10

Despite the importance, accurate modeling of mutation effects is a fundamental challenge due to the intricate interactions among numerous residues and the complex nature of mutational epistasis.11,12 Recent advances in high-throughput experimental technologies, such as deep mutational scanning (DMS),13 have led to significant improvements in the parallel assessment of mutations.14,15 However, due to the considerations of scale and costs, experimentally traversing the gigantic space of all possible protein sequences (\(\sum _i=\,1^L((\prod _j=1^i\left(L+1-j\right)\times 19)/i!)\) for a protein of length L, where i is the number of mutations, i! is the factorial of i) remains unfeasible. Substantial efforts have previously been made to predict mutational effects. Traditional modeling approaches aim to approximate mutational effects using one or a small subset of protein properties. For instance, variations in the physicochemical properties of amino acids may be used to estimate mutation tolerance.16 Alignment-based methods leverage evolutionary properties by identifying conserved regions or mutational patterns within multiple sequence alignments (MSAs).17,18,19 Stability predictors primarily operate on the principle of protein folding energy to assess functional changes resulting from mutations.20 Supervised learning methods learn the mapping from sequences or structures to a specific protein property using annotated datasets.3,21,22 While these methods are undoubtedly useful for predicting mutation effects, their performance is contingent on the depth of MSAs, the availability of labeled datasets, or the type of proteins.

Propelled by the swift advancement of natural language-processing techniques, sequence-based representation learning models or protein language models (e.g., Unirep,7 ESM23 and ProtTrans24) emerge as an unsupervised and MSA-free approach to predict mutation effects.25,26 Despite these promising developments, the accurate prediction of mutation effects remains a challenge. This is largely due to the lack of detailed structure context in these models, which is more evolutionarily conserved than sequences and includes crucial long-range contact information for protein functionality.27 The recently published AlphaMissense28 has demonstrated remarkable efficacy in predicting the pathogenicity of missense variants through the utilization of protein structure context. Nonetheless, its reliance on MSAs introduces a significant time burden in searching and processing MSAs.29

Collectively, our proposed model illuminates a multimodal approach to learn both sequence and structure context from massive protein datasets. It summarizes arbitrary protein structures into semantically rich representations approximating protein functions and achieves superior and generalizable performance across comprehensive benchmarks.

Based on our multimodal deep representation learning model, we proposed ProMEP (Fig. 1a) to predict the mutation effects on proteins in a zero-shot manner. The log-ratio heuristic, which compares the probabilities of wild-type (WT) and mutated amino acids, has proven effective in estimating mutation effects.18,25,26 While previous methods calculate this score solely conditioning on sequence context, our multimodal architecture allows ProMEP to quantify the log-likelihood of protein variants with combinational sequence and structure contexts (Fig. 1b). By comparing probabilities of the WT sequence and the mutant sequence, ProMEP could accurately depict the protein fitness landscape and guide protein engineering by recognizing beneficial (multiple) mutants (Fig. 1c).

a ProMEP combines the sequence context and the structure context of a protein to accurately predict mutation effects in a zero-shot manner. It takes an arbitrary WT protein as input and uses the pre-trained multimodal deep representation learning model to calculate semantic-rich representations for each amino acid of a protein. Specifically, for arbitrary mutations, ProMEP first extracts both sequence embeddings and structure embeddings from the WT protein. These embeddings are then aligned and fed into the pre-trained transformer encoder to generate protein representations at residual resolution. With the sequence decoder, fine-grained protein representations are eventually decomposed into the conditional probabilities on each amino acid under the contexts of both sequence and structure. Effects of an arbitrary mutation can be interpreted as the difference in predicted log-likelihood between the mutated sequence and the WT sequence. A customized protein point cloud is adopted to introduce protein structure context at atomic resolution. b 3D translations and rotations of the input protein structure will not affect the structure context of a protein. ProMEP applies a rotation- and translation-equivariant structure embedding module to guarantee such invariance. c ProMEP can be used to guide protein engineering without the requirements for labeled datasets or a holistic understanding of the protein structure and molecular function. It enables the user to recognize beneficial (multiple) mutants by efficiently traversing the protein fitness landscape.

Taken together, ProMEP demonstrates the ability to accurately interpret the underlying impact of mutations. The exceptional prediction efficiency and generalization ability of ProMEP imply its potent potential in predicting mutational effects in proteins without prior knowledge.

The multimodal architecture enables ProMEP to detect both the interaction between sequentially nearby amino acids (sequence context) (Fig. 3a) and the interaction between spatially nearby amino acids (structure context) (Fig. 3b). To interpret these contexts at different scales, we made an in-depth analysis of ProMEP.

Additionally, we assessed the perception capability of ProMEP in terms of the global protein folding context by employing a multi-class fold classification benchmark. This benchmark consists of 13,265 domains that were carefully selected from the Structural Classification of Proteins-extended (SCOPe) v2.07 database (Materials and methods). In comparison to previous structure-based models, which were trained based on either contact maps or protein graphs, the representations generated by ProMEP exhibit a remarkable ability to be accurately categorized into the correct fold classes (Supplementary information, Fig. S6c). More specifically, ProMEP demonstrates a substantial improvement in the classification of fold classes that are sparse in nature, such as multi-domain proteins (class e, F1 score of 0.92 of ProMEP vs 0.71 of the next-best model, GraSR) and small proteins (class g, F1 score of 0.92 of ProMEP vs 0.62 of the next-best model, GraSR).

Finally, the ablation study of ProMEP on three datasets demonstrates that both the sequence context learned by the sequence embedding module and the structure context learned by the structure embedding module markedly contribute to the improved performance (Supplementary information, Fig. S7). Collectively, these findings demonstrated that ProMEP is capable of capturing both the sequence context and the structure context of a protein. Furthermore, the quantitative and visually interpretable multimodal context provides significant insights for biologists to understand the molecular function of both annotated and unannotated proteins.