A new approach to machine-learning-assisted directed evolution

Directed evolution is a method that simulates the mechanism of natural evolution, uses modern molecular biology methods to create a large library of mutated genes, and uses sensitive directional screening strategies to create biomolecules such as proteins that do not exist in nature or have improved characteristics. Directed evolution has been widely used in molecular modification and optimization of proteins, and is considered to be an efficient method for producing proteins with improved or completely new properties, which are of great significance for enzyme engineering, peptide and macromolecule drug design. The traditional directed evolution experimental process includes screening and testing the function of a large number of mutant sequences, using the obtained optimal sequence as the parent sequence for the next round of mutation and screening, and implementing multiple rounds of mutation screening to obtain functionally optimized protein sequences. However, the traditional directed evolution method is prone to local optimum, and the space of the mutant sequence obtained by the experiment is limited.

In recent years, machine learning-assisted directed evolution has received more and more attention, and the experimental screening process can be significantly reduced and the screening efficiency can be significantly reduced by simulating the experimental screening process by computer models. The most important thing in machine learning methods is to establish the function mapping relationship of sequence mutants and functions of the target protein that the model learns. This mapping is called the protein fitness landscape, where fitness is an abstract concept that quantitatively characterizes a certain biological function of a particular protein sequence (such as the thermal stability of the protein, the strength of interaction with other proteins, the efficiency of catalyzing specific enzymatic reactions, etc.). Due to the different functions of proteins, the connotation of the fitness map itself is different. In addition, protein mutation effect data are difficult to obtain, experiments are time-consuming and laborious, and the picture of protein fitness is complicated. Therefore, how to learn the protein fitness picture using limited experimental data to guide directed evolution experiments has become one of the difficult problems.

Zheng Mingyue’s research group and Liao Cangsong’s research group of Shanghai Institute of Materia Medica, Chinese Academy of Sciences proposed a new deep neural network model GVP-MSA. Based on the existing fitness maps of different types of proteins, the model constructs the fitness maps of new target proteins by transfer learning. On August 16, the research results were published online in Cell Systems under the title Learning protein fitness landscapes with deep mutational scanning data from multiple sources.

The common mechanism of fitness mapping was explored from the aspects of protein thermal stability, epistatic effect and sequence conservatism. The basis for proteins to function is the ability to fold and maintain stable three-dimensional structures. Calculations of different proteins found a numerical correlation between mutations leading to changes in fitness and changes in thermal stability. Epistatic effects also implicitly contain similar mechanisms in the fitness landscape of different proteins. Epistatic effects indicate that residues interact in the protein, resulting in a multipoint mutation effect that is not equal to the sum of the single-point mutation effects of which it is composed. It was found that in different protein fitness maps, the positions of the two amino acids with two-point mutations with positive upward effects were closer in three-dimensional structure. In addition, the relationship between mutational effects and the implied distribution of homologous sequences is common. These commonalities are the basis of adaptive map transfer learning (Figure 1).

Figure 1. Motivation and basis for protein fitness map transfer learning. a. In deep mutation scanning experiments of different proteins, changes in thermal stability caused by mutations are related to changes in fitness. The histogram shows the Spearman correlation between changes in thermal stability and fitness calculated by Rosetta. b. The residues of the two-point mutation with positive epitopic effect are closer in three-dimensional structure. The pink histogram represents the residue-to-residue distance for two-point mutations with a positive upward effect, and the blue histogram represents the residue-to-residue distance for all two-point mutations.

In this study, a new type of deep neural network model GVP-MSA was established, which used the pre-trained protein language model to process the homologous sequence alignment (MSA, multiple sequence alignment) information of the target protein, used the variable graph neural network such as E-(3) to extract the three-dimensional structure information of the protein, and used multi-task learning to effectively learn and integrate protein data of different dimensions and functions, so as to generalize to the new target protein system.

In addition, the team designed multiple test scenarios: random and positional extrapolation of single-point mutation effects, zero-sample prediction of mutational effects of new proteins, and prediction of multi-point mutation effects by single-point mutation effects (Figure 2). These scenarios simulate actual needs at different stages in directed evolution experiments. GVP-MSA performed well in these three test scenarios, which verified the effectiveness of adaptive map transfer learning. This work provides new ideas for machine-learning-assisted directed evolution, which can help to explore the mutation space of protein sequences more efficiently and quickly design protein sequences with improved or completely new properties.

Figure 2. Overview of GVP-MSA model architecture and application scenario requirements. a. Model architecture of GVP-MSA. b. Application scenario requirements of protein directed evolution: (1) zero-sample prediction ability of new proteins when there is no fitness data of the target protein; (2) the random and positional extrapolation ability of the model when a small amount of fitness data of the target protein is available; (3) When there is only single-point mutation fitness data, the model’s ability to predict the effect of multi-point mutation.

The research work is supported by the National Natural Science Foundation of China, Lingang Laboratory, National Key R&D Program, Youth Innovation Promotion Association of Chinese Academy of Sciences, Shanghai Natural Science Foundation, and the joint research project of Shanghai Institute of Materia Medica-Shanghai and the TCM Innovation Team of Shanghai University of Traditional Chinese Medicine. (Source: Shanghai Institute of Materia Medica, Chinese Academy of Sciences)

Related paper information:

Special statement: This article is reproduced only for the need to disseminate information, and does not mean to represent the views of this website or confirm the authenticity of its content; If other media, websites or individuals reprint and use from this website, they must retain the “source” indicated on this website and bear their own legal responsibilities such as copyright; If the author does not wish to be reprinted or contact the reprint fee, please contact us.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button