Seeds are known as the “chip” of agriculture, and breeding technology innovation is the core driving force to promote agricultural development. The new paradigm of plant breeding in the future is the diversified integration of biotechnology (BT) such as genomics, gene editing, and synthetic biology with information technology (IT) such as data science, machine learning, and artificial intelligence. The “14th Five-Year Plan” of the Ministry of Agriculture and Rural Affairs lists “smart seed industry” as the first of the seven major research tasks in the field of “smart agriculture”. The task clearly proposes: build a digital breeding platform, explore the “intelligent breeding technology system” from genotype to phenotype, and accelerate the research goal of “empirical breeding” to “precision breeding”.
On September 21, Trends in Plant Science, Cell’s top review journal in the field of plant science, published a special review jointly written by Professor Wang Xiangfeng and Associate Professor Yan Jun of the Frontier Science Center for Molecular Design and Breeding of China Agricultural University and the National Center for Corn Improvement: “Machine learning bridges omics sciences and plant breeding”. The review accurately defines what “precision breeding” means and divides “precision design breeding” into “knowledge-driven molecular design breeding” and “data-driven genome design breeding.” The paper focuses on how machine learning techniques can translate “knowledge” and “data” into drivers of breeding services, and how to build a bridge between basic research and breeding practices to accelerate the realization of precision design breeding in the field of plants.
In recent decades, basic research in plant biology has generated a wealth of new knowledge and data that will ultimately serve plant breeding and trait improvement. However, to achieve the ultimate goal of plant precision design breeding, it is also necessary to solve the current problem of disconnect between basic plant research and breeding practice. As a branch of artificial intelligence, machine learning techniques are widely used for their exceptional ability to integrate complex and variable biological knowledge with omics big data.
“Knowledge” and “data”-driven precision design breeding Wang Xiangfeng courtesy of the photo
Machine learning can mainly build bridges between basic research and breeding practice in two ways. One way is to understand gene function and regulatory mechanisms from basic research in plant biology to enable knowledge-driven molecular design breeding. After clarifying the function of trait regulatory genes, plant varieties were improved through molecular marker-assisted selection, multigene polymerization of favorable alleles, gene editing and synthetic biology. Another approach is to directly apply machine learning techniques to commercial breeding pipelines and build various predictive models and decision-making algorithms to achieve data-driven genome design breeding.
These two approaches combine and play an important role in modern commercial breeding pipelines. The selection of modern breeding pipelines depends on the number of genes or loci associated with the trait: for quantitative traits that are primarily determined by the genetic background, such as traits such as yield, biomass, environmental adaptability, etc., data-driven models are often used to infer the correlation between phenotype and genome-wide markers; For polygenic traits determined by the genetic prospects, such as disease resistance, quality and other traits, it is necessary to first clarify the molecular function and action path of the trait regulatory gene, so as to accurately aggregate the excellent allelic variants of multiple genes; For single-gene traits, applying gene editing is the best shortcut to artificially create mutations and trait improvement. In fact, as long as sufficient knowledge and data are accumulated in plant biology and breeding, machine learning techniques can be effective in driving the precise design of plants to achieve breeding goals.
The application of various machine learning algorithms in plant multi-omics research Wang Xiangfeng provided the figure
The paper first introduces the main types of modern machine learning techniques (including supervised learning, semi-supervised learning, unsupervised learning, deep learning, etc.) and the latest progress; Secondly, how to apply modern machine learning algorithms to high-dimensional multi-omics data dimensionality reduction, gene regulation network inference, multi-omics data association analysis and gene mining, and priority decision-making of candidate genes are reviewed. Thirdly, the application progress of deep learning algorithms based on semi-supervised learning framework in plant phenomics is introduced. Finally, the application progress of machine learning technology in genome-wide selective assisted breeding, genotype-to-phenotype prediction, and genotype-environment interaction modeling is introduced. In the Conclusions and Prospects section of the paper, the current challenges and potential solutions of machine learning and artificial intelligence technologies in plant research are discussed.
Case: Application of NMF dimensionality reduction algorithm to improve the efficiency of gene mining Wang Xiangfeng provided a figure
In addition, this review also provides an application of unsupervised learning examples, namely, how to use NMF non-negative matrix decomposition algorithm to improve the efficiency of maize multi-omics data correlation analysis and the accuracy of gene mining. (Source: China Science Daily, Zhang Qingdan)
Related Paper Information:https://doi.org/10.1016/j.tplants.2022.08.018