The ESM metagenomic atlas database contains structural predictions for 617 million proteins. Image source: ESM Metagenomics Atlas
Deep Mind, Google’s artificial intelligence (AI) company, this year unveiled the predicted structure of 220 million proteins, covering nearly every protein in a DNA database of known organisms. Now, another tech giant is filling the protein universe with dark matter.
Researchers at Meta (formerly Facebook) used artificial intelligence to predict the structure of about 600 million proteins from bacteria, viruses and other microbes that have not yet been characterized. The study was published Nov. 1 on the preprint site BioRxiv.
“These are very mysterious proteins that offer the possibility of gaining insight into biology.” Alexander Rives, head of research in the Meta AI protein team, said.
The team generated these predictions using a “large language model.” A “large language model” is a type of artificial intelligence that serves as the basis for tools that predict text from a few letters or words.
Usually language models are trained on the basis of a large amount of text. To apply it to proteins, Rives’ team “feeded” them known protein sequences that could be represented by 20 different amino acid chains, each represented by a letter. The model then learned to “autocomplete” proteins in the case of ambiguous amino acid ratios.
Rives says this training gives the model an intuitive understanding of protein sequences, which contain information about the shape of proteins.
The second step, inspired by DeepMind’s pioneering artificial intelligence algorithm for protein structure, AlphaFold, combines this insight with information about the relationships between known protein structures and sequences to generate predictive structures from protein sequences.
Earlier this summer, Rives’ team reported that its model algorithm, called ESMFold, is not as accurate as AlphaFold, but about 60 times faster at predicting structures. “This means we can scale structure prediction to a much larger database.” Rives said.
As a test case, the team decided to apply the model to a large-scale sequencing database of “metagenomic” DNA from the environment, including soil, seawater, human gut, skin, and other microbial habitats. The vast majority of DNA entries encoding potential proteins come from organisms that have never been cultured and are unknown to scientists.
In total, the Meta team predicted the structure of more than 617 million proteins, and the work took only two weeks. Rives says predictions are free and can be used by anyone, just like the underlying code of the model.
Of those 617 million predictions, the model considers more than one-third of the predictions to be of high quality, so researchers can be confident that the overall shape of the protein is correct, and in some cases, the model can identify finer atomic-level details. It’s worth mentioning that millions of these structures are completely new, unlike the experimentally determined protein structure database, or the AlphaFold database predicted from known organisms.
A large portion of the AlphaFold database is made up of structures that are nearly identical to each other, while the metagenomic database is supposed to cover a large portion of the never-before-seen protein universe.
Sergey Ovchinnikov, an evolutionary biologist at Harvard University, is skeptical of ESMFold’s hundreds of millions of predictions. He believes that some proteins may lack a defined structure, while others may be noncoding DNA, mistaken for protein-coding material.
Burkhard Rost, a computational biologist at the Technical University of Munich in Germany, was impressed by the speed and accuracy of Meta’s model. But he questioned whether predicting proteins from metagenomic databases was really more accurate than AlphaFold. Prediction methods based on language models are better suited for quickly determining how mutations change protein structure, which AlphaFold cannot do.
According to a representative of DeepMind, the company currently has no plans to make metagenomic structure predictions in its database, but does not rule out the possibility of doing so in the future.
Martin Steinegger, a computational biologist at Seoul National University in South Korea, believes that the next step in such tools is clearly to study dark matter in biology. “We’ll soon see an explosion in the analysis of these metagenomic structures.” (Source: Xin Yu, China Science Daily)
Related Paper Information:https://doi.org/10.1101/2022.07.20.500902