AI Simulates 500 Million Years of Evolution to Design Shiny New Proteins

Insider Brief

  • Researchers used an AI model, ESM3, to simulate 500 million years of evolution and generate functional proteins.
  • The model was trained on billions of protein sequences and structures, enabling it to design new proteins with unique functions.
  • AI-generated proteins could accelerate drug discovery, synthetic biology, and industrial applications, though real-world validation is needed.
  • Image: Artist’s rendition of esmGFP, the new fluorescent protein created by ESM3. (EvolutionaryScale)

A new artificial intelligence system has simulated 500 million years of evolution, generating functional proteins that nature has never produced. Researchers say the system, called ESM3, could transform drug discovery and synthetic biology by designing new molecules with specific properties.

Scientists at a company called EvolutionaryScale trained ESM3, a language model, on a massive dataset of 3.15 billion protein sequences and structures. Working at Meta’s FAIR (Fundamental AI Research) unit in 2019, EvolutionaryScale’s founding team built ESM1, considered the first large language model (LLM) for proteins.

The model generates new proteins by predicting viable amino acid sequences, mimicking the process of natural selection but on a vastly accelerated timescale. The study, published in Science and on the preprint server arXiv, suggests the AI-designed proteins are both structurally stable and functionally diverse.

In this study, the team generated a glowing protein that is like those found in jellyfish and corals.

“ESM3 can follow complex prompts combining its modalities and is highly responsive to biological alignment. We have prompted ESM3 to generate fluorescent proteins with a chain of thought,” the researchers write. “Among the generations that we synthesized, we found a bright fluorescent protein at far distance (58% identity) from known fluorescent proteins. Similarly distant natural fluorescent proteins are separated by over five hundred million years of evolution.”

By generating new proteins, the system could help address major challenges in medicine, including designing enzymes for new drugs and engineering proteins for sustainable materials, the team writes.

Simulating Evolution with AI

ESM3 is a generative language model that predicts protein sequences based on structure and function. The researchers trained it using 771 billion tokens that represent both natural and synthetic proteins. Unlike previous models, ESM3 does not rely solely on sequence data. It integrates information about protein structures and functions, allowing it to design completely novel proteins while preserving biological feasibility.

The model was tested at three scales: 1.4 billion, 7 billion, and 98 billion parameters. The largest version, ESM3-98B, produced the most biologically realistic proteins, according to the study. The researchers evaluated its performance using standard metrics, including predicted structural accuracy (pLDDT), template modeling scores (pTM), and sequence diversity measures.

One key result was the creation of a synthetic fluorescent protein called esmGFP. The AI-generated protein exhibits fluorescence comparable to naturally occurring green fluorescent proteins (GFPs) but has only 58% sequence identity to its closest known counterpart. Evolutionary estimates suggest it would have taken nature over 500 million years to develop a protein as distinct as esmGFP from existing ones.

The Role of Prompting and Model Scaling

Unlike traditional evolutionary biology, where mutations accumulate gradually, ESM3 can be “prompted” to generate proteins with specific structural and functional characteristics. The researchers tested this capability by instructing the model to design proteins with known catalytic sites while changing their structural frameworks. The results showed that ESM3 can generalize beyond its training data, creating proteins that resemble natural ones but differ significantly in sequence and shape.

The study also examined how scaling the model affected its ability to generate realistic proteins. The researchers found that increasing the model’s size led to significant improvements in protein quality. The 98-billion-parameter model produced proteins that were more stable and biologically plausible than those generated by smaller models. When the system was fine-tuned using preference optimization—where successful designs were reinforced—the largest model showed a 65.5% success rate in solving complex protein design tasks, compared to 26.8% before fine-tuning.

Implications for Drug Discovery and Synthetic Biology

The ability to generate functional proteins with AI could have wide-ranging applications. In drug development, ESM3 could help design enzymes that catalyze specific chemical reactions, leading to new treatments for diseases. In synthetic biology, the model could aid in engineering proteins for biofuels, sustainable materials, and even novel antibiotics.

One advantage of AI-generated proteins is that they can be tailored to specific needs, bypassing the constraints of natural selection. Unlike proteins found in nature, which evolved for specific biological roles, AI-designed proteins can be optimized for industrial or medical applications. The model could, for example, create proteins that are more resistant to heat or acidity, making them more useful for manufacturing and environmental applications.

“It is in this space that a language model sees proteins,” the team writes. “It sees the data of proteins as filling this space, densely in some regions, and sparsely in others, revealing the parts that are accessible to evolution. Since the next token is generated by evolution, it follows that to solve the training task of predicting the next token, a language model must predict how evolution can move through the space of possible proteins.”

Challenges and Future Directions

Despite its success, ESM3 has limitations. While it can generate structurally and functionally viable proteins, its ability to predict real-world biochemical behavior remains uncertain. Proteins do not operate in isolation; they interact with other molecules, undergo folding processes, and are influenced by cellular environments. Testing AI-generated proteins in laboratory and real-world conditions will be necessary to validate their functions.

The researchers also acknowledged that while scaling the model improves performance, it comes with increasing computational costs. Training ESM3 required massive computational resources, making widespread use of such models costly. Future work may focus on optimizing training efficiency and developing smaller models that retain the generative power of the largest versions.

Another open question is whether AI-generated proteins will have unintended consequences, such as novel immune responses or unexpected interactions with biological systems. Rigorous testing and validation will be required before these proteins can be used in medical or industrial applications.

These limitations represent paths to future research work that could represent a major step in computational biology, demonstrating that AI can create functional proteins beyond those found in nature. If further validated, such models could change the way scientists approach protein engineering, shifting from trial-and-error experiments to targeted design using AI.

The ultimate step, according to the scientists, is to improve ESM3 continue t accelerate discoveries in fields ranging from medicine to materials science.

The material for this story is also based on an earlier version of the study available on the pre-print server, bioRxiv.

Researchers who worked on the study include: Thomas Hayes, Roshan Rao, Halil Akin, Nicholas J. Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q. Tran, Jonathan Deaton, Marius Wiggert, Rohil Badkundri, Irhum Shafkat, Jun Gong, Alexander Derry, Raul S. Molina, Neil Thomas, Yousuf A. Khan, Chetan Mishra, Carolyn Kim, Liam J. Bartie, Matthew Nemeth, Patrick D. Hsu, Tom Sercu, Salvatore Candido and Alexander Rives.

Share this article:

AI Insider

Discover the future of AI technology with "AI Insider" - your go-to platform for industry data, market insights, and groundbreaking AI news

Subscribe today for the latest news about the AI landscape