AI Model Evo 2 Reads and Writes DNA Across 100,000 Species

Evo2

Insider Brief

  • Evo 2, an AI model developed by Arc Institute and collaborators, is trained on DNA from over 100,000 species to predict mutations and design synthetic genomes.
  • The model can analyze entire genomes, detect evolutionary patterns, and achieve over 90% accuracy in identifying harmful mutations in genes like BRCA1.
  • Evo 2 is open-source, integrated with NVIDIA’s BioNeMo framework, and designed with safety measures to prevent misuse involving human-infecting pathogens.
  • Image: Evo 2 is trained on over 9.3 trillion tokens–in this case, nucleotides–from over 128 thousand genomes across the three domains of life, making it similar in scale to the most powerful generative AI large language models.

A new artificial intelligence model trained on DNA from over 100,000 species can predict disease-causing mutations and design synthetic genomes, a major step in using AI for genetic research.

Researchers at Arc Institute, working with NVIDIA and academic collaborators from Stanford, UC Berkeley, and UCSF, announced on their blog that they developed Evo 2, an AI model trained on more than 9.3 trillion nucleotides — the building blocks of DNA and RNA –spanning the entire tree of life. Unlike previous models, which focused on single-cell organisms, Evo 2 incorporates genetic data from a broad range of species, including humans and plants. The team is making the model and its training data fully open-source, providing researchers worldwide with a new tool for studying evolution, disease, and synthetic biology.

“Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology, as the models have enabled machines to read, write, and think in the language of nucleotides,” said Patrick Hsu, Arc Institute Co-Founder, Arc Core Investigator, an Assistant Professor of Bioengineering and Deb Faculty Fellow at University of California, Berkeley, and a co-senior author on the Evo 2 preprint. “Evo 2 has a generalist understanding of the tree of life that’s useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We’re excited to see what the research community builds on top of these foundation models.”

Evo 2, published in a pre-print journal, significantly improves on its predecessor, Evo 1. It can analyze entire genomes—up to a million nucleotides at a time—and recognize evolutionary patterns that would take human researchers years to uncover. These patterns, refined over millions of years, contain crucial signals about how molecules function and interact.

“Just as the world has left its imprint on the language of the Internet used to train large language models, evolution has left its imprint on biological sequences,” said the preprint’s other co-senior author Brian Hie, an Assistant Professor of Chemical Engineering at Stanford University, the Dieter Schwarz Foundation Stanford Data Science Faculty Fellow, and Arc Institute Innovation Investigator in Residence, in the post. “These patterns, refined over millions of years, contain signals about how molecules work and interact.”

The AI model is already proving useful in analyzing genetic mutations. In tests involving BRCA1, a gene associated with breast cancer, Evo 2 correctly classified more than 90% of mutations as either benign or potentially harmful. This capability could accelerate medical research by reducing the need for costly and time-consuming cell and animal experiments.

Beyond disease research, Evo 2 could enable more precise genetic engineering. “…if you have a gene therapy that you want to turn on only in neurons to avoid side effects, or only in liver cells, you could design a genetic element that is only accessible in those specific cells,” said co-author and computational biologist Hani Goodarzi, an Arc Core Investigator and an Associate Professor of Biochemistry and Biophysics at the University of California, San Francisco. “This precise control could help develop more targeted treatments with fewer side effects.”

To develop Evo 2, the research team trained the model for months on NVIDIA’s DGX Cloud AI platform using more than 2,000 NVIDIA H100 GPUs. OpenAI’s Greg Brockman, during a sabbatical, contributed to optimizing the AI’s underlying architecture, called StripedHyena 2, allowing Evo 2 to process 30 times more data than its predecessor.

The AI model is being released with a user-friendly interface called Evo Designer, making it accessible to a wide range of researchers. The code is available on Arc’s GitHub and integrated into NVIDIA’s BioNeMo framework to facilitate broader adoption in biomedical and synthetic biology research.

“In a loose way, you can think of the model almost like an operating system kernel—you can have all of these different applications that are built on top of it,” said Arc’s Chief Technology Officer Dave Burke, a co-author on the preprint. “From predicting how single DNA mutations affect a protein’s function to designing genetic elements that behave differently in different cell types, as we continue to refine the model and researchers begin using it in creative ways, we expect to see beneficial uses for Evo 2 we haven’t even imagined yet.”

The development team also implemented safety measures to prevent misuse. Pathogens that infect humans and other complex organisms were excluded from the training data, and the model is designed to avoid generating responses related to these pathogens. Co-author, Tina Hernandez-Boussard, a Stanford Professor of Medicine, and her lab members, assisted the team to implement responsible development and deployment of this technology.

“Evo 2 has fundamentally advanced our understanding of biological systems,” said Anthony Costa, director of digital biology at NVIDIA. “By overcoming previous limitations in the scale of biological foundation models with a unique architecture and the largest integrated dataset of its kind, Evo 2 generalizes across more known biology than any other model to date — and by releasing these capabilities broadly, the Arc Institute has given scientists around the world a new partner in solving humanity’s most pressing health and disease challenges.”

The research team expects Evo 2 to be the basis for further advancements in AI-driven biology. Scientists will now have the ability to predict how genetic mutations affect proteins, design genetic sequences with precise functionality, and potentially create synthetic life forms tailored for specific applications. As researchers begin working with Evo 2, new discoveries in medicine, biotechnology, and evolutionary science are likely to emerge.

Visit BioRxiv to access the related preprint “Genome modeling and design across all domains of life with Evo 2” and the sister machine learning paper “Systems and Algorithms for Convolutional Multi-Hybrid Language Models at Scale.”

The Arc Institute is an independent nonprofit research organization located in Palo Alto, California, that aims to accelerate scientific progress and understand the root causes of complex diseases. Arc’s model gives scientists complete freedom to pursue curiosity-driven research agendas and fosters deep interdisciplinary collaboration.

AI Insider

Discover the future of AI technology with "AI Insider" - your go-to platform for industry data, market insights, and groundbreaking AI news

Subscribe today for the latest news about the AI landscape