Deep Learning Model Maps the Full Spectrum of Molecular Interactions

Insider Brief

  • A new AI model called ATOMICA can predict how different types of molecules interact by learning from more than two million molecular structures.
  • The model outperforms others by learning general chemical rules across biomolecular types, improving predictions even in underrepresented categories like protein-DNA interactions.
  • ATOMICA’s interaction-based networks help identify disease-relevant proteins and uncover functions in the “dark proteome,” offering a new tool for studying biology and drug discovery.

A new artificial intelligence model trained on over two million molecular structures has achieved what other machine learning models have struggled to do: learn general rules of molecular interaction that work across all types of biomolecules—from metal ions to proteins to DNA.

The model, called ATOMICA, is detailed in a preprint posted to bioRxiv by researchers from Harvard University and the Broad Institute. It uses geometric deep learning to represent how molecules bind and interact, a process at the heart of nearly every biological function and drug mechanism. Unlike past models, which typically specialize in just one type of interaction — like protein-ligand or protein-protein — ATOMICA can learn across modalities.

“ATOMICA generalizes across molecular modalities by leveraging a pretraining dataset of 2,037,972 interaction complexes,” the researchers write, adding that the cross-domain generalizability improves representation quality in low-data modalities.

In other words, ATOMICA learns how different types of molecules stick together by studying millions of examples, which helps it make better guesses even when there’s not much data for a specific type.

The model is trained to recognize patterns in how atoms come together to form chemical bonds, such as hydrogen bonding, π-stacking, and van der Waals forces. It does this through a hierarchical graph representation that includes atomic positions, chemical groups and entire molecular interfaces.

A Universal Language of Molecular Binding

According to the paper, this model represents a step change to how molecules are treated in machine learning. In traditional machine learning for biology, models often treat molecules in isolation or in narrowly defined tasks. ATOMICA, however, breaks that mold by learning a shared latent space where chemically similar interactions cluster together, even when those interactions span vastly different molecular types.

The study finds that ATOMICA’s representations reflect real-world biochemistry. For example, when visualized using dimensionality reduction techniques, the model’s embeddings of atomic elements arrange themselves in a pattern that mirrors the periodic table. It can also infer chemical roles: amino acids involved in actual binding are ranked highly by the model, despite ATOMICA never being explicitly told what a binding site is.

To benchmark ATOMICA’s learning, the researchers masked out parts of a molecular interface and asked the model to guess the missing pieces. Models trained on a single type of interaction performed poorly on this task when asked to generalize. ATOMICA, trained across all types of interactions, consistently outperformed them. In fact, it showed up to a 190% improvement in recovering masked elements in protein-DNA interactions.

Mapping Disease Through the ‘Interfaceome’

The model’s applications go beyond representation learning. The team built five large-scale “interfaceome” networks — collections of human proteins connected by similarity in how they bind to ions, small molecules, nucleic acids, lipids, or other proteins.

These networks, called ATOMICANETs, revealed disease-specific patterns. For instance, in the lipid-binding network, a cluster of asthma-related proteins formed a tight-knit neighborhood of similar interaction profiles. In the ion-binding network, the team highlighted a myeloid leukemia cluster that included well-known cancer-associated genes like TET2 and zinc finger proteins.

Using diffusion algorithms, which are methods that spread information through a network to find patterns or connections, ATOMICA also predicted novel disease-associated proteins with high accuracy. In multiple sclerosis and autoimmune peripheral neuropathy, for example, the model identified voltage-gated potassium channels like KCNA1 and KCNA2 — known players in neuron signaling — with hit rates of 100%.

The results illustrate how each molecular modality captures distinct yet complementary aspects of disease biology, according to the team.

Uncovering the ‘Dark Proteome’

Perhaps the most forward-looking application of ATOMICA is its ability to assign function to the so-called dark proteome, which are the many proteins whose functions are unknown because they don’t resemble anything well studied.

By fine-tuning the model to predict specific ion or cofactor binding (like magnesium or heme), the team annotated over 2,600 previously uncharacterized binding sites. Some mapped to ancient and evolutionarily conserved protein families. In one example, ATOMICA identified a probable C4 zinc finger domain — a structural motif used in DNA binding — on a bacterial protein that had no previously known function.

They validated their predictions using AlphaFold3 to visualize the full protein-ligand complex, showing that ATOMICA’s annotations aligned with known biochemical motifs and structural scores.

Limitations and Outlook

While powerful, ATOMICA depends on high-quality structural data—something that isn’t always available, especially for flexible or disordered protein regions, such as those in antibodies or intrinsically disordered proteins.

The team notes this may not be a permanent limitation.

“However, recent advances in structure prediction, such as AlphaFold3 and RoseTTAFold, can generate high-confidence structures across a wide range of biomolecular modalities, including protein-ligand and protein-nucleic acid interactions,” they write.

Future work may integrate sequence-based features or even data from non-structural high-throughput assays.

Still, ATOMICA represents a major step toward what many researchers see as the holy grail: a foundation model that captures the full complexity of life’s molecular interactions.

The team adds: “As efforts toward a virtual cell intensify, approaches such as ATOMICA will be necessary to model the full spectrum of molecular interactions, including protein-ion, protein-small molecule, proteinprotein, and protein nucleic acid contacts, at atomic resolution. Extending ATOMICA to integrate sequence-derived and experimental interaction evidence will improve its applicability in capturing molecular interactions in biological systems.”

Researchers include Ada Fang, Zaixi Zhang, Andrew Zhou and Marinka Zitnik.

Pre-print servers, such as bioarxiv, offer a way for scientists to gather immediate feedback from their peers. However, it is not officially peer-reviewed, which is an important step in the scientific process.

Share this article:

AI Insider

Discover the future of AI technology with "AI Insider" - your go-to platform for industry data, market insights, and groundbreaking AI news

Subscribe today for the latest news about the AI landscape