Simulating 500 million years of evolution with ESM3 — how we turn sequences into signals

Meta’s ESM3 can simulate evolution and generate proteins that never existed. Here’s how our annotators give those sequences meaning: connecting model tokens to 3D surface features and turning simulated biology into testable hypotheses for the dark proteome.

Shawnak Shivakumar

7/8/20241 min read

Meta’s ESM3 treats evolution like a language: it “reads” and generates protein sequences while reasoning about structure and function. In their preprint, the team shows how a frontier language model can simulate evolutionary trajectories and propose sequences that fold and function: an idea that sounded sci-fi just a few years ago. (bioRxiv)

We’ve been pairing ESM-style embeddings with our annotation tasks: contributors label motifs, pockets, and surface features in structures; we then align those labels to sequence tokens the model cares about. That linkage does two things: (1) it helps interpret what a token or attention head might be “looking at” on the 3D surface, and (2) it gives us a way to prioritize dark proteins whose sequences suggest interesting function but lack any structural annotation. Early experiments with all-atom PLM variants (ESM All-Atom) make this mapping even more direct by connecting sequence signals to atomic neighborhoods.

In short: language models can imagine plausible proteins; our community helps decide which ones matter—by grounding tokens in pockets, interfaces, and catalytic geometry that real biology cares about.