Diffusion Language Models: Stefano Ermon's 10x Bet
Inception's Stefano Ermon on why diffusion language models match frontier quality at 10x the speed — and why startups ship them first.
Same model quality, 10x faster. That's the result Stefano Ermon's Stanford group put on the table in 2024, and it won the best paper award at ICML, the top machine learning conference in the world. They took a GPT-2-sized model, trained one copy the usual autoregressive way and one as a diffusion model, and showed the diffusion version matched the quality while running ten times faster.
That paper is why Ermon left his Stanford CS post to run Inception Labs. We sat down with him to walk through the research backstory, what a diffusion language model actually does differently, and why he thinks the next generation of LLMs won't be built the way Claude, GPT, and Gemini are built today.
The model that edits instead of guessing the next word
Almost every language model you've used is autoregressive: a neural network trained to predict the next token, producing the answer left to right, one token at a time. Ermon's description is blunt. "It's painfully slow, it's very sequential," he said. Sequential computation maps badly onto GPUs, which are built for parallel work, so you get low utilization and a bottleneck you can't easily engineer your way out of.
Diffusion takes the opposite path. It's the technology behind Sora and Midjourney, where an image isn't drawn one pixel at a time but emerges from noise through iterative refinement, coarse to fine, edit after edit, until the picture is crisp. Ermon's lab pushed score-based diffusion for images back in 2019, when the field was still dominated by GANs that were, in his words, "so hard to train, they're very unstable." Diffusion went on to take over image and video generation. The open question was whether the same idea could work on text.
It mostly hadn't. The math behind diffusion is continuous — gradients, differential equations — and text is discrete, so mapping the theory across was, as Ermon put it, "non-trivial." The 2024 paper was the breakthrough that made it competitive with autoregressive models for the first time.
The mechanism is the interesting part. A diffusion language model isn't trained to predict the next token. It's trained to edit, to fix mistakes. You start with a rough guess of the whole answer and refine it. Ermon reached for how he writes a research paper: not left to right, one word at a time, but structure first, the section headings, then filling in the details. "You have a more holistic view before actually working on the details one by one," he said. The host's analogy stuck: it's like sketching a blueprint, scratching it, iterating internally, then outputting the final version.
That structure buys two things autoregression can't. First, the model can look at context to both the left and the right of where it's writing, which is exactly what you need for code autocomplete and infilling, where the surrounding lines matter as much as what came before. Ermon said Mercury Coder is already deployed in several IDEs and performing well on next-edit suggestions for that reason. Second, the model has built-in error correction. An autoregressive model, once it emits a token, can never take it back. Ermon framed that as a control-theory problem: these models are trained through behavioral cloning, an open-loop process where compounding errors can become "very, very serious issues."
Pushing the speed-quality-cost frontier
Every LLM decision comes down to three things, Ermon said: speed, cost, and quality. Normally they trade off. Want the highest quality? Use the largest model, which is expensive to serve and slow because it's sequential. Want speed? Use a small model, which is fast and cheap, but the quality drops. For autoregressive models, that Pareto frontier is a wall you can't escape.
Inception's bet is that a fundamentally more parallel architecture moves the wall. On community benchmarks that measure how often the model gets the right answer and how good the code is, Ermon said Inception's models match the speed-optimized tier from the frontier labs — Anthropic's Haiku, Google's Flash, OpenAI's Nano — while running significantly faster. Mercury runs around 1,000 tokens per second per user, which he put at 5x to 10x faster than many comparable models.
Speed isn't only about latency. Higher GPU utilization means more output for the same dollar spent on hardware, so the models can be cheaper too. The sharper claim: Inception says it can match the speeds you'd normally need specialized inference chips for, the kind from Cerebras or Groq, on widely available Nvidia GPUs. (Nvidia is an investor in the company, and Ermon said they work closely to optimize for GPUs.) Asked in the rapid-fire round for the most overhyped thing in AI, his answer was those inference chips. "Eventually GPUs are going to dominate." The most underrated thing? Diffusion language models, no surprise there.
Why a startup, not OpenAI, ships this
If diffusion language models are this promising, why isn't a frontier lab building them? Ermon's read is that it's rational for the incumbents not to. When the first GPT models were trained, it wasn't obvious any generative approach would work, and the early autoregressive results were so promising that scaling them up was the right move. They have something proven. Diverting resources to a different architecture is technical risk and opportunity cost they don't need to take.
That's the opening. "As a startup, you're trying to disrupt — you'll never catch up if you follow the same recipe," he said. "You've got to take a different path, you've got to leapfrog." He pointed to the pattern where the first solution to a problem often isn't the one that wins; a second or third mover comes in with something fundamentally better. That's the bet Inception is making: that the future of LLMs is diffusion-based.
He's careful about where the frontier sits today. The next generation of Mercury models is being trained to add reasoning, which he framed as a step change in quality. But reasoning in a diffusion model can look different from the long thinking traces autoregressive models produce. Because the model can revise its own scratch pad, reasoning can happen "in place," refining an answer rather than endlessly extending a trace and re-reading the whole history each step. He's also clear-eyed on the bigger question. Asked what AGI means to him, Ermon kept it concrete and measurable: a functional replacement for what a person can do, measured in economic value. On whether text alone gets us there or we need to learn from images, video, and the physical world, "the jury is still out."
The honest version of this bet is that the wedge today is speed: real-time voice agents, customer support, coding assistants where a slow answer is a dead product. Inception ships an OpenAI-compatible API, so a developer can swap in a diffusion model by changing two lines of code. Ermon thinks the more interesting uses are still ahead — goal-directed reasoning that works backward from an outcome, controllable generation that steers toward a brand voice or a safety constraint. Whether the architecture wins broadly is unproven. But the ICML result is real, the customers are real, and the trade-off he's attacking is one every team building on LLMs feels.
Watch the full conversation with Stefano Ermon on YouTube, or browse more episodes of The QAI Podcast.