Session 5 // Jan 12: Esoteric Language Models

Diffusion LLM

unread,

Jan 6, 2026, 3:38:01 PMJan 6

to diffus...@googlegroups.com

Hello folks,

Discrete diffusion models are becoming a compelling alternative to autoregressive (AR) models, but current masked diffusion models (MDMs) still lag in perplexity and lack efficient inference features such as KV caching.

This Monday, Zhihan Yang will present Esoteric Language Models (Eso-LMs), a family of models that fuses AR and masked diffusion paradigms. This is the first work that unlocks

Tractable single-pass likelihood estimation for MDMs, supporting RLVR;
Exact likelihood computation for MDMs;
Exact KV caching for MDMs while preserving parallel generation over full sequence lengths.

The project was co-led with Subham Sahoo.

Title: Esoteric Language Models

Meeting Link: click here

Time: Jan 12 (Monday) 1pm ET / 10am PT / 7pm CET / 11:30pm IST

Paper: https://arxiv.org/abs/2506.01928

Prior knowledge:

Fundamentals of discrete diffusion (video by Sasha Rush)

Abstract: Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Within this family, Masked Diffusion Models (MDMs) currently perform best but still underperform AR models in perplexity and lack key inference-time efficiency features, most notably KV caching. We introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, smoothly interpolating between their perplexities while overcoming their respective limitations. Unlike prior work, which uses transformers with bidirectional attention as MDM denoisers, we exploit the connection between MDMs and Any-Order autoregressive models and adopt causal attention. This design lets us compute the exact likelihood of MDMs for the first time and, crucially, enables us to introduce KV caching for MDMs while preserving parallel generation for the first time, significantly improving inference efficiency. Combined with an optimized sampling schedule, Eso-LMs achieves a new state of the art on the speed-quality Pareto frontier for unconditional generation. On long contexts, it yields 14−65× faster inference than standard MDMs and 3−4× faster inference than prior semi-autoregressive approaches. We provide code, model checkpoints, and video tutorials on the project page: https://s-sahoo.com/Eso-LMs/.

Yours truly,

Subham, Justin, Zhihan

Website, Twitter, Discord, YouTube

Diffusion LLM

unread,

Jan 12, 2026, 11:30:29 AMJan 12

to diffus...@googlegroups.com

Gentle reminder: See you all at 1pm ET / 10am PT / 7pm CET / 11:30pm IST

Meeting Link: click here

Today's paper: https://arxiv.org/abs/2506.01928

Diffusion LLM

unread,

Jan 13, 2026, 7:27:00 AMJan 13

to Diffusion-llms

Hello folks, the recording of Zhihan's talk is now available on YouTube, make sure to check it out: https://www.youtube.com/watch?v=MsCm-aos7jE

Reply all

Reply to author

Forward