Session 20: Entropy-Gated Continuous Bitstream Diffusion for Language

Diffusion LLM

unread,

Jun 5, 2026, 2:31:32 PMJun 5

to diffus...@googlegroups.com

Hello folks,

Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity.

Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, the authors further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams.

Their approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, they adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere.

On the One Billion Word Benchmark (LM1B), their 130M-parameter bitstream model reaches a generative perplexity (Gen. PPL) of 59.76 at matched real-data entropy (4.31) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), the authors' stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving Gen. PPL = 27.06 at an entropy of 5.26 using 4× fewer steps than previous 1024-NFE baselines.

As an additional architectural benefit, bitstream diffusion removes the O(V) vocabulary scaling bottleneck shared by standard DLMs. By predicting O(log V) bitwise logits via semantic bit-patching, the model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.

This Monday, Georgios Batzolis will present his paper “Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion”.

Title: Entropy-Gated Continuous Bitstream Diffusion for Language

Meeting Link: click here

Time: June 8 (Monday) 1pm ET / 10am PT / 7pm CET / 10:30pm IST

Paper: [2605.07013] Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

Prior knowledge:

Fundamentals of discrete diffusion (video by Sasha Rush)

The Diffusion Duality (video by our reading group)

Abstract:

Diffusion language models (DLMs) promise parallel, order-agnostic generation, but on standard benchmarks they have historically lagged behind autoregressive models in sample quality and diversity. Recent continuous flow and diffusion approaches over token embeddings have narrowed this gap, suggesting continuous state spaces are highly effective for language. In this work, we further close the autoregressive gap by modeling text as a continuous diffusion process over fixed-width binary bitstreams. Our approach represents semantic tokens as analog bit sequences and utilizes a matched-filter residual parameterization to isolate contextual learning from analytic independent-bit posteriors. Crucially, we adopt a stochastic sampler that applies Langevin-type corrections gated by the entropy-rate profile, automatically concentrating stochasticity in high-information regions while remaining nearly deterministic elsewhere. On the One Billion Word Benchmark (LM1B), our 130M-parameter bitstream model reaches a generative perplexity (Gen. PPL) of 59.76 at matched real-data entropy (4.31) using 256 neural function evaluations (NFEs), decisively outperforming prior DLM baselines and reaching the autoregressive reference. On OpenWebText (OWT), our stochastic sampler establishes a new continuous-DLM Pareto frontier, achieving Gen. PPL = 27.06 at an entropy of 5.26 using 4× fewer steps than previous 1024-NFE baselines. As an additional architectural benefit, bitstream diffusion removes the O(V) vocabulary scaling bottleneck shared by standard DLMs. By predicting O(log V) bitwise logits via semantic bit-patching, our model yields a reduced memory footprint and higher throughput, demonstrating a scalable paradigm for language generation as vocabulary sizes grow.

Yours truly,

Subham, Justin, Zhihan

Website, Twitter, Discord, YouTube

Diffusion LLM

unread,

Jun 8, 2026, 12:00:42 PMJun 8

to diffus...@googlegroups.com

This is happening in 1 hour!!

Gentle reminder: See you all at 1pm ET / 10am PT / 7pm CET / 10:30pm IST

Meeting Link: click here

Today's paper: [2605.07013] Towards Closing the Autoregressive Gap in Language Modeling via Entropy-Gated Continuous Bitstream Diffusion

Diffusion LLM

unread,

Jun 8, 2026, 4:34:57 PMJun 8

to Diffusion-llms

Hi folks, we just uploaded the recording of today's session, make sure to check it out: https://www.youtube.com/watch?v=x1chemjCwB8

Reply all

Reply to author

Forward