Hello folks,
Masked diffusion LLMs are currently the dominant paradigm. But what if uniform-state discrete diffusion scales better and leads to better downstream performance?
This Monday, Dimitri von Rütte (ETH) and Zhihan Yang (Cornell) will jointly present the papers:
[1] Scaling Behavior of Discrete Diffusion Language Models
[2] Scaling Beyond Masked Diffusion Language Models
which explore the scaling behaviors of uniform-state and AR-MDLM hybrid (Eso-LMs) d-LLMs.
Meeting Link: click here
Time: Feb 23 (Monday) 1pm ET / 10am PT / 7pm CET / 11:30pm IST
Paper:
[1] https://arxiv.org/abs/2512.10858
[2] https://www.arxiv.org/abs/2602.15014
Prior knowledge:
Fundamentals of discrete diffusion (video by Sasha Rush)
Uniform-state discrete diffusion (video by Yair Schiff)
The Diffusion Duality (video by our reading group)
Esoteric Language Models (video by our reading group)
Abstract of [1]: Modern LLM pre-training consumes vast amounts of compute and training data, making the scaling behavior, or scaling laws, of different models a key distinguishing factor. Discrete diffusion language models (DLMs) have been proposed as an alternative to autoregressive language models (ALMs). However, their scaling behavior has not yet been fully explored, with prior work suggesting that they require more data and compute to match the performance of ALMs. We study the scaling behavior of DLMs on different noise types by smoothly interpolating between masked and uniform diffusion while paying close attention to crucial hyperparameters such as batch size and learning rate. Our experiments reveal that the scaling behavior of DLMs strongly depends on the noise type and is considerably different from ALMs. While all noise types converge to similar loss values in compute-bound scaling, we find that uniform diffusion requires more parameters and less data for compute-efficient training compared to masked diffusion, making them a promising candidate in data-bound settings. We scale our uniform diffusion model up to 10B parameters trained for 10^22 FLOPs, confirming the predicted scaling behavior and making it the largest publicly known uniform diffusion model to date.
Abstract of [2]: Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: this http URL.
Yours truly,
Subham, Justin, Zhihan
Website, Twitter, Discord, YouTube