Abstract: In the current age of deep learning, more compute typically means better performance. However, alternate strategies have emerged for training smaller models more efficiently by introducing structured supervision during training. In this talk, I’ll explore how synthetic testbeds help uncover the effectiveness of such methods—and reveal the role of curriculum in accelerating learning.
I will present two recent works. The first investigates progressive distillation, where student models learn not only from a final teacher checkpoint but also from its intermediate checkpoints. Using sparse parity as a testbed, we identify an implicit curriculum available only through these intermediate checkpoints—leading to both empirical speedup and provable sample complexity gains. We extend the underlying curriculum ideas to pre-training transformers on real-world datasets (Wikipedia and Books), where intermediate checkpoints are found to progressively capture longer-range context dependencies.
The second part focuses on context-enhanced learning, a gradient-based analog of in-context learning (ICL) where models are trained with extra contextual information provided in-context but removed at evaluation, with no gradient computations on this extra information. In a multi-step reasoning task, we prove that context-enhanced learning can be exponentially more sample-efficient than standard training, provided the model is ICL-capable. We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.
References
Progressive distillation induces an implicit curriculum. ICLR’25 (Oral). Abhishek Panigrahi*, Bingbin Liu*, Sadhika Malladi, Andrej Risteski, Surbhi Goel
On the Power of Context-Enhanced Learning in LLMs. In submission. Xingyu Zhu*, Abhishek Panigrahi*, Sanjeev Arora