Notes on the Inductive Reasoning Ability of Transformers (approximate MDL perspective)

28 views

Skip to first unread message

qian li

unread,

Jan 20, 2026, 2:36:47 AMJan 20

to Algorithmic Information Theory

Dear all,

I would like to share a short technical note that I recently wrote, entitled “A Note on the Inductive Reasoning Ability of Transformers.”

This note is motivated by the following fundamental questions about Transformers (beyond computational efficiency considerations):

Q1: Why do Transformers appear to outperform other neural architectures as general-purpose predictors?
Q2: Should we continue scaling Transformers? If not, when should scaling stop?
Q3: Do we need continual learning/training, and what is the nature of "continual learning"?

We provide understanding to these questions through the lens of MDL theory. Technically, we study an approximate MDL setting, in which the selected model only attains an approximately minimal description length, up to an additive slack C \geq 0. The approximate MDL setting explicitly captures the fact that practical model selection and training typically optimize their objective only approximately.

We prove an explicit finite upper bound on the cumulative squared error, whose dependence on the slack C formalizes the principle that better compression leads to better prediction. Some of the other main points are:

An analysis of block-static/continual-learning-style prediction, showing that less frequent model updates degrade cumulative prediction error.
An analysis of offline/train-then-deploy-style prediction, showing that a smaller MDL approximation slack reduces the data required to identify a sufficiently accurate model.
A discussion of implications for Q1-Q3. We also include a discussion on the expressiveness of Transformers to implement Solomonoff's induction (Remark 4.2).