Dear all,
I would like to share a short technical note that I recently wrote, entitled “A Note on the Inductive Reasoning Ability of Transformers.”
This note is motivated by the following fundamental questions about Transformers (beyond computational efficiency considerations):
We provide understanding to these questions through the lens of MDL theory. Technically, we study an approximate MDL setting, in which the selected model only attains an approximately minimal description length, up to an additive slack C \geq 0. The approximate MDL setting explicitly captures the fact that practical model selection and training typically optimize their objective only approximately.
We prove an explicit finite upper bound on the cumulative squared error, whose dependence on the slack C formalizes the principle that better compression leads to better prediction. Some of the other main points are:
I would be very happy to hear any comments and feedback.
Best regards,
Qian Li