| Subject: | Re: Research Inquiry from Rich Heimann’s friend: Connecting JEPA to Algorithmic Statistics |
|---|---|
| Date: | Sun, 5 Apr 2026 12:04:13 -0700 |
| From: | Hai Huang <hih...@gmail.com> |
| To: | Alexander Shen <alexand...@lirmm.fr> |
| CC: | Steve X. Liu <xue...@gmail.com>, Rassul Magauin <rassul...@gmail.com>, sasha...@gmail.com, kozmath <koz...@proton.me>, Alexey Milovanov <alma...@gmail.com>, Nikolay Vereshchagin <nikolay.ve...@gmail.com> |
Hi everyone,
Looking forward to our chat tomorrow!
To help make the most of our time, I’ve attached the slide deck we will be using to guide the discussion. My section covers a high-level intuition of Next Token Prediction and JEPA (and how we see them connecting to KC), and then Rassul will share some of our recent progress.
Please feel free to skim through it beforehand if you'd like, but no formal preparation is expected at all.
See you all tomorrow,
Hai
Topic: AI Kolmogorov meeting
Time: Apr 6, 2026 07:30 PM Paris
Join Zoom Meeting
https://us06web.zoom.us/j/86389121312?pwd=s7Hqwm4TYBrYkQrahaMx2asaWLsxk3.1
Meeting ID: 863 8912 1312
Passcode: 670657On 4/1/26 04:42, Hai Huang wrote:
Thanks a lot Sasha. We are really looking forward to it, and I assure you we highly value your, Alexey's, and others' perspectives.
I'm CCing @Steve X. Liu the Professor at McGill and MBZUAI and his student @Rassul Magauin here so we are all on the same thread. Looking forward to the meeting invite!
On Tue, Mar 31, 2026 at 1:27 PM Alexander Shen <alexand...@lirmm.fr> wrote:
Surely, why not? Do not expect much from us, but we will try to be useful...
On 3/31/26 22:25, Hai Huang wrote:
Dear Sasha,
Thank you for accommodating my schedule. Next Monday, April 6th at 19:30 European time works perfectly for me.
I would be absolutely delighted if Alexey Milovanov could join us; his expertise in algorithmic statistics will be incredibly valuable to our discussion.
On a related note, I recently started a collaboration with a CS Professor and a PhD student focusing on these areas. Would you be open to me looping them into this meeting as well?
Please let me know if that works for you, and I look forward to speaking with the group next week.
Best regards,
Hai
On Tue, Mar 31, 2026 at 7:24 AM Alexander Shen <alexand...@lirmm.fr> wrote:
Dear Hai,
taking into account your time constraints - may be we can make a small group meeting after our usual seminar time - say, next Monday 19.30 European time? I don't know what are people's plans, but at least Alexey Milovanov (who is an expert in algorithmic statistics, much more than myself) could come. If you are ok with this time, I'd ask others...
-- Sasha
Next monday = April 6
On 2/27/26 05:30, Hai Huang wrote:
Dear Sasha,
Sorry for the delay; I wanted to take the time to carefully reflect on your insights. After giving it much thought, I feel even more encouraged that Kolmogorov Complexity (KC) provides the precise theoretical framework I am looking for.
You are absolutely right that its incomputability and reliance on busy beaver numbers render it a purely theoretical construct. However, from my perspective in AI, I view this as a unique advantage rather than a limitation. Compared to Shannon information theory, KC is unconstrained by prior assumptions. For my theoretical work, its incomputability is actually quite freeing—when attempting to formulate proofs, I am not forced to append caveats like "assuming computability." Similarly, it liberates us from the assumption that the underlying data distribution is known, which is a pervasive bottleneck in standard machine learning. The struggle with "in-distribution" versus "out-of-distribution" generalization is so significant that it has spawned entire subfields in our domain, such as transfer learning and alignment.
For my current purposes, I believe I can primarily rely on two specific properties. The first is exactly what you mentioned: structural simplicity can always be traded for unstructured information, but not vice versa. In the context of deep learning, my interpretation is that it is always trivially easy to overfit a model (sacrificing structural simplicity), but extraordinarily difficult to achieve true generalization (discovering structural simplicity). To me, this rigorously justifies the need for architectural constraints to prevent overfitting. To some extent, my research integrating Joint Embedding Predictive Architectures (JEPA) into LLM training attempts to do exactly this.
The second property is the counting argument, which demonstrates that only a vanishingly small fraction of strings can be efficiently compressed. If we equate "intelligence" with the ability to efficiently compress, this implies that all intelligence must be highly specific, rather than universal. This provides a beautiful theoretical foundation for my belief that Artificial General Intelligence (AGI)—or even human "general" intelligence—is fundamentally a misnomer; intelligence is inherently specialized.
I would be incredibly honored to attend the Kolmogorov seminar. It would be a privilege to share my perspective on the current state of AI and to see if these connections hold up to the group's rigorous scrutiny. Thank you so much for the generous invitation.
Regarding the logistics, 16:30 European time is a bit early for me over here on the US West Coast. Would it be at all possible to push the start time to 17:30 or 18:00 European time?
Thank you again for your time and guidance.
Best regards,
Hai
On Mon, Feb 16, 2026 at 10:18 AM Alexander Shen <sasha...@gmail.com> wrote:
Dear Hai,
sorry for the delay. The problem with the algorithmic statistics - as it is now - is that its results are very theoretic "approximation from above" that do not deal with the (probably) most important practical issue of computation time. Or, better to say, algorithmic statistics somehow deals with it, but in the regime that is purely theoretical.
The definition of Kolmogorov complexity K(x) does not restrict the time needed to reconstruct x from its compressed description (=the time needed to run the shortest program that produces x). And it does not require any algorithm for _compression_ at all, not to speak about resource bounds for compression. It does correspond to some time bound, the Kolmogorov structure function (or other curves that are equivalent) corresponds to the dependence of the complexity on time bound _measured in busy beaver numbers_, i.e., we consider the number of steps needed to perform the longest terminated computation of programs of size at most n. These number for all practical or even theoretically practical purposes are infinite.
So any direct application of algorithmic statistics hardly is possible. But, of course some ideas can be useful (who knows). For example, there are some obvious lessons - e.g., we need to consider in a balanced way the "structured" and "nonstructured" information (the description of the set and the ordinal number in the set), and structural simplicity can be always traded for the unstructured information (that is probably what you mentioned), etc.
Sorry that I cannot say something more specific - you overestimate (with a huge margin) my knowledge (and in general, knowledge in Kolmogorov complexity community) about the miracles that are now happen in AI. I thought that may be you would be interested in explaining what is happening in AI at our "Kolmogorov seminar" (online seminar that usually gathers at Mondays 16.30 european time, but we can organize a special meeting) and then may be you can ask some question about algorithmic statistics that we could understand better after your explanations. (I guess that just reading your paper won't work, we are not on the level when we could read research papers...)
What do you think?
All the best
-- Sasha
On 2/10/26 03:40, Hai Huang wrote:
Hi Sasha,
Rich Heimann shared your kind note regarding your book, Kolmogorov Complexity and Algorithmic Randomness. I’ve been studying it closely and would appreciate your perspective on how its principles might formalize some recent results in LLM training dynamics.
I am currently investigating the theoretical limits of Next-Token Prediction (NTP). My recent work with Yann LeCun (ICLR 2026) explores bridging Joint Embedding Predictive Architecture (JEPA) with LLM training. We’ve found that adding a JEPA-style loss—which aligns queries and answers at a conceptual level—significantly improves accuracy and scaling compared to standard NTP.
I suspect Algorithmic Statistics offers the formal framework to explain this, and I had two specific questions regarding your theorems:
Learning Algorithms vs. Samples: NTP often focuses on memorizing samples $x$, whereas JEPA operates at the level of the "algorithm" $A$. Theorem 252 suggests it is easier to refine a coarse $A$ to fit samples $x$, but Theorem 253 suggests that if a model is overfit to a specific set of samples, generalizing it to a broader set is inherently difficult. Do you see this "upward" difficulty in Theorem 253 as a fundamental limit on the generalization of NTP-only models?
Capacity & "Jagged" Intelligence: Throughout the book, the proof trick involving the count of strings ($2^n$ strings with complexity $< n$) demonstrates that the vast majority of binary strings are incompressible. Intuitively, this suggests that any finite program (including an LLM) can only master a vanishingly small fraction of "doable" tasks. Does this support a view of "Jagged Intelligence"—where models excel in narrow, compressible domains—rather than the path toward AGI?
I would be honored to hear your thoughts on whether these interpretations hold water within the strict framework of Kolmogorov Complexity. I’m happy to share our ICLR findings or a draft of our follow-up work if you’re interested in seeing the empirical side of these ideas.
Best regards,
Hai Huang