Relaxing Levin's Coding Theorem

94 views
Skip to first unread message

Gabriel Leuenberger

unread,
Apr 3, 2024, 7:33:14 AM4/3/24
to Algorithmic Information Theory
I just read the proof of Levin's coding theorem in the book 'an introduction to kolmogorov complexity and its applications' Section 4.3.4, Theorem 4.3.3.
The proof takes almost two pages and at the end it mentions an additive constant of 4 bits:  K_T ≤ -log m(x) + 4. However, somewhat disappointingly, the Kolmogorov complexity K_T is based on a somewhat contrived reference machine T which is not going to be the simplest machine and thus the final additive constant will be much larger than 4. In practice, the usefulness of the theorem depends on how small this constant is. At least the proof shows that the constant will not be astronomical. But due to the exponential growth of probability w.r.t. the number of bits, I would still consider it useful for the constant to be quite small.
-Has someone estimated how large this constant actually is?
-Is there an alternative, relaxed version of Levin's coding theorem which only applies most of the time rather than always, in exchange for obtaining a much smaller additive constant? I have such a proof idea but I don't know if this has already been done by someone.

Gabriel Leuenberger

unread,
Apr 15, 2024, 8:32:06 PM4/15/24
to Algorithmic Information Theory

Levin’s coding theorem can be written as follows:

K(x) = -log m(x) +O(1)

I claim that there exists a similar theorem that would be more useful because the O(1) would be bounded by a smaller constant. In addition, the proof would be more natural and less contrived. The relaxation can be done by replacing K(x) with its expectation value E(K(x)). So the new theorem would be:

E(K(x)) = -log m(x) +O(1)

And here is my proof outline:

E(K(x)) > -log m(x) is trivial because of 2^-K(x) < m(x). Therefore, our focus is on proving the opposite inequality: E(K(x)) < -log m(x) +O(1).

Vaguely speaking, it says that if there exists a large percentage of long programs that output x, we can expect there to also exist a short program that outputs x. Now the main proof idea is that due to the sheer number of long programs up to some length, statistically, we should expect at least one of these long programs to be highly compressible itself, such that the short program could be obtained directly from it.

In order to show this, we just need to calculate the expected minimum of K(p) over all programs p that output x. To avoid confusion, note that this is a minimum of a minimum, since K(p) is itself already a minimum as well.

The result of this calculation should then equal -log m(x) +O(1) or so.

Deriving this result should not be too hard, if we know the probability distributions of K(p) for given L(p), which we can write as:

P( K(y)=k | L(y)=n ).

I don’t know however a theorem where such a probability distribution was derived. So some help would be appreciated here. I would expect it to be roughly of the following form:

P( K(y)=k | L(y)=n ) 2^(k-n-O(1))


Sidenote: As an improvement of the proof, one could take into account that the programs p are not allowed to be prefixes of each other, which means that, given the knowledge of this restriction, each program p will actually carry around log(L(p)) less algorithmic information than K(p).

Kreinovich, Vladik

unread,
Apr 15, 2024, 8:40:47 PM4/15/24
to Gabriel Leuenberger, Algorithmic Information Theory

Can you please clarify what you mean by expected value? Over what probability measure? And if you take mean value over x, how can it still be a function of x? The expected value means p(x_1)*K(x_1)+p(x_2)*K(x_2)+…

Gabriel Leuenberger

unread,
Apr 15, 2024, 9:09:18 PM4/15/24
to Algorithmic Information Theory
Yes, m(x) is a sum over all programs p that output x. Think of a list of all the lenghts L(p) of all of these programs. For each of these L(p) you could have a probability distribution P( K(p)=k | given L(p) ). By combining all of these PDs, you could construct the PD of the minimum of all K(p). 
So: P( min(K(p))=k | given the list ).
And based on this you can get the expected value E(min(K(p))) which is also equal to E(K(x)) +O(1).

Gabriel Leuenberger

unread,
Apr 16, 2024, 7:04:17 AM4/16/24
to Algorithmic Information Theory
To be more clear about the probability that I am asking for:
P( K(p)=k | given L(p) ) should be an estimate of the number minimal programs (not called p) of length k with an output of length L(p) divided by the total number of minimal programs with an output of length L(p). Note that we are talking about a minimal program that outputs another, non-minimal program p. 
Due to the large number of non-minimal programs p, we may treat program p as being a random string y sampled via random coin flips. As previously mentioned, I would expect the resulting PD to be roughly of the following form:

P( K(y)=k | L(y)=n )  2^(k-n-O(1))

And I was hoping that someone would already know such a theorem.

And I should also mention that this formula would only be valid for k<n+log(n)+O(1) and the PD would be zero for larger k,

since generally K(y)<L(y)+log(L(y))+O(1).

So the PD increases exponentially with k until it suddenly drops to zero.


john.tromp

unread,
Apr 17, 2024, 10:34:36 AM4/17/24
to Algorithmic Information Theory
On Wednesday, April 3, 2024 at 1:33:14 PM UTC+2 gabriel....@gmail.com wrote:
I just read the proof of Levin's coding theorem in the book 'an introduction to kolmogorov complexity and its applications' Section 4.3.4, Theorem 4.3.3.
The proof takes almost two pages and at the end it mentions an additive constant of 4 bits:  K_T ≤ -log m(x) + 4. However, somewhat disappointingly, the Kolmogorov complexity K_T is based on a somewhat contrived reference machine T which is not going to be the simplest machine and thus the final additive constant will be much larger than 4. In practice, the usefulness of the theorem depends on how small this constant is. At least the proof shows that the constant will not be astronomical. But due to the exponential growth of probability w.r.t. the number of bits, I would still consider it useful for the constant to be quite small.
-Has someone estimated how large this constant actually is?

That depends on the choice of universal machine. Not many people have gone to the trouble of specifying a universal machine in full detail, in order to determine actual constants for theorems of AIT. The most well known one is Greg Chaitin, who proposed several LISP based universal machines and proceeded to write constant yielding programs, such as those found in [1].

The essence of the coding theorem appears as Theorem I4 in that book, but is not accompanied by a program. One of the larger programs that Chaitin provides in support of an AIT theorem proof is in [2],  proving one side of the Symmetry of Information theorem as KP(x, y) ≤ KP(x) + KP(y|x ∗ ) + 2872. A program for Theorem I4 would be far larger than that.

I have myself also defined a universal machine in [3], based not on LISP but on the far more primitive lambda calculus. Despite lacking the AIT primitives that Chaitin added to his LISP, it manages to prove the same Symmetry of Information theorem with a much smaller constant of only 784 bits.

Studying these programs can perhaps give you a rough estimate of what the constant would be for Chaitin's LISP or my Binary Lambda Calculus.



Gabriel Leuenberger

unread,
Apr 22, 2024, 5:45:03 PM4/22/24
to Algorithmic Information Theory
Thank you very much for all this useful information, John Tromp.
Meanwhile, I have obtained a result via the proof idea that I have outlined above. I was able to derive E(C(x)) = -log m(x) +O(1), where C is the plain Kolmogorov complexity instead of the prefix complexity K. I immediately learned from this and then came up with a much better proof technique which allowed me to derive the original strong version of Levin's coding theorem with a constant that, coincidentally, will be about equal to the constant of the Symmetry of Information theorem that you have mentioned. The reason that Levin's proof had a constant which as you say "would be far larger" is because Levin first proved the universality of 2^-K(x) and then used this to equate it to m(x). This was unnecessary as there exists a more direct route from 2^-K(x) to m(x) which also improves the understanding of the relation between these two PDs. Furthermore, I then realized that by defining a more specialized version of the conditional complexity KP(y|x ∗ ), it would be possible to lower the constant of the coding theorem even below the constant for one side of the original information symmetry theorem.
Reply all
Reply to author
Forward
0 new messages