Information Distance Questions

209 views
Skip to first unread message

Ravi Deedwania

unread,
Jan 30, 2024, 2:35:58 PM1/30/24
to Algorithmic Information Theory
Dear AIT group,

I have two questions related to the AIT notion of information distance and am wondering if anyone in this group might be able to help.

In what follows, let U represent the function implemented by a universal prefix Turing machine with U(y'i'q) = T_i(y'q), where i' is a self-delimiting encoding of the index i of a prefix Turing machine in some effective enumeration of prefix Turing machines, y' is a self-delimiting encoding of a conditional input y, and T_i is the function implemented by the prefix Turing machine indexed by i. Also, let UR represent the function implemented by a universal reversible prefix Turing machine with UR(y'i'q) = TR_i(y'q), where all variables have the same meaning as for U except that i is the index of a reversible prefix Turing machine in some effective enumeration of reversible prefix Turing machines. 

I'm reading the fourth edition of An Introduction to Kolmogorov Complexity and Its Applications by Li and Vitanyi. Section 8.3 of the book discusses three possible ways of formalizing the notion of information distance between two strings x and y: 
  1. The length of a shortest program p such that U(x'p) = y and U(y'p) = x (the book refers to this information distance function as E_0(x,y)); 
  2. max{K(x|y), K(y|x)}, where K(x|y) is the conditional prefix complexity of x given y (the book refers to this information distance function as E_1(x,y)); and
  3. The length of a shortest program p such that UR(x'p) = y (the book refers to this information distance function as E_2(x,y)). 
Theorem 8.3.3 in the book states that E_2(x,y) = E_0(x,y) up to an additive constant. One component of the proof involves proving that E_2(x,y) >= E_0(x,y) (again up to an additive constant). The book points out that since reversible prefix Turing machines are a subset of prefix Turing machines, there is some index j such that UR(x'p) = U(x'j'p) for all x and p. The book says that this proves the inequality. 

This is where I get lost. In particular, I don't think it's necessarily the case that for a shortest program p such that UR(x'p) = y we also have UR(y'p) = x, so j'p isn't necessarily a program that converts in both directions. There is a proof of the same theorem on p. 24 of this paper that notes that if you are given a reversible program that converts x to y, the same program can be "inverted" with constant overhead to convert y to x. While this is true, I'm not getting how the result that E_2(x,y) >= E_0(x,y) follows, since it seems like any program that leverages a shortest reversible program to convert in both directions would need to be able to determine whether the conditional input is x or y (in order to determine whether to run the program "forwards" or "backwards"). As far as I can tell, the length of any code to distinguish between two strings will in general depend on the length of those strings. You could try running the reversible program p both forward and backward in a parallel/interleaved fashion, but as far as I can tell you can't rule out that both directions output two different strings (or one direction never halts). 

So my first question is: can anyone explain either why the above proof approach works or another way of proving the inequality? I suspect there is something very obvious that I'm missing, but I've spent a fair amount of time thinking about it and asked the same question on Stack Exchange to no avail. The equivalence of these two different ways of formalizing information distance seemed (to me at least) like a rather remarkable result, so I'd like to understand the proof if I can. 

My second question is about the universality/minimality of E_1(x,y) relative to other distance measures, which is discussed in section 8.3.3 of Li and Vitanyi. Theorem 8.3.2 proves that E_1(x,y) <= D(x,y) (up to an additive constant) for all "admissible" distance functions D(x,y). One of the requirements for a distance function D to be admissible is that for any given y, the sum over all x of 2^-D(x,y) is less than or equal to one, and vice versa (i.e. for any fixed y (x), the set of D(x,y) values for all x (y) correspond to the codeword lengths of a binary prefix code). The book justifies this requirement by stating that "for each x and d, we want only finitely many elements at a distance d from x," and doesn't really elaborate further. 

I didn't find this justification fully satisfying. In particular, the book seemed to be arguing that all "reasonable" distance measures for strings either satisfy this requirement or can be easily modified to satisfy this requirement in a way that preserves the practically important aspects of their structure. It's not obvious to me that this is true (in particular, it's not obvious to me that any reasonable distance function would have only finitely many elements at a distance d from a string x, for any x and d). In fact, the beginning of section 8.3.3 gives some examples of distance measures that don't seem to satisfy this requirement (e.g. the difference in the number of occurrences of a given base in two different genomes). This requirement for admissibility seems important, since without it I think E_1(x,y) is not universal/minimal. Can anyone explain (or point to resources that explain) in more detail the intuition justifying this requirement for admissibility/reasonableness of a distance measure? 

Apologies for the long email and thank you for your time! And if Profs Li or Vitanyi read this, thank you for your awesome book!

Ravi

P.S. If these questions are too basic or otherwise not appropriate for this mailing list, please let me know. 

Aram Ebtekar

unread,
Jan 30, 2024, 10:07:01 PM1/30/24
to Algorithmic Information Theory
Hi Ravi!

I don't think your questions are too basic. For the reversible Turing machines, I think you just have to instruct the machine to treat its input as if it were written on the output tape, and to invert its transition table so that it will run backwards. Maybe UR's start state has to be specially marked somehow, so that the reverse-mode run knows when to halt.

As for your second question, requirements tend to be a bit subjective. If we accept the requirement that there should only be a finite number of elements at distance <= d, then the idea is for any fixed x, you could sort the y's in order from closest to farthest. And then, if you care about this order but not about the absolute magnitudes, you could always reduce the distances to satisfy Kraft's inequality. Perhaps I missed something else.

--
Aram

Bruno Bauwens

unread,
Jan 31, 2024, 12:27:49 PM1/31/24
to Aram Ebtekar, Algorithmic Information Theory
Dear Ravi,

Question 1. I agree with you.

> There is a proof of the same theorem on p. 24 of this <https://arxiv.org/pdf/1006.3520.pdf> paper that notes that if you are given a reversible program that converts x to y, the same program can be "inverted" with constant overhead to convert y to x. While this is true, I'm not getting how the result that E_2(x,y) >= E_0(x,y) follows, since it seems like any program that leverages a shortest reversible program to convert in both directions would need to be able to determine whether the conditional input is x or y (in order to determine whether to run the program "forwards" or "backwards"). As far as I can tell, the length of any code to distinguish between two strings will in general depend on the length of those strings.

Indeed, there is a lower bound that is logarithmic in min(|x|,|y|) to solve this task. In fact, Alexander Shen proposed a directional variant of the information distance, where an extra bit is given in the input to indicate whether x -> y or y -> x direction needs to be evaluated. He asked whether this directional variant equals E_0 (which is undirectional). I could neither prove nor disprove this. (See the 1st open question in section 8 in https://arxiv.org/pdf/2009.00469.pdf .)

When looking at p. 24 of the paper you mention (https://arxiv.org/pdf/2009.00469.pdf), I also see a problem in the other direction too. In definition 5.1, there are 3 conditions, and I understand how to prove the 1st and 2nd, but I don't understand how the 3rd is proven.

On the other hand, if E_0 were defined with plain complexity, then the bidirectional variant equals E_0. For the plain variants, it holds that E_0(x,y) = max{C(x|y), C(y|x)} + O(1). Since the directional variant of E_0 lies between max{C(x|y), C(y|x)} and the undirectional variant, this means that the undirectional and directional variants are the same up to O(1). (This is explained in remark 2.1 on page 4 in my paper https://arxiv.org/pdf/2009.00469.pdf .) Also note that Paul Vitanyi switched to plain complexity in one of hist last papers: https://homepages.cwi.nl/~paulv/papers/exact.pdf Since the plain and the prefix distances are equal up to an additive logarithmic term in the distance, such considerations imply that theorem 8.3.3 in Li&Vitany (4th edition) holds up to additive logarithmic terms. The plain variant does not satisfy the triangle inequality. (But in my paper it is shown that prefix distance does satisfy E_0(x,y) = E_1(x,y) + O(1) if one excludes some strange cases where x and y have differ in complexity in an exponential way.)

Conclusion: I believe that theorem 8.3.3 only holds up to O(log E_0(x,y)).

Whether it holds up to O(1) seems a difficult open question, but with plain complexity (and the right definitions) it should be easier to adapt existing proofs.



Question 2. About which distance measure D are reasonable? I think that perhaps the plain variant of this justification is more convincing.

"There exists a function f such that for all natural numbers k and all x, the size of {y : D(x,y) < k} < f(k)."

Typically, D is monotonically non-decreasing, and hence if f is always finite, then values can be rescaled to satisfy the above condition for f(k) = 2^k. Then optimality of the plain information distance can be shown among the upper semicomputable ones. (See theorem 2.4 on page 5 in https://arxiv.org/pdf/2009.00469.pdf .) For the prefix variant, there is some strange tradeoff between different sizes of the functions. (But despite this, optimality still holds.)

I agree with you that the example "the difference of frequencies of a single base in a genome" is not really an intelligent distance measure for general purpose uses. It makes more sense to look at the frequencies of a full base, and somehow add the differences of these frequencies. Then, such a finite f(k) should exist.

I understand the motivation as follows.
(1) We are interested in designing general purpose distances that are able to measure complex similarities.
(2) Our objects are discrete and contain a huge amount of information (compared to the implicit constants in the theory of Kolmogorov complexity).
Question: What might be a limiting point if we keep making our distance measure more and more clever.

In many applications, one might have large upper bounds on inputs that can reasonably expected, and again these upper bounds should be large enough to 'beat' implicit constants. This turns the problem into a finite problem, for which finite f(k) exists. Anyway, all of the above is just philosophy, and obtaining full agreement is not always possible. Moreover, this is just 1 of the (approximate) equivalences.

Note that Vitanyi is a coauthor on several applied papers called the "google distance metric", "the normalized information distance" where the theory inspired simple experiments which received a lot of attention when the papers appeared.


Kind regards, Bruno
> 1. The length of a shortest program p such that U(x'p) = y and U(y'p) =
> x (the book refers to this information distance function as E_0(x,y));
> 2. max{K(x|y), K(y|x)}, where K(x|y) is the conditional prefix
> complexity of x given y (the book refers to this information distance
> function as E_1(x,y)); and
> 3. The length of a shortest program p such that UR(x'p) = y (the book
> refers to this information distance function as E_2(x,y)).
>
> Theorem 8.3.3 in the book states that E_2(x,y) = E_0(x,y) up to an additive
> constant. One component of the proof involves proving that E_2(x,y) >=
> E_0(x,y) (again up to an additive constant). The book points out that since
> reversible prefix Turing machines are a subset of prefix Turing machines,
> there is some index j such that UR(x'p) = U(x'j'p) for all x and p. The
> book says that this proves the inequality.
>
> This is where I get lost. In particular, I don't think it's necessarily the
> case that for a shortest program p such that UR(x'p) = y we also have
> UR(y'p) = x, so j'p isn't necessarily a program that converts in both
> directions. There is a proof of the same theorem on p. 24 of this
> <https://arxiv.org/pdf/1006.3520.pdf> paper that notes that if you are
> given a reversible program that converts x to y, the same program can be
> "inverted" with constant overhead to convert y to x. While this is true,
> I'm not getting how the result that E_2(x,y) >= E_0(x,y) follows, since it
> seems like any program that leverages a shortest reversible program to
> convert in both directions would need to be able to determine whether the
> conditional input is x or y (in order to determine whether to run the
> program "forwards" or "backwards"). As far as I can tell, the length of any
> code to distinguish between two strings will in general depend on the
> length of those strings. You could try running the reversible program p
> both forward and backward in a parallel/interleaved fashion, but as far as
> I can tell you can't rule out that both directions output two different
> strings (or one direction never halts).
>
> So my first question is: can anyone explain either why the above proof
> approach works or another way of proving the inequality? I suspect there is
> something very obvious that I'm missing, but I've spent a fair amount of
> time thinking about it and asked the same question on Stack Exchange
> <https://cstheory.stackexchange.com/questions/53516/relation-between-different-definitions-of-information-distance>
> More information: http://www.hutter1.net/ait.htm
> ---
> You received this message because you are subscribed to the Google Groups "Algorithmic Information Theory" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ait0+uns...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/ait0/e12feac3-1c37-4fb8-a073-cbb1ce36da5dn%40googlegroups.com.

Ravi Deedwania

unread,
Feb 1, 2024, 8:23:05 PM2/1/24
to Bruno Bauwens, Aram Ebtekar, Algorithmic Information Theory
Thank you both for your responses!

Aram:

Regarding my first question, it seems like you're suggesting a way to invert a reversible program that converts x to y in order to convert y to x. My question was actually about how you can take a reversible program that converts x to y and, with constant overhead, create a program that converts in both directions. Apologies, I think the way I asked the question probably wasn't clear.

Regarding the second question, what you're saying in terms of rescaling distance measures to satisfy Kraft's inequality while preserving the ordering makes sense. I think my question was more about why, for any x and d, there should only be a finite number of y with D(x,y) <= d for any "reasonable" distance measure. Apologies again as I think the way I phrased this question was confusing. 


Bruno:

Regarding my first question, thank you, your response is extremely helpful! And thank you very much for sending your paper (somehow I missed it in my searches), it helped to clarify numerous issues for me and revealed some subtleties I had never noticed. 

>In fact, Alexander Shen proposed a directional variant of the information distance, where an extra bit is given in the input to indicate whether x -> y or y -> x direction needs to be evaluated. He asked whether this directional variant equals E_0 (which is undirectional). I could neither prove nor disprove this.

Very interesting, thank you for sharing this! I suppose the directional variant will be bounded above by E_2, since you could always make the x -> y function a universal reversible prefix Turing machine and the y --> x function its inverse? 

>When looking at p. 24 of the paper you mention (https://arxiv.org/pdf/2009.00469.pdf), I also see a problem in the other direction too. In definition 5.1, there are 3 conditions, and I understand how to prove the 1st and 2nd, but I don't understand how the 3rd is proven.

I had glossed over this detail, but now that you point it out I'm also confused. I think I assumed that the "prefix" in "reversible prefix partial computable function" just meant that the set of halting programs for any given conditional input x is prefix free, I didn't realize the definition also required that for any y, the set of halting programs that outputs y for any conditional input x is also prefix free (to be honest, I don't think I understand the motivation for this latter requirement). 

I didn't realize that the plain complexity variant of E_0 equals the plain complexity variant of E_1 up to an additive constant, thank you for pointing this out! So I guess the plain complexity variants of information distance have the advantage that the "natural" definition of information distance (based on programs that convert both ways) is equivalent to the more practically useful definition (the max of the conditional complexities) but the disadvantage that the triangle inequality only holds up to logarithmic precision, while the opposite is true for the prefix complexity variants of information distance? 

Regarding my second question about the requirements for reasonable/admissible distance functions, thank you, your response helped me develop a better intuition for this! I particularly like the point that in many applications there might be upper bounds on the size of data you could reasonably expect to apply distance measures to (and hence the requirement I was asking about would be satisfied or the distance measure could be rescaled to satisfy it), and that these upper bounds might also be large enough so that the effects of implicit constants tend to be washed out. 

>Note that Vitanyi is a coauthor on several applied papers called the "google distance metric", "the normalized information distance" where the theory inspired simple experiments which received a lot of attention when the papers appeared.

Yes, thank you for pointing this out! I have come across some of these papers and the experimental results seemed like compelling evidence of the usefulness of the information distance concept (to be honest, I was pretty stunned by some of the results). This is one of the reasons I wanted to learn about information distance and AIT more broadly.

Thank you both again for your time and help!


Bruno Bauwens

unread,
Feb 2, 2024, 9:30:37 AM2/2/24
to Ravi Deedwania, Aram Ebtekar, Algorithmic Information Theory
There were a few mistakes in my answer of question 1. Also, I explicitly stated that the reversible distance equals the bidirectional variant of the information distance. See stackexchange: https://cstheory.stackexchange.com/questions/53516/relation-between-different-definitions-of-information-distance/53839#53839

Kind regards, Bruno

Ravi Deedwania

unread,
Feb 3, 2024, 9:42:59 AM2/3/24
to Bruno Bauwens, Aram Ebtekar, Algorithmic Information Theory
Thank you!
Reply all
Reply to author
Forward
0 new messages