Dear Ravi,
Question 1. I agree with you.
> There is a proof of the same theorem on p. 24 of this <
https://arxiv.org/pdf/1006.3520.pdf> paper that notes that if you are given a reversible program that converts x to y, the same program can be "inverted" with constant overhead to convert y to x. While this is true, I'm not getting how the result that E_2(x,y) >= E_0(x,y) follows, since it seems like any program that leverages a shortest reversible program to convert in both directions would need to be able to determine whether the conditional input is x or y (in order to determine whether to run the program "forwards" or "backwards"). As far as I can tell, the length of any code to distinguish between two strings will in general depend on the length of those strings.
Indeed, there is a lower bound that is logarithmic in min(|x|,|y|) to solve this task. In fact, Alexander Shen proposed a directional variant of the information distance, where an extra bit is given in the input to indicate whether x -> y or y -> x direction needs to be evaluated. He asked whether this directional variant equals E_0 (which is undirectional). I could neither prove nor disprove this. (See the 1st open question in section 8 in
https://arxiv.org/pdf/2009.00469.pdf .)
When looking at p. 24 of the paper you mention (
https://arxiv.org/pdf/2009.00469.pdf), I also see a problem in the other direction too. In definition 5.1, there are 3 conditions, and I understand how to prove the 1st and 2nd, but I don't understand how the 3rd is proven.
On the other hand, if E_0 were defined with plain complexity, then the bidirectional variant equals E_0. For the plain variants, it holds that E_0(x,y) = max{C(x|y), C(y|x)} + O(1). Since the directional variant of E_0 lies between max{C(x|y), C(y|x)} and the undirectional variant, this means that the undirectional and directional variants are the same up to O(1). (This is explained in remark 2.1 on page 4 in my paper
https://arxiv.org/pdf/2009.00469.pdf .) Also note that Paul Vitanyi switched to plain complexity in one of hist last papers:
https://homepages.cwi.nl/~paulv/papers/exact.pdf Since the plain and the prefix distances are equal up to an additive logarithmic term in the distance, such considerations imply that theorem 8.3.3 in Li&Vitany (4th edition) holds up to additive logarithmic terms. The plain variant does not satisfy the triangle inequality. (But in my paper it is shown that prefix distance does satisfy E_0(x,y) = E_1(x,y) + O(1) if one excludes some strange cases where x and y have differ in complexity in an exponential way.)
Conclusion: I believe that theorem 8.3.3 only holds up to O(log E_0(x,y)).
Whether it holds up to O(1) seems a difficult open question, but with plain complexity (and the right definitions) it should be easier to adapt existing proofs.
Question 2. About which distance measure D are reasonable? I think that perhaps the plain variant of this justification is more convincing.
"There exists a function f such that for all natural numbers k and all x, the size of {y : D(x,y) < k} < f(k)."
Typically, D is monotonically non-decreasing, and hence if f is always finite, then values can be rescaled to satisfy the above condition for f(k) = 2^k. Then optimality of the plain information distance can be shown among the upper semicomputable ones. (See theorem 2.4 on page 5 in
https://arxiv.org/pdf/2009.00469.pdf .) For the prefix variant, there is some strange tradeoff between different sizes of the functions. (But despite this, optimality still holds.)
I agree with you that the example "the difference of frequencies of a single base in a genome" is not really an intelligent distance measure for general purpose uses. It makes more sense to look at the frequencies of a full base, and somehow add the differences of these frequencies. Then, such a finite f(k) should exist.
I understand the motivation as follows.
(1) We are interested in designing general purpose distances that are able to measure complex similarities.
(2) Our objects are discrete and contain a huge amount of information (compared to the implicit constants in the theory of Kolmogorov complexity).
Question: What might be a limiting point if we keep making our distance measure more and more clever.
In many applications, one might have large upper bounds on inputs that can reasonably expected, and again these upper bounds should be large enough to 'beat' implicit constants. This turns the problem into a finite problem, for which finite f(k) exists. Anyway, all of the above is just philosophy, and obtaining full agreement is not always possible. Moreover, this is just 1 of the (approximate) equivalences.
Note that Vitanyi is a coauthor on several applied papers called the "google distance metric", "the normalized information distance" where the theory inspired simple experiments which received a lot of attention when the papers appeared.
Kind regards, Bruno
> 1. The length of a shortest program p such that U(x'p) = y and U(y'p) =
> x (the book refers to this information distance function as E_0(x,y));
> 2. max{K(x|y), K(y|x)}, where K(x|y) is the conditional prefix
> complexity of x given y (the book refers to this information distance
> function as E_1(x,y)); and
> 3. The length of a shortest program p such that UR(x'p) = y (the book
> refers to this information distance function as E_2(x,y)).
>
> Theorem 8.3.3 in the book states that E_2(x,y) = E_0(x,y) up to an additive
> constant. One component of the proof involves proving that E_2(x,y) >=
> E_0(x,y) (again up to an additive constant). The book points out that since
> reversible prefix Turing machines are a subset of prefix Turing machines,
> there is some index j such that UR(x'p) = U(x'j'p) for all x and p. The
> book says that this proves the inequality.
>
> This is where I get lost. In particular, I don't think it's necessarily the
> case that for a shortest program p such that UR(x'p) = y we also have
> UR(y'p) = x, so j'p isn't necessarily a program that converts in both
> directions. There is a proof of the same theorem on p. 24 of this
> <
https://arxiv.org/pdf/1006.3520.pdf> paper that notes that if you are
> given a reversible program that converts x to y, the same program can be
> "inverted" with constant overhead to convert y to x. While this is true,
> I'm not getting how the result that E_2(x,y) >= E_0(x,y) follows, since it
> seems like any program that leverages a shortest reversible program to
> convert in both directions would need to be able to determine whether the
> conditional input is x or y (in order to determine whether to run the
> program "forwards" or "backwards"). As far as I can tell, the length of any
> code to distinguish between two strings will in general depend on the
> length of those strings. You could try running the reversible program p
> both forward and backward in a parallel/interleaved fashion, but as far as
> I can tell you can't rule out that both directions output two different
> strings (or one direction never halts).
>
> So my first question is: can anyone explain either why the above proof
> approach works or another way of proving the inequality? I suspect there is
> something very obvious that I'm missing, but I've spent a fair amount of
> time thinking about it and asked the same question on Stack Exchange
> <
https://cstheory.stackexchange.com/questions/53516/relation-between-different-definitions-of-information-distance>
> More information:
http://www.hutter1.net/ait.htm
> ---
> You received this message because you are subscribed to the Google Groups "Algorithmic Information Theory" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
ait0+uns...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/ait0/e12feac3-1c37-4fb8-a073-cbb1ce36da5dn%40googlegroups.com.