Abram Demski wrote:
> As I said, I'm also interested in the notion of probability. I
> disagree with Solomonoff's universal distribution
> (http://en.wikipedia.org/wiki/Ray_Solomonoff), because it assumes that
> the universe is computable. I cannot say whether the universe we
> actually live in is computable or not; however, I argue that,
> regardless, an uncomputable universe is at least conceivable, even if
> it has a low credibility. So, a universal probability distribution
> should include that possibility.
This is my position as well: there is no reason to assume that the universe
must be computable, as Solomonoff's universal distribution does. For a while
I tried to learn enough logic to see what a truly universal distribution
would look like. But then I found an argument, related to Berry's Paradox,
that seems to show that any attempt at a truly universal distribution must
fail. (From Wikipedia: The Berry paradox is a self-referential paradox
arising from the expression "the smallest possible integer not definable by
a given number of words.")
The argument is very simple. Consider an arbitrary probability distribution
P, and the smallest integer (or the lexicographically least object) x such
that P(x) < 1/3^^^3 (in Knuth's up-arrow notation). Since x has a short
description, a universal distribution shouldn't assign it such a low
probability, but P does, so P can't be a universal distribution.
(I've stated variants of this argument before at  and , but this
version seems to be more general.)
Is there a way around this? If not, how should we (or an AI) deal with
uncertainty without being able to use a universal distribution as the prior?
 "is induction unformalizable?"
That's an interesting perspective. I agree that a human being, assuming he
is computable, couldn't beat Solomonoff's universal distribution in a
prediction game. He might come to believe that the universe has uncomputable
physics, but that doesn't seem to help him gain an advantage over
But what if the human could augment himself with uncomputable physics, and
start thinking uncomputable thoughts? He could then have an uncomputable
prior, which could possibly beat Solomonoff's universal distribution. Can a
Bayesian agent do the same thing? My initial thought is "no", since a
Bayesian isn't supposed to change his prior. But a Bayesian agent can
perhaps build external calculating devices (e.g. hypercomputers) to help
himself, and maybe the combined system of agent+hypercomputer can be modeled
as an extended agent with an uncomputable prior. I'll have to think about
this some more...
Did you post your previous reply somewhere?
> Right now, my first thought is that the lexographically least object
> such that P(x) < 1/3 ^^^ 3 is ill-defined.
But that object is well-defined for any particular P that's well-defined,
> Clearly it will not have a place on the Tarski hierarchy, since it
> refers to all levels.
I might be out of my depth here, but it seems to me that the
lexicographically least object x such that P(x) < 1/3 ^^^ 3 belongs to at
most one level above P on the Tarski hierarchy. Does the trouble start if we
try to quantify over all well-defined P and conclude that a well-defined
universal distribution doesn't exist? If so, we're still left with an easy
way to shoot down any particular well-defined candidate someone might
propose for a universal distribution.
BTW, I've found some academic research into whether it's ok to quantify over
"indefinitely extensible totalities". See section 11 of
for example. I'm not sure if any of that helps here though.
Yes, I see that we had an email conversation on this topic last June. I
apologize for my poor memory.
> Does the paradox rely critically on an assumption that description
> length should be the only factor? Or can you derive the paradox just
> from the assumption that the prior should dominate other priors?
Ok, here's a version based on dominance: for any probability distribution P
definable at level L of the Tarski hierarchy, there exists a P' at level at
most L+1 which it doesn't dominate. Let P'(x) = 1/3^n if x is the
lexicographically least object such that P(x) < 1/3^^^n, otherwise P'(x) =
c*P(x), with c chosen so that P' sums to 1.
> So, what I am saying is that I do not agree with *current* solutions
> to these problems, but for those solutions, it looks like everything
> carries over just fine to your paradox. So it seems as if any *actual*
> solution will have the same property.
Yes, I'm hoping that's the case too, which is why I'm here. :)
> On the other hand, a solution might still fall prey to your paradox,
> but give a good story as to why the situation is acceptable. This
> seems fairly plausible, too.
I'm still thinking through implications of Eliezer's response, which seems
to say that the situation is acceptable. Does his response carry over to the
I have found some discussion about how human beings might be able to
recognize a halting oracle:
> "First-order logic does not necessarily assume that the domain of
> discourse is computable, but rather that the logic must be run on a
> computer." Something like that?
But what's the counterpart of Solomonoff's Universal Distribution in this
analogy? How about this instead: "Consider a language L that contains all
truths that can be proved by computable minds. We know that there are truths
that L does not contain, but we can never prove any of them so there is no
point trying to extend L any further (assuming we are computable)." Can we
define this L formally?
Damn my poor memory again. I just remembered that I already came up with an
example where a human being can beat Solomonoff's universal distribution.
It's at http://www.overcomingbias.com/2008/11/complexity-and/. But Eliezer
wasn't convinced that the game was fair.
Let me try a different example here. Consider a hyperintelligent being (i.e.
someone who has access to a hypercomputer) who wants to find a
(hypercomputable) input sequence and a computable algorithm that will
humiliate Solomonoff's universal distribution: the computable algorithm will
predict the sequence better than the UD, by an infinite margin. The game is,
in each round each agent will output a single prediction for the next
symbol. UD loses the game if the computable algorithm's number of correct
predictions minus its number of correct predictions goes to positive
infinity. Does this seem fair?
To make things simple, the HI chooses the computable algorithm to be "always
predict 0". And it chooses the input sequence by simulating the UD-based
predictor: if the predictor predicts 0 as the next symbol, then it chooses
1, otherwise it chooses 0. Now consider what happens. The computable
algorithm gains a point whenever the UD predicts a 1. And that will occur an
infinite number of times, because every time it predicts a 0 it will see a
1, and it will eventually predict a 1 after seeing enough 1s. (I can try to
prove this if it's not intuitively obvious to you.)
So, Solomonoff's universal distribution is guaranteed to beat (technically,
not lose by more than a finite margin) any computable distribution, *if the
input distribution is computable*, otherwise it can be beaten by a
computable distribution. Maybe this explains why we failed to "carry over"
Eliezer's response to the other paradoxes.
Solomonoff's original proof was over probabilities, but other proofs are
over the number of errors. From http://www.hutter1.net/ai/perrbnd.pdf:
Another natural question is to ask for relations between the total number of
errors E_ξ in Solomonoff prediction and the total number of prediction
errors Eμ in the
informed scheme. Unfortunately  does not bound E_ξ in terms of Eμ in a
way. For example it does not exclude the possibility of an infinite E_ξ
even if Eμ is finite.
Here we want to prove upper bounds to E_ξ in terms of Eμ ensuring as a
the above case cannot happen.
> So what we want to do is establish that in the limit, and
> regardless of the environment, whether the environment is computable
> or not, there is no computable algorithm whose accumulated log-score
> beats the Solomonoff distribution - an (uncomputable) weighted mixture
> of all computable mixtures. And then the proof seems relatively
> trivial, since any computable algorithm has a finite weight in this
> mixture. Indeed the difference in their scores is bounded by (the log
> of) the initial finite weight, unless I've missed something.
I think you're right, but only if the computable algorithm isn't allowed to
use noncomputable notation to represent its beliefs. Suppose a human
believes that the next symbol is 0 with probability 1/BB(100), and the
payoff for guessing 0 correctly is BB(101), then he can decide to bet on 0
without actually being able to compute BB(100). So I see no reason why such
beliefs should not be allowed. But if you do allow such beliefs, then the
Solomonoff distribution is no longer guaranteed to be the best
So, given that the Solomonoff distribution is not guaranteed to be best
either epistemically or in decision making, there seems to be no reason to
stop looking for better alternatives.
Does that clear it up?
(Of course in general he's likely to have choices C and D such that he can't
prove either EU(C) > EU(D) or EU(D) > EU(C). How to make decisions in cases
like that would depend on a theory of how to handle mathematical
uncertainty, which I don't have...)
Eliezer asks for more natural examples. That's reasonable and something I'll
So consider a game where in each round the player is asked to bet on 0 or 1.
The payoff for 0 in round i is equal to 1 minus the i-th bit of the sequence
of unary encodings of BB(1), BB(2), BB(4), ..., and the payoff for 1 is
It seems clear that for a human, except for the first few rounds where he
has a reasonable chance of guessing BB(2^n) correctly, the best strategy is
to always bet on 1. Consider the situation after he sees BB(2^n). He knows
that if he keeps betting on 0, by the time he gets a payoff of 1, he could
have gotten a total payoff of BB(2^(n+1))/2^100 by betting on 1 instead.
The uncomputable probability distribution that the human is using here is
P_i(1) = i-th bit of BB(1), BB(2), BB(4), ... (This is different from what I
suggested before.) He doesn't know how to compute this, obviously, nor the
expected utility of betting on 0 in any particular round, but he can prove
that the strategy "always bet on 1" has a higher expected payoff than
"always bet on 0" and it is in fact better than any other computable
What about the agent using Solomonoff's distribution? After seeing
BB(1),...,BB(2^n), the algorithmic complexity of BB(1),...,BB(2^n) is sunk,
so to speak. It will predict a higher expected payoff for playing 0 in any
round i where the conditional complexity K(i | BB(1),...,BB(2^n)) < 100.
This includes for example 2*BB(2^n), 2*BB(2^n)+1, BB(2^n)^2 * 3 + 4,
BB(2^n)^^^3, etc. It will bet on 0 in these rounds (erroneously, since
K(BB(2^(n+1)) | BB(2^n)) > 100 for large n), and therefore lose relative to
Is this example more convincing?
Oh, in case there's objection that the frequency of Solomonoff's
distribution losing goes to 0 too quickly in this game, we can replace the
sequence BB(2^n) by some other mathematically well-defined sequence S(n)
such that K(S(n+1) | S(1), ..., S(n)) > 100, but S(n) doesn't increase so
quickly. For example, let S(n+1) be the smallest integer such that K(S(n+1)
| S(1), ..., S(n)) > 100. This sequence increases slowly, since S(n) + a
random 101-bit number has conditional complexity more than 100, so S(n+1)
can't be bigger than S(n)+2^101.
I recalled that Jürgen Schmidhuber's Gödel Machine already contains ideas
that are very similar to the above. See specifically question 4, page 22 in
ftp://ftp.idsia.ch/pub/juergen/gm6.pdf. So I'll recap how my position
differs from his:
1. A Gödel machine starts with an initial strategy and tries to find
provable improvements to it. Since humans can handle mathematical
uncertainty, I think machines should be able to as well (even if we don't
know how yet). Therefore I suggest that a machine should look for *likely*
improvements that are not necessarily provable.
2. A Gödel machine has fixed "environment axioms", essentially a
generalization of the Bayesian prior. Schmidhuber writes "All such prior
assumptions are perfectly formalizable in an appropriate A (otherwise we
could not write scientific papers about them)." But as I pointed out earlier
in this thread, we don't know how to formalize Occam's Razor, if we
interpreted it as something like "the environment likely has a short