Hutter's decision-theoretic argument for Solomonoff induction

306 views
Skip to first unread message

Aram Ebtekar

unread,
Mar 10, 2022, 6:09:34 PM3/10/22
to Algorithmic Information Theory
Hi all,

Given the scarcity of AIT researchers near me, I hope this mailing list is the appropriate place to pose my question. I'm intrigued by AIT's ability to justify the use of probabilistic inductive inference by agents who, as a result of natural selection, needed to solve decision problems.

To simplify matters, I'd like to focus on the "passive learning" setting, described in Chapter 3.4 of Marcus Hutter's book, in which the agent has no influence on the environment. The loss is a known function of the (action, observation) pair at each timestep. To give my best attempt at paraphrasing Hutter's Corollary 3.49(vii) on page 88:

Assuming that the environment samples a sequence from some enumerable distribution μ, no computable agent Λ can perform substantially better than the universal agent Λ_ξ, in terms of the μ-expected loss.

Now, I'm wondering whether this statement can be strengthened as follows. Firstly, I'd like to remove the assumption that the Universe samples from a nice distribution, and instead allow it to produce an arbitrary sequence. This risks running into the No Free Lunch Theorem, but it still seems reasonable to hope that we do as well as any computable agent. And now that the sequence is arbitrary, we can do away with μ and consider the actual loss, instead of taking its μ-expectation. To paraphrase the resulting statement:

With no restrictions on the sequence chosen by the environment, no computable agent Λ can perform substantially better than the universal agent Λ_ξ, in terms of the actual loss on the same sequence.

As usual, error terms should depend on the rival agent Λ, but not on the chosen sequence. The precise statement would be the same as Hutter's, but with arbitrary sequences and actual losses instead of expectations.

Is it known whether this stronger statement is true? If so, Occam's razor would be justified even in the presence of some uncomputable structure, as long as agents are required to be computable. It would also strengthen the AIT interpretation of probability: the modified statement never mentions probabilities or expectations, except that the universal agent reasons in terms of ξ-expectations, ξ being the universal prior. Thus, probability becomes an emergent concept for decision-making agents.

If the modified statement turns out to be false, I'd like to understand why as well.
Thanks for your time!
--
Aram

Marcus Hutter

unread,
Mar 13, 2022, 2:34:11 PM3/13/22
to ai...@googlegroups.com

Hi Aram,

Thank you for your question.
* What you are after is in general known under the names 'Prediction with Expert Advice' or 'Online Learning',
  which makes no assumptions on the observation sequence but restricts the class of predictors,
  e.g. to the (countable) class of computable predictors.
* I briefly mention this in my book in Sec.3.7.4 and expand on it in http://www.hutter1.net/ai/bayespea.htm
* Under log-loss you can continue to use Bayesian mixture predictors,
  but for other losses you have to combine the predictors=experts somewhat differently to achieve good bounds.
  See e.g. http://arxiv.org/abs/cs.AI/0504078 and references therein.
* This can be extended to acting agents, but is ugly: http://dx.doi.org/10.1007/11564089_28
* And here's an experimental comparison: http://arxiv.org/abs/cs.LG/0508073",

Best regards,

Marcus

More information: http://www.hutter1.net/ait.htm
---
You received this message because you are subscribed to the Google Groups "Algorithmic Information Theory" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ait0+uns...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ait0/23ca95a3-537e-47a0-8c43-85153cc007den%40googlegroups.com.

Aram Ebtekar

unread,
Mar 16, 2022, 11:24:30 PM3/16/22
to Algorithmic Information Theory
Thank you for the excellent references!

I believe what I'm looking for is an intermediate level of generality, between the Bayesian mixture-of-environments and the Expert Advice settings described in your technical report http://www.hutter1.net/ai/bayespea.htm

If I understand correctly, the Expert Advice setting generalizes the former in two ways: (1) it makes the sequence arbitrary (rather than sampling from μ), and (2) it considers arbitrary weighted collections of experts (rather than the universal mixture of computable agents). In this setting, only randomized agents can achieve a regret of sub-linear order.

I'd like to make the generalization (1), but not (2). That is, I'm wondering whether the deterministic agent Λ_ξ achieves sub-linear regret on arbitrary sequences, where the "experts" are simply the computable agents, weighted by complexity. Oscillating sequences are less problematic here than in the general experts setting, since the universal mixture includes oscillating strategies.

As you mentioned, Λ_ξ achieves bounded regret under the log-loss, when the action space consists of the probability simplex over the input alphabet. It seems to me that an immediate consequence of the PEA regret bounds is that we can generalize this result to arbitrary convex action spaces: that is, whenever for every random mixture of actions A and B, there exists a third action C that's at least as good as the mixture's expectation. In this case, C pretty much emulates a randomized strategy. However, I haven't yet figured out what happens under general non-convex action spaces, such as discrete prediction with the 0-1 loss.

What fascinates me about this setting is that it places probability theory purely in the agent's mind: neither the sequence nor the actions are randomized, and yet Λ_ξ works by taking the argmax of ξ-expected reward.

Best regards,
--
Aram
Message has been deleted
Message has been deleted

Aram Ebtekar

unread,
Mar 17, 2022, 4:50:26 AM3/17/22
to Algorithmic Information Theory
Oops, I just realized that my reply slightly mischaracterized the Bayesian mixture-of-environments setting, saying that it measures regret with respect to computable agents, when in fact it measures regret with respect to the optimal agent. So, I suppose (2) is really a specialization rather than a generalization. My question specializes even further, by using as experts the set of all computer programs (perhaps assigning the maximum per-timestep loss to those that fail to halt with a valid action).

So to summarize again, on arbitrary sequences, I'd like to bound the regret of some deterministic agent (I'm especially interested in Chapter 3.4's Λ_ξ, though strategies from the PEA literature may be even more relevant) with respect to these computable agents. I'm under the impression that this question has a positive answer for "convex" action spaces, since they can emulate randomized PEA strategies, but is open in the general case. I'd love to hear your thoughts.

P.S. Please pardon the deleted messages above, resulting from my edits to this one. If anyone points out additional errors or ambiguities on my part, I'll be sure to address them.

Marcus Hutter

unread,
Mar 18, 2022, 8:47:24 AM3/18/22
to Aram Ebtekar, ai...@googlegroups.com

If you make the action space (e.g. {0,1}) convex (e.g. [0;1]],
you also have to specify the loss function for this expanded action space.
If you do this in the natural linear way, you can avoid randomization,
but for non-convex action space or non-convex loss function, randomization is unavoidable.
I seem not to be able to find a good reference for this.
Very roughly, if the loss function is convex and smooth at the minimum,
you can even get a regret of O(log T) instead of O(√T).
https://doi.org/10.1111/j.1751-5823.2001.tb00457.x

On 3/17/22 07:59, Aram Ebtekar wrote:
Oops, I just realized that my last reply slightly mischaracterized the Bayesian mixture-of-environments setting, saying that it measures regret with respect to computable agents, when in fact it measures regret with respect to the optimal agent. So, I suppose (2) is really a specialization rather than a generalization. If anyone points out additional errors or ambiguities on my part, I'll make sure to address them.

In any case, my question is about the setting with arbitrary sequences, where I'd like to bound the regret of some deterministic agent (perhaps Chapter 3.4's Λ_ξ) with respect to the set of computable agents. I'm under the impression that this question has a positive answer for "convex" action spaces, since they can emulate randomized PEA strategies, but is open in the general case. I'm curious to hear your thoughts on it.

Thanks again!
On Wednesday, March 16, 2022 at 8:24:30 PM UTC-7 Aram Ebtekar wrote:

Tor Lattimore

unread,
Mar 18, 2022, 9:11:59 AM3/18/22
to Marcus Hutter, Aram Ebtekar, ai...@googlegroups.com
Hi,

Perhaps I am missing something in the question, but it is known that no deterministic agent can enjoy sublinear regret relative to even the constant strategies with the {0,1} loss and {0,1} actions. Consider the case of binary prediction, so your agent must output a prediction in {0, 1} in every round. Since your agent is deterministic there exists a sequence on which it makes a mistake every round. And yet one of the constant (hence computable) predictors (predicting all 0s or all 1s) will make only T/2 mistakes over T rounds. Hence you have a regret of at least T/2.

Apologies if this simplistic clarification is missing something from your question. And a minor comment on Marcus' notes about logarithmic regret. As far as I know, this generally needs some sort of curvature. Either of the loss function (i.e. strong convexity or exp-concavity). See Hazan's online learning book for basic results on this. Or curvature of the action space (https://arxiv.org/abs/1702.03040).

Best,

Tor




Aram Ebtekar

unread,
Mar 25, 2022, 1:12:58 AM3/25/22
to Algorithmic Information Theory
You're both right, thanks so much!

Among all universal strategies in the PEA literature, I'm specifically interested in Λ_ξ because it does more than choose actions: it makes truth claims about the world. Of course, "truth" must be interpreted somewhat liberally, since the actual sequence may have ξ-measure zero. Tor's counter-example is especially challenging: it seems the next bit's ξ-probabilities converge to 50-50, and yet Λ_ξ chooses the wrong action 100% of the time, right? Nonetheless, even in such cases, I wonder if the ξ-probabilities have some decision-theoretically relevant interpretation:

1) Inspired by the PEA literature, maybe a softmax randomized variant of Λ_ξ can achieve low regret? That is, choose actions with probability proportional to β^(ξ-expected loss). Or,

2) maybe ξ is a good predictor of the losses attained by any *computable* strategy Λ (so Λ_ξ itself is excluded)? Concretely, the regret would be measured with respect to computable adversaries that try to predict Λ's loss at each step, and are penalized by the absolute difference. On second thought, maybe this won't work since the absolute difference lacks curvature...

I'll have to think about this more, but please let me know if answers or similar interpretations already exist. Ultimately, my motivation for attaining a firmer understanding of the philosophical foundations, is that I want to turn next to practical problems in human and machine learning. Traditional statistical learning theory can tell us when a hypothesis class is sufficiently small to make a good selection within that class, but it doesn't tell us how to find the hypothesis class in the first place. For that, we usually rely on the implicit priors in our human judgment. Occam's razor, formalized by AIT, helps us complete the theory.

While exact Solomonoff induction is infeasible, we've seen that it dominates the performance of all practical algorithms, and thus should provide insight on which algorithms we should use. For example, it suggests that we should find ways to approximate AIXI. I want to try something closely related but a lot less ambitious, which is to apply the AIT lens on popular techniques that are already in use, and see if this inspires novel tweaks or extensions to those techniques.

As a simple example: why are scientific studies considered more persuasive if the hypothesis is chosen in advance? I believe the answer is, roughly speaking, that pre-commitment ensures that some function represented in the experimenter's brain can output the hypothesis, when given only old observational data as input. In other words, pre-commitment serves as proof that K(hypothesis | old data) is very low. Via the approximate identity K(data) = K(old data) + K(new data | old data), we're able to leverage the human brain as a blackbox to improve our estimate of Solomonoff's posterior. I hope that more detailed and rigorous analyses of this type would yield fresh insights on the scientific method, as well as everyday inductive reasoning and machine learning!

Are others interested in or actively working on such problems? Currently, I'm lacking a proper collaboration model for this research agenda.
Message has been deleted

David Dowe

unread,
Mar 4, 2023, 7:57:59 AM3/4/23
to Algorithmic Information Theory, Aram Ebtekar
Hi, Aram et al., and I hope that I have correctly understood your questions - and that all are well.


With regard to Aram's more recent question of "What sort of x would maximize the gap between this estimate and the true count, ...",
I believe that there is some related literature.
Michael Brand and David L. Dowe (2017), "The IMP game: Learnability, approximability and adversarial learning beyond $\Sigma_1^{0}$",
Journal of Logic and Computation, vol. 27, no. 7 (October 2017): 2171-2192 https://doi.org/10.1093/logcom/exw031  
(which we'll call Brand and Dowe (2017) - with earlier version at https://arxiv.org/pdf/1602.02743.pdf, and
we'll call the earlier version Brand and Dowe (2016)) - seems to give something related.
\cite[sec. 2, p6]{BrandDowe2016} gives some history of what I variously called the ``elusive model paradox'' or
``red herring sequences'', where the next bit is chosen to be
    (in one version) the negation/flip of the bit arising from the most probable (minimum message length, MML) model
    or (in another version) the least probable (rather than the most probable) bit.
Various authors who have presented this in one guise or another would appear to include M Scriven (1965),
D K Lewis and Jane Shelby Richardson (1966), A P Dawid (circa 1985), D L Dowe (2008a-b, 2011), and
R J Solomonoff (2009 and posthumous, 2010).  Another paper to include mention of such work is (e.g.)
J Hernandez-Orallo et al. (2011).

As for the continuation of Aram's question "... and how large can the gap be?", I will perhaps answer partly.
As in Brand and Dowe (2016, 2017), doing such a diagonalisation (and then choosing the least probable
bit rather than the most probable bit) - as described in the paragraph above - can lead us to something of
greater complexity.  I currently see no reason why a string x, not restricted, could not continue indefinitely
with the "wrong"/unexpected bit at each stage.  However, that does not seem sufficient in itself for a large gap
between the estimate and the true count, as the various probabilities might conceivably be very close to 0.5 .
Put another way, if the probabilities are all hypothetically 0.499999ish and 0.500001ish, even if we are
choosing below 0.5 when we should choose above 0.5 and vice versa, that does not necessitate a large gap.

A close examination of the Solomonoff (1978) convergence results might help.


With regard to Aram's earlier question "why are scientific studies considered more persuasive if the hypothesis is chosen in advance?"
from Thursday, March 24, 2022 at 10:12:58 PM UTC-7 (below from March 2022), I would answer as follows.
Let U1 be a UTM and p_{U1} be a corresponding universal probability distribution.
Let T2 be a TM, p_{T2} be a corresponding probability distribution, and let q be recursive s.t. 0 < q <= 1.
(If T2 is a UTM then we can allow q to be 0 and in turn relax matters slightly to 0 <= q <= 1, q recursive.)
Then p = q p_{U1} + (1-q) p_{T2} should also be a universal probability distribution.
Choosing the hypothesis in advance is tantamount to choosing q < 1 and (1-q) > 0.
By increasing the prior probability (1-q) of some one or more hypotheses, we can still maintain universality.


I hope both that I've understood correctly and that the above helps.

Keep well,  cheers  and  sincerely    from David D
 
---------- Forwarded message ---------
From: Aram Ebtekar <arame...@gmail.com>
Date: Sat, 4 Mar 2023 at 10:34
Subject: Re: [AIT] Hutter's decision-theoretic argument for Solomonoff induction
To: Algorithmic Information Theory <ai...@googlegroups.com>


Related to Tor's counterexample and my idea #2 above, now I have the following question:

Using a monotone Turing machine, fix a universal distribution ξ over binary sequences. On a given binary sequence x, consider Solomonoff's sequence of next-bit probabilities, the n'th of which is ξ(x_n = 1 | x_1, ..., x_{n-1}). For sufficiently long x, will the sum of these probabilities always closely approximate the true number of 1's in x? What sort of x would maximize the gap between this estimate and the true count, and how large can the gap be?

All my questions in this thread came from an effort to strengthen the connection between AI and AIT. I can explain the motivation further if anyone likes, and any help would be appreciated :)

jabowery

unread,
Mar 4, 2023, 10:36:04 AM3/4/23
to Algorithmic Information Theory
The motivation for the title's conjecture arises with the increasing public concern, and confusion, over the definition of "bias" in large language models.

I've looked but have been unable to discover any work in the field of "algorithmic bias" that applies Algorithmic Information Theory to the identification of "bias", in the scientific sense*, let alone its meta-measurement, given a bit string of passive measurements.  

How would one go about doing a literature search for prior scholarship on this conjecture?  How would one phrase the conjecture in the language of AIT?

*The majority of the concern over "algorithmic bias" in large language models refers to unrestricted curation of text corpora resulting in those models reflecting not only a "biased" utility function in, say, an AIXI agent's use of Sequential Decision Theory, but also, and even more critically, "bias" in terms of the accuracy of the resulting algorithmic model of reality, yielding inaccurate predictions.  Leaving behind the utility function notion of "bias" (SDT) and focusing on the scientific notion of "bias" (AIT) one can easily recognize how the scientific community detects bias in its measurement instruments with highly redundant cross-checks, not just between measurement instruments of the same phenomena, but also cross-discipline checks for consistency via unified theories.  An extreme but simple example would be a world in which all thermometers were manufactured by the same company that, for some perverse reason, reports 101C at sea level for the boiling point of water but for all other temperature measurements reported the normal Celsius temperature.   Cross-disciplinary checks with other kinds of measurements would result in a minimum algorithmic description that reified the identity of the thermometer company latent in the data as having a particular measurement bias -- and quantify that bias to be 1C(water) so that thermometer measurements in general could be optimally predicted.

Aram Ebtekar

unread,
Mar 6, 2023, 3:46:36 PM3/6/23
to Algorithmic Information Theory
Thanks David! I don't know why my last message shows up as deleted on the Google group, maybe it's a glitch or an accident...

I haven't finished going through your references yet, but I'd like to share a few thoughts. First, the adversarial sequence (i.e., Tor's counterexample) will indeed make the predictor ξ converge to 0.5. We can see this by noting that with every additional bit, the adversarial sequence's algorithmic ξ-probability is cut to at most half. In particular, ξ(x_{1:n}) <= 2^-n. On the other hand, we can't cut it by much more than half, too many times, because the program "PRINT x" has length n + O(1), which implies ξ(x_{1:n}) > 2^(-n - O(1)).

Now, let x be an arbitrary sequence with large n = |x|. Let f be the true fraction of 1s in x. Then, instead of the "PRINT x" program, we can use arithmetic coding to ensure that ξ(x) > 2^(-n*H(f) - O(1)), where H(f) = -f*log(f) - (1-f)*log(1-f). I should probably include another low-order term for specifying f with sufficient precision. Anyway, now we want to know if the sum of ξ predictions can be made far from n*f. If ξ(x_i | x_{1:i-1}) converges to a limit, then this limit must be f, for otherwise the algorithmic probability ξ(x) would again be too small. So, if there's a way to break the estimate, it must involve considerable fluctuations in ξ. For example, maybe it correctly predicts something very close to 0 whenever the true bit is 0, but is more confused (perhaps predicting something close to f) when the true bit is 1. In this case, x retains sufficient ξ-probability to avoid contradicting our bound, while also biasing the sum of estimates to about n*f^2 instead of n*f.

But now things get really weird: if ξ~0 on the 0s and ξ~f on the 1s, then an oracle for ξ(x_i | x_{1:i-1}) would let us perfectly predict the 0s and 1s in x. Fortunately, ξ is not computable; nonetheless, by dovetailing, we can approximate it in the limit. Consider the predictor that guesses the next bit depending on whether a dovetailed estimate of ξ, after running for a very long time, rises substantially above 0. If this program is successful, then it would give very high algorithmic probability to x, in contradiction to the supposition that ξ~f on all the 1s. But for this program to be unsuccessful, it must be the case that ξ typically takes an incredibly long time, longer than any computable function, to assign substantial probability to the 1s. Can this really happen?

Regarding your answer about scientific studies, it makes sense and is a little different from the one I had in mind. I'd like to return to it later if that's ok.

Aram Ebtekar

unread,
Mar 7, 2023, 3:37:34 PM3/7/23
to Algorithmic Information Theory
I'm aware of some subtle issues with non-halting programs in the monotone Turing machine model. My question didn't specify whether I meant unnormalized ξ-probabilities over finite and infinite strings, or the normalized version that only considers infinite strings. Each case brings its own complications: in the former, the next bit could be 0, 1, or non-existent if we've reached the end of a finite output. On the other hand, the latter case can't be approximated from below. Intuitively, for fixed x, I would expect that as i ranges from 1 to n, only rarely will the end-of-string probability at the i'th bit be far from zero. That is, sum_i  ξ(end | x_{1:i-1}) should grow sublinearly with n = |x|, and therefore I'm hoping it won't matter which definition we use. But we're talking about a potentially worst-case x, so I don't know how badly ξ might behave... is there more literature on worst-case behavior of ξ?

Rephrasing my main question for reference, since the original is erased:
On an arbitrary binary sequence x, consider Solomonoff's sequence of next-bit (normalized or unnormalized) probabilities p_i(x) = ξ(x_i = 1 | x_{1:i-1}). What is sup_x | sum from i=1 to n of x_i - p_i(x) | in terms of n?

Aram Ebtekar

unread,
Mar 8, 2023, 2:04:50 AM3/8/23
to Algorithmic Information Theory
Oops I replied on the Google thread but forgot to change back the subject line. I think Jim Bowery made the opposite mistake (changing the subject line but not the Google thread). Um here let me try a more suitable subject line. Apologies for the confusion...
Reply all
Reply to author
Forward
Message has been deleted
0 new messages