A neural net with hidden layers could represent A and B, but that doesn't answer the question. Suppose that when you observe A that the next bit is 1 with probability pa, and when you observe B the next bit is 1 with probability pb. You have never observed A and B together before. Otherwise you could collect statistics on it and wouldn't need to estimate it from pa and pb.
According to algorithmic information theory (AIT), formulated independently by Kolmogorov, Solomonoff, and Levin, strings have a universal a-priori probability proportional to the number of programs that output them, where each program M with length |M| bits is weighted by 2^-|M|. This is just a formal way of stating Occam's Razor: the simplest explanation is the most likely one. To use AIT to predict
bits, we consider only programs consistent with the data so far, and ask what they would output next. The answer is dominated by the shortest programs.
Unfortunately AIT is not computable because we don't know which programs will halt. Also, the choice depends on the language we use to encode M. The language dependence is a small constant because you can always translate from one language to another by appending a compiler or interpreter that doesn't depend on the input string. However, algorithmic probability still depends exponentially on this small constant so we can't neglect it.
Here is an example of how it is used. What will be the next bit in the string 00000000000000000000000000001?
We consider two groups of programs that might have generated this string.
P1: output a bunch of 0s followed by a bunch of 1s.
P2: output 0 most of the time
with an occasional 1.
The lengths of P1 and P2 (expressed in English) are nearly the same. Thus, we conclude that the next bit is 1 with probability about 1/2.
So here is how we might apply AIT to context mixing. Consider these two programs:
M1: if input is A then output 1 with probability pa
else if input is B then output 1 with probability pb.
M2: if input is B then output 1 with probability pb
else if input is A then output 1 with probability pa.
M3: if input is A and not B then output 1 with probability pa
else if input is B and not A then output 1 with probability pb
else output 1 with probability pab. (pab = P(x|A,B)).
In M3, pab does not depend on pa or pb. It's possible but unlikely because M3 is longer than M1 or M2. Also, because M1 and M2 are the same
length, it is just as likely that pab = pa or pab = pb. Of course, normal probability theory doesn't tell you any of this.
Also, this does not tell us how to mix pa and pb. For that, we need to consider other programs, for example:
M4: output 1 with probability A*pa + B*pb.
where A and B are 1 or 0. M4 is shorter than M1 or M2 and consistent with observation, although obviously wrong if pa + pb > 1. But consider programs of the form:
M5: output 1 with probability C1 + C2*A + C3*B.
where constants C1+C2 = pa, C1+C3 = pb, C1+C2+C3 = pab. AIT prefers constants C1, C2, and C3 with short binary representations. Also, it makes no preference for C2, C3 > 0 or C2, C3 < 0. Thus, averaging over all programs of this type, we can argue that pab should be between pa and pb.
Finally, note that to represent
probabilities in the open interval (0,1) requires more bits for numbers near 0 or 1 than near 1/2. The number of bits to represent p in this range is on the order |log(p/(1-p))| but again the exact length depends on the choice of language.
This is not a very complete argument, I admit. I certainly have not considered all possible programs.
-- Matt Mahoney,
matma...@yahoo.com