Ensemble Algorithms

James A. Bowery

unread,

Sep 4, 2009, 1:53:00 PM9/4/09

to Hutter Prize

An interesting article on multi-model algorithms producing superior
predictions for data mining prize competitions raises the question:
How much longer are the prediction algorithms compared to the degree
to which their predictions could have compressed the data?

--------

Collaborative Filtering with Ensembles

One of the most interesting insights from the results of the Netflix
challenge is that while the algorithms, the psychology, and good
knowledge of statistics goes a long way, it was ultimately the cross-
team collaboration that ended the contest. "The Ensemble" team,
appropriately named for the technique they used to merge their results
consists of over 30 people. Likewise, the runner up team ("BellKor")
is a collaborative effort of several distinct groups that merged their
results. It is easy to overlook this fact, except that it is not a one-
time occurrence. The leaderboard for the recent GitHub contest also
clearly shows over half of the top ten entries as ensemble techniques!

Few people would argue against the idea of consulting several experts
on any complex subject - as the saying goes: "two heads are better
than one". Having said that, the ensemble techniques which leverage
this exact strategy are rarely a hot topic in the machine learning
communities, or at least, up until now. Given their recent success
this is likely to change, but perhaps even more importantly, their
effectiveness may actually force the 'collaborative filtering' space
to become, well, a lot more collaborative....

http://www.igvita.com/2009/09/01/collaborative-filtering-with-ensembles/

Matt Mahoney

unread,

Sep 4, 2009, 4:13:33 PM9/4/09

to hutter...@googlegroups.com

PAQ uses ensemble methods on two levels. First, it combines the predictions of lots of independent models based on different contexts and methods of induction. Second, these models were developed by lots of different people. The problem with ensemble methods is that the more predictors you have, the slower it runs. Eventually you run into the contest time limits.

Compression also differs from supervised machine learning in that the objective is not to output a prediction (0 or 1) and minimize the number of wrong guesses, but to output a probability and minimize coding cost. This is different from the Bayes optimal classifier ( http://en.wikipedia.org/wiki/Machine_learning_ensemble#Bayes_Optimal_Classifier ) which does not say how probabilities should be combined. PAQ6 uses weighted averaging. However PAQ7-8 uses weighted averaging in the log odds domain (p -> log(p/(1-p))). There isn't a good theory to explain why this works better.
-- Matt Mahoney, matma...@yahoo.com

James Bowery

unread,

Sep 4, 2009, 10:29:02 PM9/4/09

to hutter...@googlegroups.com

It might be worthwhile to run a bunch of texts to see which of them do not, in fact, do better with the log odds domain.

James Bowery

unread,

Sep 5, 2009, 8:23:42 AM9/5/09

to hutter...@googlegroups.com

Has anyone looked for a theory based on the fact that log-odds, aka logit, is the negative of the derivative of the binary entropy function?

On Fri, Sep 4, 2009 at 3:13 PM, Matt Mahoney <matma...@yahoo.com> wrote:

Eugene D. Shelwien

unread,

Sep 5, 2009, 12:06:03 PM9/5/09

to hutter...@googlegroups.com

> Has anyone looked for a theory based on the fact that log-odds, aka logit, is
> the negative of the derivative of the binary entropy function?

Actually a while ago I posted a convincing (for me) explanation
of paq mixing via likelihood optimization for a model based on
fixed probability switching of static submodels.
Even with a working demo of a completely different update function
for the mixer.

http://encode.ru/forum/showthread.php?p=7924#post7924

But guess its not "a good theory" ;)

Good luck!
- Shelwien

Matt Mahoney

unread,

Sep 6, 2009, 5:17:24 PM9/6/09

to hutter...@googlegroups.com

From: Eugene D. Shelwien <shel...@gmail.com>
> http://encode.ru/forum/showthread.php?p=7924#post7924

We ran a lot of experiments but I don't think we settled the question. If x is a random event (the next bit is a 0 or 1) and A and B are contexts, then the question is, given probabilities P(x|A) and P(x|B), what is P(x|A,B)?

Probabability theory doesn't help us, even if we assume A and B are independent. So we use empirical data. In PAQ6 we use weighted averaging of P(x|A) and P(x|B) weighed by how much evidence was used to estimate these values and how recent the evidence is. In PAQ8 we convert to the logistic domain, p -> log(p/(1-p), average (disregarding evidence), and convert back. Experimentally, PAQ8 outperforms PAQ6. But we know we can still do better because in either case we can improve compression by tuning weights to favor the most accurate models, and further still by selecting the weights based on a context independent of A and B.

James Bowery wrote:
> Has anyone looked for a theory based on the fact that log-odds, aka logit, is the negative of the derivative of the binary entropy function?

That never occurred to me. I found it attractive because it simplified the weight update equation. For a neural network, the optimal update is along the gradient of the weight vector with regard to cost. If the cost is root mean squared error then the update is:

w := w + L*(y-p)*p*(1-p)

where w is the weight, L is the learning rate, y is the desired output (0 or 1) an p is the actual output, i.e. p is the prediction (probability) for the next bit and (y-p) is the output error. If the cost is coding cost instead of RMS error, then the update simplifies to

w := w + L*(y-p).

I had worked out the partial derivative in both PAQ6 and PAQ8, and was surprised by the simplicity in PAQ8. The inverse of logit(p) = stretch(p) = ln(p/(1-p)) is squash(p) = 1/(1+e^-p), which is a commonly used squashing function in continuous neural networks. It is not the only one. Others have used arctan, but I don't know of any theoretical justification for choosing one over the other.

Without a theory, I tried some examples. Suppose P(x|A) = 0.5 and P(x|B) = 0.9999. It does not seem right to say P(x|A,B) = 0.75 because clearly B tells us more about x than A does. Converting to the logistic domain gives P(x|A,B) ~ 0.99 which seems intuitively closer to the right answer.

Perhaps the answer can be derived from algorithmic information theory. What is the shortest program that could generate x consistent with P(x|A) and P(x|B), and what will that program output when both contexts are present?

-- Matt Mahoney, matma...@yahoo.com

----- Original Message ----

From: Eugene D. Shelwien <shel...@gmail.com>
To: hutter...@googlegroups.com
Sent: Saturday, September 5, 2009 12:06:03 PM
Subject: [Hutter Prize] Re: Ensemble Algorithms

> Has anyone looked for a theory based on the fact that log-odds, aka logit, is
> the negative of the derivative of the binary entropy function?

James Bowery

unread,

Sep 6, 2009, 6:09:29 PM9/6/09

to hutter...@googlegroups.com

On Sun, Sep 6, 2009 at 4:17 PM, Matt Mahoney <matma...@yahoo.com> wrote:

From: Eugene D. Shelwien <shel...@gmail.com>

> http://encode.ru/forum/showthread.php?p=7924#post7924

We ran a lot of experiments but I don't think we settled the question. If x is a random event (the next bit is a 0 or 1) and A and B are contexts, then the question is, given probabilities P(x|A) and P(x|B), what is P(x|A,B)?

Probabability theory doesn't help us, even if we assume A and B are independent.

Isn't this what the hidden layers are for in a neural net? How is the feature A&B represented in a neural net with hidden layers and what is the theoretic justification for that representation?

Matt Mahoney

unread,

Sep 6, 2009, 10:31:16 PM9/6/09

to hutter...@googlegroups.com

A neural net with hidden layers could represent A and B, but that doesn't answer the question. Suppose that when you observe A that the next bit is 1 with probability pa, and when you observe B the next bit is 1 with probability pb. You have never observed A and B together before. Otherwise you could collect statistics on it and wouldn't need to estimate it from pa and pb.

According to algorithmic information theory (AIT), formulated independently by Kolmogorov, Solomonoff, and Levin, strings have a universal a-priori probability proportional to the number of programs that output them, where each program M with length |M| bits is weighted by 2^-|M|. This is just a formal way of stating Occam's Razor: the simplest explanation is the most likely one. To use AIT to predict bits, we consider only programs consistent with the data so far, and ask what they would output next. The answer is dominated by the shortest programs.

Unfortunately AIT is not computable because we don't know which programs will halt. Also, the choice depends on the language we use to encode M. The language dependence is a small constant because you can always translate from one language to another by appending a compiler or interpreter that doesn't depend on the input string. However, algorithmic probability still depends exponentially on this small constant so we can't neglect it.

Here is an example of how it is used. What will be the next bit in the string 00000000000000000000000000001?

We consider two groups of programs that might have generated this string.

P1: output a bunch of 0s followed by a bunch of 1s.

P2: output 0 most of the time with an occasional 1.

The lengths of P1 and P2 (expressed in English) are nearly the same. Thus, we conclude that the next bit is 1 with probability about 1/2.

So here is how we might apply AIT to context mixing. Consider these two programs:

M1: if input is A then output 1 with probability pa

else if input is B then output 1 with probability pb.

M2: if input is B then output 1 with probability pb

else if input is A then output 1 with probability pa.

M3: if input is A and not B then output 1 with probability pa

else if input is B and not A then output 1 with probability pb

else output 1 with probability pab. (pab = P(x|A,B)).

In M3, pab does not depend on pa or pb. It's possible but unlikely because M3 is longer than M1 or M2. Also, because M1 and M2 are the same length, it is just as likely that pab = pa or pab = pb. Of course, normal probability theory doesn't tell you any of this.

Also, this does not tell us how to mix pa and pb. For that, we need to consider other programs, for example:

M4: output 1 with probability A*pa + B*pb.

where A and B are 1 or 0. M4 is shorter than M1 or M2 and consistent with observation, although obviously wrong if pa + pb > 1. But consider programs of the form:

M5: output 1 with probability C1 + C2*A + C3*B.

where constants C1+C2 = pa, C1+C3 = pb, C1+C2+C3 = pab. AIT prefers constants C1, C2, and C3 with short binary representations. Also, it makes no preference for C2, C3 > 0 or C2, C3 < 0. Thus, averaging over all programs of this type, we can argue that pab should be between pa and pb.

Finally, note that to represent probabilities in the open interval (0,1) requires more bits for numbers near 0 or 1 than near 1/2. The number of bits to represent p in this range is on the order |log(p/(1-p))| but again the exact length depends on the choice of language.

This is not a very complete argument, I admit. I certainly have not considered all possible programs.

-- Matt Mahoney, matma...@yahoo.com

From: James Bowery <jabo...@gmail.com>
To: hutter...@googlegroups.com
Sent: Sunday, September 6, 2009 6:09:29 PM

Subject: [Hutter Prize] Re: Ensemble Algorithms

James Bowery

unread,

Sep 14, 2009, 11:46:57 PM9/14/09

to hutter...@googlegroups.com

On Sun, Sep 6, 2009 at 9:31 PM, Matt Mahoney <matma...@yahoo.com> wrote:

Unfortunately AIT is not computable because we don't know which programs will halt. Also, the choice depends on the language we use to encode M. The language dependence is a small constant because you can always translate from one language to another by appending a compiler or interpreter that doesn't depend on the input string. However, algorithmic probability still depends exponentially on this small constant so we can't neglect it.

This sounds close to the critique of the AIT a apriori that Solomonoff himself -- and at least in part appealing to Solomonoff's authority -- Peter Turney have put forth. To quote Solomonoff:

"This subjectivity the fact that they are based on choice of which Universal machine to use is characteristic of all prediction systems based on a priori probability distributions. The choice of Universal machine and its instruction set is a necessary parameter in the system that enables us to insert a priori information into it. The dependence of the universal distribution on choice of machine is not a Bug in the System it, too, is a Necessary Feature."

We went over this around the time of the origination of the Hutter Prize and it reared its head again last year with Turney's blog article "Ockham's Razor is Dull" which caused some spirited discussion terminating when I submitted Solomonoff's above quote, criticising it on the grounds that this emulation constant is negligible, and Turney declined to approve it, asking that the issue be dropped in favor of agreeing to disagree.

There is an important philosophical point to clarify here: Just what is the difference between your exponentiation of the emulation constant and and Solomonoff's subjectivist posture?

Matt Mahoney

unread,

Sep 15, 2009, 6:17:59 PM9/15/09

to hutter...@googlegroups.com

I don't think my position differs from Solomonoff. Given any string x, I can make its algorithmic probability arbitrarily close to 1 by choosing a language M such that M("") = x and all other strings require an arbitrarily large input. By a similar argument I can make the probability arbitrarily close to 0. One could argue that the complexity of M is at least that of x and that we should use only "simple" languages. But again you need to choose a language before you can measure the complexity of languages.

This fact doesn't invalidate Occam's Razor. It still holds regardless of your choice of language. Pick any language M. Suppose that a string x has a short encoding y such that M(y) = x and |y| << |x|. Then y also outputs prefixes x' of x. If |x'| >> |y|, then programs of length |y| that output different strings with the same prefix are necessarily rare. The probability that such a program exists is 2^(|x'| - |y|) independent of M.

I disagree with Turney on Occam's Razor. He cites some papers that discredit Occam's Razor to support his position. I looked at one of the papers by Webb. http://www.cs.washington.edu/research/jair/abstracts/webb96a.html

Webb does experiments showing that small decision trees are worse predictors than larger trees that give the same results on the training data, but fill in the blank areas based on the assumption that similar objects belong to the same class. But I disagree with his conclusion. First, the similarity assumption itself is an application of Occam's Razor because it approximates a set of points in space by a partition into polygons. That partition has a shorter description than the original data. Second, a decision tree can only divide a data set into rectangles. What Webb's algorithm does is divide the set into smaller rectangles that approximates the polygons that would be produced by a learning algorithm based on similarity such as nearest neighbor or clustering. Such a tree necessarily has a longer description, but that is not the correct application of Occam's Razor because the new decision tree is not the shortest description of the hypothesis.

Tables 2 and 3 in the paper show that pruned decision trees give consistently better predictions than unpruned trees, consistent with Occam's Razor. But Webb conveniently ignores this result. Furthermore, his algorithm weakens the effect of pruning.

-- Matt Mahoney, matma...@yahoo.com

From: James Bowery <jabo...@gmail.com>
To: hutter...@googlegroups.com

Sent: Monday, September 14, 2009 11:46:57 PM

Subject: [Hutter Prize] Re: Ensemble Algorithms

Reply all

Reply to author

Forward