new ideas for exemplar selection

1 view
Skip to first unread message

Moshe Looks

unread,
Jan 23, 2009, 5:48:18 PM1/23/09
to plop...@googlegroups.com
Hi All,

For the past few days I've been working on the problem of exemplar
selection for moses/plop: given a set of N promising programs for
solving some task, how do we choose which one to use as the focus for
intensive optimization in the search for even better programs? I
discovered in my dissertation work on moses that, as expected, the
naive approach of always picking the program with this highest score
(i.e. fitness) doesn't work very well on many problems because of
deception in the search space - directions that seem promising end up
leading us away from superior programs (but instead lead us to many
new programs with equal or only marginally better fitness, that then
predominate in the set of promising programs).

The solution I used in my dissertation work was to use a stochastic
selection procedure that allowed less fit programs to sometimes still
get chosen. This works, but is not very efficient - it is insenisitive
to the information that we obtain by performing optimization that
allows us to assimilate evidence as to which regions of the space are
worth exploring. I have now developed a new approach, outlined in
http://code.google.com/p/plop/wiki/WorkingNotes , to computing the
expected utility, not of a particular program, but of performing
search in the neighborhood nearby to some particular program. With
this approach, we can greedily select the program with the highest
expected neighborhood utility for intensive optimization. If this
doesn't work out, the expected utility measure (and, crucially, the
expected utility measures for nearby programs as well) will drop
accordingly.

I am still pondering how to implement this approach (or, more likely,
an approximation) efficiently. Let me know what you think - all
feedback appreciated!

Have a great weekend,
Moshe Looks

Peter Norvig

unread,
Jan 23, 2009, 7:29:01 PM1/23/09
to plop...@googlegroups.com
Seems like a good idea to try to use an exemplar as a proxy for the search space that can be explored from the exemplar.  reminds me of http://www.cs.cmu.edu/~jab/thesis/ .

Let's meet monday to discuss this and your other recent messages.

Nil Geisweiller

unread,
Jan 23, 2009, 9:24:22 PM1/23/09
to plop...@googlegroups.com
Hi Moshe,

I like the approach, before I start, I don't understand the formula of your unbiased estimator (V2 is not defined...), and there is a typo in the before_the_last formula, the definition of E, - should be ].

My question would be what class of deceptive traps the estimate of P and E (based on past weighted scores as you suggest) that method would triumph...

let me give an counter example... Let say there are 2 exemplars A and B. A is vastly surrounded by better sub-optimal candidates, let's call them better_A_neigh. B has only a few better neighbors but they are optimal, let's call them better_B_neigh. And of course A has a better score than B to make things harder (an also better_A_neigh are syntactically far away from the optimal solution)... So the only way to solve the problem is to focus on exemplar B.

While in theory Utility(A) could well be < Utility(B), their estimate would most likely be the opposite, that is ~Utility(A)>~Utility(B), since |better_A_neigh|>>|better_B_neigh|. So the method is gonna give the focus to A...

That situation is not unrealistic, that's exactly the fitness landscape I was dealing with some trick learning... I fixed that my weighting operators in combo so that B scores better than A, i.e. a total hack!

What say you? maybe I misunderstood your method... Anyway, I'm sure there are plenty of occasions it's accelerating the search but the question would be what proportion of practical problems it works and so (which depends on the class of course and god knows what else).

Cheers
Nil

Ben Goertzel

unread,
Jan 24, 2009, 6:35:26 AM1/24/09
to Moshe Looks, plop...@googlegroups.com

Two comments:

1)
To partially address the problem Nil mentioned, I'd suggest to use the p'th-power average rather than the arithmetic mean, when averaging the score over the neighborhood.  One then has the problem of tuning p, but at least one is just tuning a single number.  As p gets larger, there is more and more tendency to account for *only* the best stuff in the neighborhood.

2)
You might want to consider looking at sectors rather than spheres.  That is, suppose you have exemplar E, and are looking at mutating it to get program A.  Then, rather than looking at the score of everything near program A (and creating some average), you might want to look at the trend of score along the short pathways between E and A.  If this trend is increasing then this makes A look more promising.  (Of course, one also has the question of what power p to use in the average here.)

-- Ben
--
Ben Goertzel, PhD
CEO, Novamente LLC and Biomind LLC
Director of Research, SIAI
b...@goertzel.org

"This is no place to stop -- half way between ape and angel"  
-- Benjamin Disraeli

Moshe Looks

unread,
Jan 25, 2009, 4:37:47 PM1/25/09
to plop...@googlegroups.com
Hi Nil,

Thanks for the helpful comments!

> I like the approach, before I start, I don't understand the formula of your
> unbiased estimator (V2 is not defined...), and there is a typo in the
> before_the_last formula, the definition of E, - should be ].
>

Fixed. V2 is simply the sum of the squares of the weights, so it
measures the "imbalance" of the weightings (1/N <= V2 <= 1 for N
weights), with our estimate of the variance *increasing* as the
weights become more imbalanced (this is important)...

> let me give an counter example... Let say there are 2 exemplars A and B. A
> is vastly surrounded by better sub-optimal candidates, let's call them
> better_A_neigh. B has only a few better neighbors but they are optimal,
> let's call them better_B_neigh. And of course A has a better score than B to
> make things harder (an also better_A_neigh are syntactically far away from
> the optimal solution)... So the only way to solve the problem is to focus on
> exemplar B.
>
> While in theory Utility(A) could well be < Utility(B), their estimate would
> most likely be the opposite, that is ~Utility(A)>~Utility(B), since
> |better_A_neigh|>>|better_B_neigh|. So the method is gonna give the focus to
> A...
>

This is *exactly* the sort of case that my utility measure is designed
to handle. The key to this (I hope) is that if you look at way utility
is computed we decide whether to prefer to search around A vs. B based
not just on the weighted mean, which will indeed be higher for A, but
also on the weighted variance around A vs. B. My working understanding
(may be wrong) is that even if exploring around A at first has higher
expected utility, after spending some time searching there our
estimated variance will decrease (while our estimated variance around
B should actually *increase* - see how V2 is defined). And,
importantly, our estimated variance for points near A (better_A_neigh)
will also decrease. These changes *should* (if my intuitions about the
math are correct) cause A and better_A_neigh's utilities to decrease,
while B's will increase.

If this does not happen I will need to go back to the drawing board
and see if (a) I screwed up the math; or (b) the model just doesn't
capture the real shape of the space(s) properly (e.g. maybe we need to
estimate skewness as well, or fit to a non-normal distribution?)...

> That situation is not unrealistic, that's exactly the fitness landscape I
> was dealing with some trick learning... I fixed that my weighting operators
> in combo so that B scores better than A, i.e. a total hack!
>

Right - I have encountered many similar landscapes as well (e.g.
artificial ant, boolean parity and multiplexer), and they appear to be
fairly common in program spaces...

- Moshe

Moshe Looks

unread,
Jan 25, 2009, 10:27:17 PM1/25/09
to plop...@googlegroups.com
Hi Ben,

> 1)
> To partially address the problem Nil mentioned, I'd suggest to use the
> p'th-power average rather than the arithmetic mean, when averaging the score
> over the neighborhood. One then has the problem of tuning p, but at least
> one is just tuning a single number. As p gets larger, there is more and
> more tendency to account for *only* the best stuff in the neighborhood.
>

As I said in the email I just sent to Nil, the idea (maybe wrong) is
to estimate the probability of finding a solution *better* than any
found so far, not to estimate the average solution quality in the
neighborhood, or the average solution quality of the best solutions in
the neighborhood. I do this by making the assumptions that:

1) solution quality is normally distributed
2) correlation between distance and solution quality falls off
geometrically with distance

Of course (1) is a bit dubious, but it seems like a decent starting
point. So we are not only favoring neighborhoods where good solutions
have been found, but also favoring neighborhoods where there is a high
variance in solution quality (or where we have done comparatively
little exploration, which will lead our estimate of variance to be
comparatively higher, I think)... Does this make sense?

> 2)
> You might want to consider looking at sectors rather than spheres. That is,
> suppose you have exemplar E, and are looking at mutating it to get program
> A. Then, rather than looking at the score of everything near program A (and
> creating some average), you might want to look at the trend of score along
> the short pathways between E and A. If this trend is increasing then this
> makes A look more promising. (Of course, one also has the question of what
> power p to use in the average here.)
>

This is an interesting idea, thanks. Offhand it seems like it would
probably work better in spaces with continuous parameters and the
like, where a trendline makes more sense than in discrete spaces such
as Boolean expressions and abstract list manipulation...

- Moshe

Nil Geisweiller

unread,
Jan 26, 2009, 4:25:04 AM1/26/09
to plop...@googlegroups.com
Thanks for that explanation Moshe, I got it now.

So indeed as measure as A is explored, V2 for B will increase (assuming neighbors of A are syntactically far away from B) and therefore its variance, but as well as pretty much anything syntactically far away from A... I guess that is ok because you only apply that selection method from a limited number of promising exemplars which have been selected according to other criteria (local optimum, syntactic diversity...), right? Otherwise a simple hack around would be to weight V2 with some coef to tune... And also of course (what Ben has suggested at first if I understood correctly) to apply a "fitness landscape modifier", like the p-th power...

Nil

Moshe Looks

unread,
Jan 26, 2009, 12:54:36 PM1/26/09
to plop...@googlegroups.com
> So indeed as measure as A is explored, V2 for B will increase (assuming
> neighbors of A are syntactically far away from B) and therefore its
> variance, but as well as pretty much anything syntactically far away from
> A... I guess that is ok because you only apply that selection method from a
> limited number of promising exemplars which have been selected according to
> other criteria (local optimum, syntactic diversity...), right?
>
Right. I treat "preservation of promising solutions" as a separate
problem from "exemplar selection", and attempt to solve it by
computing dominant and non-dominant sets, and then recursively
bisecting the node sets with Hariks' restricted tournament selection
method to obtain the right number of nodes:

if |nodes|<=n
return nodes
partition nodes into dominated and nondominated
if |nondominated|>=n
return restricted-tournament-select(n, nondominated)
return nondominated U restricted-tournament-select(n - |nondominated|,
dominated)

The dominant vs. nondominant partition is to preserve semantic
diversity (more important), and rts is design to preserve syntactic
diversity (of secondary importance)...

- Moshe

Cassio Pennachin

unread,
Jan 27, 2009, 8:14:12 AM1/27/09
to plop...@googlegroups.com
Hi Moshe,

> 1) solution quality is normally distributed
> 2) correlation between distance and solution quality falls off
> geometrically with distance
>
> Of course (1) is a bit dubious, but it seems like a decent starting
> point.

It seems to me that you could take skew into consideration without a
lot of extra effort, no? This would be useful, because high variance
that's skewed toward bad solutions probably indicates a non-robust
local max, and therefore a bad neighborhood to explore. The opposite
would suggest potentially very profitable neighborhoods...

Cassio

Moshe Looks

unread,
Jan 28, 2009, 12:11:51 AM1/28/09
to plop...@googlegroups.com
Hi Cassio,

> It seems to me that you could take skew into consideration without a
> lot of extra effort, no? This would be useful, because high variance
> that's skewed toward bad solutions probably indicates a non-robust
> local max, and therefore a bad neighborhood to explore. The opposite
> would suggest potentially very profitable neighborhoods...
>

Yeah, skewness might be worth taking into consideration as well. A bit
of Google search led me to
http://en.wikipedia.org/wiki/Skew_normal_distribution - is this what
you had in mind, or is there a better family of distribution that
you'd recommend? There are also cases where fitness is bounded or
half-open where e.g. a beta distribution might be more appropriate to
use...

Still, I think for the first go-round I will try it with plain-old
normal distributions, unless you and/or other have a strong feeling
that this is just too simplistic. This is because (a) I am lazy and
working out the math for normal distributions was hard enough; (b) it
might be that our samples will be so small and noisy that normal
distributions will work just as well as anything else...

Thanks,
Moshe

Ben Goertzel

unread,
Jan 28, 2009, 12:18:21 AM1/28/09
to plop...@googlegroups.com

I believe this would be the appropriate distribution, Moshe

http://en.wikipedia.org/wiki/Dirichlet_distribution

I tend to be suspicious of assumptions of normality, but maybe that's just the mathematician in me...

ben

Moshe Looks

unread,
Jan 28, 2009, 2:38:27 PM1/28/09
to plop...@googlegroups.com
Ben,

> I believe this would be the appropriate distribution, Moshe
>
> http://en.wikipedia.org/wiki/Dirichlet_distribution
>
> I tend to be suspicious of assumptions of normality, but maybe that's just
> the mathematician in me...
>

I'm not sure how you intend the Dirichlet distribution to be applied
here - do you mean just using a beta distribution (to model fitness
(scaling it to fall in [0,1]), or something else that I'm missing?
What would the N dimensions of the distribution correspond to?

- Moshe

Ben Goertzel

unread,
Jan 28, 2009, 6:30:02 PM1/28/09
to plop...@googlegroups.com

If you have a fitness function that is fundamentally 1-dimensional, you'd use the beta distribution

If you have a fitness function that is some kind of weighted combination of multiple components, it might be better to use a Dirichlet distribution on the components, rather than a beta on the weighted combination

For instance, this could potentially be interesting if one of the components is a "raw fitness" and the other is an "Occam penalty" ... though I haven't thought this through extremely carefully...

ben
Reply all
Reply to author
Forward
0 new messages