For the past few days I've been working on the problem of exemplar
selection for moses/plop: given a set of N promising programs for
solving some task, how do we choose which one to use as the focus for
intensive optimization in the search for even better programs? I
discovered in my dissertation work on moses that, as expected, the
naive approach of always picking the program with this highest score
(i.e. fitness) doesn't work very well on many problems because of
deception in the search space - directions that seem promising end up
leading us away from superior programs (but instead lead us to many
new programs with equal or only marginally better fitness, that then
predominate in the set of promising programs).
The solution I used in my dissertation work was to use a stochastic
selection procedure that allowed less fit programs to sometimes still
get chosen. This works, but is not very efficient - it is insenisitive
to the information that we obtain by performing optimization that
allows us to assimilate evidence as to which regions of the space are
worth exploring. I have now developed a new approach, outlined in
http://code.google.com/p/plop/wiki/WorkingNotes , to computing the
expected utility, not of a particular program, but of performing
search in the neighborhood nearby to some particular program. With
this approach, we can greedily select the program with the highest
expected neighborhood utility for intensive optimization. If this
doesn't work out, the expected utility measure (and, crucially, the
expected utility measures for nearby programs as well) will drop
accordingly.
I am still pondering how to implement this approach (or, more likely,
an approximation) efficiently. Let me know what you think - all
feedback appreciated!
Have a great weekend,
Moshe Looks
Thanks for the helpful comments!
> I like the approach, before I start, I don't understand the formula of your
> unbiased estimator (V2 is not defined...), and there is a typo in the
> before_the_last formula, the definition of E, - should be ].
>
Fixed. V2 is simply the sum of the squares of the weights, so it
measures the "imbalance" of the weightings (1/N <= V2 <= 1 for N
weights), with our estimate of the variance *increasing* as the
weights become more imbalanced (this is important)...
> let me give an counter example... Let say there are 2 exemplars A and B. A
> is vastly surrounded by better sub-optimal candidates, let's call them
> better_A_neigh. B has only a few better neighbors but they are optimal,
> let's call them better_B_neigh. And of course A has a better score than B to
> make things harder (an also better_A_neigh are syntactically far away from
> the optimal solution)... So the only way to solve the problem is to focus on
> exemplar B.
>
> While in theory Utility(A) could well be < Utility(B), their estimate would
> most likely be the opposite, that is ~Utility(A)>~Utility(B), since
> |better_A_neigh|>>|better_B_neigh|. So the method is gonna give the focus to
> A...
>
This is *exactly* the sort of case that my utility measure is designed
to handle. The key to this (I hope) is that if you look at way utility
is computed we decide whether to prefer to search around A vs. B based
not just on the weighted mean, which will indeed be higher for A, but
also on the weighted variance around A vs. B. My working understanding
(may be wrong) is that even if exploring around A at first has higher
expected utility, after spending some time searching there our
estimated variance will decrease (while our estimated variance around
B should actually *increase* - see how V2 is defined). And,
importantly, our estimated variance for points near A (better_A_neigh)
will also decrease. These changes *should* (if my intuitions about the
math are correct) cause A and better_A_neigh's utilities to decrease,
while B's will increase.
If this does not happen I will need to go back to the drawing board
and see if (a) I screwed up the math; or (b) the model just doesn't
capture the real shape of the space(s) properly (e.g. maybe we need to
estimate skewness as well, or fit to a non-normal distribution?)...
> That situation is not unrealistic, that's exactly the fitness landscape I
> was dealing with some trick learning... I fixed that my weighting operators
> in combo so that B scores better than A, i.e. a total hack!
>
Right - I have encountered many similar landscapes as well (e.g.
artificial ant, boolean parity and multiplexer), and they appear to be
fairly common in program spaces...
- Moshe
> 1)
> To partially address the problem Nil mentioned, I'd suggest to use the
> p'th-power average rather than the arithmetic mean, when averaging the score
> over the neighborhood. One then has the problem of tuning p, but at least
> one is just tuning a single number. As p gets larger, there is more and
> more tendency to account for *only* the best stuff in the neighborhood.
>
As I said in the email I just sent to Nil, the idea (maybe wrong) is
to estimate the probability of finding a solution *better* than any
found so far, not to estimate the average solution quality in the
neighborhood, or the average solution quality of the best solutions in
the neighborhood. I do this by making the assumptions that:
1) solution quality is normally distributed
2) correlation between distance and solution quality falls off
geometrically with distance
Of course (1) is a bit dubious, but it seems like a decent starting
point. So we are not only favoring neighborhoods where good solutions
have been found, but also favoring neighborhoods where there is a high
variance in solution quality (or where we have done comparatively
little exploration, which will lead our estimate of variance to be
comparatively higher, I think)... Does this make sense?
> 2)
> You might want to consider looking at sectors rather than spheres. That is,
> suppose you have exemplar E, and are looking at mutating it to get program
> A. Then, rather than looking at the score of everything near program A (and
> creating some average), you might want to look at the trend of score along
> the short pathways between E and A. If this trend is increasing then this
> makes A look more promising. (Of course, one also has the question of what
> power p to use in the average here.)
>
This is an interesting idea, thanks. Offhand it seems like it would
probably work better in spaces with continuous parameters and the
like, where a trendline makes more sense than in discrete spaces such
as Boolean expressions and abstract list manipulation...
- Moshe
if |nodes|<=n
return nodes
partition nodes into dominated and nondominated
if |nondominated|>=n
return restricted-tournament-select(n, nondominated)
return nondominated U restricted-tournament-select(n - |nondominated|,
dominated)
The dominant vs. nondominant partition is to preserve semantic
diversity (more important), and rts is design to preserve syntactic
diversity (of secondary importance)...
- Moshe
> 1) solution quality is normally distributed
> 2) correlation between distance and solution quality falls off
> geometrically with distance
>
> Of course (1) is a bit dubious, but it seems like a decent starting
> point.
It seems to me that you could take skew into consideration without a
lot of extra effort, no? This would be useful, because high variance
that's skewed toward bad solutions probably indicates a non-robust
local max, and therefore a bad neighborhood to explore. The opposite
would suggest potentially very profitable neighborhoods...
Cassio
> It seems to me that you could take skew into consideration without a
> lot of extra effort, no? This would be useful, because high variance
> that's skewed toward bad solutions probably indicates a non-robust
> local max, and therefore a bad neighborhood to explore. The opposite
> would suggest potentially very profitable neighborhoods...
>
Yeah, skewness might be worth taking into consideration as well. A bit
of Google search led me to
http://en.wikipedia.org/wiki/Skew_normal_distribution - is this what
you had in mind, or is there a better family of distribution that
you'd recommend? There are also cases where fitness is bounded or
half-open where e.g. a beta distribution might be more appropriate to
use...
Still, I think for the first go-round I will try it with plain-old
normal distributions, unless you and/or other have a strong feeling
that this is just too simplistic. This is because (a) I am lazy and
working out the math for normal distributions was hard enough; (b) it
might be that our samples will be so small and noisy that normal
distributions will work just as well as anything else...
Thanks,
Moshe
> I believe this would be the appropriate distribution, Moshe
>
> http://en.wikipedia.org/wiki/Dirichlet_distribution
>
> I tend to be suspicious of assumptions of normality, but maybe that's just
> the mathematician in me...
>
I'm not sure how you intend the Dirichlet distribution to be applied
here - do you mean just using a beta distribution (to model fitness
(scaling it to fall in [0,1]), or something else that I'm missing?
What would the N dimensions of the distribution correspond to?
- Moshe