I've created a new wiki page containing some speculations on sharing
probabilistic models across representations:
http://code.google.com/p/plop/wiki/ModelingAcrossRepresentations .
Over the last few weeks I have given considerable thought to this
problem - its rather tricky both conceptually and implementationally.
I have now concluded for now, in the interested of expediency, to
proceed in the plop implementation of MOSES without incorporating
sharing of probabilistic models across representation.
I remain convinced that this technique will eventually bear fruit and
allow search spaces to be explored much more effectively.
Unfortunately it is not clear to me right now exactly what a form a
normative approach would take, nor what heuristics would provide the
most bang per buck. So while the idea is on the back burner for now, I
am interested in any suggested for how to do it right when the time
comes to implement it...
Cheers,
Moshe
> 1)
> OK: so crudely, what you're saying is that if a new represesentation
> is formed by crossing over prior ones, then you can create a
> first-approximation probabilistic model via crossing over the
> probabilistic models from the parents, and doing some "revision" at
> the crossover join-points? The tree-alignment is then a mechanism for
> helping perform this?
>
Well, there is no crossover here. What happens is you have some
expression that you want to use as the exemplar for a new
representation. Now, this expression corresponds to one or more
addresses (set of turns to knobs) in one or more existing
representations. So the idea is to create the initial probabilistic
model by somehow combining the models for all these representations.
The tree-alignment mechanism helps perform this because the knobs
(variables) in the new representation will in general not be in
one-to-one correspondence with the knobs in any of the "parent"
representations. Without a correspondence between knobs, I don't see
any way to import knowledge from the existing probabilistic models
into the new one.
> 2)
> This seems to be a case where keeping careful track of the confidence
> levels ("weight of evidence" levels, whatever) associated with each
> probability in the model makes sense. Because some of the
> extrapolated probabilities are going to be more confident than others
> ...this gets at your point about
>
> "
> When propagating frequencies across representations, do the propagated
> frequencies have equal weight to actual frequencies?
> "
> I think this is a good case for two-component (or more) probabilistic
> truth values, like imprecise or indefinite probabilities...
>
Well I am sort of assuming that we keep all the data around, which is
obviously unrealistic in the general case... At the very least we will
know how many points we sampled out of each representation, which lets
us compute confidence. What complicates things here is that some of
the mappings across representations will be more meaningful than
others.. Of course, a simplifying assumption that will often be
correct is to only consider importing knowledge from *one* existing
model (e.g. the representation where the expression's address is the
shortest when measured in bits).
> 3)
> This is a somewhat whacky idea, but we're brainstorming here, so...
>
> I wonder if you could look at execution flow inside the program tree,
> rather than just at syntactic form, in determining what parts of the
> parent's probabilistic model to carry over to the child. If done
> right, this could make a big difference.
>
> Suppose you have representation T, and you mutate it into T'
>
> Then, suppose you run T and T' on the same inputs ... and keep track
> of the flow of execution inside T and T'
>
> Then, you align the execution traces (yes, there is work hidden here ;-) ....
>
> For those nodes N' within T' whose execution aligns well with
> comparable nodes N within T, **then** you port over N's probabilistic
> model and attribute it to N' [with a confidence proportional to the
> degree of alignment]
>
> This avoids being misled by syntactic parallels that are not actually
> semantic parallels, right?
>
Yeah, this makes sense. Of course, you still need to do the syntactic
alignment in order to get a variable correspondence, so you may be
misled anyways (unless you can compute a semantic similarity measure
across the subexpressions, which would be really expensive). A cheaper
and easier way to do something like this is to instead of execution
traces just look at "behavioral distance" (e.g. truth-table hamming
distance for boolean expressions, differences in sequences of actions
taken for agents, etc.). What do you think?
My fear is that all of this will require a lot of tuning to be useful
- e.g. in many cases there might not be any good alignment at all, and
so starting with a biased probabilistic model will actually degrade
performance...
- Moshe
. A cheaper
and easier way to do something like this is to instead of execution
traces just look at "behavioral distance" (e.g. truth-table hamming
distance for boolean expressions, differences in sequences of actions
taken for agents, etc.). What do you think?
- Moshe
> 2) in principal the distribution of a population inside a deme can be
> represented by a stochastic contextual grammar. And actually the union
> of all rules for every deme is also a stochastic contextual grammar
> but of course it is probably too specific to be used well anyway so...
> So the idea could be to try to infer a compact stochastic contextual
> grammar (SCG) that seems to capture well the dependencies contained by
> all the rules accumulated generation over generation.
Basically this would be an alternative to decision tree modeling, right?
I believe Moshe and I discussed this back in 1887 or so ;-)
> I haven't found anything about stochastic contextual grammar on the
> web, but maybe that is not the right terminology.
Well this is well known
http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
but I guess what you're referring to is a "stochastic context
sensitive grammar" or SCSG
I found some papers on this such as this one on image grammar
http://www.stat.ucla.edu/~sczhu/papers/Reprint_Grammar.pdf
but nothing particularly useful...
> I'm actually even wondering if we could represent the SCG in terms of
> nodes and links and use PLN to (help) learn that compact SCG... at
> some point we will certainly want to implement an optimization algo
> that relies on knowledge on the AtomTable, background knowledge
> (including model across population and fitness functions) but thats
> another topic...
I suppose that inferring the SCSG is going to be best done by some
kind of "pattern mining" algorithm, customized to operate on the
population of Combo trees....
But the pattern mining is going to find only relatively simple SCSG
rules, and PLN may be used to generalize from these to find more
abstract SCSG rules...
ben g
Yes ... but I have little intuition about the power of the tree
alignment method...
Just musing wildly: I wonder if there is some more strict way of
normalizing programs that could be useful.... ENF is good, but I
guess it still admits a lot of different equivalent programs. I
realize that the problem in general is intractable, but I have a
suspicion we could go further in terms of articulating more specific
hierarchical normalizations that are useful for the majority of
programs that are relevant in practice...
But I don't have time go down this path in detail at this moment...
ben