sharing probabilistic models across representations

Moshe Looks

unread,

Feb 17, 2009, 1:25:02 AM2/17/09

to plop...@googlegroups.com

Hi All,

I've created a new wiki page containing some speculations on sharing
probabilistic models across representations:
http://code.google.com/p/plop/wiki/ModelingAcrossRepresentations .
Over the last few weeks I have given considerable thought to this
problem - its rather tricky both conceptually and implementationally.
I have now concluded for now, in the interested of expediency, to
proceed in the plop implementation of MOSES without incorporating
sharing of probabilistic models across representation.

I remain convinced that this technique will eventually bear fruit and
allow search spaces to be explored much more effectively.
Unfortunately it is not clear to me right now exactly what a form a
normative approach would take, nor what heuristics would provide the
most bang per buck. So while the idea is on the back burner for now, I
am interested in any suggested for how to do it right when the time
comes to implement it...

Cheers,
Moshe

Ben Goertzel

unread,

Feb 17, 2009, 7:11:47 AM2/17/09

to plop...@googlegroups.com

quick question: does

"
new representation (program subspace)
"

mean

"
new deme
"

in traditional MOSES-speak?

ben

--
Ben Goertzel, PhD
CEO, Novamente LLC and Biomind LLC
Director of Research, SIAI
b...@goertzel.org

"If we all worked on the assumption that what is accepted as true is
really true, there would be little hope of advance."
-- Orville Wright

"When I die, I'm leaving my body to science fiction."
-- Steven Wright

Ben Goertzel

unread,

Feb 17, 2009, 7:23:00 AM2/17/09

to plop...@googlegroups.com

A few thoughts (the third one being the most adventurous/interesting):

1)
OK: so crudely, what you're saying is that if a new represesentation
is formed by crossing over prior ones, then you can create a
first-approximation probabilistic model via crossing over the
probabilistic models from the parents, and doing some "revision" at
the crossover join-points? The tree-alignment is then a mechanism for
helping perform this?

2)
This seems to be a case where keeping careful track of the confidence
levels ("weight of evidence" levels, whatever) associated with each
probability in the model makes sense. Because some of the
extrapolated probabilities are going to be more confident than others
...this gets at your point about

"
When propagating frequencies across representations, do the propagated
frequencies have equal weight to actual frequencies?
"
I think this is a good case for two-component (or more) probabilistic
truth values, like imprecise or indefinite probabilities...

3)
This is a somewhat whacky idea, but we're brainstorming here, so...

I wonder if you could look at execution flow inside the program tree,
rather than just at syntactic form, in determining what parts of the
parent's probabilistic model to carry over to the child. If done
right, this could make a big difference.

Suppose you have representation T, and you mutate it into T'

Then, suppose you run T and T' on the same inputs ... and keep track
of the flow of execution inside T and T'

Then, you align the execution traces (yes, there is work hidden here ;-) ....

For those nodes N' within T' whose execution aligns well with
comparable nodes N within T, **then** you port over N's probabilistic
model and attribute it to N' [with a confidence proportional to the
degree of alignment]

This avoids being misled by syntactic parallels that are not actually
semantic parallels, right?

-- Ben G

On Tue, Feb 17, 2009 at 1:25 AM, Moshe Looks <madsc...@google.com> wrote:
>

Cassio Pennachin

unread,

Feb 17, 2009, 8:10:14 AM2/17/09

to plop...@googlegroups.com

The execution flow idea seems to be valid for representation building
proper as well, not just for "representation inheritance" (or whatever
we want to call this idea), right? However, it seems a lot harder to
implement than Moshe's original ideas.

I agree that some form of weight of evidence can be used to answer
some of Moshe's questions.

Cassio

--
Cassio Pennachin
CEO, Vetta Labs
www.vettalabs.com -- +55(31)3421-6933

Ben Goertzel

unread,

Feb 17, 2009, 8:13:46 AM2/17/09

to plop...@googlegroups.com

The execution flow idea is definitely WAY harder to implement than
Moshe's original ideas...

But the (obvious) hope is that if some subtree N' of T' both *looks*
and *acts* like some subtree N of T, maybe the model of N can
meaningfully be ported to N'

To use the familiar metaphor, Moshe's original method wants to assume

"If it looks like a duck, then we can model it as a duck"

My method wants to assume

"If it looks like a duck *and* quacks like a duck, then we can model
it as a duck"

-- ben g

Moshe Looks

unread,

Feb 18, 2009, 12:35:47 PM2/18/09

to plop...@googlegroups.com

> quick question: does
>
> "new representation (program subspace)"
>
> mean
>
> "new deme"
>
> in traditional MOSES-speak?
>

Yep.

Moshe Looks

unread,

Feb 19, 2009, 1:55:30 AM2/19/09

to plop...@googlegroups.com

Hi Ben,

> 1)
> OK: so crudely, what you're saying is that if a new represesentation
> is formed by crossing over prior ones, then you can create a
> first-approximation probabilistic model via crossing over the
> probabilistic models from the parents, and doing some "revision" at
> the crossover join-points? The tree-alignment is then a mechanism for
> helping perform this?
>

Well, there is no crossover here. What happens is you have some
expression that you want to use as the exemplar for a new
representation. Now, this expression corresponds to one or more
addresses (set of turns to knobs) in one or more existing
representations. So the idea is to create the initial probabilistic
model by somehow combining the models for all these representations.
The tree-alignment mechanism helps perform this because the knobs
(variables) in the new representation will in general not be in
one-to-one correspondence with the knobs in any of the "parent"
representations. Without a correspondence between knobs, I don't see
any way to import knowledge from the existing probabilistic models
into the new one.

> 2)
> This seems to be a case where keeping careful track of the confidence
> levels ("weight of evidence" levels, whatever) associated with each
> probability in the model makes sense. Because some of the
> extrapolated probabilities are going to be more confident than others
> ...this gets at your point about
>
> "
> When propagating frequencies across representations, do the propagated
> frequencies have equal weight to actual frequencies?
> "
> I think this is a good case for two-component (or more) probabilistic
> truth values, like imprecise or indefinite probabilities...
>

Well I am sort of assuming that we keep all the data around, which is
obviously unrealistic in the general case... At the very least we will
know how many points we sampled out of each representation, which lets
us compute confidence. What complicates things here is that some of
the mappings across representations will be more meaningful than
others.. Of course, a simplifying assumption that will often be
correct is to only consider importing knowledge from *one* existing
model (e.g. the representation where the expression's address is the
shortest when measured in bits).

> 3)
> This is a somewhat whacky idea, but we're brainstorming here, so...
>
> I wonder if you could look at execution flow inside the program tree,
> rather than just at syntactic form, in determining what parts of the
> parent's probabilistic model to carry over to the child. If done
> right, this could make a big difference.
>
> Suppose you have representation T, and you mutate it into T'
>
> Then, suppose you run T and T' on the same inputs ... and keep track
> of the flow of execution inside T and T'
>
> Then, you align the execution traces (yes, there is work hidden here ;-) ....
>
> For those nodes N' within T' whose execution aligns well with
> comparable nodes N within T, **then** you port over N's probabilistic
> model and attribute it to N' [with a confidence proportional to the
> degree of alignment]
>
> This avoids being misled by syntactic parallels that are not actually
> semantic parallels, right?
>

Yeah, this makes sense. Of course, you still need to do the syntactic
alignment in order to get a variable correspondence, so you may be
misled anyways (unless you can compute a semantic similarity measure
across the subexpressions, which would be really expensive). A cheaper
and easier way to do something like this is to instead of execution
traces just look at "behavioral distance" (e.g. truth-table hamming
distance for boolean expressions, differences in sequences of actions
taken for agents, etc.). What do you think?

My fear is that all of this will require a lot of tuning to be useful
- e.g. in many cases there might not be any good alignment at all, and
so starting with a biased probabilistic model will actually degrade
performance...

- Moshe

Moshe Looks

unread,

Feb 19, 2009, 1:57:51 AM2/19/09

to plop...@googlegroups.com

I just realized that my comment on #3 didn't make that much sense ...
feel free to ignore it!

Ben Goertzel

unread,

Feb 19, 2009, 2:00:03 AM2/19/09

to plop...@googlegroups.com

. A cheaper
and easier way to do something like this is to instead of execution
traces just look at "behavioral distance" (e.g. truth-table hamming
distance for boolean expressions, differences in sequences of actions
taken for agents, etc.). What do you think?

I'm afraid that if you look at the behavioral distance between two different subtrees (S within T) and (S' within T'), you'll get a bad estimate unless you calculate the behavioral distance using a **realistic probability distribution** over inputs to the subtrees ...

and that kind amounts to using execution traces over real inputs to the whole tree, right?

I mean, a subtree S within T is not gonna get uniformly distributed inputs, because the nodes feeding into it will have particular biases...

ben

Moshe Looks

unread,

Feb 19, 2009, 2:37:43 AM2/19/09

to plop...@googlegroups.com

> I'm afraid that if you look at the behavioral distance between two different
> subtrees (S within T) and (S' within T'), you'll get a bad estimate unless
> you calculate the behavioral distance using a **realistic probability
> distribution** over inputs to the subtrees ...
>
> and that kind amounts to using execution traces over real inputs to the
> whole tree, right?
>
> I mean, a subtree S within T is not gonna get uniformly distributed inputs,
> because the nodes feeding into it will have particular biases...
>

Yeah, agreed (unfortunately). Behavioral distance in general only
really makes sense to compute between the overall expressions... Of
course, there is the common special case of subtrees whose values only
depend on the inputs to the overall function...

- Moshe

Nil Geisweiller

unread,

Jul 30, 2009, 4:28:41 AM7/30/09

to plop...@googlegroups.com, OpenCog

Hi Guys,

I'm reopening that old thread:
http://groups.google.com/group/plop-dev/browse_thread/thread/a478b24356c5de1f

1) due to reduction in normal form, tree alignment may be tricky some
(many?) times. Well in principle one could carry out the tree
alignment before reducing the new exemplar and then keep track to the
alignment while reducing, but in practice that will require a faire
bit of ComboReduct hacking.

2) in principal the distribution of a population inside a deme can be
represented by a stochastic contextual grammar. And actually the union
of all rules for every deme is also a stochastic contextual grammar
but of course it is probably too specific to be used well anyway so...
So the idea could be to try to infer a compact stochastic contextual
grammar (SCG) that seems to capture well the dependencies contained by
all the rules accumulated generation over generation.

I haven't found anything about stochastic contextual grammar on the
web, but maybe that is not the right terminology. What I mean by that
is:

Say we have the exemplar after the knob-building step:

and(N1 N2 N3)

where Ni are non terminal symbols for knob_i.

For instance If N1 has no dependency, its marginal probability can
represented by a context-free rule like:

N1 --0.4-> a
N1 --0.6-> b

(or replace the proba by an indefinte tv to take confident into account)

But if say N2 and N3 are dependent, then we need contextual rules:

and(N1 c N3) --0.3-> and(N1 c e)
and(N1 c N3) --0.7-> and(N1 c f)
and(N1 d N3) --0.9-> and(N1 d e)
...

Once we have collected a bunch of rules we can start tring to refine
the overall SCG, here the tree alignment technique would correspond to
rename different knobs (corresponding to different population) so they
match. But they are plenty of other things the system could discover
like for instance it generalize the above rules into:

and(S c N3) --0.3-> and(S c e)

where S is a non-terminal symbol that can be rewriten into some
different sub-trees.

Or alternately it could keep track of the entire trace across
populations and induce a compact grammar from scratch... Either way
I'm aware that it is hard learning problem (contextual grammar ~=
Turing machine so...).

I'm actually even wondering if we could represent the SCG in terms of
nodes and links and use PLN to (help) learn that compact SCG... at
some point we will certainly want to implement an optimization algo
that relies on knowledge on the AtomTable, background knowledge
(including model across population and fitness functions) but thats
another topic...

Nil

Ben Goertzel

unread,

Jul 30, 2009, 9:23:06 AM7/30/09

to plop...@googlegroups.com, OpenCog

Hi,

> 2) in principal the distribution of a population inside a deme can be
> represented by a stochastic contextual grammar. And actually the union
> of all rules for every deme is also a stochastic contextual grammar
> but of course it is probably too specific to be used well anyway so...
> So the idea could be to try to infer a compact stochastic contextual
> grammar (SCG) that seems to capture well the dependencies contained by
> all the rules accumulated generation over generation.

Basically this would be an alternative to decision tree modeling, right?

I believe Moshe and I discussed this back in 1887 or so ;-)

> I haven't found anything about stochastic contextual grammar on the
> web, but maybe that is not the right terminology.

Well this is well known

http://en.wikipedia.org/wiki/Stochastic_context-free_grammar

but I guess what you're referring to is a "stochastic context
sensitive grammar" or SCSG

I found some papers on this such as this one on image grammar

http://www.stat.ucla.edu/~sczhu/papers/Reprint_Grammar.pdf

but nothing particularly useful...

> I'm actually even wondering if we could represent the SCG in terms of
> nodes and links and use PLN to (help) learn that compact SCG... at
> some point we will certainly want to implement an optimization algo
> that relies on knowledge on the AtomTable, background knowledge
> (including model across population and fitness functions) but thats
> another topic...

I suppose that inferring the SCSG is going to be best done by some
kind of "pattern mining" algorithm, customized to operate on the
population of Combo trees....

But the pattern mining is going to find only relatively simple SCSG
rules, and PLN may be used to generalize from these to find more
abstract SCSG rules...

ben g

Nil Geisweiller

unread,

Jul 30, 2009, 9:45:28 AM7/30/09

to plop...@googlegroups.com, OpenCog

On Thu, Jul 30, 2009 at 9:23 AM, Ben Goertzel<b...@goertzel.org> wrote:
>
> Hi,
>
>> 2) in principal the distribution of a population inside a deme can be
>> represented by a stochastic contextual grammar. And actually the union
>> of all rules for every deme is also a stochastic contextual grammar
>> but of course it is probably too specific to be used well anyway so...
>> So the idea could be to try to infer a compact stochastic contextual
>> grammar (SCG) that seems to capture well the dependencies contained by
>> all the rules accumulated generation over generation.
>
> Basically this would be an alternative to decision tree modeling, right?

well my point is that, a decision tree modeling for a particular
representation (set of knobs) can be seen as modeling a certain SCSG,
but a set of models can also be seen as a certain SCSG (the union of
all rules). But it could be seen as an alternative too if the
algorithm to learn the SCSG of a particular deme differs substantially
from Bayesian learning with decision trees.

>
> I believe Moshe and I discussed this back in 1887 or so ;-)
>
>> I haven't found anything about stochastic contextual grammar on the
>> web, but maybe that is not the right terminology.
>
> Well this is well known
>
> http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
>
> but I guess what you're referring to is a "stochastic context
> sensitive grammar" or SCSG

ahh thanks

>
> I found some papers on this such as this one on image grammar
>
> http://www.stat.ucla.edu/~sczhu/papers/Reprint_Grammar.pdf
>
> but nothing particularly useful...
>
>

> I suppose that inferring the SCSG is going to be best done by some
> kind of "pattern mining" algorithm, customized to operate on the
> population of Combo trees....
>
> But the pattern mining is going to find only relatively simple SCSG
> rules, and PLN may be used to generalize from these to find more
> abstract SCSG rules...

yes ultimately that would be the idea. But even before that, we can go
for Moshe's idea of tree alignment as a starting heuristic to find
interesting SCSG at the meta-population level, and add some more
useful pattern mining algo on top of it.

Nil

>
> ben g
>
> >
>

Ben Goertzel

unread,

Jul 30, 2009, 10:30:39 AM7/30/09

to plop...@googlegroups.com, OpenCog

> yes ultimately that would be the idea. But even before that, we can go
> for Moshe's idea of tree alignment as a starting heuristic to find
> interesting SCSG at the meta-population level, and add some more
> useful pattern mining algo on top of it.

Yes ... but I have little intuition about the power of the tree
alignment method...

Just musing wildly: I wonder if there is some more strict way of
normalizing programs that could be useful.... ENF is good, but I
guess it still admits a lot of different equivalent programs. I
realize that the problem in general is intractable, but I have a
suspicion we could go further in terms of articulating more specific
hierarchical normalizations that are useful for the majority of
programs that are relevant in practice...

But I don't have time go down this path in detail at this moment...

ben

Moshe Looks

unread,

Jul 30, 2009, 4:05:46 PM7/30/09

to ope...@googlegroups.com, plop...@googlegroups.com

Hi Nil,

Let me add to what Ben's written that there has been a fair bit of
work on PMBGP (probabilistic model-building genetic programming) based
on various grammar formalisms, as well as GP systems based on
grammars:
http://sc.snu.ac.kr/PAPERS/PMBSurv.pdf
has a bunch of pointers to this literature...

The problem with keeping track of tree structure while doing reduct is
not just that its tricky on a code level, but that its in fact a
poorly posed problem - in some cases there are clear-cut alignments to
be made, but in many cases there aren't.

A less ambitious form of learning across demes that I've tried in plop
is sharing the probabilities for sampling various sorts of knobs
across demes. This can lead to modest speedups in some cases but also
increases the risk of premature convergence, so its turned off at the
moment. There are of course various hacks one could try to prevent the
distribution from degenerating while still deriving benefit from this
knowledge, but I haven't gotten to this yet.

Once it is working, however, one reasonable next step would be doing
linkage-learning amongst these parameters (which are maintained across
demes)...

- Moshe

Reply all

Reply to author

Forward