Inference and parameter estimation with interventional data

Jeff

unread,

Sep 30, 2010, 2:47:37 PM9/30/10

to pebl-project

Piggybacking off of the discussion in January (on tying into a
parameter estimation / inference package)... a great thing about PeBL
is it's ability to learn from interventional (experimental) data...
but my struggle is that this seems pretty worthless if I then throw
the network at a package that doesn't support interventional data to
learn parameters.

So...

1) Do any packages support learning parameters (and inference if
possible) on interventional data? I'm less concerned about free than
scalability and stability.
(I'm aware that BNT supposedly does do this but I've had no luck all
week getting it to do anything without an error. I'm also told that
its heavy use of cell arrays makes it inefficient and its use of
classes doesn't let it scale with MDCS easily.)

2) What happens to the CPTs in PeBL? There was some talk of them never
being explicitly calculated, then some of them being updated on every
iteration. If the latter is true, wouldn't the CPDs for the final
network be available? But I'm not super familiar with scoring
mechanisms and how BDeu works.

Thanks,
Jeff Klann
NLM Medical Informatics Fellow, Regenstrief Institute
PhD Candidate, Indiana University

Abhik Shah

unread,

Sep 30, 2010, 5:36:19 PM9/30/10

to pebl-p...@googlegroups.com

On Thu, Sep 30, 2010 at 2:47 PM, Jeff <jkl...@gmail.com> wrote:
> Piggybacking off of the discussion in January (on tying into a
> parameter estimation / inference package)... a great thing about PeBL
> is it's ability to learn from interventional (experimental) data...
> but my struggle is that this seems pretty worthless if I then throw
> the network at a package that doesn't support interventional data to
> learn parameters.
>
> So...
>
> 1) Do any packages support learning parameters (and inference if
> possible) on interventional data? I'm less concerned about free than
> scalability and stability.
> (I'm aware that BNT supposedly does do this but I've had no luck all
> week getting it to do anything without an error. I'm also told that
> its heavy use of cell arrays makes it inefficient and its use of
> classes doesn't let it scale with MDCS easily.)

Last time I searched for BN software was several years ago. I didn't
find any that supported interventional data.

>
> 2) What happens to the CPTs in PeBL? There was some talk of them never
> being explicitly calculated, then some of them being updated on every
> iteration. If the latter is true, wouldn't the CPDs for the final
> network be available? But I'm not super familiar with scoring
> mechanisms and how BDeu works.
>

When scoring a network, pebl creates a CPT, uses it to calculate the
BDe score and then destroys it. So, given a learner or a network,
there's currently no way to get the CPTs used. But, with data and a
network, you can re-create the CPT for any node. Here's some
*untested* code:

You have network 'net' and want CPT for the variable with index 3 (so,
the fourth variable).

from pebl import data, cpd

dat = data.fromfile(filename)
node = net.nodes[3]
parents = net.edges.parents(3)
subset = dat.subset([node] + parents)
c = cpd.MultinomialCPD_Py(subset)

-------
subset is a data.py:Dataset object with child node in first column and
parents in rest

c.counts contains the CPT. It's a 2-D numpy array:

1) Rows correspond to parent state configurations (specific values for
parent nodes).

If you have three parents, each binary, then rows correspond to:

{0,0,0}
{1,0,0}
{0,1,0}
{1,1,0}
...
{1,1,1}

Note that the parents to the left cycle faster than ones to the right.

2) Columns contain the counts. Each row corresponds to a parent state
configuration. The first column counts observations with 0 for the
child value and given parent states, second column counts observations
with 1 for child value, and so on. The number of columns is one
greater than the arity of the child node. The last column counts the
number of times the parent state has been observed.

Note that the table contains counts, so you'll have to divide by the
count in the last column to get CPT probabilities.

test/test_cpd.py contains unit tests that check that the table is
being built correctly. It should be helpful in unit-testing your
understanding as well.

Thanks,
Abhik.

Jeff Klann

unread,

Oct 6, 2010, 12:25:55 PM10/6/10

to pebl-project

Thanks so much for the info. It appears that your code

subset = dat.subset([node] + parents)
c = cpd.MultinomialCPD_Py(subset)

does not support interventional data. Flipping through the code, I found this seems to work:

c=cpd.MultinomialCPD_Py(dat._subset_ni_fast([node,parent]))

... where node is just the node number, and of course parent can be more than one item.

Not quite sure why I need to use a "hidden" method but it correctly adjusts for interventional data.

Thanks!

Jeff Klann

Abhik Shah

unread,

Oct 6, 2010, 1:08:50 PM10/6/10

to pebl-p...@googlegroups.com

Ah, you're right..

Dataset.subset() creates a subset of the data and all metadata
(variable metadata, sample metadata, interventions and missing data
flags). It's a general purpose method to create a subset for any
purpose.

Dataset._subset_ni_fast() which stands for "non-interventions fast"
returns a _FastDataset class which has no metadata. It returns just
the observations minus ones that are flagged as interventional. It's
only called by the evaluator module.. It's a private method because it
was a hack to minimize the time to create subsets when they're only
used for scoring. Actually, the method should have been defined in the
evaluator or cpd module instead of the data module but i'll leave it
here for backwards compatibility..

Thanks,
Abhik

Reply all

Reply to author

Forward