Last time I searched for BN software was several years ago. I didn't
find any that supported interventional data.
>
> 2) What happens to the CPTs in PeBL? There was some talk of them never
> being explicitly calculated, then some of them being updated on every
> iteration. If the latter is true, wouldn't the CPDs for the final
> network be available? But I'm not super familiar with scoring
> mechanisms and how BDeu works.
>
When scoring a network, pebl creates a CPT, uses it to calculate the
BDe score and then destroys it. So, given a learner or a network,
there's currently no way to get the CPTs used. But, with data and a
network, you can re-create the CPT for any node. Here's some
*untested* code:
You have network 'net' and want CPT for the variable with index 3 (so,
the fourth variable).
from pebl import data, cpd
dat = data.fromfile(filename)
node = net.nodes[3]
parents = net.edges.parents(3)
subset = dat.subset([node] + parents)
c = cpd.MultinomialCPD_Py(subset)
-------
subset is a data.py:Dataset object with child node in first column and
parents in rest
c.counts contains the CPT. It's a 2-D numpy array:
1) Rows correspond to parent state configurations (specific values for
parent nodes).
If you have three parents, each binary, then rows correspond to:
{0,0,0}
{1,0,0}
{0,1,0}
{1,1,0}
...
{1,1,1}
Note that the parents to the left cycle faster than ones to the right.
2) Columns contain the counts. Each row corresponds to a parent state
configuration. The first column counts observations with 0 for the
child value and given parent states, second column counts observations
with 1 for child value, and so on. The number of columns is one
greater than the arity of the child node. The last column counts the
number of times the parent state has been observed.
Note that the table contains counts, so you'll have to divide by the
count in the last column to get CPT probabilities.
test/test_cpd.py contains unit tests that check that the table is
being built correctly. It should be helpful in unit-testing your
understanding as well.
Thanks,
Abhik.
Dataset.subset() creates a subset of the data and all metadata
(variable metadata, sample metadata, interventions and missing data
flags). It's a general purpose method to create a subset for any
purpose.
Dataset._subset_ni_fast() which stands for "non-interventions fast"
returns a _FastDataset class which has no metadata. It returns just
the observations minus ones that are flagged as interventional. It's
only called by the evaluator module.. It's a private method because it
was a hack to minimize the time to create subsets when they're only
used for scoring. Actually, the method should have been defined in the
evaluator or cpd module instead of the data module but i'll leave it
here for backwards compatibility..
Thanks,
Abhik