A contribution and a request

Jeff Klann

unread,

Nov 3, 2010, 10:56:59 AM11/3/10

to pebl-project

Hello,

Strange that this list is so quiet. Pebl is a really fast toolkit. I suppose it doesn't have that many features yet but where are all the open-source developers? Everything else I've tried is either miserable on large datasets or closed source (with the exception of bnlearn which I haven't tried). What's everyone else using? Anyway, two less rhetorical things:

1) A contribution: I've written an exporter for pebl networks into .dnet files (http://www.norsys.com/dl/DNET_File_Format.txt). It also learns the parameters (CPDs) per Abhik's earlier example with the addition of using Dirichlet priors with pseudocounts of 1. DNET is Netica's format but is also supported by GEnie/SMILE (from UPitt) and probably others. I don't have Netica, but I got correct results on a network of 70 binary variables exporting into GEnie.

The code doesn't export any layout information. Also it's a bit messy and needs more testing, but I thought it'd be useful to someone and I'd like to contribute it to the project. I doubt it's ready to be wrapped into the main trunk unless there's someone who wants to massage it, but let me know what the options are.

2) A request: I'm realizing the learning algorithms in pebl are pretty painfully trivial, though like the docs say the framework should make it quite easy to write more. Has anyone done this? Even Tabu search would help the learner make better random decisions. The GES algorithm, though it requires search on equivalence classes, finds an optimal equivalence class (though my experience is it's pretty slow) - I'm sure it's possible to adapt the principles to DAGs. What I would love is a hybrid search/score implementation like Sparse Candidate or MMHC, which I might try my hand at if no one else is working on things.

Also, Abhik.... why do you keep all of the "bad" networks in the posterior? What's the point of a consensus network that includes those? It seems to me it'd make more sense to only keep the top network before each restart, and then run a bunch of learners in parallel and merge those top networks and compute the consensus network. Your tutorial basically does that, but all the "bad" network choices you keep must mess things up?

Abhik Shah

unread,

Nov 3, 2010, 4:18:34 PM11/3/10

to pebl-p...@googlegroups.com

1) Could you send another email with your code as attachment or send a
URL to it. I can add it to my list of changes I want to make to
pebl... BUT without a guarantee on when it'll be done :)

2) I wrote pebl before I knew exactly how I would be using it for my
thesis work.. and since my projects haven't used the built in
learners, I haven't spent much time improving them. Greedy and
Simulated Annealing, for example, depend crucially on the config
parameters used but I haven't made any effort to guess ideal
parameters given a dataset. Other learners would be great but I'm not
sure if I'll be able to devote much time on them. I'm finishing up my
dissertation and I'm not sure if my future work will use pebl.

3) Generating the "true" consensus network requires the full
posterior. Because we can't do that, we approximate by sampling the
posterior which is what the learners do. The "bad" networks don't
really mess things up because the probability for each edge in the
consensus network is a sum of the normalized probabilities of networks
containing the edge. The bad networks have low probability and have
little effect on the final edge probabilities although keeping lots of
low-probability networks does eat up memory. Each learner keeps the
top N networks found during the search. N can be changed via the
'result.size' config param. I think the ideal search for building
consensus network would be MCMC. Pebl doesn't provide that but
simulated annealing with a very high starting temperature should
approximate it well.

Thanks,
Abhik

Charles R. Twardy

unread,

Nov 7, 2010, 1:34:07 PM11/7/10

to pebl-p...@googlegroups.com, Jeff Klann

Jeff,

Thank you for the Netica exporter. This is something I will use, and
could potentially assist with. Alas, I have a hard time making time.

What are the key references for Sparse candidate or MMHC? Hybrid search
is an interesting idea, esp. if they maintain the convergence properties
of the samplers.

Following up on what Abhik said, keeping the "bad" networks should
improve predictions. Per Solomonoff, the best predictor would keep *all*
networks, and predict by posterior-weighted average.

-crt

--
Charles R. Twardy
Research Assistant Professor
George Mason University C4I Center
703 993 1846 voice
571 212 0674 mobile
ctw...@gmu.edu email

Jeff Klann

unread,

Nov 18, 2010, 11:59:44 AM11/18/10

to ctwardy, pebl-project

Hi Charles & Abhik,

Sorry for the delay, I've been away at conferences for a bit. Glad to see some interest in PeBL!

1) The DNET exporter is on my website now at http://www.jeffklann.com/home/demo/jkpebltools/. Assuming I keep using PeBL, I am happy to put together and maintain and standalone package of my add-ons - I see people doing this with BNT. If you want to wrap it into PeBL that'd be awesome, though like I said it needs more testing. Also I'm new to contributing to open source - I'd like to have some acknowledgment retained in the code/docs. Is that pretty standard?

2) Abhik, maybe I'm not understanding something but your Greedy Search seems to not match what's described in the tutorial. I believe a standard greedy search tries all O(n^2) local additions on the network and keeps the best one. Yours just chooses a random operation (add, delete, reverse) and continues until score stops increasing. This seems more like a best-first approach with reversals and deletions. Is there any literature on how that compares to regular greedy search or a reason you did it this way? It seems to me the results would be more unpredictable but maybe if you take the consensus network of 100s of searches on a distributed cluster it'd work well. I'm curious. Also, if you don't use PeBL for structure search, what do you use it for? That seems to be it's primary purpose! :)

3) Charles, Nir Friedman wrote a paper called "Learning Bayesian Network Structure From Massive Datasets" in Proc UAI - which is what I'm drawing from. MMHC is described in "The max-min hill-climbing Bayesian network structure learning algorithm" in AI. Also you might look for "Causal explorer: A causal probabilistic network learning toolkit for biomedical discovery" which describes Vanderbilt's very comprehensive toolkit and has references to several hybrid algorithms. I'm probably insane to think about reimplementing something when Vanderbilt's toolkit does a great job, but it's not open source and it's in Matlab, which are two hindrances.

One thing you could help with on #3, if you're interested, is answering a few questions I have in trying to understand Friedman's paper. In particular, I don't understand how iteratively restricting and maximizing produces a better network, especially if you keep the network's old parents at each iteration (which I'm not sure the paper is suggesting) - I'd think that would leave the greedy search at a plateau. Anyway, if you'd like to help in that way we can take it offline from the whole pebl group.

Cheers,
Jeff Klann
NLM Medical Informatics Fellow, Regenstrief Institute
PhD Candidate, Indiana University

Abhik Shah

unread,

Nov 20, 2010, 1:34:32 PM11/20/10

to pebl-p...@googlegroups.com, ctwardy

On Thu, Nov 18, 2010 at 11:59 AM, Jeff Klann <jkl...@gmail.com> wrote:
> Hi Charles & Abhik,
>
> Sorry for the delay, I've been away at conferences for a bit. Glad to see
> some interest in PeBL!
>
> 1) The DNET exporter is on my website now at
> http://www.jeffklann.com/home/demo/jkpebltools/. Assuming I keep using PeBL,
> I am happy to put together and maintain and standalone package of my add-ons
> - I see people doing this with BNT. If you want to wrap it into PeBL that'd
> be awesome, though like I said it needs more testing. Also I'm new to
> contributing to open source - I'd like to have some acknowledgment retained
> in the code/docs. Is that pretty standard?

I'll hold off on adding to pebl for now.. Regarding attribution, I
would be more than happy to acknowledge your contribution in both code
and docs.

>
> 2) Abhik, maybe I'm not understanding something but your Greedy Search seems
> to not match what's described in the tutorial. I believe a standard greedy
> search tries all O(n^2) local additions on the network and keeps the best
> one. Yours just chooses a random operation (add, delete, reverse) and
> continues until score stops increasing. This seems more like a best-first
> approach with reversals and deletions. Is there any literature on how that
> compares to regular greedy search or a reason you did it this way? It seems
> to me the results would be more unpredictable but maybe if you take the
> consensus network of 100s of searches on a distributed cluster it'd work
> well. I'm curious. Also, if you don't use PeBL for structure search, what do
> you use it for? That seems to be it's primary purpose! :)
>

I believe that checking one or all neighboring networks are both
considered greedy searches. Checking all neighbors makes the search
deterministic (given a known starting seed) whereas choosing a random
change is more akin to simulated annealing. There actually is code in
pebl/learners/base.py to check all changes but it's not currently
being used. To be honest, the learners in pebl could benefit from more
attention. In my work, rather than learning structure, I've been
scoring structures suggested by other sources of knowledge/data. This
way I have a combinatorial space of models that is far smaller than
the full space and I just write a custom learner that enumerates all
models.

Hope that helps,
Abhik

Jeff Klann

unread,

Nov 25, 2010, 11:43:48 AM11/25/10

to pebl-project, ctwardy

Abhik, great, thanks for the info. I'll have a look at base.py.

To Abhik or any other great Python developers - I'm working on an implementation of MMHC and my remaining limiting factor is the speed of the X^2 test. It's pretty slow. I calculate a pebl Multinomial CPD, do calculations on the counts, and return the X^2 and a significance value from scipy.stats. It's pretty slow.

Would anyone care to take a look? http://www.jeffklann.com/home/demo/jkpebltools (file is chi2.py) Limited testing convinces me it's correct but very slow. Higher up in MMHC I'm caching the X^2 results as appropriate so I'm not recomputing unnecessarily, but still a lot of X^2 tests are necessary. Maybe there's a way to do it without computing a full cpd every time? Alternately, any other test of conditional independence that's also able to determine a level of association between two variables would work fine, if anyone knows of a faster one.

Happy Thanksgiving,

Jeff Klann
NLM Medical Informatics Fellow, Regenstrief Institute
PhD Candidate, Indiana University

Reply all

Reply to author

Forward