Probabilistic programming

54 views
Skip to first unread message

Peteris Erins

unread,
Jul 28, 2012, 12:00:24 PM7/28/12
to ll-...@googlegroups.com
In case you missed it, there was a recent article about probabilistic programming, implementing some of the most famous ML algorithms as generative models at http://zinkov.com/posts/2012-06-27-why-prob-programming-matters/.

It turns out probabilistic programming languages are pretty lightweight, see this paper for an explanation of how to write one based on ANY language: http://www.stanford.edu/~ngoodman/papers/WSG-AIStats11.pdf. There's a video too at http://videolectures.net/aistats2011_wingate_lightweight/. You might want to brush up on MCMC (Markov Chain Monte Carlo) before you dive in.

I'm sure some have played around with this already here so I have a pressing question. Using the "minimum choice point change" that the author advocates, i.e., taking one random choice point, modifying it according to a proposal, and then preserving as many possible values from the previous trace. Are we happy to have programs like this fail to get an accurate sample:

Let binomial() be flip() ? 1 : 0

a <- binomial()
b <- binomial()

Sample (a, b) conditional on a + b = 1

Here if I'm not mistaken, we are either going to get loads of samples of (0, 1) or loads of samples of (1, 0) based on which one the rejection sampler catches first. I've plugged this model into a Church interpreter and those are indeed the results I'm getting.

So what is the consensus on this? Is there some implicit requirement of an "axially" connected conditioning space or is there a way around this without more initializations necessarily?

Thanks
-- 
Peteris Erins

David Nolen

unread,
Jul 31, 2012, 12:20:08 PM7/31/12
to ll-...@googlegroups.com
Haha this is a lot of stuff, at least for me :) Thanks for bringing this up. I'd never heard of Church before, http://projects.csail.mit.edu/church/wiki/Church. This also seems like a good place to start: http://danroy.org/papers/church_GooManRoyBonTen-UAI-2008.pdf

Other suggestions Peteris?

David 

Peteris Erins

unread,
Jul 31, 2012, 11:00:35 PM7/31/12
to ll-...@googlegroups.com
The church wiki has a list of relevant papers http://projects.csail.mit.edu/church/wiki/Papers.

There's an interesting preprint "On the Computability of Conditional Probability" http://danroy.org/papers/AckFreRoy-CompCondProb-preprint.pdf, which shows that not all programs that can be written can be conditioned upon even inefficiently.

Another cool paper is about an ML DSL http://www.cs.rutgers.edu/~ccshan/rational/dsl-paper.pdf. It's a heavier read (I haven't tackled it yet), but they leverage monads and CPS.

These are the ones I have found to be most interesting to me so far.

Regards
-- 
Peteris Erins

ejackson

unread,
Aug 1, 2012, 4:14:06 AM8/1/12
to ll-...@googlegroups.com
Coming from the other side of the problem, the BUGS language is popular with
inference people.  The language is described here:
The DSL is expressive and easily understood by stats folk.

Although the original BUGS implementation is non-portable, there is a recent
(open source) implementation in C++ as JAGS here
Usefully its easily callable from R which is a nice environment for
playing and learning.

The underlying machinery may be of interest.

  Edmund

David Wingate

unread,
Aug 1, 2012, 7:41:26 AM8/1/12
to ll-...@googlegroups.com
Hi Peteris --

[A bit of background: the whole point of "minimal changes to execution traces" is to increase acceptance probabilities of MCMC moves.  There's nothing magical about this idea; it's just the observation that for most programs, it's going to be hard to craft complex proposals that change many variables at once and still maintain high acceptance rates.  Reusing as much randomness as possible is an easy way to create a proposal that's a "small" change, which is likely to be accepted.]

A couple of thoughts on your example.  First of all, there's nothing wrong with your reasoning -- but I think it would be useful to separate a couple of concepts here.  As written, your program definitely won't mix with a single-site MCMC sampler, as you've already noted.  There are a couple of ways around that:  1) do something more complex than single-site MCMC [blocked Gibbs, delayed rejection, constraint propagation, some other complex proposal mechanism, etc.], or 2) soften the constraint that A+B=1, possibly by adding noise (ie, condition on A+B+noise=1).

Solution #1 is a statement about an inference algorithm, not about how you manage traces in a probabilistic program.  Let's take the example of block Gibbs, and suppose that what you've written below is just a small part of a larger program.  The idea of "minimal changes" still applies in the block setting -- you make proposals to A and B simultaneously, keep all of the other variables in the program the same, and then accept/reject.  So in that sense, the complexity of the proposal mechanism is independent from the idea of re-running a trace and trying to keep as much of the trace similar as possible.

Solution #2 is a statement about the conditioner.  In general, hard conditioning statements like the one you've written are what make inference difficult.  You've basically written a simple SAT problem (and in fact, it's very easy to write SAT as a probabilistic program).  Sometimes, it's possible to take the constraints implied by the conditioner and propagate them backwards through a program (we call this "constraint propagation.").  It's hard to do in general languages, and even in languages like Lisp it isn't trivial.  In your example program, once you've resampled A, the constraint A+B=1 implies a value for B -- so you could just set B to be the correct value.  You've reused "as much randomness as possible", which is none -- but notice that you didn't *sample* a new value for B.  It was deterministically implied.

Constraint propagation is hard.  Automatically identifying good candidate block moves is hard.  The "lightweight implementations" paper deals with neither issue; those are largely language specific, and not the real point of our work.  But they're essential for real implementations.

By the way: several probabilistic programming language developers feel that constraint propagation won't ever really work for general programs, and that the only reliable solution is to "noise up" the model by ensuring that you never create hard conditioners.  You can almost always accomplish this by adding some sort of fudge factor at the "bottom" of your program; this essentially has the effect of creating the "axially connected" regions of probability mass you alluded to.

Oh, and it's an interesting (and current) research question to figure out how to automatically self-relax programs -- ie, take a program with a hard conditioner and automatically inject noise to improve mixability...

HTH,

David
Reply all
Reply to author
Forward
0 new messages