stochastic linear regression based VB

Bob Carpenter

unread,

Dec 26, 2013, 3:52:39 PM12/26/13

to stan...@googlegroups.com

There was a tantalizing paper in the most recent Bayesian Analysis
that it would be nice to hear from Matt or Michael (or Dave Blei, bcc-ed)
about:

Fixed-Form Variational Posterior Approximation
through Stochastic Linear Regression

Tim Salimans and David A. Knowles

http://ba.stat.cmu.edu/journal/2013/vol08/issue04/salimans.pdf

- Bob

Andrew Gelman

unread,

Dec 26, 2013, 5:16:34 PM12/26/13

to stan...@googlegroups.com, Aki Vehtari, David Blei

I noticed this:

"The generic nature of our methodology would lend itself naturally to a software package for Bayesian inference along the lines of Infer.NET (Minka et al. 2010) or WinBUGS (Gilks et al. 1994), and would allow inference in a considerably wider range of models. Incorpo- rating automatic differentiation in such a package could clearly be beneficial…"

I was surprised to see this. I thought everybody who did this sort of thing knew about Stan. If the algorithm really works, Stan would seem like a natural fit.

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Matt Hoffman

unread,

Dec 27, 2013, 1:03:14 AM12/27/13

to stan-dev, Aki Vehtari, David Blei

I talked to David Knowles about Stan over the summer. David's
currently a postdoc with Daphne Koller at Stanford, but spent some
time at MSR Cambridge with infer.NET people, which partly explains why
that's the software on the tip of his tongue.

I haven't read the paper in detail yet (it's been on my reading list
since the arXiv version came out last summer), but what I've seen
seems very cool.

Matt

Michael Betancourt

unread,

Dec 28, 2013, 3:31:29 PM12/28/13

to stan...@googlegroups.com, stan...@googlegroups.com

Unless I misunderstood something, all they're doing is using a single-sample Monte Carlo estimate to compute the variational expectations needed for updating the variational parameters. When using variational distributions for the exponential family they show that this Monte Carlo approximation is equivalent to a kind of stochastic VB.

Personally I find the appeal of stochastic VB to be the ability to stream data and, although they note the possibility of extending their algorithm to admit such a feature, it is not a fundamental part if their approach. I dunno, at some level it seems like a Cache-22 - yes, you can get better variational approximations by using richer variational distributions but then you have to model twice, once for the actual model and once for its approximation.

Andrew Gelman

unread,

Dec 28, 2013, 4:09:49 PM12/28/13

to stan...@googlegroups.com

M

I agree that settings with streaming data are a natural for stochastic optimization algorithms of all kinds, but I thought there was another important application area of stochastic VB, even in non-streaming, fixed data examples. I'm thinking, for example, of hierarchical models with, say, 10^5 data points and 10^3 groups, maybe some varying intercepts and slopes, so that straight HMC (or, worse, Metropolis) could be fairly slow, taking 20 minutes or so. Instead it would be great to have a black-box VB algorithm, and I could imagine stochastic approximation as being part of that story.

I guess what I'm saying is: right now we do not have any black-box distributional approximation that runs in Stan. As we stand, all we have is HMC (or joint optimization, which is essentially useless for the sorts of problems I'm thinking about). I think if we had any black box implementation of a distributional approximation, then we could talk about strengths and weaknesses of alternatives. But, given that we have nuthin, anything looks good to me!

A

Michael Betancourt

unread,

Dec 28, 2013, 4:40:47 PM12/28/13

to stan...@googlegroups.com

Sure, I just meant that I don't think this offers anything more than what Dave, his students, and I are working on. Incidentally, the hyperparameters in hierarchical models have marginal distributions that are sufficiently wacky to render most variational distributions poor approximations. I think this is one of the reasons I've been having such a hard time with the current algorithm.

Bob Carpenter

unread,

Dec 28, 2013, 4:43:10 PM12/28/13

to stan...@googlegroups.com

My understanding (very rough given my lack of depth in VB, KL, and
exponential families) from the paper was that they were suggesting
an approach to non-conjugate models based on an exponential family
approximation. This approximation would be exact in some conjugate
exponential family models and provide a more generalized Laplace-type
approximation elsewhere. I didn't see any automatic way to exploit
the extra generality --- I think this is what Michael was saying when
he said you'd have to model twice.

They also have a bunch of different approaches to implementation
from batch to mini-batch (the latter of which does seem stochastic);
I didn't follow the distinctions very closely and I don't know the
history of all these algorithms for VB.

I agree that it would be nice to have something that's general, even if
it's not stochastic. If it were stochastic, that'd be awesome. I'll
be happy to let you all sort out the academic provenance of the ideas.

- Bob

P.S. I like the idea of a "Cache 22". It would only let you
read a value if it's already in the cache.

Bob Carpenter

unread,

Dec 28, 2013, 4:46:04 PM12/28/13

to stan...@googlegroups.com

On 12/28/13, 4:40 PM, Michael Betancourt wrote:
> Sure, I just meant that I don't think this offers anything more than what Dave, his students, and I are working on.
> Incidentally, the hyperparameters in hierarchical models have marginal distributions that are sufficiently wacky to
> render most variational distributions poor approximations. I think this is one of the reasons I've been having such a
> hard time with the current algorithm.

Is it the general form of the approximations, the variable-at-a-time
nature of the approximations, or both?

Seems like we'll have to go down the rabbit hole of double modeling,
be it automatic or manual.

- Bob

Michael Betancourt

unread,

Dec 28, 2013, 4:51:15 PM12/28/13

to stan...@googlegroups.com

Yeah, Bob definitely echoed my sentiments.

The problem with VB is that it requires a second step in defining the variational distributions. This is why there are so many different variations (hah!) of the algorithm out there.

I think what Andrew wants is a choice of variational distribution that works okay for all models - I think the Laplace or delta algorithms are the best choice here (once we have efficient higher-order autodiff these will be easy to implement).

VB, however, also lets us enter the world of big data (yes, I hate myself for saying that). This requires not just generic VB but also the ability to stream batches, which is what the original stochastic VB buys you.

Michael Betancourt

unread,

Dec 28, 2013, 4:54:55 PM12/28/13

to stan...@googlegroups.com

The issue here is that the hyperparameters marginals are highly skewed and often heavy-tailed. This pushes typical variational distributions (like the Gaussian) beyond what would be considered a reasonable approximation. What would be really great would be Student variational distributions, but that makes the math explode.

There will definitely be two modeling steps here (this is the huge benefit of MCMC, which works for any model). The question is whether or not we do the second step or we force the user to do it.

Andrew Gelman

unread,

Dec 29, 2013, 1:23:42 PM12/29/13

to stan...@googlegroups.com

Once we have something that runs, I'll be happy--because, once we have something that runs, it will be a benchmark for people (including us) to try to beat!

Andrew Gelman

unread,

Dec 29, 2013, 1:28:42 PM12/29/13

to stan...@googlegroups.com

Streaming would be great, but first I'd like some black-box algorithm that works! If we have something that works on my basic hierarchical models (the ones we use for MRP) in Stan and they run fast with reasonable results, I'll be happy! Right now we have a big "hole" in that we can do simple glms and we can do full Bayes in great generality, but we can't do quick approximate Bayes for moderate-sized or large problems. I really would like something that people (including me) can run, that beats lmer/glmer in speed for moderate-sized datasets. If we are close to having that already, that's great.
A

Jared Murray

unread,

Dec 30, 2013, 9:33:52 AM12/30/13

to stan...@googlegroups.com

On Saturday, December 28, 2013 4:54:55 PM UTC-5, Michael Betancourt wrote:
> The issue here is that the hyperparameters marginals are highly skewed and often heavy-tailed. This pushes typical variational distributions (like the Gaussian) beyond what would be considered a reasonable approximation. What would be really great would be Student variational distributions, but that makes the math explode.

Not here; their method extends to mixtures of exponential family distributions. Have a look at their final example (the beta-binomial). They use a discrete mixture, but a continuous mixture like the t would be a straightforward extension. You start with a model p(y, theta) that you would like to approximate with a family q(theta), where q(theta) = \int q(theta, u) du. If your model is p(y, theta) you augment to the model p(y, theta, u) = p(y, theta)q(u|theta) and proceed with the variational distribution q(theta, u) which is now tractable. This problem has the same optima as the original problem with p(y, theta) and q(theta), I believe. The preceding section where they describe using triangular q's (i.e. q(theta_1)q(theta_2 | theta_1) ) is also interesting.

I don't know whether these tricks are standard in the VB literature but they seem very important to me. Fixed form VB has at least a chance of capturing complex dependence, particularly when you can go beyond the exponential family. Mean field methods have been effective but a factorized approximate posterior just isn't going to work for a lot of applications, including most of mine.

I've got no connection to the Knowles & Salimans paper but I hope stan-dev will take a hard look. Stan has the modeling language to express both the true model and perhaps the approximate posterior *and* Stan has the AD (their methods can use gradients and Hessians where they're available).

Andrew Gelman

unread,

Dec 30, 2013, 3:10:15 PM12/30/13

to stan...@googlegroups.com

On Dec 30, 2013, at 3:33 PM, Jared Murray wrote:

> On Saturday, December 28, 2013 4:54:55 PM UTC-5, Michael Betancourt wrote:
>> The issue here is that the hyperparameters marginals are highly skewed and often heavy-tailed. This pushes typical variational distributions (like the Gaussian) beyond what would be considered a reasonable approximation. What would be really great would be Student variational distributions, but that makes the math explode.
>
> Not here; their method extends to mixtures of exponential family distributions. Have a look at their final example (the beta-binomial). They use a discrete mixture, but a continuous mixture like the t would be a straightforward extension. You start with a model p(y, theta) that you would like to approximate with a family q(theta), where q(theta) = \int q(theta, u) du. If your model is p(y, theta) you augment to the model p(y, theta, u) = p(y, theta)q(u|theta) and proceed with the variational distribution q(theta, u) which is now tractable. This problem has the same optima as the original problem with p(y, theta) and q(theta), I believe. The preceding section where they describe using triangular q's (i.e. q(theta_1)q(theta_2 | theta_1) ) is also interesting.
>
> I don't know whether these tricks are standard in the VB literature but they seem very important to me. Fixed form VB has at least a chance of capturing complex dependence, particularly when you can go beyond the exponential family. Mean field methods have been effective but a factorized approximate posterior just isn't going to work for a lot of applications, including most of mine.
>
> I've got no connection to the Knowles & Salimans paper but I hope stan-dev will take a hard look. Stan has the modeling language to express both the true model and perhaps the approximate posterior *and* Stan has the AD (their methods can use gradients and Hessians where they're available).

I'd be happy to see a successful implementation of this or any other black-box distributional approximation method in Stan.

Reply all

Reply to author

Forward