Re: how to tune ADVI

54 views
Skip to first unread message

Andrew Gelman

unread,
Jul 16, 2015, 10:36:45 PM7/16/15
to Dustin Tran, stan...@googlegroups.com, Jalaj Bhandari
Hi, let me just say that I have no ideas at all on this but now I’m thinking this is important so I’m cc-ing stan-dev (and continuing to cc Jalaj, as this could be an excellent project for him to work on).

I’ll just emphasize that, right now, users are tuning ADVI by hand, so this suggests we should be doing something.  If we don’t have any auto-tune, we should at least have an example with a workflow for manual tuning.

But the most Stan-ish thing would be some autotune.

A

On Jul 16, 2015, at 4:50 PM, Dustin Tran <dt...@g.harvard.edu> wrote:

It's actually RMSprop which is being run under the scenes in Stan, with a constant decay factor of 0.9 for taking the moving average.

I've played around with ADAM and haven't seen much noticeable improvement. I view it as a generalization of both adagrad and rmsprop, and which includes additional hyperparameters to tweak. ADAM still does biasedly estimate the inverse fisher information if i recall correctly. There's also recent work by Roger Grosse, james martens, and rus on sparsely factorizing the inverse fisher information but it's problem dependent. 

The whole line of work in learning rates is starting to run dry, as there's only so many ways to approximate a matrix. The real innovation which i'd ultimately like to see in stan is to derive a proximal method which is robust and works for all (or at least, many) classes of models—kind of like your trust region paper Matt. This makes it less sensitive to such overfocused things like learning rate schedules and conditioning, c.f. Panos toulis and edo airoldi's work on implicit sgd which derives statistical properties.

Dustin

On Jul 16, 2015, at 7:32 PM, Matt Hoffman <mdho...@cs.princeton.edu> wrote:

Adagrad can work really well, although it does involve a little extra tuning compared to SGD with no preconditioning.

Another possibility to consider is RMSprop, which is a very similar idea. Durk Kingma has a "corrected" version which he calls Adam:
I don't have any experience with it myself, but IIRC the story is that RMSprop uses a biased estimate of the diagonal of the gradient's covariance matrix, and Adam corrects that bias.

One editorial comment:
I suspect there are a lot of cases where ADVI will be a pretty good substitute for MCMC in MML estimation, since I suspect there are a lot of cases where the expectation w.r.t. the posterior of the gradient of the log joint is close to the gradient of the log joint evaluated at the expected value of the nuisance parameter, i.e.,
E_{p(z|y; θ)}[∂ log p(y, z; θ)/∂θ] ≈ ∂ log p(y, E_{p(z|y; θ)}; θ)/∂θ
and I expect ADVI to often give not-too-terrible estimates of the expected value of the nuisance parameter.

Matt

PS: Andrew, I hear you might be at NIPS this year?

On Wed, Jul 15, 2015 at 11:42 PM Dustin Tran <dt...@g.harvard.edu> wrote:
Great results! I expected ADVI to outspeed glmer but not by that order of magnitude! It looks like you should try increasing the hyper parameter. In general with AdaGrad, the hyperparameter eta determines the constant step size in which one proceeds—before dividing by the gradient history, which can be seen as an approximation to the inverse Fisher information and thus the natural gradient. eta deals with better convergence for finite sample size whereas the gradient history appeals to asymptotics.

As Andrew mentions, we have eta set as a single value now although it’s certainly problem dependent. What the optimization community often does is run the optimization over a subset of the data and do a grid search to see which value did best over that subset, c.f., Leon Bottou’s sgd code. In essence, they do a mini-bootstrap/cross validation at the beginning. I’ll add a issue on Github to remind us to implement this too.

Dustin

On Jul 16, 2015, at 5:23 AM, Andrew Gelman <gel...@stat.columbia.edu> wrote:

Alp and Dustin: What do you think?  Also cc-ing Matt as he is an expert on VI and tuning.

On Jul 15, 2015, at 11:21 PM, Jalaj Bhandari <jb3...@columbia.edu> wrote:

Absolutely, that was my thought when I ran it for different parameter values. I don't know how adagrad works, maybe i should look into it?

Possible to dynamically adjust the step size parameter?



On Wed, Jul 15, 2015 at 11:16 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
J
Cool!  I’m cc-ing Alp, Dustin, Rob, and Jonah.
It sounds like autotuning ADVI would be a useful step!?!
A


Begin forwarded message:

From: Jalaj Bhandari <jb3...@columbia.edu>
Subject: Re: an example where glmer is much faster than Stan
Date: July 15, 2015 at 10:22:19 PM EDT
To: Andrew Gelman <gel...@stat.columbia.edu>

Hi Professor,

I ran this in ADVI. Please find attached the results. The parameter estimates are not as lousy as glmer and the computations are fast. 

Thanks

On Tue, Jul 14, 2015 at 11:14 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
Hi all.  I came across this example in my applied work.  I’ve scrambled the data and replaced with fake outcomes so as not to violate any privacy with the dataset.  Attached on this message are a zipfile with the data, the Stan code, and the R code.

Here’s the story:  I have a relatively simple hierarchical model (nonnested, with 3 batches of varying intercepts of dimension 51, 15, and 51*15) and 150,000 data points.  Running Stan for 4 chains in parallel, 100 iterations each, takes about 3 times as long as glmer.  But really most people would run Stan for at least 1000 iterations, so Stan is much slower than glmer here.

This is a big motivation for us to get MML working in Stan, also a great example for trying out the new ADVI on a real live problem!

In the attached code I’m fitting the model to a random subset of 20,000 data points, just so it will run faster and you can see what’s going on.

For what it’s worth, glmer and Stan give similar estimates for the regression coefficients but they differ on the estimated variance parameters, where Stan seems to be doing much better.  That’s no surprise because glmer has no regularization on these variance parameters.  Our MML and ADVI will allow these, so they should outperform glmer.

So our goal is to have a quick approx algorithm that runs fast.  The example is a motivator.

P.S.  Stan runs faster on these simulated data than on the real data.  Folk theorem in action, I guess.









Dustin Tran

unread,
Jul 17, 2015, 2:37:56 AM7/17/15
to stan...@googlegroups.com
To summarize, the lowest hanging fruit for autotuning is to run ADVI over a subset of the data and check among a coarse grid of values which led to fastest convergence.
On Jul 17, 2015, at 8:37 AM, Dustin Tran <dt...@g.harvard.edu> wrote:

To summarize, the lowest hanging fruit for autotuning is to run ADVI over a subset of the data and check among a coarse grid of values which led to fastest convergence.
--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Reply all
Reply to author
Forward
0 new messages