ADVI for beginner questions

Sebastian Weber

unread,

Jul 17, 2015, 2:52:35 AM7/17/15

to stan-...@googlegroups.com

Hi!

I have started to play with ADVI and I must say I am impressed by the speed, but I still need to check if I get garbage (on a first look everything looks ok). Now, a couple of questions (and if you answer with RT(F)M that's fine - justlet me know which document to read):

- A run with the default settings took 3h, a run with eta_agrad=0.05 (half of the default) took only 20 minutes. Is this normal? On what scale should I vary eta_agrad? I.e. try multiples or offsets (so 1/2^n or eta_agrad= 0.1, 0.09, 0.08 ...)

- ADVI complained about diverging ELBO during the first 1000 iterations and the delta_ELBO_mean was in the thousands whereas the delta_ELBO_median was around or below 1. Do I need to worry now?

- For which type of problems does ADVI work well and which ones not?

- Is the resulting CSV a dump of MCMC-like samples? So can I use it in the usual way?

- Does Rhat mean anything? What else do I need to check for non-convergence?

- Can read_stan_csv from rstan read in these csv files or should I read these in with R and skip over the first 1000 warmups?

I am still evaluating, but just let me say that ADVI crunched an ODE problem which takes almost 3 days per chain in just 20-30 minutes !!! If the result is any usable then this is just awesome!

Best,

Sebastian

Bob Carpenter

unread,

Jul 17, 2015, 3:08:23 AM7/17/15

to stan-...@googlegroups.com

That's fantastic. Did you see if the results were similar
to those you got with MCMC?

Some answers inline.

On Jul 16, 2015, at 11:52 PM, Sebastian Weber <sdw....@gmail.com> wrote:
>
> Hi!
>
> I have started to play with ADVI and I must say I am impressed by the speed, but I still need to check if I get garbage (on a first look everything looks ok). Now, a couple of questions (and if you answer with RT(F)M that's fine - justlet me know which document to read):

Other than the doc in the Stan manual and CmdStan manual (2.7.0 versions
of both of which are now available), you'll want to read Alp's
arXive paper for a deeper description of how it works:

http://arxiv.org/abs/1506.03431

> - A run with the default settings took 3h, a run with eta_agrad=0.05 (half of the default) took only 20 minutes. Is this normal? On what scale should I vary eta_agrad? I.e. try multiples or offsets (so 1/2^n or eta_agrad= 0.1, 0.09, 0.08 ...)
>
> - ADVI complained about diverging ELBO during the first 1000 iterations and the delta_ELBO_mean was in the thousands whereas the delta_ELBO_median was around or below 1. Do I need to worry now?

These first two, Dustin or Alp are going to have to tackle.

> - For which type of problems does ADVI work well and which ones not?

It's labeled "experimental" exactly because we don't know
under what conditions it'll work well or how best to tune it.
So all these data points are super helpful.

> - Is the resulting CSV a dump of MCMC-like samples? So can I use it in the usual way?

As to the draws, those are from the variational approximation,
as explained in the paper. But they're meant to be usable in
the same way as the MCMC output. So if everything's done right,
you should be able to read them in just like MCMC output.

I think the plan is to perhaps do some importance weighting
in the future to make expectation calculations closer to the
true posterior, but I'm not 100% sure.

> - Does Rhat mean anything? What else do I need to check for non-convergence?

The Rhat won't mean anything --- it should have n_eff close to the
number of iterations because it's using pure Monte Carlo (not Markov chain)
draws.

> - Can read_stan_csv from rstan read in these csv files or should I read these in with R and skip over the first 1000 warmups?

If everything's coded correctly, it should be. I haven't tried it.

> I am still evaluating, but just let me say that ADVI crunched an ODE problem which takes almost 3 days per chain in just 20-30 minutes !!! If the result is any usable then this is just awesome!

This is really exciting news. Thanks again for trying this out. I
can see a whole stream of PK/PD publications coming out of this if
it really is that great. Do you have an estimate of how long NONMEM
would take to fit a similar problem?

- Bob

Dustin Tran

unread,

Jul 17, 2015, 3:26:27 AM7/17/15

to stan-...@googlegroups.com

To answer what Bob hasn’t:

> - A run with the default settings took 3h, a run with eta_agrad=0.05 (half of the default) took only 20 minutes. Is this normal? On what scale should I vary eta_agrad? I.e. try multiples or offsets (so 1/2^n or eta_agrad= 0.1, 0.09, 0.08 …)

We’re looking into autotuning the hyper parameter for the learning rate. At the moment, the general idea is to increase the hyper parameter if it looks like it is converging slowly at the beginning, and to decrease the hyper parameter if it looks like it goes down fast at the beginning but takes too long during the middle/end.

> - ADVI complained about diverging ELBO during the first 1000 iterations and the delta_ELBO_mean was in the thousands whereas the delta_ELBO_median was around or below 1. Do I need to worry now?

We've seen it complain about diverging ELBO at the beginning on many models. In general, convergence diagnostics are very rough in the works; it's not uncommon to see what you are experiencing so I wouldn't worry about it.

> - For which type of problems does ADVI work well and which ones not?

For now, it will work well for any posterior which can be reasonably approximated by independent normal distributions (on the unconstrained parameter space). We're looking into using better variational families in a fully generic way, c.f., http://arxiv.org/abs/1506.03159

Thank you for trying this out! Your results are very promising, and hopefully we can motivate more to consider the feasibility of approximate Bayesian inference. :-)

Dustin

--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sebastian Weber

unread,

Jul 17, 2015, 3:55:11 AM7/17/15

to stan-...@googlegroups.com

I have to warn you: I have had a quick look at the mean estimates and they looked similar to what I am used to see from MCMC results. However, I have to check more closely. I am happy to report back success or failure. Right now I need to understand what to look at to make sure ADVI works for my case.

@ Dustin: I am not sure if I understood your comment about changing eta_agrad. I just followed the advice in the manual which says "if you see ELBO diverging, then decrease eta_agrad". Other than that I have no clue on this and I was surprised to see that changing from 0.1 to 0.05 made such a difference in computing time. If I remember right, then the result for the larger eta_agrad=0.1 (default) was indeed garbage to me. So I am a bit worried about this at the moment - I mean if I try this on a problem where I don't know the MCMC result, then how do I know that ADVI worked? Given the speed, I would just try out different eta_agrad values... so some guidance here would be very helpful. I am happy to throw this thing on our cluster - trying out these eta_agrad in parallel is not a problem here (500 cores are waiting to be feed).

@ Bob: No idea how fast Nonmem is on this problem, I haven't run it in ages with Nonmem with this parametrization. However, for this problem size anything below 30minutes for a fully Bayesian solution is just amazing. And yes, if this turns out to work so well, we have to print a lot of papers (but before the main-stream can use this we have to get the generic dosing thing into Stan, really).

Best,

Sebastian

Dustin Tran

unread,

Jul 17, 2015, 4:37:28 AM7/17/15

to stan-...@googlegroups.com

The hyperparameter eta in AdaGrad only affects the convergence rate, and, theoretically, should not affect the results you get. If you did get different results, then it likely comes from one of two possibilities: the algorithm hasn’t fully converged, or the algorithm converged at a different local optima. Checking for convergence is easy based on looking at the differences in the ELBO from iterations, and I imagine this is most likely the case because we have a cap on the maximum number of iterations ADVI can run for.

At the moment, the best thing is to run ADVI on a subset of the data for different values of the hyper parameter, e.g., eta={0.01, 0.5, 0.1, 0.5, 1}; then run ADVI on the full data using the hyperparamter which performed best. This is what we aim to implement within Stan itself.

Dustin

Sebastian Weber

unread,

Jul 17, 2015, 4:49:05 AM7/17/15

to stan-...@googlegroups.com

Ah, this helps already. Three more:

- in what sense is "best" meant when I run on data subsets? speed?

- Concerning differences in ELBO: I assume you refer to the delta's. These are given as mean-deltas or median-deltas. Should ADVI converge on both metrics or is one of them sufficient - L2 metrics will always be larger than L1... medians are just so much more robust.

- Also, I saw that after 1000 iterations ADVI somehow changed, i.e. the delta-ELBO-means became small (the delta-ELBO-medians were already small). At the end I got a "converged" message which made me happy, but given that I got it at the very end - are then all samples deemed ok or only those which are after this message?

Thanks!

Sebastian

Dustin Tran

unread,

Jul 17, 2015, 4:56:41 AM7/17/15

to stan-...@googlegroups.com

Yup, “best” here means speed. To be safe, ADVI should converge on both; it’s easy to construct (practical) scenarios where the mean vs the median makes sense.

Not sure what you mean on the second question. Samples from the output file are drawn from the variational distribution which is parameterized by the set of optimized parameters. That is, there should be no samples given to the user before convergence.

Dustin

Sebastian Weber

unread,

Jul 17, 2015, 5:16:57 AM7/17/15

to stan-...@googlegroups.com

Ok, got it. How robust is this parameter with respect to small model changes? I mean should I always profile this or are small model changes very likely ok and don't change things too much?

Can you recommend any settings for the profiling step? I guess I don't need a long sampling phase, but how long should warmup be; I am asking for shortcuts here.

My second question is answered by the very reassuring "no samples are given to the user before convergence".

Thanks & sorry for bombarding you with questions...

Janne Sinkkonen

unread,

Jul 17, 2015, 10:40:32 PM7/17/15

to stan-...@googlegroups.com

My experience with a big hierarchical model (not the same one that had problems to init):

- Mean field ADVI was very fast, something like 10–30 minutes compared to 4–7 days of HMC. Means were on the right ballpark, variances not always so, generalisation over hierarchy was not so good (I got a geographic map out of it).

- Full rank ADVI seems to take about the same CPU time as a single HMC chain. I did not run it to the end, so can't be sure. ADVI can't be really blamed here, for the model has over 7000 parameters.

With a smaller regression/AR/t-residuals model both converged fast.

My current impression was that it is often necessary to use a smaller eta than the default 0.01, and more advi samples. Otherwise the final converge does not happen.

Dustin Tran

unread,

Jul 18, 2015, 3:12:52 AM7/18/15

to stan-...@googlegroups.com

@Sebastian, it's hard to say definitely as theory for variational inference on arbitrary models is nigh impossible. I would say in practice I haven't noticed much change in setting the hyperparameter when adding additional priors/hierarchies to the model. Small model changes can mean lots of things; changing distributions within the graphical model can easily require changes for example, so in general I recommend that you continue to profile things for now. Hopefully when I or Alp have time, we can implement the auto tuning that I mentioned earlier.

For general recommendations, and what I would first do when implementing this in Stan, I would run ADVI on, say, 10% of the data doing a bisection-type search on eta={0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1}.

Bob Carpenter

unread,

Jul 18, 2015, 12:24:30 PM7/18/15

to stan-...@googlegroups.com

> On Jul 18, 2015, at 12:12 AM, Dustin Tran <dustinv...@gmail.com> wrote:
>
> ...

> For general recommendations, and what I would first do when implementing this in Stan, I would run ADVI on, say, 10% of the data doing a bisection-type search on eta={0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1}.

Being a computer scientist, I'd have used 1/2^n (starting from 0,
naturally):

1, 1/2, 1/4, 1/8, 1/16, 1/32, ..., 1/1024.

Dustin took roughly 7 of those 11 points, so maybe 1/4^n:

1, 1/4, 1/16, 1/64, 1/256, 1/1024

to use 6 points evenly spaced on the log scale? Or
is there a reason to believe that the answer's
more likely to be in the 0.1 to 1 range and not be
greater than 1?

- Bob

Reply all

Reply to author

Forward