ADVI - current statistical thinking

135 views
Skip to first unread message

Peadar Coyle

unread,
Sep 28, 2016, 8:14:57 AM9/28/16
to Stan users mailing list
Hi all,
I'm one of the PyMC3 contributors - I ran into Michael Betancourt yesterday at an event in London. 

He told me 'ADVI doesn't really work'. 

Does anyone have benchmarks or current statistical thinking on this - I'm interested. 

Bob Carpenter

unread,
Sep 28, 2016, 12:30:44 PM9/28/16
to stan-...@googlegroups.com

> On Sep 28, 2016, at 8:14 AM, 'Peadar Coyle' via Stan users mailing list <stan-...@googlegroups.com> wrote:
>
> Hi all,
> I'm one of the PyMC3 contributors - I ran into Michael Betancourt yesterday at an event in London.
>
> He told me 'ADVI doesn't really work'.

It's well known because of the direction of KL-divergence being
optimized that

* variational inference underestimates posterior variance.

But we don't really know how much (or at least I don't).

Rayleigh and Andrew wrote an evaluation package for ADVI a bunch of
models we have prebuilt for Stan and have the right answers
through Stan's (and often JAGS's) MCMC. I don't know if
he made the R repo for this public.

They're evaluating how often the posterior mean for
a parameter falls within one posterior standard deviation
of the posterior mean. I don't know if it evaluates posterior
variance estimates.

The tests are for ADVI, not just for normal variational approximations.
Specifically, ADVI can fail for two reasons: it fails to find the
correct parameters to minimize the KL divergence (there are adaptation
and tuning parameters for the optimization) or the normal approximation
is inappropriate. The former is likely to improve over time as
the ADVI defaults improve.

I'm also not sure the proportion of models for which ADVI gets
every parameter to within one posterior s.d., fails to get
at least one parameter within one posterior s.d., or just
fails to converge. And it varies between the mean
field (independent normal approximation per parameter)
and full rank (multivariate normal approximation).

Word on the street is that

* it works better with more data so that the posterior
is more approximately normal (independent or multivariate)

* the prior parameters in a hierarchical model may
be off and the lower-level coefficients still roughly
correct

* it can often perform very well predictively even when
some params are off by more than one posterior s.d. (the
previous hierarchical model case would be an example)

I hope the ADVI devs can clarify!

Until we have some better understanding of the nature of
the multivariate normal approximation, it's hard to know when
we can trust the parameter estimates from ADVI. On the other
hand, if we focus on predictions or other event probabilities,
and we have an independent means of calibration (like held out data),
then we can evaluate those independently and ADVI is going to be able
to fit models at scales we won't even be able to test with MCMC.
This latter property is why we're excited about making it work and
undertanding where it will work.

- Bob

Reply all
Reply to author
Forward
0 new messages