Simple latent Poisson time series model: Sampling versus Optimizing

Nicolas Chapados

unread,

Jun 4, 2014, 12:36:06 AM6/4/14

to stan-...@googlegroups.com

Dear Stan friends,

I've been experiencing a puzzling issue trying to use Stan to fit a very simple state-space time series model with Poisson observations: whereas "sampling" with Stan works perfectly, the "optimizing" call drives the latent innovations variance (called tau in the model) to zero, despite having a rather strong prior that should keep this variance away from zero.

The attached R and Stan code reproduce the problem. Simply sourcing the "stan_sampling_vs_optim_flat.R" code in a fresh R environment will call Stan with some test data, both in sampling mode and optimizing mode, and produce a plot illustrating the issue. Note that to make the optimization somewhat "easier", I use as initial values the mean of the posterior distribution obtained after sampling, although the problem reported here is quite insensitive to initial values.

The first thing to notice is the end of the optimization trace:

initial log joint probability = 524.845

Iter log prob ||dx|| ||grad|| alpha # evals Notes

0 527.382 0.0607042 27.8307 0.001 2

19 704.788 0.234647 465.337 0.2088 29

39 879.473 0.100888 7215.49 0.000664 78

[...]

459 1863.94 0.0145273 1.18276e+10 0.04406 1679

479 1867.73 0.00161514 1.11964e+10 0.1327 1699

499 1869.57 0.00274887 1.01092e+10 0.2288 1720

519 1873.55 0.00318458 7.37747e+09 0.2044 1742

539 1877.07 0.00691263 6.05137e+09 0.1593 1768

559 1878.34 0.000999365 4.91358e+09 1 1789

579 1878.85 3.32939e-05 4.07479e+09 1 1810

595 1878.85 3.48345e-08 4.07269e+09 1 1826

Optimization terminated normally: Convergence detected: change in objective function was below tolerance

Here are the strange things:

The initial log joint probability is at a value that's typical of where MCMC finds its samples (see "lp__" in the attached density-plot and trace-plot of the MCMC sampling process).
As optimization proceeds, the log-prob increases to become very large, falling completely outside the bounds that are sampled by MCMC (which are roughly in the 450-510 range)
Also, note that the norm of the gradient becomes huge (there are some parameter range errors and gradient errors that are reported along the way, and that I did not transcribe)

At the end of the optimization, the mode of the posterior variance is close to zero:

> result.optimizing$par["tau"]

tau

9.183124e-20

whereas the posterior distribution under MCMC is quite far from zero (again, see "tau" in the attached density-plot and trace-plot).

If all goes well, after sourcing the R file, you should see the plot attached under the name "stan_sampling_vs_optim_output.png". It shows two panels: the upper one is the raw data used to fit the model. The bottom is the posterior expectation of the Poisson process log-intensity, both under MCMC (blue curve) and the modal approximation (red curve). The latter is basically a constant and does not fit the data at all.

I have the feeling that this might be a somewhat well-known pathology, although I could not find direct references to it. Why is the log-posterior probability increasing so much during optimization, as the innovation variance is driven to zero? Where would the model specify a prior so strongly in favor of a tiny innovation variance, at the expense of the likelihood (i.e. good data fit)? I am aware of the boundary-avoiding priors described in BDA3, but I believe that my prior on tau should be sufficiently boundary-avoiding.

Any insights?

Many thanks,

+ Nicolas

stan_sampling_vs_optim_flat.R

poisson_flatter.stan

stan_sampling_vs_optim_output.png

stan_sampling_vs_optim_traplot.png

stan_sampling_vs_optim_denplot.png

Michael Betancourt

unread,

Jun 4, 2014, 4:51:56 AM6/4/14

to stan-...@googlegroups.com

I have the feeling that this might be a somewhat well-known pathology, although I could not find direct references to it. Why is the log-posterior probability increasing so much during optimization, as the innovation variance is driven to zero? Where would the model specify a prior so strongly in favor of a tiny innovation variance, at the expense of the likelihood (i.e. good data fit)? I am aware of the boundary-avoiding priors described in BDA3, but I believe that my prior on tau should be sufficiently boundary-avoiding.

Indeed, this is a well-known pathology (but perhaps not as prevalent in the literature as it should be).

Ultimately there are two issues at play here: why do sampling and optimization behave differently,

and why is optimization falling towards the boundary at zero? Let’s address these one at a time.

You note that you start the optimization near where the samples concentrate and then it diverges —

that is not at all unexpected. Samples are drawn from the probability mass, which takes into

account not just the density but also the differential volume, where as optimization considers only

the density. Consequently the neighborhood where samples concentrate (what we call the typical

set) will not, in general, be close to the posterior maximum. In fact the posterior maximum doesn’t

even have a strong mathematical motivation [1].

If you plot the posterior density for the innovation variance verses its dependent variables

you should see a funnel-like structure, with the high density towards zero variance concentrating

in a very narrow volume. This narrow volume disfavors the zero-variance region when sampling,

which is why the samples appear so much better behaved.

Okay, so maybe the samples and the optimum shouldn’t be all that similar, but why is the optimum

diverging? The problem is that although you’re adding regularizing prior information it is nowhere

near strong enough. In models like this the likelihood dominates all but the strongest, unrealistic

prior information. I don’t think I’ve ever seen a hierarchical model where the hyper variances didn’t

have maxima at the boundary, but that’s okay because I never try to optimize. :-)

One thing you can do is reparameterize — if you optimize to log variance you should see a maximum

displaced from the boundary [2].

[1] The terse answer I would like to give is “don’t EVER optimize a density, because it doesn’t

mean anything mathematically”. These kinds of pathologies are just manifestations of this

underlying principle.

[2] The fact that the maximum depends on the (arbitrary) parameterization is a good sign that

something is wrong with the maximum.

Nicolas Chapados

unread,

Jun 4, 2014, 3:21:43 PM6/4/14

to stan-...@googlegroups.com

Michael,

Many thanks for your reply and suggestions. I agree that one should avoid blindly optimizing if at all possible, but sometimes it is necessary... Your reparametrization idea indeed solves the problem: modeling the log-variance produces a well-behaved optimization.

My original problem (not posted) arose from a hierarchical model with a similar time-series structure — I will see if your insights help me avoid the same variance collapse problem.

A more philosophical question: your point [1] says “don’t EVER optimize a density, because it doesn’t mean anything mathematically”. But isn't where maximum likelihood comes from in the first place? [Not that I would find it unusual that a card-carrying Bayesian should find MLE meaningless :-) ]

Many thanks and warm regards,

+ Nicolas

Michael Betancourt

unread,

Jun 5, 2014, 7:07:15 AM6/5/14

to stan-...@googlegroups.com

>
> Many thanks for your reply and suggestions. I agree that one should avoid blindly optimizing if at all possible, but sometimes it is necessary... Your reparametrization idea indeed solves the problem: modeling the log-variance produces a well-behaved optimization.
>
> My original problem (not posted) arose from a hierarchical model with a similar time-series structure — I will see if your insights help me avoid the same variance collapse problem.
>
> A more philosophical question: your point [1] says “don’t EVER optimize a density, because it doesn’t mean anything mathematically”. But isn't where maximum likelihood comes from in the first place? [Not that I would find it unusual that a card-carrying Bayesian should find MLE meaningless :-) ]

This is going to become necessarily mathematical (also super Bayesian) and so apologies
in advance for the technical nature of the response.

Firstly, what are we really manipulating when we talk about probabilities? Probability
densities aren’t really all that fundamental: the fundamental object is something called
a probability measure that assigns probabilities to well-behaved collections of events.
Probability measures exist to be integrated, in other words they provide a way of
computing expectations like means, variances, and quantiles. Concepts like “most
probably event” are not well defined.

Now probability measures are abstract objects that live in abstract spaces, but with a
choice of parameterization they can be mapped into the real numbers, and that mapping
provides the probability densities with which most people are familiar. These densities,
however, are only a convenience meant to ease the computation of expectations with
respect to the underlying measure. Other properties like optima have no real meaning [1].

One reason optima are ill-defined is that they depend on the arbitrary parameterization
used in defining the densities. When you report a MAP estimator you are adding additional
(often implicit) assumptions! Note that expectations computed with the densities, however,
do not depend on the parameterization as the integral of a density is an invariant object.

What does this imply for statistics? That depends on how you interpret uncertainty.

From a Bayesian perspective everything is a random variable, and we can apply Bayes’
Theorem to construct a posterior measure. In practice we define a parameterization of
the model space which gives us a posterior density with which it is easier to compute.
But we have to need to heed the warning above and not try to do something foolish like
optimize in complicated models.

The Frequentist perspective is completely different. Here only the data is random
and we don’t have a single posterior measure over the model space but rather a family
of measures defined over the data space. This likelihood function defines a measure
over the data for every choice of model (value of the parameters) and one builds
expectations over the data that attempt to estimate the underlying model. Taking
expectations over the data space gives functions, not densities, like E[x] ( \theta )
which can readily be optimized to give Frequentist estimators, estimators whose
utility strongly depends on how much you believe in the Frequentist perspective.

In simple cases [1] Bayesian expectations and Frequentist estimators like the MLE
appear to be similar, and this lulls people into make claims like “Bayes is just
Frequentist estimators with regularization”. But this is a very naive perspective
and doesn't generalize to the complex systems common in modern applied statistics!

[1] There can be coincidences, however. For some measures in some parameterizations,
such as the standard Normal distribution, the maximum of the density coincides with
the mean and the maximum appears to have some computational use. Indeed if you can
show that an optimum in a given parameterization approximates a mean then it can
provide a useful approximation. In practice, however, this is limited to simple models
like the exponential family in few dimensions, which are rare in real applications.

Avraham Adler

unread,

Jun 5, 2014, 10:49:50 PM6/5/14

to stan-...@googlegroups.com

I just wanted to thank you, Michael, for these very clear explanations - both the practical and the fundamental.

Avi

Reply all

Reply to author

Forward