>
> Many thanks for your reply and suggestions. I agree that one should avoid blindly optimizing if at all possible, but sometimes it is necessary... Your reparametrization idea indeed solves the problem: modeling the log-variance produces a well-behaved optimization.
>
> My original problem (not posted) arose from a hierarchical model with a similar time-series structure — I will see if your insights help me avoid the same variance collapse problem.
>
> A more philosophical question: your point [1] says “don’t EVER optimize a density, because it doesn’t mean anything mathematically”. But isn't where maximum likelihood comes from in the first place? [Not that I would find it unusual that a card-carrying Bayesian should find MLE meaningless :-) ]
This is going to become necessarily mathematical (also super Bayesian) and so apologies
in advance for the technical nature of the response.
Firstly, what are we really manipulating when we talk about probabilities? Probability
densities aren’t really all that fundamental: the fundamental object is something called
a probability measure that assigns probabilities to well-behaved collections of events.
Probability measures exist to be integrated, in other words they provide a way of
computing expectations like means, variances, and quantiles. Concepts like “most
probably event” are not well defined.
Now probability measures are abstract objects that live in abstract spaces, but with a
choice of parameterization they can be mapped into the real numbers, and that mapping
provides the probability densities with which most people are familiar. These densities,
however, are only a convenience meant to ease the computation of expectations with
respect to the underlying measure. Other properties like optima have no real meaning [1].
One reason optima are ill-defined is that they depend on the arbitrary parameterization
used in defining the densities. When you report a MAP estimator you are adding additional
(often implicit) assumptions! Note that expectations computed with the densities, however,
do not depend on the parameterization as the integral of a density is an invariant object.
What does this imply for statistics? That depends on how you interpret uncertainty.
From a Bayesian perspective everything is a random variable, and we can apply Bayes’
Theorem to construct a posterior measure. In practice we define a parameterization of
the model space which gives us a posterior density with which it is easier to compute.
But we have to need to heed the warning above and not try to do something foolish like
optimize in complicated models.
The Frequentist perspective is completely different. Here only the data is random
and we don’t have a single posterior measure over the model space but rather a family
of measures defined over the data space. This likelihood function defines a measure
over the data for every choice of model (value of the parameters) and one builds
expectations over the data that attempt to estimate the underlying model. Taking
expectations over the data space gives functions, not densities, like E[x] ( \theta )
which can readily be optimized to give Frequentist estimators, estimators whose
utility strongly depends on how much you believe in the Frequentist perspective.
In simple cases [1] Bayesian expectations and Frequentist estimators like the MLE
appear to be similar, and this lulls people into make claims like “Bayes is just
Frequentist estimators with regularization”. But this is a very naive perspective
and doesn't generalize to the complex systems common in modern applied statistics!
[1] There can be coincidences, however. For some measures in some parameterizations,
such as the standard Normal distribution, the maximum of the density coincides with
the mean and the maximum appears to have some computational use. Indeed if you can
show that an optimum in a given parameterization approximates a mean then it can
provide a useful approximation. In practice, however, this is limited to simple models
like the exponential family in few dimensions, which are rare in real applications.