definition of identifiability?

149 views
Skip to first unread message

Bob Carpenter

unread,
Feb 5, 2014, 6:28:56 PM2/5/14
to stan...@googlegroups.com
It doesn't look to me like the usual notion of identifiability is right,
at least not in a Bayesian setting. Both Greene's book where I just read
this and the Wikipedia define identifiability as

theta != theta' if and only if p(theta) != p(theta')

That is, the probability function is one-to-one.

But that can't be right. If we have parameters mu and sigma for a normal
model with some data y, then let mu* and sigma* be the MLE.

Then if mu != mu*, there must be some sigma != sigma* such that

p(mu,sigma*) = p(mu*,sigma)

We know this because of continuity and that p(mu,sigma) is bounded
between 0 and p(mu*,sigma*).

So how do I really define identifiability?

Also, is it kosher to use the term "weakly identifiable" for one
of those normal(0,gazillion) priors?

- Bob

Michael Betancourt

unread,
Feb 5, 2014, 6:46:49 PM2/5/14
to stan...@googlegroups.com
I think you’re misreading the definitions. Remember, classical identifiably
is in the regime of classical statistics, which means likelihoods!

Identifiability on Wikipedia is defined such that if two likelihood _distributions_
(i.e. we haven’t measured the data yet) are distinct then the parameters
labeling them are distinct. This means that, provided you measure
enough data, you will be able to discriminate between the two
likelihood distributions (and hence the parameters that label them).

This is not the same as the likelihood _values_ being equal for a given
data set. Remember, “ensembles of possible data sets” not
“the actual data set that I fricken measured”.

A more intuitive (and I believe equivalent) definition is that in the limit
of infinite data the MLE will converge to a single value equal to the
true parameters. Multimodality? Unidentified. The y ~ N(mu1 + m2, sigma)
example? Unidentified both with and without a proper prior.

The big difference in Bayesian modeling is that we’re conditioning
on only the measured data so the infinite data limit isn’t as meaningful.
Uncertainty is fine so long as it’s finite. So I tend to think of a parameter
as being identified if its marginal variance is finite, even if it doesn’t
vanish as the data expands towards infinity.

From this perspective we can think of identifiability as a spectrum,
from weak (large marginal variance as the data grows) to strong
(small, possibly vanishing marginal variance as as the data grows).
> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,
Feb 5, 2014, 6:53:16 PM2/5/14
to stan...@googlegroups.com


On 2/6/14, 12:46 AM, Michael Betancourt wrote:
> I think you're misreading the definitions. Remember, classical identifiably
> is in the regime of classical statistics, which means likelihoods!
>
> Identifiability on Wikipedia is defined such that if two likelihood _distributions_
> (i.e. we haven't measured the data yet) are distinct then the parameters
> labeling them are distinct. This means that, provided you measure
> enough data, you will be able to discriminate between the two
> likelihood distributions (and hence the parameters that label them).

I see. There's a reason I didn't get into Caltech :-) [It was the only
place out of state I applied.]

So that means if we have parameters theta, and define f_theta(y) = p(y|theta),
the theta != theta' iff f_theta != f_theta'. Got it.

I thought we'd have to get all measure-theoretic.

> This is not the same as the likelihood _values_ being equal for a given
> data set. Remember, "ensembles of possible data sets" not
> "the actual data set that I fricken measured".
>
> A more intuitive (and I believe equivalent) definition is that in the limit
> of infinite data the MLE will converge to a single value equal to the
> true parameters. Multimodality? Unidentified. The y ~ N(mu1 + m2, sigma)
> example? Unidentified both with and without a proper prior.

That was how I was used to thinking about it. But I didn't want to
start talking about MLEs.

> The big difference in Bayesian modeling is that we're conditioning
> on only the measured data so the infinite data limit isn't as meaningful.
> Uncertainty is fine so long as it's finite. So I tend to think of a parameter
> as being identified if its marginal variance is finite, even if it doesn't
> vanish as the data expands towards infinity.
>
> From this perspective we can think of identifiability as a spectrum,
> from weak (large marginal variance as the data grows) to strong
> (small, possibly vanishing marginal variance as as the data grows).

Cool!

And thanks.

I feel like I should be paying tuition on this list!

- Bob

Michael Betancourt

unread,
Feb 6, 2014, 4:12:57 AM2/6/14
to stan...@googlegroups.com
> So that means if we have parameters theta, and define f_theta(y) = p(y|theta),
> the theta != theta' iff f_theta != f_theta'. Got it.
>
> I thought we'd have to get all measure-theoretic.

It actually is a measure-theoretic approach, just obscured by the use of densities.
I can make it sound more complicated if you’d like. ;-)

>> A more intuitive (and I believe equivalent) definition is that in the limit
>> of infinite data the MLE will converge to a single value equal to the
>> true parameters. Multimodality? Unidentified. The y ~ N(mu1 + m2, sigma)
>> example? Unidentified both with and without a proper prior.
>
> That was how I was used to thinking about it. But I didn't want to
> start talking about MLEs.

There should be some constraints on the kinds of estimators you can use
beyond MLE, but that goes beyond my familiarity.

Andrew Gelman

unread,
Feb 6, 2014, 10:48:48 AM2/6/14
to stan...@googlegroups.com
Identification is actually a tricky concept and is not so clearly defined. In the broadest sense, a Bayesian model is identified if the posterior distribution is proper. Then one can do Bayesian inference and that's that. No need to require a finite variance or even a finite mean, all that's needed is a finite integral of the probability distribution.

That said, there are some reasons why a stronger definition can be useful:

1. Weak identification. Suppose that, with reasonable data, you'd have a posterior with a sd of 1 (or that order of magnitude). But you have sparse data or collinearity or whatever, and so you have some dimension in your posterior that's really flat, some "ridge" with a sd of 1000. Then it makes sense to say that this parameter or linear combination of parameters is only weakly identified. Or one can say that it's identified from the prior but not the likelihood.

So, yes, Bob, as chief rabbi I say it is kosher to refer to weak identifiability. If we wanted to make the concept more formal, we'd stipulate that the model is expressed in terms of some hyperparameter A which is set to a large value, and that weak identifiability corresponds to nonidentifiability when A -> infinity.

Even there, though, some tricky cases arise. For example, suppose your model includes a parameter p that is defined on [0,1] and is given a flat prior, and suppose the data don't tell us anything about p, so that our posterior is also U(0,1). That sounds nonidentified to me, but it does have a finite integral.

2. Aliasing. Consider an item response model or mixture model where the direction or labeling is unspecified. Then you can have 2 or 4 or K! different reflections of the posterior. Even if all priors are proper, so the full posterior is proper, it contains all these copies so this labeling is not identified in any real sense.

Here, and in general, identification depends not just on the model but also on the data. So, strictly speaking, one should not talk about an "identifiable model" but rather an 'identifiable fitted model" or "identifiable parameters" within a fitted model.

Ben Goodrich

unread,
Feb 7, 2014, 7:52:19 PM2/7/14
to stan...@googlegroups.com, gel...@stat.columbia.edu
On Thursday, February 6, 2014 10:48:48 AM UTC-5, Andrew Gelman wrote:
Identification is actually a tricky concept and is not so clearly defined.  

I agree that a lot of people use the word identification without defining what they mean, but there are no shortage of definitions out there. However, I'm not sure that identification is that helpful a concept for the practical problems we are trying to solve here when providing recommendations on how users should write .stan files.

I think many if not most people that think about identification rigorously have in mind a concept that is pre-statistical. So, for them it is going to sound weird to associate "identification" with problems that arise with a particular sample or a particular computational approach. In economics, the idea of identification of a parameter goes back at least to the Cowles Commission guys, such as in the first couple of papers here

http://scholar.google.com/scholar?hl=en&q=author%3ATC-Koopmans+identification&btnG=&as_sdt=1%2C33

In causal inference, the idea of identification of an average causal effect is a property of a DAG in Pearl's stuff

http://bayes.cs.ucla.edu/jp_home.html
 
In the broadest sense, a Bayesian model is identified if the posterior distribution is proper.  Then one can do Bayesian inference and that's that.  No need to require a finite variance or even a finite mean, all that's needed is a finite integral of the probability distribution.

I don't disagree, but what good is the word "identified"? If the posterior distribution is improper, then there is no Bayesian inference.

That said, there are some reasons why a stronger definition can be useful:

1.  Weak identification.  Suppose that, with reasonable data, you'd have a posterior with a sd of 1 (or that order of magnitude).  But you have sparse data or collinearity or whatever, and so you have some dimension in your posterior that's really flat, some "ridge" with a sd of 1000.  Then it makes sense to say that this parameter or linear combination of parameters is only weakly identified.  Or one can say that it's identified from the prior but not the likelihood.

Here again we are running into the problem of other people associating the phrase "weak identification" with a different thing (usually instrumental variable models where the instruments are weak predictors of the variable they are instrumenting for). This paper

http://cowles.econ.yale.edu/~dwka/pub/p1370.pdf

basically is interested in situations where some parameter is not identified iff another parameter is zero. And then they drift the population toward that zero.
 
So, yes, Bob, as chief rabbi I say it is kosher to refer to weak identifiability.  If we wanted to make the concept more formal, we'd stipulate that the model is expressed in terms of some hyperparameter A which is set to a large value, and that weak identifiability corresponds to nonidentifiability when A -> infinity.

If we were going to go this route, then it is good to be specific with the definitions but I still don't know if you want to call this situation "weak identifiability".

Even there, though, some tricky cases arise.  For example, suppose your model includes a parameter p that is defined on [0,1] and is given a flat prior, and suppose the data don't tell us anything about p, so that our posterior is also U(0,1).  That sounds nonidentified to me, but it does have a finite integral.

Do you mean that a particular sample doesn't tell us anything about p or that data are incapable of telling us anything about p? In addition, I think it is helpful to distinguish between situations where
  1. There is a unique maximum likelihood estimator (perhaps with probability 1)
  2. There is not a unique maximum likelihood estimator but the likelihood is not flat everywhere with respect to a parameter proposal
  3. The likelihood is flat everywhere with respect to a parameter proposal
What bothers me about some notion of "computational identifiability" is that a Stan user may be in situation 1 but through some combination of weird priors, bad starting values, too few iterations, finite-precision, particular choice of metric, maladaptation, and / or bad luck can't get one or more chains to converge to the stationary distribution of the parameters. That's a practical problem that Stan users face, but I don't think many people would consider it to be an identification problem.

Maybe something that is somewhat unique to Stan is the idea of identified in the constrained parameter space but not identified in the unconstrained parameter space like we have with uniform sampling on the unit sphere.
 
2.  Aliasing.  Consider an item response model or mixture model where the direction or labeling is unspecified.  Then you can have 2 or 4 or K! different reflections of the posterior.  Even if all priors are proper, so the full posterior is proper, it contains all these copies so this labeling is not identified in any real sense.

This one doesn't bother me as much because you can at least conceptually restrict the parameter space to preclude all but one of the reflections, although whether it is easy to do that in Stan is another matter.
 
Here, and in general, identification depends not just on the model but also on the data.  So, strictly speaking, one should not talk about an "identifiable model" but rather an 'identifiable fitted model" or "identifiable parameters" within a fitted model.

Certainly, whether you have computational problems depends on the data, among other things. But to say that identification depends on the data goes against the conventional usage where identification is pre-statistical so we need to think about whether it would be more effective to try to redefine identification or to use other phrases to describe the problems we are trying to overcome.

Ben
 

Bob Carpenter

unread,
Feb 7, 2014, 8:36:13 PM2/7/14
to stan...@googlegroups.com
I just took a whack at writing up a chapter on identifiability
for the manual, but realize I may have gotten in over my head
philosophically. I've attached the draft chapter (for some
reason, printing to pdf in Mac OS preview changed the page size).

I bcc-ed Richard McElreath, who originally inspired this endeavour
with his question to stan-users (Richard, you can reply to me at
ca...@alias-i.com or respond on the users list).

The issue comes up fairly often on the users list, so it seems worth
saying something. It brings up lots of good issues with how
sampling works.

Feel free to grab the branch

feature/issue-485-next-manual

and update or send me comments (as long as they're not too complex
for me to understand) and I'll update.

Now we just need to get Andrew's friend to turn it all into
a blog post and then a paper :-)

- Bob
> 1. There is a unique maximum likelihood estimator (perhaps with probability 1)
> 2. There is not a unique maximum likelihood estimator but the likelihood is not flat everywhere with respect to a
> parameter proposal
> 3. The likelihood is flat everywhere with respect to a parameter proposal
> stan-dev+u...@googlegroups.com <javascript:>.
> >> For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
> >
> > --
> > You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com
> <javascript:>.
> > For more options, visit https://groups.google.com/groups/opt_out <https://groups.google.com/groups/opt_out>.
identifiability-chap-stan-manual.pdf

Andrew Gelman

unread,
Feb 8, 2014, 5:06:46 PM2/8/14
to stan...@googlegroups.com
Hi all. I'll probably just blog my remarks on identifiability along with Ben's updates.

Bob: regarding your chapter: I think it has some good things in it but I also think it's too theoretical for the Stan manual and also has excursions into maximum likelihood that will be more confusing than helpful to readers. I'd just get rid of section 19.1 and keep the rest.

There's also a larger question of how much of this sort of thing belongs in the manual at all. It seems to be a better fit for our future book! I guess it's ok to park it in the manual for now, but once we do the book it might make sense to take this material out of the manual.

A
> <identifiability-chap-stan-manual.pdf>


Bob Carpenter

unread,
Feb 8, 2014, 6:36:50 PM2/8/14
to stan...@googlegroups.com


On 2/8/14, 11:06 PM, Andrew Gelman wrote:
> Hi all. I'll probably just blog my remarks on identifiability along with Ben's updates.

I figured it'd be too tempting to resist.

> Bob: regarding your chapter: I think it has some good things in it but I also think it's too theoretical for the Stan manual and also has excursions into maximum likelihood that will be more confusing than helpful to readers. I'd just get rid of section 19.1 and keep the rest.

The Stan manual bounces around a lot in its level of theory (which
is not really a good thing, because it's not layered in any natural
way). I think that's partly because I've been using it partly to
clarify my own understanding of these issues.

Stan already provides MLE, so it's not really stepping outside of
what Stan does! I could make that connection clearer.

I find without the MLE definitions, the whole concept is too vague.

How strongly do you feel about getting rid of 19.1? If you feel
strongly about it, I'll drop it in favor of a vague allusion to
non-Bayesian concepts of identifiability.

I do feel that when you talked about a U(0,1) not being identified,
it was this MLE notion speaking.

> There's also a larger question of how much of this sort of thing belongs in
> the manual at all. It seems to be a better fit for our future book! I guess it's
> ok to park it in the manual for now, but once we do the book it might make sense
> to take this material out of the manual.

I also write in the manual rather than doing hard work, like debugging.
You can think of it like a blog that just keeps growing :-)

Seriously, though, the identifiability issue comes up again and again and
we need to give our users at least some concrete advice and also I think
it helps to give them some feeling for what's going on with HMC, which is
unfamiliar to a lot of them who are well versed in Gibbs.

I think of the manual as coming in two parts. The first part is the
user's programming guide and the second part is the actual language spec.
I put this in the first half, because it's not part of the language or
implementation itself.

Then there's all the stuff that doesn't really fit anywhere like the
intro to MCMC.

- Bob

Michael Betancourt

unread,
Feb 8, 2014, 7:03:39 PM2/8/14
to stan...@googlegroups.com
>> I think it has some good things in it but I also think it's too theoretical for the Stan manual and also has excursions into maximum likelihood that will be more confusing than helpful to readers.

—cough cough— then why spend a bunch of time implementing MLE inference algoriths? —cough cough—
I’m not going away anytime soon. ;-)

> The Stan manual bounces around a lot in its level of theory (which
> is not really a good thing, because it's not layered in any natural
> way). I think that's partly because I've been using it partly to
> clarify my own understanding of these issues.
>
> Stan already provides MLE, so it's not really stepping outside of
> what Stan does! I could make that connection clearer.
>
> I find without the MLE definitions, the whole concept is too vague.
>
> How strongly do you feel about getting rid of 19.1? If you feel
> strongly about it, I'll drop it in favor of a vague allusion to
> non-Bayesian concepts of identifiability.

If we call it something other than identifiability then I think we can get away with imposing a definition
and discussing just the Bayesian perspective. Concentration, maybe?

But if we use the term identifiability then we have to discuss the MLE stuff or it will cause conflicts
for anyone with any statistical training.

> I think of the manual as coming in two parts. The first part is the
> user's programming guide and the second part is the actual language spec.
> I put this in the first half, because it's not part of the language or
> implementation itself.

And that first half is basically what will become the book, right?

Marco Inacio

unread,
Feb 8, 2014, 7:38:08 PM2/8/14
to stan...@googlegroups.com
This part seems to have a logical problem:

"Model identifiability is a necessary (but not sufficient) condition for the existence of maximum likelihood estimates (MLE). Without identifiability, the maximum likelihood estimate might not be unique."

If it's a necessary condition, how does the MLE might not be unique? (if, so it should never exist/be unique)

Btw, nice chapter and nice topic, I'm really learning a lot.

Bob Carpenter

unread,
Feb 9, 2014, 7:13:44 AM2/9/14
to stan...@googlegroups.com


On 2/9/14, 1:38 AM, Marco Inacio wrote:
> This part seems to have a logical problem:
>
> "Model identifiability is a necessary (but not sufficient) condition for the existence of maximum likelihood estimates
> (MLE). Without identifiability, the maximum likelihood estimate might not be unique."
>
> If it's a necessary condition, how does the MLE *might* not be unique? (if, so it should never exist/be unique)

Good point. I'll rephrase. What I meant to say was:

Without identifiability there may be more than one value of the parameters
that maximize the likelihood function; we provide an example in the next
section.

I'll add

Even with identifiability, the likelihood function may grow without bound
as parameter values approach a point and therefore not have a point in the
support at which likelihood function is maximized; we provide examples later
in the chapter.

I forgot to discuss separability for logistic regression, which is a
better example than the beta posterior.

> Btw, nice chapter and nice topic, I'm really learning a lot.

Me, too :-)

- Bob

Andrew Gelman

unread,
Feb 9, 2014, 1:24:08 PM2/9/14
to stan...@googlegroups.com
B
I definitely support the idea of putting identifiabiilty into the manual (and maybe later it will migrate to the data analysis book).  And I don't mind the discussion of mle.  The part I'm not thrilled with is the introduction of new notation f_theta.  Also I think the use of "x" to indicate multiplication (on top of p.162) looks weird, I keep thinking it's some sort of Kronecker product or something.  But maybe "*" doesn't look right to you?
Anyway, maybe I'd be ok with 19.1 if you got rid of f_theta.  I'd prefer to just start with 19.2 but I'll trust your judgment on this.  On this particular sort of topic, you're closer to "user" status than I am!
A


On Feb 9, 2014, at 12:36 AM, Bob Carpenter wrote:


Stan already provides MLE, so it's not really stepping outside of
what Stan does!  I could make that connection clearer.

I find without the MLE definitions, the whole concept is too vague.

How strongly do you feel about getting rid of 19.1?  If you feel
strongly about it, I'll drop it in favor of a vague allusion to
non-Bayesian concepts of identifiability.

I do feel that when you talked about a U(0,1) not being identified,
it was this MLE notion speaking.

Ben Goodrich

unread,
Feb 9, 2014, 5:35:20 PM2/9/14
to stan...@googlegroups.com, gel...@stat.columbia.edu
I think you should start with 19.4 and avoid the word identifiability. The basic question you are trying to address is "What are the situations where the posterior is proper, but Stan nevertheless has trouble sampling from that posterior?" There is not much to say about improper posteriors, except that you basically can't do Bayesian inference. Although Stan can optimize a log-likelihood function, everybody doing so should know that you can't do maximum likelihood inference without a unique maximum. Then, there are a few things that are problematic such as long ridges, multiple modes (even if they are not exactly the same height), label switches and reflections, densities that approach infinity at some point(s), densities that are not differentiable, discontinuities, integerizing a continuous variable, good in the constrained space vs. bad in the constrained space, etc. And then we can suggest what to do about each of these specific things without trying to squeeze them under the umbrella of identifiability.

Ben

Andrew Gelman

unread,
Feb 9, 2014, 5:38:38 PM2/9/14
to stan...@googlegroups.com
I agree.  Although it could be useful to mention the term "identifiability" just to say that this is a related concept.

Bob Carpenter

unread,
Feb 9, 2014, 7:40:21 PM2/9/14
to stan...@googlegroups.com
I think we should step back and ask ourselves who we think
our target audience is and what we're trying to help them do.

As is, the manual's trying to serve (at least) the following
four different purposes.

I'd be happy to take responsibility for the first two of these
and leave the third and fourth to others. I think a "Stan Book"
would most naturally serve the third and maybe the fourth purpose.


Stan Reference
--------------

Part IV Modeling Language Reference
Part V Built-in Functions
Part VI Discrete Distributions
Part VII Continuous Distributions

Chapter 51 Transformations of Variables

Appendix D. Stan Program Style Guide


CmdStan Reference
-----------------

Part II Commands and Data Formats

Appendix A Licensing

Appendix B Installation and Compatibility


Intro to Stan Modeling
----------------------------------------

Chapter 2 Getting Started

Part III Programming Techniques

Appendix C. Stan for Users of BUGS


Intro to (Computational) (Bayesian) Stats
---------------------------------------

Chapter 48. (Penalized) MLE and Posterior Modes (I was just writing this)

Chapter 49. Bayesian Data Analysis

Cahpter 50. MCMC

Chapter 19. Identifiability


- Bob

Bob Carpenter

unread,
Feb 9, 2014, 8:28:58 PM2/9/14
to stan...@googlegroups.com

> On Feb 9, 2014, at 11:35 PM, Ben Goodrich wrote:
>
>> ... The basic question you are trying to address is
>> "What are the situations where the posterior is proper, but Stan nevertheless has trouble sampling from that
>> posterior?" There is not much to say about improper posteriors, except that you basically can't do Bayesian inference.
>> Although Stan can optimize a log-likelihood function, everybody doing so should know that you can't do maximum
>> likelihood inference without a unique maximum. Then, there are a few things that are problematic such as long ridges,
>> multiple modes (even if they are not exactly the same height), label switches and reflections, densities that approach
>> infinity at some point(s), densities that are not differentiable, discontinuities, integerizing a continuous variable,
>> good in the constrained space vs. bad in the constrained space, etc. And then we can suggest what to do about each of
>> these specific things without trying to squeeze them under the umbrella of identifiability.

Agreed. How about I just drop the concept of identifiability for now
and retitle the chapter "Problematic Posteriors"?

I can use the informal notion of identifiability from BDA/ARM and
just avoid calling it that. I can also redact all mentions of the
term "identifiability" from the rest of the manual, which should make
everyone happy.

I do want to include the problematic posterior arising from non-identifiable
models as an example, though, for reasons alluded to in the previous message.

- Bob

Bob Carpenter

unread,
Feb 9, 2014, 8:33:43 PM2/9/14
to stan...@googlegroups.com
On 2/9/14, 11:38 PM, Andrew Gelman wrote:
> I agree. Although it could be useful to mention the term "identifiability" just to say that this is a related concept.

Andrew followed his own advice in BDA, but it never gets defined,
which is why I think I was so confused about what it meant. Because
the usage in BDA is not the usual notion from non-Bayesian stats.

In ARM, Gelman and Hill (p. 220) are more precise, saying

Identifiability refers to whether the data contain sufficient information
for unique estimation of a given parameter or set of parameters in a particular
model.

The cases mentioned in BDA and arm are (a) collinearity in regressions, (b)
separability in logistic regressions, (c) additive and multiplicative nonidentifiability
in IRT-like models, and (d) mixture model "label switching".

This is obviously not the usual definition of identifiability for MLE
(as defined in Greene and on the Wikipedia) and presumably why Andrew
didn't like my first section --- it doesn't match his usage of the term.

So I propose I just drop this notion of "identifiability" altogether and
concentrate on "problematic posteriors".

More below on what I think that should entail.

> On Feb 9, 2014, at 11:35 PM, Ben Goodrich wrote:

...

>> There is not much to say about improper posteriors, except that you basically can't do Bayesian inference.

I disagree. The problem stems from advice from Andrew and others
in the context of Gibbs. Advice our users are trying to apply to Stan,
with no luck, leading them to conclude Stan is slow.

Check out ARM's section 19.4, "Redundant parameters and intentionally nonidentifiable
models". Here Andrew and Jennifer suggest for computational efficiency to use
parameterizations that are not identifiable and then post-process to something that is.

This works fine in Gibbs for reasons I tried to explain in the section Ben wants
me to cut, but won't work in Stan for reasons I try to explain in the same section.

>> Although Stan can optimize a log-likelihood function, everybody doing so should
>> know that you can't do maximum
>> likelihood inference without a unique maximum.

"Should know" and "do know" can be miles apart.

But I think the bigger point is that people try to do inference all the
time when parameters aren't identified in the sense of Gelman and Hill.
Sometimes the models aren't identified in the sense of Greene and the
Wikipedia and sometimes they are.

For example, in the two-location parameter example I was going to include,
you get perfectly reasonable predictions for new data with any of the
maximal (not maximum) likelihood estimates --- just as good
as with the single parameter model. You also get reasonable inferences in
Gibbs and even in Stan. It's just that your paramters aren't
identified in the Gelman/Hill sense.

Same thing with collinear predictors under L1 priors --- the
model plus data don't give a unique MLE, but the inferences are just fine in
most cases because most software takes finitely many finite steps.

- Bob

Ben Goodrich

unread,
Feb 10, 2014, 11:36:04 AM2/10/14
to stan...@googlegroups.com
On Sunday, February 9, 2014 8:33:43 PM UTC-5, Bob Carpenter wrote:
So I propose I just drop this notion of "identifiability" altogether and
concentrate on "problematic posteriors".

I think that is a good plan, especially if we distinguish which posteriors are problematic for which algorithms. In your other email, you sketched out this proposal further for the Stan manual. I think it would be good to start that chapter with a subsection on Problematic Posteriors for Gibbs Samplers, which (aside from the problem of possibly not having a customized algorithm to draw from some conditional distribution) are basically situations where the parameters are highly correlated. And the possible solutions are batching, reparameterizing the model so that the parameters are less correlated and possibly post-processing the output to dereparameterize, maybe some other things, or don't use a Gibbs sampler. Then there could be a subsection on Problematic Posteriors for HMC that talks about the things I mentioned yesterday (in addition to non-zero posterior mass on the boundaries, which I forgot). Later there could be a section on Problematic Posteriors for RMHMC, which hopefully only consists of high-dimensional posteriors.
 
> On Feb 9, 2014, at 11:35 PM, Ben Goodrich wrote:
>> There is not much to say about improper posteriors, except that you basically can't do Bayesian inference.

I disagree.  The problem stems from advice from Andrew and others
in the context of Gibbs.  Advice our users are trying to apply to Stan,
with no luck, leading them to conclude Stan is slow.

I see what you mean, but I disagree that you disagree. With post-processing, you have post-processed draws from a proper posterior; it is just a more complicated algorithm to draw from that proper posterior.

Ben

Bob Carpenter

unread,
Feb 10, 2014, 12:15:33 PM2/10/14
to stan...@googlegroups.com


On 2/10/14, 5:36 PM, Ben Goodrich wrote:
> On Sunday, February 9, 2014 8:33:43 PM UTC-5, Bob Carpenter wrote:
>
> So I propose I just drop this notion of "identifiability" altogether and
> concentrate on "problematic posteriors".
>
>
> I think that is a good plan, especially if we distinguish which posteriors are problematic for which algorithms. In your
> other email, you sketched out this proposal further for the Stan manual. I think it would be good to start that chapter
> with a subsection on Problematic Posteriors for Gibbs Samplers, which (aside from the problem of possibly not having a
> customized algorithm to draw from some conditional distribution) are basically situations where the parameters are
> highly correlated. And the possible solutions are batching, reparameterizing the model so that the parameters are less
> correlated and possibly post-processing the output to dereparameterize, maybe some other things, or don't use a Gibbs
> sampler. Then there could be a subsection on Problematic Posteriors for HMC that talks about the things I mentioned
> yesterday (in addition to non-zero posterior mass on the boundaries, which I forgot). Later there could be a section on
> Problematic Posteriors for RMHMC, which hopefully only consists of high-dimensional posteriors.

I think I'll just stick to one chapter now. We can write
a more general book later!

Michael was going to add some more info on samplers in general.

As Andrew pointed out to me today, once we transform, there's not
going to be any posterior mass on the boundary in the space we
actually sample from. This is complicated!

> > On Feb 9, 2014, at 11:35 PM, Ben Goodrich wrote:
> >> There is not much to say about improper posteriors, except that you basically can't do Bayesian inference.
>
> I disagree. The problem stems from advice from Andrew and others
> in the context of Gibbs. Advice our users are trying to apply to Stan,
> with no luck, leading them to conclude Stan is slow.
>
>
> I see what you mean, but I disagree that you disagree. With post-processing, you have post-processed draws from a proper
> posterior; it is just a more complicated algorithm to draw from that proper posterior.

I see what you mean. My point was that they're problematic in Stan
because the sampling itself happens over the improper posterior.
They're OK in Gibbs because the correlation means that the sampler
doesn't wander too far and the conjugacy makes it fast.

- Bob

Michael Betancourt

unread,
Feb 10, 2014, 12:25:48 PM2/10/14
to stan...@googlegroups.com
Woah, woah.  They’re not okay, they’re just hiding the serious
statistical problems from the user.  Big difference.

Bob Carpenter

unread,
Feb 10, 2014, 12:38:47 PM2/10/14
to stan...@googlegroups.com


On 2/10/14, 6:25 PM, Michael Betancourt wrote:
> Woah, woah. They’re not okay, they’re just hiding the serious
> statistical problems from the user. Big difference.

Take it up with Andrew and Jennifer :-)

Seriously, though, what I meant was that if you sample
something like:

y ~ normal(lambda1 + lambda2, sigma);

then mu = (lambda1 + lambda2) is going to be well behaved
in that it'll take on the same value as mu would if you'd
just written

y ~ normal(mu, sigma);

Each update of lambda conditioned on lambda2 and sigma is
very constrained.

Sure, if you ran forever you might run into floating point
problems.

Sure, your convergence diagnostics are going to be messed
up for lambda1 + lambda2.

I am NOT suggesting that users do this for Stan, just that
the reason they think it's working in Gibbs is because of
the correlation between lambda1 and lambda2 in the posterior
and Gibbs's slow exploration of it.

- Bob


>
> On Feb 10, 2014, at 5:15 PM, Bob Carpenter <ca...@alias-i.com <mailto:ca...@alias-i.com>> wrote:
>
>> I see what you mean. My point was that they're problematic in Stan
>> because the sampling itself happens over the improper posterior.
>> They're OK in Gibbs because the correlation means that the sampler
>> doesn't wander too far and the conjugacy makes it fast.
>

Michael Betancourt

unread,
Feb 10, 2014, 12:45:08 PM2/10/14
to stan...@googlegroups.com
Yeah, I agree that the chains will look okay but as we’ve discussed
they’re not sampling from the correct distribution. This is a classic
example of our not having any sufficient conditions for convergence.
I just want to make sure that we emphasize that such Gibbs behavior
is in fact incorrect.

Bob Carpenter

unread,
Feb 10, 2014, 12:55:30 PM2/10/14
to stan...@googlegroups.com


On 2/10/14, 6:45 PM, Michael Betancourt wrote:
> Yeah, I agree that the chains will look okay but as we've discussed
> they're not sampling from the correct distribution. This is a classic
> example of our not having any sufficient conditions for convergence.
> I just want to make sure that we emphasize that such Gibbs behavior
> is in fact incorrect.

I did, and will make sure it's extra clear. I described Gibbs's
poor mixing as masking the problematic behavior of not really
sampling from the posterior for (lambda1, lambda2).

- bob
Reply all
Reply to author
Forward
0 new messages