Explicitly Correlated Parameters

1,081 views
Skip to first unread message

John Hall

unread,
Mar 21, 2017, 3:58:14 PM3/21/17
to Stan users mailing list
There are already a few threads on the subject, but I'm not interested in a particular model. Rather, I'm just observing that models with explicitly correlated parameters are a bit quirky in Stan. I'm curious if people out there have a lot of success with explicitly correlated parameters versus the implicit case. I'm also curious about best practices on when to use them and when to not use. My thoughts are below.

By explicitly correlated parameters, I mean a model that contains something like what is in Section 26.6 of the documentation under Multivariate Reparameterizations, rather than the previous section that handles the univariate case.

With Gibbs sampling and Metropolis-Hastings, you have to include a multivariate distribution or the posterior distribution would be almost forced to have uncorrelated parameters. I believe the situation was similar with BUGS/JAGS. I suspect the focus in those techniques is why there is a decent amount of focus in the Stan documentation on models with correlated parameters.

In Stan, my experience with correlated parameters in Stan has been rather frustrating. I almost always get worse results (in terms of having divergences, low N_eff, time to estimate, etc.) compared to univariate models. With a univariate prior on sets of parameters, the resulting posterior distribution on parameters might have some correlation when you look at them together, or maybe they won't. The only downside is that if you have some new observation, then you might need to make some assumptions in order to make predictions about it.

I suspect that this is driven by the restrictions imposed by the multivariate normal prior. For instance, one model I'm estimating is a linear regression on some panel data, each regressed on one common time series variable with parameters varying cross-sectionally on the intercepts, slopes, and std. deviations (hierarchical priors on each of these three). If you leave the priors on the varying parameters univariate, it fits pretty well, pretty quickly, all things considered. By contrast, if you explicitly impose any kind of correlated parameters (centered or non-centered on the slopes and intercepts), then you get divergences or low N_eff and it takes forever way longer.

To understand why this is, I focused on the posterior distribution of the parameters in the univariate case. Upon inspection, the slope looks a more log-normal than normal and the relationship of the intercept and the slope (or the slope and the standard deviations) looks non-linear. Thus, I suspect that imposing the linear correlation in the multivariate normal prevents the Stan sampler from finding the non-linear result.

Another potential explanation is sort of brought up in the wiki section on priors
https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations
under "Boundary-avoiding priors for modal estimation". I'm not sure it entirely applies, but it discusses the varying slope and intercept model and discusses the potential for a correlation of 1. Well, my example above often had a correlation near -1 between the intercepts and slopes. But, I'm not sure if this is a product of the divergences though.

Bob Carpenter

unread,
Mar 21, 2017, 4:26:21 PM3/21/17
to stan-...@googlegroups.com
It's not so much a Stan problem as a general problem.
There's no way to tweak multivariate proposals in Gibbs unless
you block them, and they tend not to work in Metropolis unless
you know the curvature (second derivative matrix) explicitly and
it doesn't vary around the posterior. What you may be finding
is just that Stan's much better at exploring out the boundaries of
parameter space and reporting when it gets into trouble.

To see this for yourself, we recommend simulating data from a model
that looks like what you want and trying to fit in various systems
and look at the boundaries on 80% or 90% intervals you're getting
(95% or higher will be too noisy without a high effective sample size).
But you need to use Stan's posterior summary tools, which are also
more conservative than Coda or other tools often applied to BUGS
or JAGS models.

It's not the correlation that causes so much of a problem as
when the curvature (second derivative matrix aka Hessian) varies
around the posterior, as in the funnel example.

For arithmetic stability, we recommend using Cholesky factors
of correlation matrices with LKJ priors and then scaling (a
less ad hoc version of what Gelman and Hill recommend in
their regression book.

And yes, correlations near -1 or 1 are going to be problematic
no matter what sampler you use. This is largely a fact of the
way floating point arithmetic works. You can represent numbers
like 1e-300 --- in other words, get very close to zero, but the
closest you can get to 1 or -1 in floating point is around 1 - 1e-15.
If there's nothing stopping you from wanting to get closer, you're
going to get divergences because you just can't even represent a number
that's closer to 1 or -1 without rounding error to 1 or -1, which in
turns makes the covariance matrix singular.

But this is just what Andrew calls the "folk theorem". If your
inferences want a correlation near 1 or -1, you have two linearly
redundant parameters in your regression and you probably just want
to get rid of one. Otherwise, they're only going to be identified
by the prior (see the manual section on problematic priors for a simple
example using two intercepts, which is just an extreme case of
the same problem).

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ben Goodrich

unread,
Mar 21, 2017, 6:23:34 PM3/21/17
to Stan users mailing list
On Tuesday, March 21, 2017 at 3:58:14 PM UTC-4, John Hall wrote:
There are already a few threads on the subject, but I'm not interested in a particular model. Rather, I'm just observing that models with explicitly correlated parameters are a bit quirky in Stan. I'm curious if people out there have a lot of success with explicitly correlated parameters versus the implicit case. I'm also curious about best practices on when to use them and when to not use. My thoughts are below.

The stan_{g}lmer() models in the rstanarm package fall in this category once you go beyond the simple case where only the intercepts deviate by group, and they tend to work well.

Ben

John Hall

unread,
Mar 23, 2017, 6:28:11 PM3/23/17
to stan-...@googlegroups.com
Some good points. I wanted to take some time to digest. A few comments:

"To see this for yourself, we recommend simulating data from a model
that looks like what you want"

The simulations I have used in the past often simulate the parameters as effectively multivariate normal. There are relatively fewer problems with the model in these cases because the model assumes multivariate normal and that's how the parameters are distributed when simulated. I get more problems when I switch to the actual data set and the parameters are no longer multivariate normal.

"For arithmetic stability, we recommend using Cholesky factors
of correlation matrices with LKJ priors and then scaling (a
less ad hoc version of what Gelman and Hill recommend in
their regression book."

I read and re-read the Stan documentation more regularly than I should admit...I followed a lot of the best practices, like LKJ priors.

"And yes, correlations near -1 or 1 are going to be problematic
no matter what sampler you use.  This is largely a fact of the
way floating point arithmetic works.  You can represent numbers
like 1e-300 --- in other words, get very close to zero, but the
closest you can get to 1 or -1 in floating point is around 1 - 1e-15.
If there's nothing stopping you from wanting to get closer, you're
going to get divergences because you just can't even represent a number
that's closer to 1 or -1 without rounding error to 1 or -1, which in
turns makes the covariance matrix singular."

I've been fitting the model with OLS and no pooling and comparing it to the results to Stan's output. So for instance, I can plot the distribution of intercepts vs. slopes in OLS and compare that to what the mean intercepts and slopes are from Stan. The correlation of the intercepts to the slopes from the OLS output is probably closer to -0.6. So it's modest, but it's not that extreme. I think the issue more is that Stan will move towards that +1 or -1 and then get stuck there. Also, the Stan plots looks much more like a multivariate normal than the OLS plots do. I suspect this is because of the divergences.

"But this is just what Andrew calls the "folk theorem".  If your
inferences want a correlation near 1 or -1, you have two linearly
redundant parameters in your regression and you probably just want
to get rid of one.  Otherwise, they're only going to be identified
by the prior (see the manual section on problematic priors for a simple
example using two intercepts, which is just an extreme case of
the same problem)."

I've made that mistake of effectively putting in two intercepts before...One thing that would make life a little easier is if divergences can be printed at the end of each chain or perhaps an option to exit early if there is a divergence (for instance, equivalent to a field that indicates success or failure in an optimization). Beyond that, I don't know if it would be possible for someone to write a program that parses a Stan model and it warns you if there are problems, at least for simple things like identification.

Anyway, this part of your post caused me to go back and run a few more models. After double-checking, the divergences problem was showing up with explicitly multivariate normal intercepts/slopes, regardless of whether I used a constant standard deviation or varying standard deviations by group. However, I also realized that under some conditions, I would also get divergences when I did the model with varying intercepts, slopes, and standard deviations (which I don't think I had realized before double-checking). I could avoid these with some simple changes, like having 2 of 3 vary or changing the priors (for my data, varying standard deviations was more important than varying intercepts).

There are two separate points I want to address with respect to the model being wrong:

1) The multivariate case. The idea that the model is wrong makes some sense since the empirical distribution of the OLS parameters I discussed above aren't REALLY multivariate normal. So there may be computational problems because I'm forcing a distribution to be multivariate normal, when it is really more complicated than that. In my case, I had wanted the intercepts and slopes to vary by individual. I knew from looking at the OLS results that individuals with high slopes tended to have low intercepts, so I assumed it was important to explicitly model the correlation with a multivariate normal. But I almost always had worse results when I did that. That's what I find intuitive. The model isn't wrong with separate univariate normal priors, but it becomes wrong with a multivariate normal one.

2) Varying intercepts/slopes/standard deviations. Stan can fit a univariate model with an intercept, slope, and standard deviation. And it can fit a no pooling version of that with many separate individuals each having their own intercepts, slopes, and standard deviations. But give all those parameters hierarchical priors and you get divergences. I don't find this intuitive. Why has the model become wrong?



> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Andrew Gelman

unread,
Mar 23, 2017, 6:30:30 PM3/23/17
to stan-...@googlegroups.com
Let me just add to this discussion that there's nothing wrong at all about identifying parameters with the prior!  Suppose you have a regression with two collinear predictors and you have some prior information on the coefficients.  Then include the prior information!  No need to throw predictors out of the model just because of some scruple that prior info is cheating.
A

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

John Hall

unread,
Mar 23, 2017, 6:31:26 PM3/23/17
to stan-...@googlegroups.com
"The stan_{g}lmer() models in the rstanarm package fall in this category once you go beyond the simple case where only the intercepts deviate by group, and they tend to work well."

I tried the code below yesterday. My data was 120X565 originally, so to get it to work with stan_glmer it becomes like 70,000X3. It never got past 1 of 500 on the first chain after leaving it for an hour. At least when I write it myself in Stan I have some idea what's going on and I can exploit some knowledge of the data set to speed things up.

fit_mkt_rstanarm <- stan_glmer(y ~ (1 + x | stock), data = rstanarm_data_mkt,
                               family = gaussian(link = "identity"),
                               prior=hs(df=7),
                               prior_intercept=normal(0, 1),
                               prior_covariance=decov(),
                               iter = N_iter, chains = N_chain)

John Hall

unread,
Mar 23, 2017, 6:37:36 PM3/23/17
to stan-...@googlegroups.com
"Let me just add to this discussion that there's nothing wrong at all about identifying parameters with the prior!  Suppose you have a regression with two collinear predictors and you have some prior information on the coefficients.  Then include the prior information!  No need to throw predictors out of the model just because of some scruple that prior info is cheating."

I try to use reasonable priors on means and standard deviations, not overly tight but not too wide.

The correlations are more tricky. Academic theory in my case has historically been that there's no correlation between the intercepts and slopes. However, recent studies have disputed the theory. Do I base the prior on the theory or the recent studies?

Andrew Gelman

unread,
Mar 23, 2017, 6:40:02 PM3/23/17
to stan-...@googlegroups.com
You want the data to inform you about the corrs, but if the predictors are roughly zero-centered, then maybe you don't expect corrs to be fully degenerate.  In a model with one correlation, you can put a Beta(2,2) prior on (rho + 1)/2.  In a model with multiple correlations, you can use a Wishart (not inverse-Wishart) priors, as discussed here:
A

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Jonah Gabry

unread,
Mar 23, 2017, 7:59:47 PM3/23/17
to Stan users mailing list
Hi John,

It's actually not too bad to write stan_glmer's decov prior yourself if you want to do it in your Stan program. The prior is explained here

https://cran.r-project.org/web/packages/rstanarm/vignettes/glmer.html#priors-on-covariance-matrices

and it involves not only the LKJ for the correlation matrix (or preferably it's cholesky decomposition), but then it also does a reparameterization of the scale parameters.

Jonah

John Hall

unread,
Mar 24, 2017, 4:23:12 PM3/24/17
to stan-...@googlegroups.com
This simpler model has one correlation, so I tried the Beta(2,2) prior that you discuss. I had the same issue.

One thing I noticed is that the standard deviation on the hierarchical intercept tends to get driven to zero, so I was curious if standardizing the data made any difference. But it didn't seem to make much of an improvement in one set of models. I'm currently running a Stan model with standardized variables* and correlation as you describe. Given how slow it's going, I'm expecting divergences.

I suspect it's some kind of identification issue, but I don't understand why if there's no identification problem with a univariate model, then there becomes one with a hierarchical one.

*With hierarchical models, it's correct to standardize with the overall mean and standard deviation, not the individual, correct?

John Hall

unread,
Mar 24, 2017, 4:24:28 PM3/24/17
to stan-...@googlegroups.com
Thanks Jonah. There are similar recommendations in the Stan model. I have tried them. Sometimes they work for me, but not always.

Have you had much success with them? I would be interested in your experiences.

Bob Carpenter

unread,
Mar 24, 2017, 6:21:22 PM3/24/17
to stan-...@googlegroups.com
The hierarchical model can often collapse because
normal_lpdf(y | mu, sigma) is unbounded as sigma -> 0 and y -> mu.
That's why you can't do ordinary optimization or MLE with them.

You can get identifiability problems if you add in intercepts
and intercepts for the priors, because you essentially get two
intercepts. This is the kind of thing that often gets identified
with priors.

If you're finding the real data a lot harder to fit than simulated
data, you might want to try to diagnose where the actual data's
violating the modeling assumptions.

- Bob
>>> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
>>> > To post to this group, send email to stan-...@googlegroups.com.
>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>>> --
>>> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
>>> To post to this group, send email to stan-...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
>>> To post to this group, send email to stan-...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
>> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
>> To post to this group, send email to stan-...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
>> To post to this group, send email to stan-...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Jonah Sol Gabry

unread,
Mar 24, 2017, 6:40:34 PM3/24/17
to stan-...@googlegroups.com
On Fri, Mar 24, 2017 at 4:24 PM John Hall <john.mic...@gmail.com> wrote:
Thanks Jonah. There are similar recommendations in the Stan model. I have tried them. Sometimes they work for me, but not always.

Have you had much success with them? I would be interested in your experiences.

The stuff about using (cholesky of) correlation matrices is in the Stan manual, but stan_glmer also does something different with the vector of variances that isn't in the general Stan manual. It involves a dirichlet prior on the proportion of the total variance (trace of cov matrix) corresponding to each variable (column). Did you see that part of the vignette? You have to scroll down to the end of the "Details" section.

Or is that what you're saying you tried? I have rarely had the kinds of problems you're seeing when using that prior. 


Jonah



John Hall

unread,
Mar 27, 2017, 11:41:59 AM3/27/17
to stan-...@googlegroups.com
@Jonah, Sorry, I must have skipped over that detail. No, that's not the approach I took. I used an LKJ prior on the correlation with half normal priors on the standard deviations.

Have you found more success with your approach?

Jonah Sol Gabry

unread,
Mar 27, 2017, 12:56:32 PM3/27/17
to stan-...@googlegroups.com
No problem. And yeah it has been a very robust default in rstanarm, so it could be worth trying.

On Mon, Mar 27, 2017 at 11:42 AM John Hall <john.mic...@gmail.com> wrote:
@Jonah, Sorry, I must have skipped over that detail. No, that's not the approach I took. I used an LKJ prior on the correlation with half normal priors on the standard deviations.

Have you found more success with your approach?

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

John Hall

unread,
Mar 29, 2017, 2:59:51 PM3/29/17
to stan-...@googlegroups.com
Comments in line.

On Fri, Mar 24, 2017 at 6:20 PM, Bob Carpenter <ca...@alias-i.com> wrote:
The hierarchical model can often collapse because
normal_lpdf(y | mu, sigma) is unbounded as sigma -> 0 and y -> mu.
That's why you can't do ordinary optimization or MLE with them.

You can get identifiability problems if you add in intercepts
and intercepts for the priors, because you essentially get two
intercepts.  This is the kind of thing that often gets identified
with priors.

Just to confirm: you're saying that if you have intercepts, then you can't have intercepts in the priors. But, you can have intercepts in the priors if there is no intercept in the problem?

I'm prefacing the following with the reminder that the model I am discussing is a panel with each individual regressed one time series, and different intercepts, slopes, and standard deviations for each.

With explicitly correlated parameters, I had originally been doing something like
        B_mu ~ normal(B_prior_mu, B_prior_sd);
        for (i in 1:N_Y)
            B_n[i] ~ multi_normal_prec(B_mu , inv_sigma);

B_prior_mu would have been something like {0, 1}, since I had a fairly good idea that the average slope would be closer to 1.

And now I'm doing something like
        for (i in 1:N_Y)
            B_n[i] ~ multi_normal_prec(B_prior_mu, inv_sigma);

As a result the divergences seem to have disappeared. However, I'm not sure this is exactly what you are suggesting.

What you're saying seems closer to something like

        B_mu ~ normal(0, B_prior_sd);
        for (i in 1:N_Y)
            B_n[i] ~ multi_normal_prec(B_prior_mu, inv_sigma);

which is similar to what is in the Stan manual in Section 8.13. However, the manual says that more information can be incorporated into u_{k} (might be a typo, since there is no u_{k}, only u_{j}). If it only makes sense to adjust the standard deviation of the prior, then it might be made more clear in the manual.

Similarly, if I were doing this without explicit correlation, I should do something like (excluding the other stuff for brevity)

    beta ~ normal(0, B_prior_sd[2]);
    beta_n_sd ~ normal(0, B_n_prior_sd[2]);
    beta_n ~ normal(beta, beta_n_sd);

instead of

    beta ~ normal(1, B_prior_sd[2]);
    beta_n_sd ~ normal(0, B_n_prior_sd[2]);
    beta_n ~ normal(beta, beta_n_sd);

even if I have a good idea that beta should have a mean closer to 1?



 

If you're finding the real data a lot harder to fit than simulated
data, you might want to try to diagnose where the actual data's
violating the modeling assumptions.

This was a good suggestion, but it has been a rather time-consuming project. It turned out to be harder than I thought to figure out exactly what part of the data was violating the assumptions. I kept making it more and more realistic without finding where it was going wrong. Eventually I realized that the problem was that I had been filling a matrix with normal random variables and that I needed to center them before proceeding. Not centering meant that there was some extra variation in the intercept (i.e. I had made an inadvertent additional change in the assumptions). So I'm starting the search over now.

Bob Carpenter

unread,
Mar 31, 2017, 2:56:51 PM3/31/17
to stan-...@googlegroups.com

> On Mar 29, 2017, at 2:59 PM, John Hall <john.mic...@gmail.com> wrote:
>
> Comments in line.
>
> On Fri, Mar 24, 2017 at 6:20 PM, Bob Carpenter <ca...@alias-i.com> wrote:
> The hierarchical model can often collapse because
> normal_lpdf(y | mu, sigma) is unbounded as sigma -> 0 and y -> mu.
> That's why you can't do ordinary optimization or MLE with them.
>
> You can get identifiability problems if you add in intercepts
> and intercepts for the priors, because you essentially get two
> intercepts. This is the kind of thing that often gets identified
> with priors.
>
> Just to confirm: you're saying that if you have intercepts, then you can't have intercepts in the priors. But, you can have intercepts in the priors if there is no intercept in the problem?

What I'm saying is that if you have multiple intercepts (say one
global one and then group-level intercepts), then you only get
identification through the prior because you can add to the global
intercept and subtract from the group intercepts and get the same
value. Usually this is mitigated with using n-1 intercepts if
there are n groups and pinning one to zero.

>
> I'm prefacing the following with the reminder that the model I am discussing is a panel with each individual regressed one time series, and different intercepts, slopes, and standard deviations for each.
>
> With explicitly correlated parameters, I had originally been doing something like
> B_mu ~ normal(B_prior_mu, B_prior_sd);
> for (i in 1:N_Y)
> B_n[i] ~ multi_normal_prec(B_mu , inv_sigma);
>
> B_prior_mu would have been something like {0, 1}, since I had a fairly good idea that the average slope would be closer to 1.
>
> And now I'm doing something like
> for (i in 1:N_Y)
> B_n[i] ~ multi_normal_prec(B_prior_mu, inv_sigma);
>
> As a result the divergences seem to have disappeared. However, I'm not sure this is exactly what you are suggesting.

You're making a normal-normal model of B_n in the first case and a normal
model in the second case. I don't think you want to do the former unless
you have some kind of explicit measurement-error model, and even then, it's
usually on the data side.

> What you're saying seems closer to something like
>
> B_mu ~ normal(0, B_prior_sd);
> for (i in 1:N_Y)
> B_n[i] ~ multi_normal_prec(B_prior_mu, inv_sigma);

I think B_mu and B_prior_mu were meant to be the same variable here,
so I'll assume that should be B_mu in the third line.

> which is similar to what is in the Stan manual in Section 8.13. However, the manual says that more information can be incorporated into u_{k} (might be a typo, since there is no u_{k}, only u_{j}). If it only makes sense to adjust the standard deviation of the prior, then it might be made more clear in the manual.

It's OK if the prior for B_mu here is any kind of constant --- it
doesn't have to be zero. If it's a variable, then you just wind
up with a compound distribution for B_mu (sort of like beta-binomial
or gamma-Poisson [aka negative binomial]).

>
> Similarly, if I were doing this without explicit correlation, I should do something like (excluding the other stuff for brevity)
>
> beta ~ normal(0, B_prior_sd[2]);
> beta_n_sd ~ normal(0, B_n_prior_sd[2]);
> beta_n ~ normal(beta, beta_n_sd);
>
> instead of
>
> beta ~ normal(1, B_prior_sd[2]);
> beta_n_sd ~ normal(0, B_n_prior_sd[2]);
> beta_n ~ normal(beta, beta_n_sd);
>
> even if I have a good idea that beta should have a mean closer to 1?

No, if you think the mean's closer to 1, by all means use that second
case. Zero isn't the relevant thing here, it's whether it's a parameter
or not.

> If you're finding the real data a lot harder to fit than simulated
> data, you might want to try to diagnose where the actual data's
> violating the modeling assumptions.
>
> This was a good suggestion, but it has been a rather time-consuming project. It turned out to be harder than I thought to figure out exactly what part of the data was violating the assumptions. I kept making it more and more realistic without finding where it was going wrong. Eventually I realized that the problem was that I had been filling a matrix with normal random variables and that I needed to center them before proceeding. Not centering meant that there was some extra variation in the intercept (i.e. I had made an inadvertent additional change in the assumptions). So I'm starting the search over now.

Yup, challenging, but usually worth the effort in model understanding and
peace of mind that your code works.

- Bob

John Hall

unread,
Apr 3, 2017, 1:54:18 PM4/3/17
to stan-...@googlegroups.com
I appreciate the reply Bob.


On Fri, Mar 31, 2017 at 2:55 PM, Bob Carpenter <ca...@alias-i.com> wrote:
What I'm saying is that if you have multiple intercepts (say one
global one and then group-level intercepts), then you only get
identification through the prior because you can add to the global
intercept and subtract from the group intercepts and get the same
value.  Usually this is mitigated with using n-1 intercepts if
there are n groups and pinning one to zero.

In my case, I didn't have a global intercept. Just group-level intercepts. I've made the above mistake before.

You're making a normal-normal model of B_n in the first case and a normal
model in the second case.  I don't think you want to do the former unless
you have some kind of explicit measurement-error model, and even then, it's
usually on the data side.

I was a little confused by this. The B_n's are the varying coefficients, not the actual data. So in the former, I have correlated coefficients B_n, and then I gave the mean of the B_n a prior. You seem to be saying in other places that this is the correct approach.

Yup, challenging, but usually worth the effort in model understanding and
peace of mind that your code works.

One thing I had realized from this was that when I used actual OLS intercepts/slopes/standard errors and simulated the errors as normal, it almost always resulted in very low N_eff. But when I replaced the empirical intercept distribution with some simpler distribution, the N_eff were almost always more reasonable. I tried a lot of options, but the least disruptive thing I did was get the percent ranks of the intercepts and convert it into a normal distribution. So it was very similar, except didn't have the fat tails. This got like 250/1000 N_eff, which was substantially better than the actual intercepts.

However, when I tried to adjust the Stan code (a version not including the correlated parameters, but similar) to include a t distribution for the intercepts, the posterior distribution of the degrees of freedom parameter didn't really change much from the prior and the final results weren't really any different. I was a little disappointed after that.

This approach was also interesting because it made clear how non-normal the joint distribution of the intercepts and slopes was. The intercepts were t distributed and the slopes (and standard errors) were log normally distributed. There probably also was some sort of non-normal dependence between the two. Even when I tried regressing the intercepts against the slopes and errors and including t errors in them, the simulated plots of the intercepts and slopes looked different from the empirical ones.

Suffice it to say, I think the non-normality of the parameters is what is driving my problem. At this point, I don't see how to resolve it other than removing the varying intercepts (I was actually able to use intercepts that vary by larger groups, just not the lowest level ones).

John Hall

unread,
Apr 3, 2017, 5:36:59 PM4/3/17
to stan-...@googlegroups.com
Right now I'm playing around with the distribution on the standard deviation on the intercept as that seems like a key issue. It seems biased too low. So instead of half normal, I'm trying a normal parameter (chosen carefully!) that's transformed to log normal. I'm also curious to see if switching the normal for cauchy helps.

One idea I just had that I hadn't considered before is that I didn't make the problem simple enough when I was trying to identify the problem. I could remove the independent variable and slopes completely and re-estimate.

Bob Carpenter

unread,
Apr 4, 2017, 1:48:31 PM4/4/17
to stan-...@googlegroups.com

> On Apr 3, 2017, at 5:36 PM, John Hall <john.mic...@gmail.com> wrote:
>
> Right now I'm playing around with the distribution on the standard deviation on the intercept as that seems like a key issue. It seems biased too low. So instead of half normal, I'm trying a normal parameter (chosen carefully!) that's transformed to log normal. I'm also curious to see if switching the normal for cauchy helps.

How are you assessing bias? Are you talking about the posterior
mean as an estimate of the parameter? I'd recommend looking at posterior
interval coverage for simulated data --- that'll let you assess whether
your model's coded properly. Then it's a different matter to asses whether
it matches some real data set.

> One idea I just had that I hadn't considered before is that I didn't make the problem simple enough when I was trying to identify the problem. I could remove the independent variable and slopes completely and re-estimate.

This is a key step in debugging. Isolate the problem in
a reproducible example that is as simple as possible. Often
by the time you've done that, the debugging job is done.

This is one of the main reasons we suggest building models
up from simpler models wherever possible. It's tedious to
build up one feature at a time, but at least you know exactly
where the problem comes from when it arises.

- Bob

John Hall

unread,
Apr 5, 2017, 5:10:47 PM4/5/17
to stan-...@googlegroups.com

On Tue, Apr 4, 2017 at 1:47 PM, Bob Carpenter <ca...@alias-i.com> wrote:
How are you assessing bias?  Are you talking about the posterior
mean as an estimate of the parameter?  I'd recommend looking at posterior
interval coverage for simulated data --- that'll let you assess whether
your model's coded properly.  Then it's a different matter to asses whether
it matches some real data set.

The actual value (of the standard deviation on the hierarchically varying intercept) was closer to 0.01 and I was getting like 0.001. The intervals were way different. This was also what was driving the divergences. Basically, this standard deviation was getting too low. I played around a little with the specification and was getting better results.

Also, I started using some of the diagnostics from Michael Betancourt's case study on the subject. They are definitely valuable.
 

This is a key step in debugging.  Isolate the problem in
a reproducible example that is as simple as possible.  Often
by the time you've done that, the debugging job is done.

This is one of the main reasons we suggest building models
up from simpler models wherever possible.  It's tedious to
build up one feature at a time, but at least you know exactly
where the problem comes from when it arises.

Of course, but sometimes until you fit the actual data you don't know there are going to be problems. I began with simpler models. I just needed to work with different simple models...

I actually was able to isolate things and confirm that it really isn't the correlation between the intercepts and slopes that is causing the problem. I had previously mentioned that the data also has varying standard deviations that I estimate. Focusing on just the varying intercepts and standard deviations caused divergences with the actual data but not some simpler adjustments.

The problem is that the more the intercept is below zero, the higher the standard deviation tends to be. However, the standard deviation also gets larger when the intercept is above zero, just not by as much. So, it's not a linear relationship between the two. I did a few plots trying to make it clear, but it's most obvious just looking at the square of the intercepts above and below a certain bound. For instance:

> cor(alpha_new[(alpha_new ^ 2)>0.000025] ^ 2, log(sigma_new[(alpha_new ^ 2)>0.000025]))
[1] 0.5878579
> cor(alpha_new[(alpha_new ^ 2)<0.000025] ^ 2, log(sigma_new[(alpha_new ^ 2)<0.000025]))
[1] -0.09757556

This also makes it clear why when I had tried to simulate the parameters from a multivariate normal distribution I didn't have any problems. The problem isn't there when the distribution is multivariate normal.

Anyway, I tried to incorporate this into the simple model without much luck.

Bob Carpenter

unread,
Apr 5, 2017, 6:12:32 PM4/5/17
to stan-...@googlegroups.com
Thanks for the report.

You'll find Stan has a lot easier time adapting if the parameters
are on the scale of 1 rather than 1e-3. If you give Stan long
enough to warmup, it should adapt to the right scales, but it might
take a while.

- Bob

John Hall

unread,
Apr 6, 2017, 6:43:03 PM4/6/17
to stan-...@googlegroups.com
I'm hoping to hang my hat on the non-linear relationship between intercepts and standard deviations.

I had tried normalizing the variables in the past and it didn't make nearly as much a difference as when I did simulations where the relationship was more linear or uncorrelated. I also multiplied the number of iterations by 10 (to 10,000 total post-warm up across 4 chains) and that didn't make a noticeable impact.

When normalizing, I had subtracted each series by the overall mean and divided by the overall standard deviation. So in this case, the cross-sectional variation in the intercepts and standard deviations increased by about a factor of 10, but not up to 1. The standard deviation of the intercepts becomes something like 0.1 empirically, but most of the time when I run Stan it still gets shoved down to 0.01 in some of the simpler variations and divergences (in a simple simulation, but preserving the non-linearity).

It doesn't make sense to me to subtract the series mean and divide by the series standard deviation. If I fit a hierarchical model, all the means should be 0 and the standard deviations 1, which kind of defeats the purpose of doing a hierarchical regression. This actually makes the standard deviation of the cross sectional means even smaller (and more more divergences).

> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Bob Carpenter

unread,
Apr 7, 2017, 1:56:31 PM4/7/17
to stan-...@googlegroups.com
I just mean rather than having

alpha ~ normal(0, 1e-3);

define

alpha_raw ~ normal(0, 1);

alpha = 1e-3 * alpha_raw;

Then the parameters over which Stan is actually sampling, alpha_raw,
will be unit scale, but you'll still have your original value alpha.

- Bob
> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> > To post to this group, send email to stan-...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

John Hall

unread,
Apr 10, 2017, 5:58:09 PM4/10/17
to stan-...@googlegroups.com
Ah. Well, my sense after playing around with this a bit is that what you're saying is similar to a non-centered reparameterizations.  In my case, I have enough data in the cross-section (>500) that I usually get worse results with a non-centered reparameterization. I think that's why I see worse results with the approach you're talking about.

Something I just tried helps illustrate why I think the problem is the non-linear relationship. I calculated the Mahalanobis distance of the empirical intercepts and log standard deviations. I then removed the top 5% or so of the results, simulated data, and ran the relevant Stan models. No divergences and about 400/1000 N_Eff on lp__ and all the results look sensible.

I'm still trying a few things. In particular, I had the idea last week to do a mixture distribution for the intercepts and log standard deviations. My first attempt didn't go so well (I had tried to do a mixture letting only the correlation change, but it's kind of tricky because the only way to constrain an ordered vector is to make it positive), but I'm still at the early stages of trying stuff.

> > To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> > To post to this group, send email to stan-...@googlegroups.com.
> > For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
> To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "Stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/Zuz9CHJ0aSU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

John Hall

unread,
Apr 10, 2017, 6:23:29 PM4/10/17
to stan-...@googlegroups.com
Hmmm, I noticed an error in the code I thought illustrated the error well. I had accidentally removed the hierarchical prior from the intercepts so that it would just include a separate intercept for each one. Adding back the hierarchical prior results in divergences...even with the outliers removed...

John Hall

unread,
Apr 11, 2017, 6:16:01 PM4/11/17
to stan-...@googlegroups.com
It seems like the problem is more evident the less data there is. I simulated 10, 5, 2.5, 2, 1.5, 1, 0.5 times the amount of data in the original data set (of 10 years of monthly data) and the results looked great at 5 and 10 times and then steadily get worse the less data is available.

In the past, I have used a formula from Gelman and Hill (21.14) to determine the amount of pooling in models. Near 0 and there is little pooling and near 1 and there is a lot of pooling. Sure enough, the pooling factor for the varying standard deviations was near 0 (little to no pooling) across all the simulations, but for the varying intercepts the more data was available the closer it was to 0 (and estimated well) and the less data available the closer it was to 1.

So my sense is that I shouldn't worry too much that it isn't estimated all that well...

John Hall

unread,
Apr 17, 2017, 10:14:24 AM4/17/17
to stan-...@googlegroups.com
This should be my final word on the subject. I think I finally have a decent understanding of why I was getting the results.

I did the above a little more thoroughly, plotting a wider range. Basically, I simulated data as above, such that T/N for the data from around 0.05 to 10. I then calculated the lambda formula I mentioned above, it looks something like
lambda = 1 - var(apply(e, 2, mean)) / mean(apply(e, 1, var))
where e is the de-meaned parameters (the varying intercepts and standard deviations. I had saved both the numerator and denominator to get some greater sense of everything.

In both cases, plotting lambda as a function of T/N looks like a 1/x graph. However, for small T/N, the varying standard deviation lambda is relatively small, whereas the varying intercept lambda is close to 1. By the time T/N gets to 2 or so, the lambda get relatively flat at low levels (the standard deviation lambda is still lower though).

It was also interesting to plot the numerator and the denominator of lambda against each other. For the varying standard deviation, it looked pretty linear, but the points eventually bunch up at some long-term values. However, the values for the varying intercept look a little different. For larger T/N, it looks linear, like the varying standard deviations. However, for smaller T/N, the numerator goes to zero faster than the denominator. So the bottom looks a bit like a square root plot, but the top is more linear.

So the way I read this is that when T/N is small, the intercepts all cluster around the pooled mean, but as T/N increases, then it increasingly becomes possible to differentiate between them.

The issue became a little more obvious to me when I wrote out what a multivariate distribution of all the data would look like. In particular, for the covariance, there would be a block diagonal matrix representing the covariance of the varying intercepts plus a diagonal matrix with a different value for each N representing the different variances. Compared to a simpler model with only varying intercepts, the second part would just be a constant diagonal matrix. This would allow the varying intercepts to pick up all the different variance between N. However, when the variance is able to vary (and does so much more than the intercept in my case), then the intercept is just left to pick up the covariance within the Ns. When T/N is small, it is much harder to pick up this relationship when the standard deviation of the alphas is relatively smaller.

Reply all
Reply to author
Forward
0 new messages