Varying intercepts, varying slopes model

1,264 views
Skip to first unread message

Shravan Vasishth

unread,
Dec 15, 2013, 5:47:41 PM12/15/13
to stan-...@googlegroups.com
I just fit a varying intercepts, varying slopes model analogous to this model in lmer:

lmer(y ~ x + (1+x|fac1) + (1|fac2),data)

My question is: do I compute the estimate for the correlation parameter (the correlation between the varying intercepts and varying slopes) by looking at the variance covariance matrix estimated for random effects, and computing the estimate of rho from there? In JAGS, I could have defined a prior on rho and explicitly specified that that has to be estimated.

Here's the Stan code, and below that is the JAGS code:

## this is what my data's specifications look like:
headnoun.dat <- list(mu_prior=c(0,0),
                     subj=
                       sort(as.integer( 
                         factor(headnoun$subj) )),
                     item=sort(as.integer( 
                       factor(headnoun$item) )),
                     rrt = headnoun$rrt,
                     so = headnoun$so,
                     N = nrow(headnoun),
                     I = 
                       length( unique(headnoun$subj) ),
                     K = 
                       length( unique(headnoun$item) )  
)


varyingintslopes_code<- 'data {
int<lower=1> N;
real rrt[N];   //outcome
real so[N];   //predictor
int<lower=1> I;   //number of subjects
int<lower=1> K;   //number of items
int<lower=1, upper=I> subj[N];  //subject id
int<lower=1, upper=K> item[N];  //item id
vector[2] mu_prior; //vector of zeros passed in from R
}
parameters {
vector[2] beta;      // intercept and slope
vector[2] u[I];   // random intercept and slope
real w[K];   // random intercept item
real<lower = 0> sigma_e;  // residual sd 
vector<lower=0>[2] sigma_u;   // subj sd
real<lower=0> sigma_w;   // item sd
corr_matrix[2] Omega;     // correlation matrix for random intercepts and slopes
}
transformed parameters {
cov_matrix[2] Sigma;  // constructing the variance/cov matrix for random effects
for (r in 1:2) {
for (c in 1:2) {
Sigma[r,c] <- sigma_u[r] * sigma_u[c] * Omega[r,c];
}
}
}
model {
real mu[N];   // mu for likelihood
for (i in 1:I) u[i] ~ multi_normal(mu_prior, Sigma);  // loop for subj random effects
for (k in 1:K) w[k] ~ normal(0,sigma_w); 
// loop for item random effects
for (n in 1:N) {
mu[n] <- beta[1] + beta[2]*so[n] + u[subj[n], 1] + u[subj[n], 2]*so[n];
}
rrt ~ normal(mu,sigma_e);    // likelihood
beta ~ normal(0,5);
sigma_e ~ cauchy(0,2);
sigma_u ~ cauchy(0,2);
sigma_w ~ cauchy(0,2);
Omega ~ lkj_corr(2.0);
}
'

##JAGS:

cat("
# Fixing data to be used in model definition
data
{
    zero[1] <- 0
    zero[2] <- 0
}
    # Then define model
    model
{
    # Intercept and slope for each person, including random effects
    for( i in 1:I )
{
    u[i,1:2] ~ dmnorm(zero,Omega.u)
}
    
    # Random effects for each item
    for( k in 1:K )
{
    w[k] ~ dnorm(0,tau.w)
}
    
    # Define model for each observational unit
    for( j in 1:N )
{
    mu[j] <- ( beta[1] + u[subj[j],1] ) +
    ( beta[2] + u[subj[j],2] ) * ( so[j] ) + w[item[j]]
    rrt[j] ~ dnorm( mu[j], tau.e )
}
    
    #------------------------------------------------------------
    # Priors:
    
    # Fixed intercept and slope (uninformative)
    beta[1] ~ dnorm(0.0,1.0E-5)
    beta[2] ~ dnorm(0.0,1.0E-5)
    
    # Residual variance
    tau.e <- pow(sigma.e,-2)
    sigma.e  ~ dunif(0,100)
    
    # Define prior for the variance-covariance matrix of the random effects for subjects
    ## precision:
    Omega.u  ~ dwish( R, 2 )
    ## R matrix:
    R[1,1] <- pow(sigma.a,2)
    R[2,2] <- pow(sigma.b,2)
    R[1,2] <- rho*sigma.a*sigma.b
    R[2,1] <- R[1,2]
    ## Vcov matrix:
    Sigma.u <- inverse(Omega.u)
    ## priors for var int. var slopes
    sigma.a ~ dunif(0,10)
    sigma.b ~ dunif(0,10)
    
    ## prior for correlation:
    rho ~ dunif(-1,1)

    # Between-item variation
    tau.w <- pow(sigma.w,-2)
    sigma.w  ~ dunif(0,10)   
        
}",
     file="intslopeuninfcorr.jag" )


Sergio Polini

unread,
Dec 16, 2013, 1:45:30 AM12/16/13
to stan-...@googlegroups.com
I'm very interested in your model.
Could you please attach a dataset?
Thanks
Sergio
> --
> You received this message because you are subscribed to the Google
> Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Bob Carpenter

unread,
Dec 16, 2013, 2:27:13 AM12/16/13
to stan-...@googlegroups.com


On 12/15/13, 5:47 PM, Shravan Vasishth wrote:
> I just fit a varying intercepts, varying slopes model analogous to this model in lmer:
>
> lmer(y ~ x + (1+x|fac1) + (1|fac2),data)
>
> My question is: do I compute the estimate for the correlation parameter (the correlation between the varying intercepts
> and varying slopes) by looking at the variance covariance matrix estimated for random effects, and computing the
> estimate of rho from there? In JAGS, I could have defined a prior on rho and explicitly specified that that has to be
> estimated.

These days we like to do what you did and use a scaled correlation matrix
as a hierarchical prior. Using the LKJ(2) prior you're using, there's a slight
regularization toward the identity correlation.

You can then just read the correlation out of
the entries in the correlation matrix in the posterior. The usual
estimate would be the posterior mean (which is itself a correlation
matrix because the class of correlation matrices is closed under averaging).

You could also have translated the JAGS model directly including
the Wishart priors and read out rho in the same way as you did there.

If you have some other kind of covariance prior, you can just calculate
correlation in the usual way (divide by scales of the parameters).

The best overview I know of the available priors in Stan is this paper
from Goodrich et al. which we've discussed here before:

http://www.stat.columbia.edu/~gelman/research/unpublished/Visualization.pdf

Ben's also created some code to convert the R/lmer hierarchical model notation
into JAGS models --- I don't know how general a models it can handle.

- Bob

P.S. Models are a lot easier to read with indentation for blocks.

Shravan Vasishth

unread,
Dec 16, 2013, 2:28:56 AM12/16/13
to stan-...@googlegroups.com
You can download all data and code from a course on data analysis that I'm teaching right now:




My question really is: if I want to look at the posterior distribution of the correlation parameter in Stan, should I just construct the variance covariance matrix cell-wise as I did in the corresponding JAGS model? And related to that, when I do not specify a prior for the correlation, is it legitimate to compute correlation from the estimated variance components for varying intercept and slope? If I know these two quantities and the value in the off-diagonal, I can compute the correlation even though it is never mentioned in the model. I hope my question is making sense.

Bob Carpenter

unread,
Dec 16, 2013, 3:04:50 AM12/16/13
to stan-...@googlegroups.com


On 12/16/13, 2:28 AM, Shravan Vasishth wrote:
...

> My question really is: if I want to look at the posterior distribution of the correlation parameter in Stan, should I
> just construct the variance covariance matrix cell-wise as I did in the corresponding JAGS model?

It would be easier to see how these models align if you
used the same variable names. The rho in your JAGS model
is used as follows:


> sigma.a ~ dunif(0,10)
> sigma.b ~ dunif(0,10)

> rho ~ dunif(-1,1)

> R[1,1] <- pow(sigma.a,2)
> R[2,2] <- pow(sigma.b,2)
> R[1,2] <- rho*sigma.a*sigma.b
> R[2,1] <- R[1,2]

> Omega.u ~ dwish( R, 2 )

> Sigma.u <- inverse(Omega.u)

> # Intercept and slope for each person, including random effects
> for( i in 1:I )
> {
> u[i,1:2] ~ dmnorm(zero,Omega.u)
> }

I don't understand why you have so many levels. I would've
expected R to be a constant because there's only a single Omega.u.

The usual correlation to expect would be for the hierarchical
prior for u:

Sigma.u[1,2] / ( sqrt(Sigma.u[1,1]) * sqrt(Sigma.u[2,2]) )



Stan:

> parameters {
> ...
> corr_matrix[2] Omega; // correlation matrix for random intercepts and slopes

> transformed parameters {
> cov_matrix[2] Sigma; // constructing the variance/cov matrix for random effects
> for (r in 1:2) {
> for (c in 1:2) {
> Sigma[r,c] <- sigma_u[r] * sigma_u[c] * Omega[r,c];
> }
> }
> model {
> for (i in 1:I)
> u[i] ~ multi_normal(mu_prior, Sigma); // loop for subj random effects


And the Sigma.u in JAGS is (I'm pretty sure) the same Sigma
as in your Stan model (though with a different prior).


> And related to that,
> when I do not specify a prior for the correlation,

If you have a correlation parameter, you need to constrain it to (-1,1),
which provides a proper uniform prior. A correlation matrix without
a prior also gets an implicit proper uniform distribution over all correlation
matrices.

> is it legitimate to compute correlation from the estimated variance
> components for varying intercept and slope? If I know these two quantities and the value in the off-diagonal, I can
> compute the correlation even though it is never mentioned in the model. I hope my question is making sense.

I'm not sure what you mean here.

- Bob

Shravan Vasishth

unread,
Dec 16, 2013, 5:51:01 AM12/16/13
to stan-...@googlegroups.com
I'll make my code consistent across JAGS and Stan and re-post my questions more clearly, with examples.

--
Shravan Vasishth
Professor für Psycholinguistik und Neurolinguistik
Department Linguistik
Universität Potsdam, D-14476 Potsdam
http://www.ling.uni-potsdam.de/~vasishth




--
You received this message because you are subscribed to a topic in the Google Groups "stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/pdfignYQcas/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+unsubscribe@googlegroups.com.

Sergio Polini

unread,
Dec 16, 2013, 9:13:26 AM12/16/13
to stan-...@googlegroups.com
Il 16/12/2013 08:28, Shravan Vasishth ha scritto:
> You can download all data and code from a course on data analysis that
> I'm teaching right now:
>
> http://www.ling.uni-potsdam.de/~vasishth/advanceddataanalysis.html
>
> Here are the
> data: http://www.ling.uni-potsdam.de/~vasishth/data/gibsonwu2012data.txt
>
> And here is all the relevant
> code: http://www.ling.uni-potsdam.de/~vasishth/lmmexamplecode.txt

Thanks!

> My question really is: if I want to look at the posterior distribution
> of the correlation parameter in Stan, should I just construct the
> variance covariance matrix cell-wise as I did in the corresponding JAGS
> model?

If you define priors for the std devs, you can_
a) construct a covariance matrix, then compute the correlation(s);
b) construct diagonal matrix, D=diag(sigma), and define a correlation
matrix.
In the first case, you'll use:

for (i in 1:I) u[i] ~ multi_normal(mu_prior, Sigma);

In the second one, since
Sigma = D * Omega * D = D(LL')D = (DL)(DL)'
where L is the Cholesky factor, you can use:

matrix[2,2] L;
matrix[2,2] DL;
L <- cholesky_factor(Omega);
DL <- D * L;
for (i in 1:I) u[i] ~ multi_normal_cholesky(mu_prior, DL);

which is more effcient (example attached: vasishth2.stan).

> And related to that, when I do not specify a prior for the
> correlation, is it legitimate to compute correlation from the estimated
> variance components for varying intercept and slope?

Omega without a prior is just implicitly Omega ~ lkj_corr(1), i.e. a
uniform (joint) prior. Correlations are there and you don't need to
compute them.

> If I know these two
> quantities and the value in the off-diagonal, I can compute the
> correlation even though it is never mentioned in the model.

Of course (toy example attached: corr.R).

> I hope my
> question is making sense.

I hope my answers make sense!

Sergio (just a learner)

vasishth2.stan
corr.R

Shravan Vasishth

unread,
Dec 16, 2013, 9:27:57 AM12/16/13
to stan-...@googlegroups.com
This is fantastic, Sergio. Thanks very much. I think this addresses all my questions. Thanks also for the (implicit) guidance on cleaner code presentation.

I will post all the different versions on my course page once I am done setting up and testing the models. 

best,


--
Shravan Vasishth
Professor für Psycholinguistik und Neurolinguistik
Department Linguistik
Universität Potsdam, D-14476 Potsdam
http://www.ling.uni-potsdam.de/~vasishth


Bob Carpenter

unread,
Dec 16, 2013, 11:29:44 AM12/16/13
to stan-...@googlegroups.com


On 12/16/13, 9:13 AM, Sergio Polini wrote:
> Il 16/12/2013 08:28, Shravan Vasishth ha scritto:
>> You can download all data and code from a course on data analysis that
>> I'm teaching right now:
...

> I hope my answers make sense!

Absolutely.

Thanks chiming in with great answers. We've been hoping to
get more users chiming in on the lists.

> Sergio (just a learner)

We're all just Stan learners! Even though we wrote the language,
we still have a lot to learn in using it.

Applied modeling is too broad and domain specific for anyone to
master all of --- though Andrew and Ben can answer most questions
on this front.

- Bob

Bob Carpenter

unread,
Dec 16, 2013, 11:33:09 AM12/16/13
to stan-...@googlegroups.com


On 12/16/13, 5:51 AM, Shravan Vasishth wrote:
...
> On Mon, Dec 16, 2013 at 9:04 AM, Bob Carpenter <ca...@alias-i.com <mailto:ca...@alias-i.com>> wrote:
...
> > And related to that,
>
> when I do not specify a prior for the correlation,
>
>
> If you have a correlation parameter, you need to constrain it to (-1,1),
> which provides a proper uniform prior. A correlation matrix without
> a prior also gets an implicit proper uniform distribution over all correlation
> matrices.

Which, as Sergio points out, is the same as an lkj_correlation(1) prior,
because in lkj_correlation(nu), nu - 1 is a power, so when it's 1, everything's
constant. Just like the beta or Dirichlet in that regard. When nu > 1, the
probability mass concentrates around the unit correlation matrix. You can work
out explicitly what it does for a 2 x 2 correlation matrix by evaluating the determinant,
which only has one variable (symmetric, diagonal constrained to 1).

- Bob

Shravan Vasishth

unread,
Dec 16, 2013, 12:14:58 PM12/16/13
to stan-...@googlegroups.com
Thanks, Bob. I need a from-first-principles introduction to lkj_correlation. I have no idea what it is, and the reference manual assumes one knows. I searched for it a bit but I just get obscure stan discussions that don't help.

thanks,


--
Shravan Vasishth
Professor für Psycholinguistik und Neurolinguistik
Department Linguistik
Universität Potsdam, D-14476 Potsdam
http://www.ling.uni-potsdam.de/~vasishth




- Bob

Andrew Gelman

unread,
Dec 16, 2013, 12:38:46 PM12/16/13
to stan-...@googlegroups.com
It's in appendix A of BDA, 3rd edition.

You received this message because you are subscribed to the Google Groups "stan users mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.

Christian Roy

unread,
Dec 16, 2013, 12:49:44 PM12/16/13
to stan-...@googlegroups.com
I believe that the lkj_correlation  refer to the Lewandowski et al. paper in the Journal of Multivariate Analysis

Generating random correlation matrices based on vines and extended onion method


Good luck

Chris
----------------

Shravan Vasishth

unread,
Dec 16, 2013, 1:53:42 PM12/16/13
to stan-...@googlegroups.com
Thanks Chris, I'll try to figure out what's in that article. Andrew, I did look at the appendix in BDA. I think that a little bit more explanation would help people like me who don't know much.

--
Shravan Vasishth
Professor für Psycholinguistik und Neurolinguistik
Department Linguistik
Universität Potsdam, D-14476 Potsdam
http://www.ling.uni-potsdam.de/~vasishth


To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Bob Carpenter

unread,
Dec 16, 2013, 5:04:12 PM12/16/13
to stan-...@googlegroups.com


On 12/16/13, 12:49 PM, Christian Roy wrote:
> I believe that the lkj_correlation refer to the Lewandowski et al. paper in the Journal of Multivariate Analysis
>
> Generating random correlation matrices based on vines and extended onion method
>
> http://www.sciencedirect.com/science/article/pii/S0047259X09000876
>
> Good luck

That paper's pretty difficult. I think Ben's the only
one who fully understands it other than the authors!

The density is quite simple, though:

lkj_corr(Sigma|nu) propto det(Sigma)^(nu - 1)

where det(Sigma) is the determinant of Sigma.

It works like the Beta distribution in that if nu > 1, there's
a mode at the identity matrix, but if nu < 1, there's a trough
at the identity, whereas at nu = 1, it's flat.

But that's just repeating what our manual says.

The way to get intuitions about how it works is to draw
random samples and visualize them as in Ben's paper.

- Bob

Sergio Polini

unread,
Dec 16, 2013, 5:54:25 PM12/16/13
to stan-...@googlegroups.com
Il 16/12/2013 19:53, Shravan Vasishth ha scritto:
> Thanks Chris, I'll try to figure out what's in that article. Andrew, I
> did look at the appendix in BDA. I think that a little bit more
> explanation would help people like me who don't know much.

Well, let me try. "Someone" (!) will surely correct my mistakes.

Defining priors for variances looks easy (it's not that easy, but let's
go on).
Defining priors for 2x2 covariance matrices is easy too (see above): two
variances and one correlation, just a number in [-1,1].
But defining priors for KxK, K>2, covariance matrices is difficult
because of constraints among the correlation parameters. So you have to
define a prior for a large (K>2) covariance matrix "en bloc".

One often use the inverse-Wishart distribution, but its parameters are
difficult to interpret and a single parameter controls the precision of
all elements of the covariance matrix [1].
BTW: reading
http://andrewgelman.com/2012/08/29/more-on-scaled-inverse-wishart-and-prior-independence/
is mandatory!

This is why Barnard, McCulloch and Meng [2] recommend a separation
strategy, i.e.:

Sigma = D R D

where D is a diagonal matrix of std devs and R a correlation matrix.
This is pretty useful if you are willing to express prior beliefs about
the standard deviations, but less willing about the correlation matrix
(as in your models, I'd say).

But modeling a correlation matrix is a tough task. So, O'Malley and
Zaslavsky have suggested a scaled inverse-Wishart:

Sigma = diag(xi) Q diag(xi), Q ~ Inv-Wishart(...)

For example, you could set degrees-of-freedom = K+1 and get marginal
uniform distribution for the correlations, then "correct" the
constrained variances by a vector xi of scale parameters.
See Gelman & Hill, pp. 286-287.

A few years ago Lewandowski, Kurowicka and Joe [3] have found efficient
algorithms to generate random correlation matrices whose distribution
depends on a single "eta" parameter; if eta = 1 then their distribution
is jointly uniform.
So the original separation strategy is now feasible and very efficient
in Stan.
You could notice that the joint uniformity results in marginal priors
for individual correlations which are not uniform and the more favors
values close to zero over values close to +/-1 the more K is large, but
it could make sense because the positive definite constraint is more
restrictive as the correlations move away from zero [2].

HTH

Sergio

--------------------------
[1] A. James O�Malley & Alan M. Zaslavsky, "Domain-Level Covariance
Analysis for Multilevel Survey Data With Structured Nonresponse",
Journal of the American Statistical Association, 103:484, 1405-1418,
http://dx.doi.org/10.1198/016214508000000724
[2] John Barnard, Robert McCulloch and Xiao-Li Meng, "Modeling
Covariance Matrices in Terms of Standars Deviations and Correlations,
with Application to Shrinkage", Statistica Sinica 10(2000), 1281-1311,
http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf
[3] You know:
http://www.sciencedirect.com/science/article/pii/S0047259X09000876

Shravan Vasishth

unread,
Dec 17, 2013, 2:04:03 AM12/17/13
to stan-...@googlegroups.com
Dear Sergio,

Thanks for that. This explanation is excellent (you have a rare skill).  


--
Shravan Vasishth
Professor für Psycholinguistik und Neurolinguistik
Department Linguistik
Universität Potsdam, D-14476 Potsdam
http://www.ling.uni-potsdam.de/~vasishth


On Mon, Dec 16, 2013 at 11:54 PM, Sergio Polini <srgp...@gmail.com> wrote:
Il 16/12/2013 19:53, Shravan Vasishth ha scritto:

Thanks Chris, I'll try to figure out what's in that article. Andrew, I
did look at the appendix in BDA. I think that a little bit more
explanation would help people like me who don't know much.

Well, let me try. "Someone" (!) will surely correct my mistakes.

Defining priors for variances looks easy (it's not that easy, but let's go on).
Defining priors for 2x2 covariance matrices is easy too (see above): two variances and one correlation, just a number in [-1,1].
But defining priors for KxK, K>2, covariance matrices is difficult because of constraints among the correlation parameters. So you have to define a prior for a large (K>2) covariance matrix "en bloc".

One often use the inverse-Wishart distribution, but its parameters are difficult to interpret and a single parameter controls the precision of all elements of the covariance matrix [1].
BTW: reading http://andrewgelman.com/2012/08/29/more-on-scaled-inverse-wishart-and-prior-independence/ is mandatory!

This is why Barnard, McCulloch and Meng [2] recommend a separation strategy, i.e.:

Sigma = D R D

where D is a diagonal matrix of std devs and R a correlation matrix. This is pretty useful if you are willing to express prior beliefs about the standard deviations, but less willing about the correlation matrix (as in your models, I'd say).

But modeling a correlation matrix is a tough task. So, O'Malley and Zaslavsky have suggested a scaled inverse-Wishart:

Sigma = diag(xi) Q diag(xi), Q ~ Inv-Wishart(...)

For example, you could set degrees-of-freedom = K+1 and get marginal uniform distribution for the correlations, then "correct" the constrained variances by a vector xi of scale parameters.
See Gelman & Hill, pp. 286-287.

A few years ago Lewandowski, Kurowicka and Joe [3] have found efficient algorithms to generate random correlation matrices whose distribution depends on a single "eta" parameter; if eta = 1 then their distribution is jointly uniform.
So the original separation strategy is now feasible and very efficient in Stan.
You could notice that the joint uniformity results in marginal priors for individual correlations which are not uniform and the more favors values close to zero over values close to +/-1 the more K is large, but it could make sense because the positive definite constraint is more restrictive as the correlations move away from zero [2].

HTH

Sergio

--------------------------
[1] A. James O’Malley & Alan M. Zaslavsky, "Domain-Level Covariance Analysis for Multilevel Survey Data With Structured Nonresponse", Journal of the American Statistical Association, 103:484, 1405-1418, http://dx.doi.org/10.1198/016214508000000724

[2] John Barnard, Robert McCulloch and Xiao-Li Meng, "Modeling Covariance Matrices in Terms of Standars Deviations and Correlations, with Application to Shrinkage", Statistica Sinica 10(2000), 1281-1311,
http://www3.stat.sinica.edu.tw/statistica/oldpdf/A10n416.pdf
[3] You know: http://www.sciencedirect.com/science/article/pii/S0047259X09000876
Reply all
Reply to author
Forward
0 new messages