Optimising a phylogenetic mixed model in rstan

657 views
Skip to first unread message

Diego Barneche

unread,
Jan 24, 2017, 1:42:38 AM1/24/17
to stan-...@googlegroups.com
Hello all,

I have been trying to implement a gaussian phylogenetic mixed model, in a similar fashion to what is presented in Hadfield and Nakagawa 2010 [link here, also check the equivalent JAGS code in the appendix here]. 

My model has 5 fixed effects, and 2 random effects on the intercept. My questions concern the optimisation of the parameter specification with regards to the variance components for the random effects. Species (tips in a phylogeny) are associated with two independently distributed variance components: 

(1) one variance component that is attributable to the shared ancestry among species, which is obtained from the correlation matrix generated from the phylogeny (it includes both nodes and tips, coefficients named gammaPhy, standard deviation sigmaPhy, variance-covariance matrix is SigmaPhy). Here I estimate a single standard deviation (sigmaPhy in the model) which weights the phylogeny-derived correlation matrix (S provided in the model) in order to generate a prior variance-covariance matrix (SigmaPhy) for a multivariate normal distribution.

(2) one standard deviation component (sigmaSpp) that contains the unique among-species deviations on the intercept (gammaSpp) once I account for (1).

ZI and ZII are the provided model matrices for the random effects with species (or species + nodes in the case of ZII) as columns and observations as rows. They are structurally similar, but the difference is that SigmaPhy is weighted by the correlation matrix and thus affects the estimates of gammaPhy while the other (sigmaSpp) is not weighted, i.e. it is supposed to estimate the variance that is attributable to each species independently after the residual variance attributable to their shared ancestry is estimated. That is also why I inserted two separate model matrices for clarification purposes.

The model below compiles and runs, but EXTREMELY SLOWLY. For example, a similar model using 1.5e6 iterations in MCMCglmm runs in 20 min whereas in stan it hasn't even reached 10% after 2 hours with 1e4 iterations only. I've read posts in this group and the manual many times, and it seem that I could use a Cholesky decomposition for my correlation matrix, and estimate everything accordingly. Unfortunately I cannot seem to wrap my head around the whole Cholesky-decomposition concept, so I would really appreciate any help you can give me to optimise this code below.

Thanks very much for your time and help.
With best wishes,
Diego
p.s. please do apologise me if my language is not precise, I am not a statistician.


data {
   
int<lower=1> N;               // Number of observations
   
int<lower=0> P;               // Number of fixed-effects predictors
   
int<lower=0> J;               // Number of groups (Species)
   
int<lower=0> K;               // Number of phylogeny-derived correlation matrix columns; it includes both nodes and tips
    real Y
[N];                    // Response variable
    matrix
[N,P] X;                // Fixed-effect predictor matrix
    matrix
[N,J] ZI;               // Random-effect predictor matrix for across-species unique effects
    matrix
[N,K] ZII;              // Random-effect predictor matrix for heritable between-species effects (i.e. phylogeny-derived)
    vector
[K] zeroes;             // Dummy vector containing zeroes for MVN
    corr_matrix
[K] S;             // Phylogenetic correlation matrix (tips + nodes)
}


parameters
{
    vector
[P] beta;          // Vector of fixed-effect estimates
    vector
[J] gammaSpp;      // Among-species random-effect coefficients
    real
<lower=0> sigmaSpp;  // Among-species random-effect standard deviation
    vector
[K] gammaPhy;      // Phylogeny-derived random-effect coefficients
    real
<lower=0> sigmaPhy;  // Standard deviation attributable to heritability among species (i.e. phylogeny-derived)
    real
<lower=0> sigmaY;    // Error scale
}


transformed parameters
{
    vector
[N] mu;                   // Linear mean predictor
    cov_matrix
[K] SigmaPhy;         // Variance-covariance matrix for the multi-normal distribution
   
SigmaPhy = (sigmaPhy ^ 2) * S;  // Operation depicted in eqn 22 from hadfield_jd_2010_j23_494.pdf; sigmaPhy is the standard deviation, needs to be squared
    mu  
=  X * beta +  ZI * gammaSpp +  ZII * gammaPhy;
}


model
{
   
// Priors
    gammaSpp  
~  normal(0, sigmaSpp);
    sigmaSpp  
~  cauchy(0,5);
    gammaPhy  
~  multi_normal(zeroes, SigmaPhy);
    sigmaPhy  
~  cauchy(0,5);    
    beta      
~  normal(0,5);
    sigmaY    
~  cauchy(0,5);
   
// Likelihood
    Y        
~  normal(mu, sigmaY);
}





Matt Espe

unread,
Jan 25, 2017, 12:22:57 PM1/25/17
to Stan users mailing list
Hi,

The manual has a really excellent walk through on how to convert from ~multi_normal to ~multi_normal_cholesky. You should be able to just follow along.

There are two other tweaks that might help. Depending on how many random effects you have, changing from matrix multiplication to indexing can help a ton.

For example, this:
data{
...
    matrix
[N,J] ZI;               // Random-effect predictor matrix for across-species unique effects
    matrix[N,K] ZII;              // Random-effect predictor matrix for heritable between-species effects }

}

parameters
{
...
}

transformed parameters
{
...
    mu  
=  X * beta +  ZI * gammaSpp +  ZII * gammaPhy;
}


To this:

data{
...
   
int<lower = 1, upper = J>
ZI[N];               // Random-effect index of group membership for across-species unique effects
    int<lower = 1, upper = K> ZII[N];              // Random-effect index of group membership for heritable between-species effects }
}

parameters
{
...
}

transformed parameters
{
...
    mu  
=  X * beta +  gammaSpp[ZI] +  gammaPhy[ZII];
}

Can be much faster for large numbers of random effects.

Second, you can specify the random effects using the non-centered parameterization. Check out the manual, as it has a really good section on this as well.

Hope that helps,

Matt

Paul Buerkner

unread,
Jan 25, 2017, 6:37:51 PM1/25/17
to stan-...@googlegroups.com
You might be able to fit this model with the R package brms. It calls Stan under the hood so you don't have to worry about writing the Stan code yourself.
After installing brms, type

vignette("brms_phylogenetics")

for explanations on how to fit phylogenetic models with this package.

Diego Barneche

unread,
Jan 25, 2017, 6:58:30 PM1/25/17
to Stan users mailing list
Dear Matt,

Thanks so much for your helpful reply! I followed all your suggestions, and I have been able to run my model in under 2 hours! Before it didn't even progress to 10% after 2 hours, so huge improvement! It is still much slower than what MCMCglmm is doing though, but that's probably because I do not fully understand how MCMCglmm is parametrising things under the hood.

Here's my current working code:

data {
   
int<lower=1> N;               // Number of observations
   
int<lower=0> P;               // Number of fixed-effects predictors
   
int<lower=0> J;               // Number of groups (Species)

   
int<lower=0> K;               // Number of phylogeny-derived correlation matrix columns; it includes tips only
    real Y
[N];                    // Response variable

    matrix
[N,P] X;                // Fixed-effect predictor matrix

   
int<lower=1,upper=J> ZI[N];   // Random-effect index of group membership for across-species unique effects
   
int<lower=1,upper=K> ZII[N];  // Random-effect index of group membership for heritable between-species effects (i.e. phylogeny-derived)

    corr_matrix
[K] S;             // Phylogenetic correlation matrix (tips + nodes)
}



transformed data
{
    matrix
[K,K] LPhy;
   
LPhy  =  cholesky_decompose(S);

}


parameters
{
    vector
[P] beta;               // Vector of fixed-effect estimates
    vector
[J] gammaSpp;           // Among-species random-effect coefficients
    real
<lower=0> sigmaSpp;       // Among-species random-effect standard deviation

    vector
[K] alphaPhy;           // Phylogeny-derived random-effect coefficients (re-parametrised for MVN)
    vector
<lower=0>[K] sigmaPhy;  // Standard deviation attributable to heritability among species (i.e. phylogeny-derived)

    real
<lower=0> sigmaY;         // Error scale
}


transformed parameters
{
    vector
[N] mu;                   // Linear mean predictor

    vector
[K] gammaPhy;             // Phylogeny-derived random-effect coefficients

    gammaPhy
= sigmaPhy .* (LPhy * alphaPhy);

    mu  
=  X * beta +  gammaSpp[ZI] +  gammaPhy[ZII];
}



model
{

   
// Priors
    gammaSpp  
~  normal(0, sigmaSpp);
    sigmaSpp  
~  cauchy(0, 5);

    alphaPhy  
~  normal(0, 5); // implies: gammaPhy  ~  multi_normal_cholesky(zeroes, diag_matrix(sigmaPhy) * LPhy * LPhy' * diag_matrix(sigmaPhy)))
    sigmaPhy  
~  cauchy(0, 5);
    beta      
~  normal(0, 5);
    sigmaY    
~  cauchy(0, 5);

   
// Likelihood
    Y        
~  normal(mu, sigmaY);
}



Now, a quick question. I have seen an alternative code in this older post here with a similar model.
Is there any advantage in using the transpose of the Cholesky decomposition (on the inverted correlation matrix) rather than the Cholesky decomposition on the original matrix? I think the alternative code block would instead look like (not sure if that would also require escalating tweaks down on the other code blocks?):

transformed data {
    matrix
[K,K] LPhy;
   
LPhy  =  transpose(cholesky_decompose(inverse(S)));
}



Thanks very much again!
Best wishes,
Diego

Diego Barneche

unread,
Jan 25, 2017, 7:00:10 PM1/25/17
to Stan users mailing list
Dear Paul,

Thanks for your reply as well! It seems like one can also extract the log-likelihood from these models and use LOO model comparison!
Very useful. Once I give this a try I'll post the benchmarking performance results.

Best wishes,
Diego

Andrew Gelman

unread,
Jan 25, 2017, 7:03:36 PM1/25/17
to stan-...@googlegroups.com
This is worth looking into.  It's very hard for me to imagine that the equivalent model programmed in Stan can be slower (in effective sample size per second) than MCMCglmm.

Does anyone know enough about this sort of model to look into this?
A


--
You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
To post to this group, send email to stan-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Matt Espe

unread,
Jan 25, 2017, 7:53:51 PM1/25/17
to Stan users mailing list, gel...@stat.columbia.edu
One last thing you might think about is your priors - depending on the scale of your data, cauchy(0,5) might be far too wide to the point of providing no regularization on the estimates. Tightening up priors to be weakly informative can also make a big difference when fitting models.

Matt

Paul Buerkner

unread,
Jan 25, 2017, 8:52:33 PM1/25/17
to stan-...@googlegroups.com, gel...@stat.columbia.edu
Andrew,

while working on the first paper for brms a year ago, I noticed that MCMCglmm might well be faster than Stan in terms of effective sample size. Not necessarily for phylogenetic models (I didn't test it for these models but I could), but for multilevel models with only one or two varying intercepts this might happen. The models are not fully equivalent since MCMCglmm doesn't have the LKJ prior. For more complex models, however (e.g., when modelling varying slopes or multiple varying intercepts across different hierachical levels), the performance of MCMCglmm drops pretty fast to be sometimes more or less unusable unless one sets very strong priors. Stan, on the other hand, is very robust when scaling up model complexity, that's why I love it.

Paul

Andrew Gelman

unread,
Jan 25, 2017, 9:00:23 PM1/25/17
to stan-...@googlegroups.com
Hi, Paul, thanks for the feedback. I'd be interested in seeing an example where MCMCglmm is much faster than Stan.  Even a simple example.  This sort of thing is useful in helping us make Stan work better!
A

On Jan 25, 2017, at 8:52 PM, Paul Buerkner <paul.b...@gmail.com> wrote:

Andrew,

while working on the first paper for brms a year ago, I noticed that MCMCglmm might well be faster than Stan in terms of effective sample size. Not necessarily for phylogenetic models (I didn't test it for these models), but for multilevel models with only one or two varying intercepts this might happen. The models are not fully equivalent since MCMCglmm doesn't have the LKJ prior. For more complex models, however (e.g., when modelling varying slopes or multiple varying intercepts across different hierachical levels), the performance of MCMCglmm drops pretty fast to be sometimes more or less unusable unless one sets very strong priors. Stan, on the other hand, is very robust when scaling up model complexity, that's why I love it.

Diego Barneche

unread,
Jan 26, 2017, 6:29:19 AM1/26/17
to stan-...@googlegroups.com, gel...@stat.columbia.edu
Dear Matt, Paul and Andrew,

Thanks for your feedback. I have now played a little bit with the brms package, which allowed me to improve my model code even further. I did two things: (1) tightened up the priors on the prior distributions, and (2) used rescaled parameters for both random effects (i.e. across species, and phylogeny-derived).

Below I am pasting the R code and the timing results comparing similar models between Stan and MCMCglmm. Again, I am not sure if they are directly comparable, as MCMCglmm uses a different parametrisation.

R code:
library(MCMCglmm)
library
(rstan)
rstan_options
(auto_write = TRUE)
options
(mc.cores = parallel::detectCores())


load
('./dataAndTree.RData')


# fixed-effect model matrix
X  
<-  model.matrix(~ 1 + Pred1 + Pred2 + Pred3 + Pred4, data = data) # 5 in total, includes intercept (1)
P  
<-  as.integer(ncol(X)) # 5
# among-species random-effect vector
ZI  
<-  as.integer(as.factor(data$species)) # 284 species in total
J    
<-  as.integer(length(unique(ZI))) # 284
# cholesky factor of phylogenetic vcv matrix
A     
<-  ape::vcv(tree, corr = FALSE) # 284 x 284 covariance matrix
LPhy  <-  t(chol(A[sort(rownames(A)),sort(rownames(A))]))
# phylogeny-derived relatedness random-effect vector
ZII  
<-  as.integer(data$animal) # also 284
# number of tips (i.e. species) only
K    
<-  as.integer(length(unique(ZII))) # also 284

stanData  
<-  list(N             =  as.integer(nrow(data)),
                   P            
=  P,
                   J            
=  J,
                   K            
=  K,
                   
MSpp          =  1L,
                   
MPhy          =  1L,
                   Y            
=  data$Y,
                   X            
=  X,
                   ZI            
=  ZI,
                   ZII          
=  ZII,
                   LPhy          
=  LPhy)


nIter    
<-  1e4
nWarm    
<-  1e3
nThin    
<-  5
nChains  
<-  1
inits    
<-  list(beta=c(0.1, -0.15, -0.8, -0.2, -0.2), sigmaY=1)


timingRawStan  
<-  system.time({
 stanFit  
<-  stan(file = 'phyloLMM.stan', data = stanData, iter = nIter, thin = nThin, chains = nChains, warmup = nWarm, init = list(inits))
 
})


# compare with original MCMCglmm model
set.seed(10)
priorInverseWishart  
<-  list(G = list(G1 = list(V = diag(1), n = 0.002), G2 = list(V = diag(1), n = 0.002)), R = list(V = diag(1), n = 0.002))
timingRawMCMCglmm  
<-  system.time({
 
MCMCglmmFit  <-  MCMCglmm(Y ~ Pred1 + Pred2 + Pred3 + Pred4, random = ~animal + species, pedigree = tree, data = data, family = 'gaussian', nitt = nIter, thin = nThin, burnin = nWarm, prior = priorInverseWishart, pr = TRUE, verbose = TRUE)
})


And here are the total time results:
> timingRawStan
   user  system elapsed
420.654   2.550 423.960
> timingRawMCMCglmm
   user  system elapsed
 
5.093   0.031   5.145

I do not know how to calculate the effective sample size, though I'd be more than happy to do it if someone provides me a sample code.

The variance estimates (as well as the main coefficients) are pretty similar, which is very encouraging:
# from stan
> res  <-  summary(stanFit)[[1]]
> res[c('sigmaPhy[1]','sigmaSpp[1]','sigmaY'), 'mean'] ^ 2
sigmaPhy
[1] sigmaSpp[1]      sigmaY
 
0.47419640  0.05777246  0.12036096
# from MCMCglmm
> colMeans(MCMCglmmFit$VCV)
    animal otlSpecies      units
0.45447357 0.05864063 0.12024812



Now the final phylogenetic mixed model, which I am calling phyloLMM.stan above, looks like this:
data {
   
int<lower=1> N;               // Number of observations
   
int<lower=0> P;               // Number of fixed-effects predictors
   
int<lower=0> J;               // Number of groups (Species)
   
int<lower=0> K;               // Number of phylogeny-derived correlation matrix columns; it includes tips only

   
int<lower=0> MSpp;            // 1 (integer for random-effect row vector definition)
   
int<lower=0> MPhy;            // 1 (integer for random-effect row vector definition)

    real Y
[N];                    // Response variable
    matrix
[N,P] X;                // Fixed-effect predictor matrix
   
int<lower=1,upper=J> ZI[N];   // Random-effect index of group membership for across-species unique effects
   
int<lower=1,upper=K> ZII[N];  // Random-effect index of group membership for heritable between-species effects (i.e. phylogeny-derived)

    matrix
[K,K] LPhy;             // Cholesky factor of phylogenetic covariance matrix

}

parameters
{
    vector
[P] beta;                  // Vector of fixed-effect estimates

    vector
[J] alphaSpp[MSpp];        // Unscaled among-species random-effect coefficients
    vector
<lower=0>[MSpp] sigmaSpp;  // Among-species random-effect standard deviation
    vector
[K] alphaPhy[MPhy];        // Unscaled phylogeny-derived random-effect coefficients
    vector
<lower=0>[MPhy] sigmaPhy;  // Standard deviation attributable to heritability among species (i.e. phylogeny-derived)
    real
<lower=0> sigmaY;            // Residual sd

}


transformed parameters
{
    vector
[N] mu;                   // Linear mean predictor

    vector
[J] gammaSpp;             // Among-species random-effect coefficients

    vector
[K] gammaPhy;             // Phylogeny-derived random-effect coefficients

    gammaSpp = sigmaSpp[1] * (alphaSpp[1]);
    gammaPhy
= sigmaPhy[1] * (LPhy * alphaPhy[1]);

    mu  
=  X * beta +  gammaSpp[ZI] +  gammaPhy[ZII];
}


model
{
   
// Priors

    alphaSpp
[1]  ~  normal(0, 1);
    sigmaSpp    
~  cauchy(0, 2.5);
    alphaPhy
[1]  ~  normal(0, 1);
    sigmaPhy    
~  cauchy(0, 2.5);
    beta        
~  normal(0, 1);
    sigmaY      
~  cauchy(0, 2.5);

   
// Likelihood
    Y        
~  normal(mu, sigmaY);
}

Thanks once again.
Best wishes,
D

Michael Betancourt

unread,
Jan 26, 2017, 12:37:05 PM1/26/17
to stan-...@googlegroups.com
I would be suspicious of the MCMCglmm results given the disparities
in speed.  Before doing anything you really should do fake data simulation,
i.e. simulate data from known truth and see if you can recover that truth,
to get a handle on whether or not MCMCglmm is giving you accurate
answers.  If it’s strongly-biased then the speed is meaningless.

Peter Smits

unread,
Jan 26, 2017, 12:46:58 PM1/26/17
to Stan users mailing list
Hi Diego,

I think I might be able to help. The phylogenetic mixed model is the same mathematically as the animal model (the Hadfield and Nakagawa article discusses this), both of which are just covariance matrices known up to a constant. A few years ago someone posted to stan-users asking for help implementing this (https://groups.google.com/forum/#!topic/stan-users/HcvYaDu71_Y). A few solutions were given but the general idea is you have to calculate the multivariate normal part on your own/by hand.

At the time one of the solutions Ben Goodrich gave was the following (FYI uses old Stan syntax):
lp__ <- lp__ - 0.5 * N * log(sigmasq_u); // non-constant part of log(det(sigmasq_u * A)^-0.5)
lp__ <- lp__ - (u' * A_inv * u) / ( 2 * sigmasq_u); // log of kernel of multivariate normal

where sigmasq is the Brownian rate parameter; your SigmaPhy^2.

If you go through the thread there are more ideas on how to speed up this calculation. I suffer from this a lot myself as I'll have a phylogentic mixed model for 1000-2000+ tips and it gets very slow.

Hope that helps!

Cheers,

Peter Smits

Bob Carpenter

unread,
Jan 26, 2017, 12:57:52 PM1/26/17
to stan-...@googlegroups.com

> On Jan 25, 2017, at 6:58 PM, Diego Barneche <barne...@gmail.com> wrote:
>
> Dear Matt,
>
> Thanks so much for your helpful reply! I followed all your suggestions, and I have been able to run my model in under 2 hours! Before it didn't even progress to 10% after 2 hours, so huge improvement! It is still much slower than what MCMCglmm is doing though, but that's probably because I do not fully understand how MCMCglmm is parametrising things under the hood.

Have you measured convergence from multiple chains and estimated
n_eff using Stan's tools? For speed, all that matters for MCMC
(assuming everything fits) is the ratio of n_eff to time. We often
see other algorithms that run much faster than Stan in terms of
iterations per second, but then they have much lower effective sample
sizes per time.

- Bob
You can make these local variables in the model and
it won't save everything so it should be faster and smaller.

>
>
> model {
> // Priors
> gammaSpp ~ normal(0, sigmaSpp);

You probably want a non-centered parameterization here, with

gammaSppRaw ~ normal(0, 1);
gammaSpp = sigmaSpp * gammaSppRaw;

That separates gammaSpp and sigmaSpp in the prior.

> sigmaSpp ~ cauchy(0, 5);
> alphaPhy ~ normal(0, 5); // implies: gammaPhy ~ multi_normal_cholesky(zeroes, diag_matrix(sigmaPhy) * LPhy * LPhy' * diag_matrix(sigmaPhy)))
> sigmaPhy ~ cauchy(0, 5);
> beta ~ normal(0, 5);
> sigmaY ~ cauchy(0, 5);

These are very fat priors, so tightening them up in terms
of not using a fat tailed distribution might make things faster

> // Likelihood
> Y ~ normal(mu, sigmaY);
> }
>
>
>
> Now, a quick question. I have seen an alternative code in this older post here with a similar model.
> Is there any advantage in using the transpose of the Cholesky decomposition (on the inverted correlation matrix) rather than the Cholesky decomposition on the original matrix? I think the alternative code block would instead look like (not sure if that would also require escalating tweaks down on the other code blocks?):
>
> transformed data {
> matrix[K,K] LPhy;
> LPhy = transpose(cholesky_decompose(inverse(S)));
> }

Any work you can do in the transformed data block should be done there.
So if you need the transpose, do it ahead of time.

- Bob

Bob Carpenter

unread,
Jan 26, 2017, 1:35:52 PM1/26/17
to stan-...@googlegroups.com, gel...@stat.columbia.edu
See below.

> On Jan 26, 2017, at 6:29 AM, Diego Barneche <barne...@gmail.com> wrote:
>
> Dear Matt, Paul and Andrew,
>
> ...

> And here are the total time results:
> > timingRawStan
> user system elapsed
> 420.654 2.550 423.960

Were you running those Stan chains in parallel or one at a time?
That'd still only be a factor of four speedup, and still 20 times
slower than what MCMCglmm is doing:

> > timingRawMCMCglmm
> user system elapsed
> 5.093 0.031 5.145
>
> I do not know how to calculate the effective sample size, though I'd be more than happy to do it if someone provides me a sample code.

Can you get a Coda-like object out of MCMCglmm? If so,
you can plug it into Stan's posterior analysis tools directly.


> The variance estimates (as well as the main coefficients) are pretty similar, which is very encouraging:
> # from stan
> > res <- summary(stanFit)[[1]]
> > res[c('sigmaPhy[1]','sigmaSpp[1]','sigmaY'), 'mean'] ^ 2
> sigmaPhy[1] sigmaSpp[1] sigmaY
> 0.47419640 0.05777246 0.12036096
> # from MCMCglmm
> > colMeans(MCMCglmmFit$VCV)
> animal otlSpecies units
> 0.45447357 0.05864063 0.12024812

You want to make sure that the answers are within a few
MCMC standard errors of each other.
You only use sigmaPhy[1]. So if MPhy > 1, you're going to have problems.
The prior on sigmaPhy[2:MPhy] is Cauchy so the prior will be the same and
Stan can't sample effectively into the fat tails of a Cauchy. So this'll
be a huge drag on sampling.

Same problem for all the other containers where only the first element
appears in the likelihood. alphaPhy[2:MPhy] doesn't
even get a prior so will be even more poorly behaved.

> }
>
>
> model {
> // Priors
> alphaSpp[1] ~ normal(0, 1);
> sigmaSpp ~ cauchy(0, 2.5);
> alphaPhy[1] ~ normal(0, 1);
> sigmaPhy ~ cauchy(0, 2.5);
> beta ~ normal(0, 1);
> sigmaY ~ cauchy(0, 2.5);
> // Likelihood
> Y ~ normal(mu, sigmaY);
> }

- Bob

Andrew Gelman

unread,
Jan 26, 2017, 10:41:31 PM1/26/17
to stan-...@googlegroups.com
Hi, Diego.  Effective sample size is output if you print the fitted Stan object.   This is a big deal because Stan can be much more efficient than something like MCMCglmm.  So you have to compare effective sample size, not total sample size.

Also when running Stan you should always run multiple chains.  My laptop has 4 processors so Stan can run 4 chains in the same time it takes to run 1, and then you get 4 times as many simulation draws.  Also, multiple chains allows assessment of mixing.  Also I recommend using default settings.  #warmup iterations should be 1/2 #total iterations.  You seem to have 100 warmup and 1000 total iterations.  That won't be the most efficient way to go; you're not giving Stan long enough to warm up.  Or, conversely, if only 100 warmup iterations are needed, you might only need 200 total iterations per chain.

There might also be ways to make the Stan model more effiicient; I don't know about this because I'm not familiar with this particular application area.
A


 
MCMCglmmFit  <-  MCMCglmm(Y ~ Pred1 + Pred2 + Pred3 + Pred4, random = ~animal + otlSpecies, pedigree = tree, data = data, family = 'gaussian', nitt = nIter, thin = nThin, burnin = nWarm, prior = priorInverseWishart, pr = TRUE, verbose = TRUE)
})

Diego Barneche

unread,
Jan 27, 2017, 9:58:41 PM1/27/17
to stan-...@googlegroups.com, gel...@stat.columbia.edu
Hi everyone, 

Thanks very much for all your feedback and apologies for taking so long to reply. I did some digging into the comments Peter Bob and Andrew provided, and below is what I got (very sorry Michael, I haven't had the time yet to simulate data with known behaviour, but I'll definitely post it here when I get the chance).

I dropped the 'row_vector' declarations for the random effects coefficients and used 'real' declarations for the standard deviations. I also played a bit more with the priors for the standard deviations, and that seems to have helped. Also kept the default settings as suggested by Andrew (nWarm = nIter / 2), used 3 chains (nIter = 5e4) now each one allocated to a different independent core.

The reason why I used 1 chain in the previous example was because MCMCglmm does not allow for multiple chains by default. I learned (from here) that one needs to run some code using the package parallel in order to achieve that (see below).

So here are my current working codes:

R code:

library(MCMCglmm)
library
(rstan)
rstan_options
(auto_write = TRUE)
options
(mc.cores = parallel::detectCores())

rm
(list=ls())



load
('./dataAndTree.RData')

# fixed-effect model matrix

X  
<-  model.matrix(~ 1 + Pred2 + Pred3 + Pred4 + Pred5, data = data) # 874 obs
P  
<-  as.integer(ncol(X)) # 5 predictors
# among-species random-effect vector
ZI  
<-  as.integer(as.factor(data$species)) # 1:284
J    
<-  as.integer(length(unique(ZI))) # number of species, 284
# phylogenetic vcv matrix

A    
<-  ape::vcv(tree, corr = FALSE)
LPhy  <-  t(chol(A[sort(rownames(A)), sort(rownames(A))])) # Cholesky factor

# phylogeny-derived relatedness random-effect vector

ZII  
<-  as.integer(data$animal) # 1:284

# number of tips (i.e. species)

K    
<-  as.integer(length(unique(ZII))) # number of tips, 284



stanData  
<-  list(N             =  as.integer(nrow(data)),
                   P            
=  P,
                   J            
=  J,
                   K            
=  K,

                   Y            
=  data$Y,
                   X            
=  X,
                   ZI            
=  ZI,
                   ZII          
=  ZII,
                   
LPhy          =  LPhy)



nIter    
<-  5e4
nWarm    
<-  nIter / 2
nThin    
<-  round(nWarm / 6e3) # keep 6e3 samples
nChains  
<-  3
inits1  
<- list(beta = c(0.1, -0.15, -0.8, -0.2, -0.2), sigmaY = 1)
inits2  
<- list(beta = c(0.2, -0.3,  -0.5, -0.1, -0.5), sigmaY = 1)
inits3  
<- list(beta = c(0,   0.15,  -1,   -0.5, -0.3), sigmaY = 1)


timingRawStan  
<-  system.time({
 stanFit  
<-  stan(file = 'phyloLMM.stan', data = stanData, iter = nIter, thin = nThin, chains = nChains, warmup = nWarm, init = list(inits1, inits2, inits3))
 
})


res  
<-  summary(stanFit)[[1]]
res
[which(rownames(res) %in% c('sigmaPhy','sigmaSpp','sigmaY')), 1] ^ 2
nEff  <-  res[, 'n_eff']
summary(nEff)

# compare with original MCMCglmm model
library(parallel)
set.seed(10)
priorInverseWishart  <-  list(G = list(G1 = list(V = diag(1), n = 0.002), G2 = list(V = diag(1), n = 0.002)), R = list(V = diag(1), n = 0.002))
timingRawMCMCglmm  <-  system.time({
MCMCglmmFit  <-  mclapply(1:3, function(i) {
MCMCglmm(Y ~ Pred2 + Pred3 + Pred4 + Pred5, random = ~animal + species, pedigree = tree, data = data, family = 'gaussian', nitt = nIter, thin = nThin, burnin = nWarm, prior = priorInverseWishart, pr = TRUE, verbose = TRUE)
}, mc.cores = 3)
})

summary(as.numeric(sapply(MCMCglmmFit, function(m)summary(m)$solutions[,'eff.samp'])))

And the phyloLMM.stan code:

data {
   
int<lower=1> N;               // Number of observations
   
int<lower=0> P;               // Number of fixed-effects predictors
   
int<lower=0> J;               // Number of groups (Species)
   
int<lower=0> K;               // Number of phylogeny-derived correlation matrix columns; it includes tips only
    real Y
[N];                    // Response variable
    matrix
[N,P] X;                // Fixed-effect predictor matrix
   
int<lower=1,upper=J> ZI[N];   // Random-effect index of group membership for across-species unique effects
   
int<lower=1,upper=K> ZII[N];  // Random-effect index of group membership for heritable between-species effects (i.e. phylogeny-derived)

    matrix
[K,K] LPhy;             // Cholesky factor of phylogenetic covariance matrix
}


parameters
{
    vector
[P] beta;          // Vector of fixed-effect estimates

    vector
[J] alphaSpp;      // Unscaled among-species random-effect coefficients
    vector
[K] alphaPhy;      // Unscaled phylogeny-derived random-effect coefficients

    real
<lower=0> sigmaSpp;  // Among-species random-effect standard deviation

    real
<lower=0> sigmaPhy;  // Standard deviation attributable to heritability among species (i.e. phylogeny-derived)

    real
<lower=0> sigmaY;    // Model residual standard deviation

}


transformed parameters
{
    vector
[N] mu;                   // Linear mean predictor
    vector
[J] gammaSpp;             // Among-species random-effect coefficients
    vector
[K] gammaPhy;             // Phylogeny-derived random-effect coefficients

    gammaPhy
= sigmaPhy * (LPhy * alphaPhy);
    gammaSpp
= sigmaSpp * (alphaSpp);

    mu  
=  X * beta +  gammaSpp[ZI] +  gammaPhy[ZII];
}


model
{
   
// Priors

    alphaSpp  
~  normal(0, 1);
    sigmaSpp  
~  cauchy(0, 0.5);
    alphaPhy  
~  normal(0, 1);
    sigmaPhy  
~  cauchy(0.5, 1);
    beta      
~  normal(0, 1);
    sigmaY    
~  cauchy(0.1, 0.5);

   
// Likelihood
    Y        
~  normal(mu, sigmaY);
}


So the variance components estimates are all very similar to before:
> res  <-  summary(stanFit)[[1]]
> res[which(rownames(res) %in% c('sigmaPhy','sigmaSpp','sigmaY')), 1] ^ 2
  sigmaSpp   sigmaPhy     sigmaY
0.05758587 0.47334377 0.12012333

And the total timing:
> timingRawStan
    user   system  elapsed
4928.952   33.740 2507.232

And here's the effective sample size:
> nEff  <-  res[, 'n_eff']
> summary(nEff)
   
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
 
12680   17200   17670   17660   18070   18750

 And some other general stats (not sure if those are helpful):
Chain 1, Iteration: 50000 / 50000 [100%]  (Sampling)
 
Elapsed Time: 1062.82 seconds (Warm-up)
               
1432.62 seconds (Sampling)
               
2495.44 seconds (Total)
Chain 2, Iteration: 50000 / 50000 [100%]  (Sampling)
 
Elapsed Time: 1022.18 seconds (Warm-up)
               
1443.64 seconds (Sampling)
               
2465.83 seconds (Total)
Chain 3, Iteration: 50000 / 50000 [100%]  (Sampling)
 
Elapsed Time: 1053.64 seconds (Warm-up)
               
1434.36 seconds (Sampling)
               
2488 seconds (Total)

Inference for Stan model: phyloLMM.
3 chains, each with iter=50000; warmup=25000; thin=4;
post
-warmup draws per chain=6250, total post-warmup draws=18750.


Now this is the effective sample size for 3 parallel chains in MCMCglmm using the same specs:
> timingRawMCMCglmm
    user   system  elapsed
2508.775   17.754   29.163

And the effective sample size is (not sure if that's how it should be calculated though, this table only provides ess for the fixed effects):
> summary(as.numeric(sapply(MCMCglmmFit, function(m)summary(m)$solutions[,'eff.samp'])))
   
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   
3555    6092    6250    5965    6250    6666

So is this comparison/interpretation correct and fair?
Stan was 4928.952/2508.775 = 1.964685 times slower than MCMCglmm, but at the same time had 17670/6250 = 2.8272 as many efficient samples if compared to MCMCglmm?

Peter's alternative coding suggestion:
Finally, I tried to modify my stan code even further following Peter's suggestion. Unfortunately however, I do not get similar estimates for the variance components, and the code runs much slower. So, two questions about this one: (1) what did I code wrong? (2) Is it much slower because I coded it incorrectly, or is it because my working code above (which is improved based on brms auto-generated stan code) is faster already?

Here's the modified version (currently compiles, but give wrong estimates):

data {
   
int<lower=1> N;               // Number of observations
   
int<lower=0> P;               // Number of fixed-effects predictors
   
int<lower=0> J;               // Number of groups (Species)
   
int<lower=0> K;               // Number of phylogeny-derived correlation matrix columns; it includes tips only
    real Y
[N];                    // Response variable
    matrix
[N,P] X;                // Fixed-effect predictor matrix
   
int<lower=1,upper=J> ZI[N];   // Random-effect index of group membership for across-species unique effects
   
int<lower=1,upper=K> ZII[N];  // Random-effect index of group membership for heritable between-species effects (i.e. phylogeny-derived)

    cov_matrix
[K] A;              // Phylogenetic vcv matrix (tips only)
}


transformed data
{
  matrix
[K,K] t_LAinv;
  t_LAinv  
=  transpose(cholesky_decompose(inverse(A)));

}
 
parameters
{
    vector
[P] beta;            // Vector of fixed-effect estimates

    vector
[J] alphaSpp;        // Unscaled among-species random-effect coefficients

    vector
[K] gammaPhy;        // Phylogeny-derived random-effect coefficients

    real
<lower=0> SigmaSpp;    // Among-species random-effect standard deviation
    real
<lower=0> SigmaPhy;    // Variance attributable to heritability among species (i.e. phylogeny-derived)    
    real
<lower=0> SigmaY;      // Error scale

}


transformed parameters
{
    vector
[N] mu;                   // Linear mean predictor

    vector
[J] gammaSpp;             // Among-species random-effect coefficients

    gammaSpp
= sqrt(SigmaSpp) * (alphaSpp);

    mu  
=  X * beta +  gammaSpp[ZI] +  gammaPhy[ZII];
}


model
{

    vector
[K] v;
    real sumOfSquares
;
    v  
=  t_LAinv * gammaPhy;


    sumOfSquares  
=  dot_product(v,v);


    target  
+=  - 0.5 * N * log(SigmaPhy); // non-constant part of log(det(SigmaPhy * A)^-0.5)
    target  
+=  - sumOfSquares / (2 * SigmaPhy); // log of kernel of multivariate normal


   
// Priors
    alphaSpp  
~  normal(0, 1);
   
SigmaSpp  ~  scaled_inv_chi_square(2, 0.25);
   
SigmaPhy  ~  scaled_inv_chi_square(2, 0.25);
    beta      
~  normal(0, 1);
   
SigmaY    ~  scaled_inv_chi_square(2, 0.25);
   
// Likelihood
    Y        
~  normal(mu, sqrt(SigmaY));
}

Total time (10x slower if compared to my the working stan code above):
> timingRawStan
     user    system   elapsed
11057.179    54.732  4017.181

Here are the estimates for the variance components:
> res  <-  summary(stanFit)[[1]]
> res[which(rownames(res) %in% c('SigmaPhy','SigmaSpp','SigmaY')), 1]
   
SigmaSpp     SigmaPhy       SigmaY
3.416017e-02 1.000567e-13 2.957514e-01

Thanks very much!!
Diego

Andrew Gelman

unread,
Jan 27, 2017, 10:15:53 PM1/27/17
to Diego Barneche, Stan users mailing list
Hi, so if Stan takes twice as long as MCMCglmm but the effective sample size is multiplied by 2.8, the Stan is 1.4 times faster per effective sample size, which I could believe for this simple model.
You can do print(stanFit) and see the summaries for all the parameters; I'm guessing that 50,000 iterations is way more than you need.  Maybe 2000 iterations would be enough?  Or 5000?  Also no need to thin.  Finally, I recommend avoiding the inverse-Wishart prior; it can give you trouble.
A

Inference for Stan model: workingExample-MCMCglmm-MatrixPhylo3.

3 chains, each with iter=50000; warmup=25000; thin=4;
post
-warmup draws per chain=6250, total post-warmup draws=18750.

Now this is the effective sample size for 3 parallel chains in MCMCglmm using the same specs:
> timingRawMCMCglmm
    user   system  elapsed
2508.775   17.754   29.163

And the effective sample size is (not sure if that's how it should be calculated though):
> summary(as.numeric(sapply(MCMCglmmFit, function(m)summary(m)$solutions[,'eff.samp'])))
   
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   
3555    6092    6250    5965    6250    6666

So is this comparison correct and fair?
Stan was 4928.952/2508.775 = 1.964685 times slower than MCMCglmm, but at the same time had 17670/6250 = 2.8272 as many efficient samples if compared to MCMCglmm?

Peter's alternative coding suggestion:
Finally, I tried to modify my stan code even further following Peter's suggestion. Unfortunately however, I do not get similar estimates for the variance components, and the code runs much slower. So, two questions about this one: (1) what did I code wrong? (2) Is it much slower because I coded it incorrectly, or is it because my working code (which is improved based on brms auto-generated stan code)  above is faster already?
Here are the estimates for the standard deviation components:

Bob Carpenter

unread,
Jan 28, 2017, 11:20:42 AM1/28/17
to stan-...@googlegroups.com, gel...@stat.columbia.edu

> On Jan 27, 2017, at 9:58 PM, Diego Barneche <barne...@gmail.com> wrote:
>
> Hi everyone,
> ...

> Inference for Stan model: workingExample-MCMCglmm-MatrixPhylo3.
> 3 chains, each with iter=50000; warmup=25000; thin=4;
> post-warmup draws per chain=6250, total post-warmup draws=18750.

> Now this is the effective sample size for 3 parallel chains in MCMCglmm using the same specs:
> > timingRawMCMCglmm
> user system elapsed
> 2508.775 17.754 29.163
>
> And the effective sample size is (not sure if that's how it should be calculated though):
> > summary(as.numeric(sapply(MCMCglmmFit, function(m)summary(m)$solutions[,'eff.samp'])))
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 3555 6092 6250 5965 6250 6666
>
> So is this comparison correct and fair?
> Stan was 4928.952/2508.775 = 1.964685 times slower than MCMCglmm, but at the same time had 17670/6250 = 2.8272 as many efficient samples if compared to MCMCglmm?

That's more digits than you need, but net, you're reporting that

Stan's actually 50% *faster* than MCMCglmm

We don't care about speed per iteration, only effective sample
size per second. Think of speed per iteration like RPMs in
a car's engine and effective sample size per second
like km/s (or MPH if you're a colonial motorist).

It's really really hard to compare systems. Here, a problem
you face in evaluation is the time to converge/adapt and the
time to draw a sample with an effective sample size target.

An effective sample size of 10K is almost certainly more than you need
for inference. If you really do need an effective sample size that
large for high precision estimates, then you almost certainly don't need to
spend half the iterations doing warmup. Stan usually warms up well in the
range of 100 to 1000 iterations. Just changing that would result in Stan
running almost twice as fast end to end.

I don't know about MCMglmm, but they probably have similar tuning parameters.
So it's always dangerous to compare systems if you only know how to tune
one of them!

Your model saying that SigmaY is the scale is misleading. It's
the squared scale.

With the new declare/define syntax you can write

vector[K] v;
real sumOfSquares;
v = t_LAinv * gammaPhy;
sumOfSquares = dot_product(v,v);

as

vector[K] v = t_LAinv * gammaPhy;
real sumOfSquares = dot_self(v);

Same thing can clean up the transformed parameter block.

Note that I replaced your dot_product function with a more efficient
version. One more efficiency issue:

- sumOfSquares / (2 * SigmaPhy);

should be

-0.5 * sumOfSquares / SigmaPhy;

just because -0.5 will be compiled to a constant and that reduces
the expression size a little bit at run time. You won't notice it,
so feel free to leave the 2 * in the denominator. I don't know the
model, so if it's not giving you the right result, I don't know why.

- Bob

Diego Barneche

unread,
Jan 28, 2017, 9:26:35 PM1/28/17
to Stan users mailing list, gel...@stat.columbia.edu
Dear Bob and Andrew,

Thanks so much for all the feedback. I got the previously working code to run properly with 2e3 iterations (warm-up = 1e3) as suggested by Andrew. Also used the direct declare/define syntax as suggested by Bob. If in the future I discover what I coded wrong in the alternative model I'll updated this post.

Best wishes,
Diego
Reply all
Reply to author
Forward
0 new messages