Multilevel Multivariate Probit Regression

344 views
Skip to first unread message

Roy Martin

unread,
Feb 24, 2016, 5:52:29 PM2/24/16
to Stan users mailing list
Hello group,

I am working on building a multilevel version of the multivariate probit regression that is provided as an example in the rstan manual. I actually swiped the code from a posting on this list as a starting point, which I believe was posted by Bob Carpenter. In fact, I should start by conveying my utmost gratitude for posting that code, as it pulled me out of very dark place where I was trying, for an unspeakable amount of time, to get this model to converge in JAGS using the wishart prior! My dataset is comprised of 24 fish species as the multivariate response, and this Stan code resulted in (what felt like) a "beautiful" convergence. And all of the parameters had very logical interpretations, ecologically speaking.

However, my ultimate goal for this dataset is to evaluate a group level predictor in addition to site level predictors. Specifically, I have these data from streams (site level) that are situated across watersheds (group level) with varying degrees of mining degradation: a continuous mining index estimated previously at this watershed scale. I also have data on chemistry at the site scale, which is reflective of local mining activity.

So, I have tweaked Ben's original code in order to get at what I believe is a reasonable multilevel model for this problem. However, I can not get this code to compile. Being only two weeks or so into Stan, I feel that it will probably turn out that I am making some glaring coding error somewhere that will be obvious to the more seasoned. However, the error(s) returned were not obvious to me. Also, not being a real statistician, there is also some good chance that I am just a bonehead and this coding/parameterization reflects some general ignorance on my part. If so, I apologize in advance for wasting anyone's time.

In any case, I have attached the code below and would be highly gracious if anyone would offer to point me in the right direction.

High Regards,

-Roy

Enter code here...MVprobitreString = " #open quote for model string
functions {
int sum(int[,] a) {
  int s;
  s <- 0;
  for (i in 1:size(a))
  for (j in 1:size(a[i]))
  s <- s + a[i,j];
  return s;
  }
}
data {
  int<lower=0> N; // number samples (stream sites)
  int<lower=1> K; // number level 1 (site) predictors (beta coefs)
  int<lower=0> J; // number groups (HUC watersheds)
  int<lower=1> L; // number level 2 (group/HUC) predictors (gamma coefs)
  int<lower=1> D; // number traits (species)
  int<lower=1,upper=J> HUC[N]; //  Level 2 (group/HUC) index at level 1
  int<lower=0,upper=1> y[N,D]; // outcomes (binary P/A) for each species (site level)
  vector[K] x[N];
  vector[L] g[J];
}
transformed data {
  int<lower=0> N_pos;
  int<lower=1,upper=N> n_pos[sum(y)];
  int<lower=1,upper=D> d_pos[size(n_pos)];
  int<lower=0> N_neg;
  int<lower=1,upper=N> n_neg[(N * D) - size(n_pos)];
  int<lower=1,upper=D> d_neg[size(n_neg)];

  N_pos <- size(n_pos);
  N_neg <- size(n_neg);
  {
    int i;
    int j;
    i <- 1;
    j <- 1;
    for (n in 1:N) {
    for (d in 1:D) {
      if (y[n,d] == 1) {
        n_pos[i] <- n;
        d_pos[i] <- d;
        i <- i + 1;
      } else {
        n_neg[j] <- n;
        d_neg[j] <- d;
        j <- j + 1;
        }
    }
    }
  }
}
parameters {
  matrix[D,K] beta;
  matrix[D,L] gamma;
  vector[D] alpha[J];
  cholesky_factor_corr[D] N_Omega; //cor for units (sites)
  cholesky_factor_corr[D] J_Omega; //cor for groups (HUC watershed)
  vector<lower=0>[N_pos] z_pos;
  vector<upper=0>[N_neg] z_neg;
}
transformed parameters {
  vector[D] z[N];
  for (n in 1:N_pos)
    z[n_pos[n], d_pos[n]] <- z_pos[n];
  for (n in 1:N_neg)
    z[n_neg[n], d_neg[n]] <- z_neg[n];
}
model {
  N_Omega ~ lkj_corr_cholesky(4);
  J_Omega ~ lkj_corr_cholesky(4);
  to_vector(beta) ~ normal(0, 5);
  to_vector(gamma) ~ normal(0, 5);
  {
    vector[D] beta_x[N];
    vector[D] gamma_g[J];
    vector[D] mu_a[N];
    for (n in 1:N)
      beta_x[n] <- beta * x[n];
    for (j in 1:J)
      gamma_g[j] <- gamma * g[j];
    alpha ~ multi_normal_cholesky(gamma_g, J_Omega);
    for (n in 1:N)
      mu_a[n] <- alpha[HUC[n]] + beta_x[n];
    z ~ multi_normal_cholesky(mu_a, N_Omega);
  }
}
generated quantities {
  matrix[D,D] Omega;
  matrix[D,D] OmegaHUC;
  Omega <- multiply_lower_tri_self_transpose(N_Omega);
  OmegaHUC <- multiply_lower_tri_self_transpose(J_Omega);
}
" #Close model string

writeLines(MVprobitreString, con="MVprobitre.stan") #write to file


MVPrefitDSO <- stan_model(model_code = MVprobitreString)








  

MVprobitre.stan

Bob Carpenter

unread,
Feb 24, 2016, 8:45:19 PM2/24/16
to stan-...@googlegroups.com
I'd suggest making the model standalone and trying to line
errors you get up with line numbers. Or let us know what
error you got. It's much easier for us than guessing.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> <MVprobitre.stan>

Roy Martin

unread,
Feb 24, 2016, 10:29:59 PM2/24/16
to Stan users mailing list
Thanks Bob. I am sorry for the vagueness regarding the errors. It was silly to have left that part out.

However, I just came home from the office, noticed your message, and started to work on your advice and, lo and behold, the modeled compiled without complaint on my home machine. It also ran, converged, and made good sense. So, all in all, I am sorry to have probably just wasted your time.

If I recall, the error compiling from the work machine was related to makevars, which made little sense to me at the time. In hindsight, I suppose I may have an issue with the rstan/rtools install on that machine? Nevertheless, if it is of any use, I'll post the actual error from that machine to this thread tomorrow. I also noticed the FAQ section after the fact, so I'll be sure to consult that as well. The error messages confused me because all of the other Stan models I have worked with on that machine have had no trouble compiling and running; including the original code you proposed for this MVP problem. So, being unfamiliar with Stan, I just assumed I had borked something in the actual model script that was keeping it from compiling.

Thanks again,

-Roy

Roy Martin

unread,
Feb 25, 2016, 10:41:01 AM2/25/16
to stan-...@googlegroups.com
Below is the actual error message. I edited out part of the path related to the "Net MyDocuments" folder just to be sure I wasn't committing any agency policy violation I might not be aware of. It is just a path to a synced server folder. 

This was actually the issue. The default install path for R packages on our Agency PCs goes to this server folder, which ultimately causes a lot of complications for some R packages. I knew that already, and had created a separate library folder to the local drive to install packages. As a result I have to set the library each time I run R (or put a .Rprofile in the wd). I must have not done that in this case because, after reloading rstan from the correct (local) library directory, everything compiled fine. Sigh.

TLDR: I am, in fact, a bonehead and was running rstan from a network directory.

Enter code here...Error in compileCode(f, code, language = language, verbose = verbose) :
  Compilation ERROR, function(s)/method(s) not created! In file included from C:/Program Files/R/R-3.2.2/library/rstan/include/rstan/rstaninc.hpp:3:0,
                 from file157016044bd2.cpp:953:
C:/Program Files/R/R-3.2.2/library/rstan/include/rstan/stan_fit.hpp:71:55: fatal error: stan/services/optimize/do_bfgs_optimize.hpp: No such file or directory
compilation terminated.
make: *** [file157016044bd2.o] Error 1
Warning message:
running command 'make -f "C:/PROGRA~1/R/R-32~1.2/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-32~1.2/share/make/winshlib.mk" -f "//..edit../rmartin/Net MyDocuments/.R/Makevars" SHLIB_LDFLAGS='$(SHLIB_CXXLDFLAGS)' SHLIB_LD='$(SHLIB_CXXLD)' SHLIB="file157016044bd2.dll" WIN=64 TCLBIN=64 OBJECTS="file157016044bd2.o"' had status 2 
In addition: Warning message:
running command 'C:/PROGRA~1/R/R-32~1.2/bin/x64/R CMD SHLIB file157016044bd2.cpp 2> file157016044bd2.cpp.err.txt' had status 1

Bob Carpenter

unread,
Feb 25, 2016, 10:51:23 AM2/25/16
to stan-...@googlegroups.com
Thanks for reporting back.

- Bob

Roy Martin

unread,
Mar 8, 2016, 11:19:54 AM3/8/16
to Stan users mailing list
Hello again,

As it turns out, although this model was converging well with my empirical dataset, such that all Rhat < 1.1, after doing some more reading on convergence diagnostics in Stan, I decided to download shinystan (amazing tool here!) to take a look at some other measures. It turns out that, after runs with warmups of 2k-10k, there were still quite a few parameters where NEff was less than 10% of the samples (1000 x 4 chains). Also, monte carlo std err was greater than 10% for a some parameters. 

So, I decided to try to simulate a similar dataset to see if similar convergence issues were apparent and what parameter recovery looked like and I ran into similar issues with the sim data. Likewise, although the runs generally recovered the individual and group level predictors (beta, gamma) quite well, the correlations at both the units and group (traits) level left a good bit to be desired. In particular, large correlations usually were not included in the 95% CI. I thought, probably naively, that maybe this was because lkj_corr_cholesky(eta=4) was too strong a constraint for the simulated correlations, where several were quite large. So I lowered eta to 2, which may have made some slight difference, but still not great. I attached coefplots from a run (with 4000 warmup/5000 iter, eta=2) for illustration. Of course, lowering eta also occasionally lead to some divergent transitions, and certainly didn't help with NEff > 10%, etc. There was also often concerning autocorrelation in the chains. All of these diagnostics varied a little from run to run when adjusting things like adapt_delta, but I've yet to find the right combination of adapt_delta, max_treedepth, warmup, etc that would result in a model passing all of the QAQC, particularly NEff > 10%.

In the end, I'd guess that this model potentially has high parameter correlation in the posterior (?) and needs some optimization. I have been reading  the manual and trying to understand the "Matt Trick" and "non-centering" approaches in this context, and I think I get the gist and am trying to figure out how to implement those now. However, I wanted to make this post in hopes that someone might be able to give me a nudge. These operations are a little confusing to me because (1) matrix algebra is not yet a native language to me (an ecologist), and (2) multivariate sigma is 1 in this latent model and I am having trouble translating the manual examples for this case. 

On another front, and sorry if this is documented somewhere I haven't looked, but I am not sure what a reasonable number of warmup iter should be for a model of this complexity; or at what point I should give up on increasing warmup to improve the diagnostics. I did feel that, in this case, increasing warmup perhaps led to slight improvements. Yet, even after 10k warmup iter (say 1-2 hrs runs), there were still quite a few parameters with NEff < 10% and parameter recovery for the correlations was similarly poor (but maybe slightly better?). As such, I would not bet that running 100k iter would make the difference; and from what I've read, even 10k seems large for Stan. So I wanted to check here before potentially spending further hours upon days of processing time trying to tune warmup and control options, even after perhaps optimizing the model parameterization. 

Many thanks again, in advance. And sorry for the long-winded post. Below are the coefficient plots, the Stan model, and dataset sim in R.

-Roy


 
functions {
 int sum(int[,] a) {
  int s;
  s <- 0;
  for (i in 1:size(a))
  for (j in 1:size(a[i]))
   s <- s + a[i,j];
   return s;
 }
}
data {
 int<lower=0> N; // number samples (stream sites)
 int<lower=1> J; // number groups (HUC watersheds)
 int<lower=1> D; // number traits (species)
 int<lower=1> K; // number level 1 (site) predictors
 int<lower=1> L; // number level 2 (group/HUC) predictors
 N_Omega ~ lkj_corr_cholesky(2);
 J_Omega ~ lkj_corr_cholesky(2);
 to_vector(beta) ~ normal(0,5);
 to_vector(gamma) ~ normal(0,5);
 {
 vector[D] beta_x[N];
 vector[D] gamma_g[J];
 vector[D] mu_a[N];
 for (n in 1:N)
  beta_x[n] <- beta * x[n];
 for (j in 1:J)
  gamma_g[j] <- gamma * g[j];
  alpha ~ multi_normal_cholesky(gamma_g, J_Omega);
 for (n in 1:N)
  mu_a[n] <- alpha[HUC[n]] + beta_x[n];
 z ~ multi_normal_cholesky(mu_a, N_Omega);
 }
}
generated quantities {
 matrix[D,D] Omega;
 matrix[D,D] OmegaHUC;
 Omega <- multiply_lower_tri_self_transpose(N_Omega);
 OmegaHUC <- multiply_lower_tri_self_transpose(J_Omega);
}

###Sim multilevel dataset####

K <- 5; #number of betas (site predictors)
L <- 1; #number of gamma (group predictors)
D <- 10; #number of responses (species)
n <- 10; #number of samples per group
J <- 30; #number of groups
N <- n*J; #total sample size per D
HUC <- as.numeric(as.factor(rep(1:J,each=n))) #group indicator

# stick breaking construction to generate a random correlation matrices
N_Omega <- matrix(0,D,D);
N_Omega[1,1] <- 1;
for (i in 2:D) {
  bound <- 1;
  for (j in 1:(i-1)) {
    N_Omega[i,j] <- runif(1, -sqrt(bound), sqrt(bound));
    bound <- bound - N_Omega[i,j]^2;
  }
  N_Omega[i,i] <- sqrt(bound);
}

J_Omega <- matrix(0,D,D);
J_Omega[1,1] <- 1;
for (i in 2:D) {
  bound <- 1;
  for (j in 1:(i-1)) {
    J_Omega[i,j] <- runif(1, -sqrt(bound), sqrt(bound));
    bound <- bound - J_Omega[i,j]^2;
  }
  J_Omega[i,i] <- sqrt(bound);
}

OmegaN <- N_Omega %*% t(N_Omega);
OmegaJ <- J_Omega %*% t(J_Omega); 

x <- matrix(rnorm(N * K, 0, 1), N, K);
g <- cbind(rep(1,J),matrix(rnorm(J * L, 0, 1),J, L));
beta <- matrix(rnorm(D * (K), 0, 1), D, K);
gamma <- matrix(rnorm(D * (L+1), 0, 1), D, L+1);

alpha <- matrix(NA, J, D);
a_k <- matrix(NA, N, D);
z <- matrix(NA, N, D);

for (j in 1:J){
  alpha[j,] <- mvrnorm(1,g[j,] %*% t(gamma),OmegaJ)
}

for (d in 1:D){
  a_k[,d] <- kronecker(alpha[,d],rep(1,n))
}

for (i in 1:N) {
  z[i,] <- mvrnorm(1, a_k[i,]+(x[i,] %*% t(beta)), OmegaN);
}

y <- matrix(0, N, D);
for (i in 1:N) {
  for (d in 1:D) {
    y[i,d] <- ifelse((z[i,d] > 0), 1, 0);
  }
}

##################
####check sims####
##################
laRE <- extract(MVPrefit, inc_warmup=FALSE,permuted = TRUE)

coefplot2(as.mcmc(laRE$beta[,,5]),intercept=T,
          varnames=colnames(y),main="beta5")
for (i in 1:dim(laRE$beta)[2]){
  points(x=beta[i,5],y=i,col='red')
}

coefplot2(as.mcmc(laRE$gamma[,,2]),intercept=T,
          varnames=colnames(y),main='gamma2')
for (i in 1:dim(laRE$gamma)[2]){
  points(x=gamma[i,2],y=i,col='red')
}

coefplot2(as.mcmc(laRE$Omega[,,1]),intercept=T,
          varnames=colnames(y),xlim=c(-1,1),main="OmegaN10")
for (i in 1:dim(laRE$Omega)[2]){
  points(x=OmegaN[i,1],y=i,col='red')
}

coefplot2(as.mcmc(laRE$OmegaHUC[,,1]),intercept=T,
          varnames=colnames(y),xlim=c(-1,1),main="OmegaJ10")
for (i in 1:dim(laRE$OmegaHUC)[2]){
  points(x=OmegaJ[i,1],y=i,col='red')
}


MVprobitre.stan

Andrew Gelman

unread,
Mar 8, 2016, 4:12:52 PM3/8/16
to stan-...@googlegroups.com
Hi, if n_eff is less than 10% of the number of draws then, yes, that means the chains are moving slowly, but if you’ve run for 1000 iterations for 4 chains, that’s still giving you n_eff of 400 or whatever, which should be fine for most practical purposes. 
A

Bob Carpenter

unread,
Mar 8, 2016, 4:17:41 PM3/8/16
to stan-...@googlegroups.com

> On Mar 8, 2016, at 11:19 AM, Roy Martin <royw...@gmail.com> wrote:
>
> ...

> front, and sorry if this is documented somewhere I haven't looked, but I am not sure what a reasonable number of warmup iter should be for a model of this complexity; or at what point I should give up on increasing warmup to improve the diagnostics.

You can see if the warmup iterations have converged in their
adaptation parameters (printed out in the CSV from CmdStan; I'm
sure you can get them somehow in RStan) and if you have found
the high mass volume of the posterior. And you can test with
the same seed running 200, 400, 800, 1600... iterations until
you get stability in the adaptation parameters. At that point,
you've run enough.

Having said all that, we usually just recommend half the iterations
for warmup because it's at most a factor of 2 off of optimal.

- Bob



> I did feel that, in this case, increasing warmup perhaps led to slight improvements. Yet, even after 10k warmup iter (say 1-2 hrs runs), there were still quite a few parameters with NEff < 10% and parameter recovery for the correlations was similarly poor (but maybe slightly better?). As such, I would not bet that running 100k iter would make the difference; and from what I've read, even 10k seems large for Stan. So I wanted to check here before potentially spending further hours upon days of processing time trying to tune warmup and control options, even after perhaps optimizing the model parameterization.
>
> Many thanks again, in advance. And sorry for the long-winded post. Below are the coefficient plots, the Stan model, and dataset sim in R.
>
> -Roy
>
>
>
>
>
>
>
>
>
>
>

Michael Betancourt

unread,
Mar 8, 2016, 4:28:37 PM3/8/16
to stan-...@googlegroups.com
Yes 400 effective samples is sufficient for most applications ,but
if n_eff < 0.1 * N then I’m arguing that the n_eff estimator itself is dubious.

Andrew Gelman

unread,
Mar 8, 2016, 4:31:38 PM3/8/16
to stan-...@googlegroups.com
Sure, it’s not perfect but if R-hat is less than 1.1, I’m guessing that the chains have mixed ok.  At least, back in the bad old days before NUTS, I’d see this sort of thing all the time, and the simulations were inefficient but eventually got there.

Michael Betancourt

unread,
Mar 8, 2016, 4:46:33 PM3/8/16
to stan-...@googlegroups.com
In every controlled experiment I’ve done  n_eff < 0.1 * N  has lead
to atrocious behavior of the autocorrelation/n_eff estimators, even 
in Stan and even with multiple chains.  The estimators are just way
too noisy with such strong dependencies.

Bob Carpenter

unread,
Mar 8, 2016, 5:02:41 PM3/8/16
to stan-...@googlegroups.com
It'd be really great if some of these stories could get
converted into reproducible case studies. Then we can
share them with skeptical users like Andrew.

- Bob

> On Mar 8, 2016, at 4:46 PM, Michael Betancourt <betan...@gmail.com> wrote:
>
> In every controlled experiment I’ve done n_eff < 0.1 * N has lead
> to atrocious behavior of the autocorrelation/n_eff estimators, even
> in Stan and even with multiple chains. The estimators are just way
> too noisy with such strong dependencies.
>
> On Mar 8, 2016, at 9:31 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>
>> Sure, it’s not perfect but if R-hat is less than 1.1, I’m guessing that the chains have mixed ok. At least, back in the bad old days before NUTS, I’d see this sort of thing all the time, and the simulations were inefficient but eventually got there.
>>
>>> On Mar 8, 2016, at 4:28 PM, Michael Betancourt <betan...@gmail.com> wrote:
>>>
>>> Yes 400 effective samples is sufficient for most applications ,but
>>> if n_eff < 0.1 * N then I’m arguing that the n_eff estimator itself is dubious.
>>>
>>> On Mar 8, 2016, at 9:12 PM, Andrew Gelman <gel...@stat.columbia.edu> wrote:
>>>
>>>> Hi, if n_eff is less than 10% of the number of draws then, yes, that means the chains are moving slowly, but if you’ve run for 1000 iterations for 4 chains, that’s still giving you n_eff of 400 or whatever, which should be fine for most practical purposes.
>>>> A
>>>>
>>>>> On Mar 8, 2016, at 11:19 AM, Roy Martin <royw...@gmail.com> wrote:
>>>>>
>>>>> Hello again,
>>>>>
>>>>> As it turns out, although this model was converging well with my empirical dataset, such that all Rhat < 1.1, after doing some more reading on convergence diagnostics in Stan, I decided to download shinystan (amazing tool here!) to take a look at some other measures. It turns out that, after runs with warmups of 2k-10k, there were still quite a few parameters where NEff was less than 10% of the samples (1000 x 4 chains). Also, monte carlo std err was greater than 10% for a some parameters.
>>>>>
>>>>> So, I decided to try to simulate a similar dataset to see if similar convergence issues were apparent and what parameter recovery looked like and I ran into similar issues with the sim data. Likewise, although the runs generally recovered the individual and group level predictors (beta, gamma) quite well, the correlations at both the units and group (traits) level left a good bit to be desired. In particular, large correlations usually were not included in the 95% CI. I thought, probably naively, that maybe this was because lkj_corr_cholesky(eta=4) was too strong a constraint for the simulated correlations, where several were quite large. So I lowered eta to 2, which may have made some slight difference, but still not great. I attached coefplots from a run (with 4000 warmup/5000 iter, eta=2) for illustration. Of course, lowering eta also occasionally lead to some divergent transitions, and certainly didn't help with NEff > 10%, etc. There was also often concerning autocorrelation in the chains. All of these diagnostics varied a little from run to run when adjusting things like adapt_delta, but I've yet to find the right combination of adapt_delta, max_treedepth, warmup, etc that would result in a model passing all of the QAQC, particularly NEff > 10%.
>>>>>
>>>>> In the end, I'd guess that this model potentially has high parameter correlation in the posterior (?) and needs some optimization. I have been reading the manual and trying to understand the "Matt Trick" and "non-centering" approaches in this context, and I think I get the gist and am trying to figure out how to implement those now. However, I wanted to make this post in hopes that someone might be able to give me a nudge. These operations are a little confusing to me because (1) matrix algebra is not yet a native language to me (an ecologist), and (2) multivariate sigma is 1 in this latent model and I am having trouble translating the manual examples for this case.
>>>>>
>>>>> On another front, and sorry if this is documented somewhere I haven't looked, but I am not sure what a reasonable number of warmup iter should be for a model of this complexity; or at what point I should give up on increasing warmup to improve the diagnostics. I did feel that, in this case, increasing warmup perhaps led to slight improvements. Yet, even after 10k warmup iter (say 1-2 hrs runs), there were still quite a few parameters with NEff < 10% and parameter recovery for the correlations was similarly poor (but maybe slightly better?). As such, I would not bet that running 100k iter would make the difference; and from what I've read, even 10k seems large for Stan. So I wanted to check here before potentially spending further hours upon days of processing time trying to tune warmup and control options, even after perhaps optimizing the model parameterization.
>>>>>
>>>>> Many thanks again, in advance. And sorry for the long-winded post. Below are the coefficient plots, the Stan model, and dataset sim in R.
>>>>>
>>>>> -Roy
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>

Roy Martin

unread,
Mar 8, 2016, 5:42:55 PM3/8/16
to Stan users mailing list
Hi Bob, 

This certainly seems to be a critical piece to the puzzle I overlooked. I was following the "double the iterations" advice, but was unaware that I could actually check the warmup in that manner. It seems kind of obvious in hindsight. Thank so much, again. I'll give that a try. 

To Andrew and Michael, thank you as well for your responses. I was of the thinking that ~400 should be plenty of effective samples, but I guess my concern was that, as I was increasing samples (even though Rhat was always < 1.1) with my toy data, it appeared that, n_eff was maybe improving a little here and there, as maybe was the coverage of credible intervals over the sim parameters. I began thinking maybe warmup was not converged/optimized, but I am just a couple of weeks into HMC/Stan and some details are still a little like sorcery to me, so I wasn't sure. As The Dude said, "This is a complicated case...lotta ins, lotta outs, lotta what-have-yous. And, uh, alot of strands to keep in my head. Lotta strands in old Duders head". In any case, I'm catching on I think. I will follow Bob's advice to try to get a better handle on the these behaviors and will report back.

Best,

-Roy

Bob Carpenter

unread,
Mar 8, 2016, 6:24:37 PM3/8/16
to stan-...@googlegroups.com
It's largely our fault for not writing all this methodology up in
a single place. As usual in open source projects, we have
far more volunteers (and funding) for new features than we do for
maintenance, testing, or doc.

We're prioritizing writing up some of this basic methodology
(my first pass was in the latest case study), but we need to
get lower down into the guts of Stan and how to use it.

- Bob

Roy Martin

unread,
Mar 29, 2016, 7:03:48 PM3/29/16
to Stan users mailing list
Hi again all,

I decided to go back to the single-level version of the MVP that Bob posted to the group earlier to tinker and try to get a better understanding of how to optimize the multilevel version using  that simpler model. In this case, I added a hyperprior to the beta term in Bob's model, such that beta was drawn from multi_normal_cholesky(mu_beta_K, diag_pre_multiply(sd_beta_K, beta_L)), which follows loosely from a paper by Pollock et al. (2014; reference below if interested).

I found that when placing beta in the transformed parameters block (examples of code to compare below), though the results were generally similar to the naive or "un-optimized" version  in terms of recovery of sim parameters and convergence diagnostics (e.g., Rhat,  n_eff > 0.1 * N, etc), the run took much longer. The optimized version took ~3536 seconds for 4 x 1000 chains to complete, with adapt_delta=0.95 and max_treedepth=15 (first run had divergent transitions and exceeded max treedepth a number of times). However, there were still 3 divergent transitions and beta[2,2] had Rhat > 1.1 even after setting those options. On the other hand, the naive model took ~606 seconds for 4 x1000 in the first run. That run had no issues with treedepth or diverget transitions, but left lp_ with Rhat > 1.1. So I ran again with 4 x 2000 and that took ~1050 seconds and converged in terms of Rhat. So, I'm a little confused as to whether I'm doing this optimization correctly. Could someone help me see the light here?

In any case, thanks again in advance. And Bob, to your last message (re: methodology documentation scattered about), I have a hard time finding any serious complaints about the documentation related to this project at this point. And the warnings/error messages the program generates with such specificity are a breath of fresh air, especially for someone like myself whose linear algebra can be clunky at times.

Relevant code for "naive" parameterization i.e., beta ~ multi_normal_cholesky(beta_mu_K, diag_pre_multiply(beta_sigma_K, L_beta):

parameters {
matrix[D,K] beta;
 vector[K] beta_mu_K;
 vector<lower=0>[K] beta_sigma_K;
 cholesky_factor_corr[D] L_Omega;
 cholesky_factor_corr[K] L_beta;
 vector<lower=0>[N_pos] z_pos;
 vector<upper=0>[N_neg] z_neg;
 }
transformed parameters {
 vector[D] z[N];
 for (n in 1:N_pos)
  z[n_pos[n], d_pos[n]] <- z_pos[n];
 for (n in 1:N_neg)
  z[n_neg[n], d_neg[n]] <- z_neg[n];
 }
model {
 L_Omega ~ lkj_corr_cholesky(4);
 L_beta ~ lkj_corr_cholesky(4);
 beta_mu_K ~ normal(0, 5);
 beta_sigma_K ~ cauchy(0, 2.5);
 { 
  vector[D] beta_x[N];
  for(d in 1:D)
   beta[d,] ~ multi_normal_cholesky(beta_mu_K, diag_pre_multiply(beta_sigma_K, L_beta));
  for (n in 1:N) 
   beta_x[n] <- beta * x[n];
  z ~ multi_normal_cholesky(beta_x, L_Omega);
  }
 }

(Not-so-) Optimized version:

parameters {
 matrix[K,D] beta_z;
 row_vector[K] beta_mu_K;
 vector<lower=0>[K] beta_sigma_K;
 cholesky_factor_corr[D] L_Omega;
 cholesky_factor_corr[K] L_beta;
 vector<lower=0>[N_pos] z_pos;
 vector<upper=0>[N_neg] z_neg;
 }
transformed parameters {
 vector[D] z[N];
 matrix[D,K] beta;
 for (n in 1:N_pos)
  z[n_pos[n], d_pos[n]] <- z_pos[n];
 for (n in 1:N_neg)
  z[n_neg[n], d_neg[n]] <- z_neg[n];
 beta <- rep_matrix(beta_mu_K, D) + (diag_pre_multiply(beta_sigma_K, L_beta) * beta_z)';
 }
model {
 L_Omega ~ lkj_corr_cholesky(4);
 L_beta ~ lkj_corr_cholesky(4);
 beta_mu_K ~ normal(0, 5);
 beta_sigma_K ~ cauchy(0, 2.5);
 to_vector(beta_z) ~ normal(0, 1); //implies beta ~ multi_normal_cholesky(beta_mu_K, diag_pre_multiply(beta_sigma_K, L_beta))
 { 
  vector[D] beta_x[N];
  for (n in 1:N) 
   beta_x[n] <- beta * x[n];
  z ~ multi_normal_cholesky(beta_x, L_Omega);
  }
 }


Pollock, LJ, et al. 2014. Understanding co-occurrence by modelling species simultaneously with a joint species distribution model (JSDM). Methods in Ecology and Evolution 5: 397 - 406.
Reply all
Reply to author
Forward
0 new messages