Warm-up slow, or halted?

1,796 views
Skip to first unread message

Stephen Martin

unread,
Mar 2, 2016, 2:48:19 PM3/2/16
to Stan users mailing list
I'm attempting to fit a fairly complicated model.

It's a way I came up with to assess the probability of differential item functioning (DIF) in irt (and other) models in a bayesian framework, with known OR unknown groups.
I initially created the model in JAGS, and it worked, but I want to convert most of my models to stan.

I'm mostly interested in assessing the probability of each item's DIF in the unknown or latent group scenario.

In BUGS syntax, it's something like:

for(i in 1:N){
  for(j in 1:J){
    y[i,j] ~ dbern(p[i,j,d])
    logit(p[i,j,1]) <- alpha[j,1] * (theta[i,k[i] - diff[j,1])
    logit(p[i,j,2]) <- alpha[j,k[i]] * (theta[i,k[i]] - diff[j,k[i]]
  }
  k[i] ~ dcat(alpha)
}
for(j in 1:J){
  d ~ dcat(overallDIF)
}
overallDIF ~ ddirch(1,1)
alpha ~ ddirich(1,1)

Basically, the model simultaneously estimates the IRT parameters, group memberships, and the probability that each item exhibits DIF.
More simply, it's finding a combination of items that best splits the data into two clusters.

I translated this model into Stan in the best way I know how, and that's at the bottom.

The problem is that warmup doesn't seem to actually *occur*, or it's taking WAY too long.
That is, I start 4 chains on the model, one or two of the chains might make progress in warmup and will finish sampling thereafter.
The other chains will never progress past 0%, or at least after an hour they made no progress.

This seems random. Sometimes, 1 will work or 2 will work.
*Once* all four worked, and the inference was correct. The issue seems to be in the very initial warmup stage.

What does this suggest about the model? How could I improve sampling?
Things I've tried:
1) Parameterizing in terms of beta and real<lower=0,upper=1> p2, instead of using simplex and dirichlet. This made no difference.
2) Estimating alpha and diff from the ltm irt package as initial values. This made no difference.

The code:

data {
   
int N;        //Sample size
   
int J;        //Number of items
   
int y[N,J];    //Dichotomous responses
   
int jOrder;    //Which item difficulty to order-constrain, for ident
}

parameters
{
    simplex
[2] delta[J]; //DIF or not
    simplex
[2] p2[N]; //Mixture component

    vector
[J] alpha_s[2,2]; //[Delta, Kappa, Item]
    vector
[J] diff_s[2,2];  //[Delta, Kappa, Item]
    ordered
[2] diff_s1; //Ordered difficulty constraint, for mixture ident
    vector
[N] theta[2]; //Theta varies across mixture components

    simplex
[2] Delta; //Overall DIF
}
transformed parameters
{
    vector
[J] alpha[2,2];
    vector
[J] diff[2,2];
    alpha
<- alpha_s;
    alpha
[1,1] <- alpha[1,2]; //Constrain non-DIF alphas to be equal across components

    diff
<- diff_s;
    diff
[1,1] <- diff[1,2]; //Constrain non-DIF Difficulties to be equal across components
    diff
[2,1,jOrder] <- diff_s1[1]; //Replace jOrderth item with ordered difficulty
    diff
[2,2,jOrder] <- diff_s1[2];
}

model
{
//Decl
real yhat
[J,N,2,2]; //Estimate of Jth item, nth person on non-DIF or DIF version, in cluster 1 or 2
real lp_jndk
[J,N,2,2]; //Auxillary variables for holding likelihood parts
real lp_jnd
[J,N,2];
real lp_jn
[J,N];
real lp_j
[J];

//Priors
////IRT params
alpha
[1,1] ~ normal(0,1);
alpha
[2,1] ~ normal(0,1);
alpha
[2,2] ~ normal(0,1);

diff
[1,1] ~ normal(0,1);
diff
[2,1] ~ normal(0,1);
diff
[2,2] ~ normal(0,1);

////Mixture probabilities
for(n in 1:N){
    p2
[N] ~ dirichlet(rep_vector(.5,2));
}

////Latent abilities, per cluster
for(k in 1:2){
    theta
[k] ~ normal(0,1);
}

////Deltas
for(j in 1:J){
   
//delta[j] ~ dirichlet(rep_vector(.5,2));
    delta
[j] ~ dirichlet(Delta);
}

Delta ~ dirichlet(rep_vector(1.0,2)); //May be unnecessary

//Likelihoods
for(j in 1:J){
   
for(n in 1:N){
       
for(d in 1:2){
           
for(k in 1:2){
                yhat
[j,n,d,k] <- alpha[d,k,j]*(theta[k,n] - diff[d,k,j]);
           
}
       
}
   
}
}

for(j in 1:J){
   
for(n in 1:N){
       
for(d in 1:2){
           
for(k in 1:2){
               
//Likelihood on item j of person n on non-DIF or DIF version of item, weighted by cluster membership
                lp_jndk
[j,n,d,k] <- log(p2[n,k]) + bernoulli_logit_log(y[n,j],yhat[j,n,d,k]);
           
}
           
//Weighted likelihood across cluster membership for person n
            lp_jnd
[j,n,d] <- log(delta[j,d]) + log_sum_exp(lp_jndk[j,n,d]);
       
}
       
//Weighted likelihood across cluster membership for person n in each cluster
        lp_jn
[j,n] <- log_sum_exp(lp_jnd[j,n]);
   
}
   
//Sum of N weighted likelihoods for item j
    lp_j
[j] <- sum(lp_jn[j]);
}
//Increment
increment_log_prob
(sum(lp_j));

R Code:

#Libraries#
library
(sirt)
library
(rstan)
library
(ltm)
options
(mc.cores=parallel::detectCores())
rstan_options
(auto_write=TRUE)

#Simulate heterogenous irt data
N
<- 200
J
<- 40

thetas
.hetero <- list(rnorm(N/2),rnorm(N/2))
bs
.hetero <- list(rnorm(J),rnorm(J))
as.hetero <- list(runif(J,.5,1.2),runif(J,.5,1.2))
ds
.hetero <- sim.raschtype(thetas.hetero[[1]],b=bs.hetero[[1]],fixed.a=as.hetero[[1]])
ds
.hetero <- rbind(ds.hetero,sim.raschtype(thetas.hetero[[2]],b=bs.hetero[[2]],fixed.a=as.hetero[[2]]))
colnames
(ds.hetero) <- paste0('item',1:J)

#Start stan with initial values of alpha/difficulty. This doesn't make a difference.
#Find initial values for some parameters -- in this case, discrimination/alpha
ltmCoef
<- coef(ltm(ds.hetero ~ z1)) ## Obtain ltm's irt estimates.
alphaInit
<- array(dim = c(2,2,J))
alphaInit
[1,1,1:J] <- ltmCoef[,1]
alphaInit
[1,2,1:J] <- ltmCoef[,1]
alphaInit
[2,1,1:J] <- ltmCoef[,1]
alphaInit
[2,2,1:J] <- ltmCoef[,1]
diffInit
<- array(dim = c(2,2,J))
diffInit
[1,1,1:J] <- ltmCoef[,2]
diffInit
[1,2,1:J] <- ltmCoef[,2]
diffInit
[2,1,1:J] <- ltmCoef[,2]
diffInit
[2,2,1:J] <- ltmCoef[,2]

dif
.heteroOut <- stan(model_code=irt.2pl.DIF,data = list(y=ds.hetero,N=N,J=J,jOrder=1),pars=c('theta','alpha','diff','delta','Delta'),
                                            init
=function(){
                                                list
(alpha_s=alphaInit,diff_s=diffInit)
                                           
},verbose=TRUE)


Stephen Martin

unread,
Mar 2, 2016, 4:57:57 PM3/2/16
to Stan users mailing list
As an update to this:
1) I realized I didn't constrain alpha to be positive, so now it's <lower=0>. This hasn't affected warmup, but did give proper inference about latent abilities.
2) I added refresh=1, and indeed, warmup is *going*, but excrutiatingly slow. Some chains will take off and do fine, but others will be at iteration 20 when the working chain is at 1600.
...

Krzysztof Sakrejda

unread,
Mar 3, 2016, 7:27:46 AM3/3/16
to Stan users mailing list
It means that HMC is happily simulating a path in some direction in parameter space and not making a U-turn (so it never makes the stopping criteria for an iteration). This can happen either because a posterior is genuinely unconstrained or very heavy tailed in one direction in parameter space, or because initial chosen step size, which is adjusted multiple times in warm-up, is temporarily adapted to a narrow part of the posterior and then sampling escapes to a much larger region. No matter how you think of it it's diagnosing a model problem for you. Do the models that complete have warnings about hitting max tree depth? Krzysztof

Michael Betancourt

unread,
Mar 3, 2016, 7:36:59 AM3/3/16
to stan-...@googlegroups.com
Or the posterior probability concentrates around surfaces
that twist around the parameter space, and you need both
small step sizes and many, many steps to explore those
neighborhoods sufficiently well. In either case you should
see the step size getting small and warnings about the
tree depth limit. The only difference will be that in complex
models the step size will bottom out at some value while
in bad warmup or pathological models the step size will
converge all the way to zero.

On Mar 3, 2016, at 12:27 PM, Krzysztof Sakrejda <krzysztof...@gmail.com> wrote:

> It means that HMC is happily simulating a path in some direction in parameter space and not making a U-turn (so it never makes the stopping criteria for an iteration). This can happen either because a posterior is genuinely unconstrained or very heavy tailed in one direction in parameter space, or because initial chosen step size, which is adjusted multiple times in warm-up, is temporarily adapted to a narrow part of the posterior and then sampling escapes to a much larger region. No matter how you think of it it's diagnosing a model problem for you. Do the models that complete have warnings about hitting max tree depth? Krzysztof
>
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Stephen Martin

unread,
Mar 3, 2016, 10:23:17 AM3/3/16
to stan-...@googlegroups.com
The models that completed did not have any warnings. They also only take ~ 5 minutes.

On the other hand, the models that get slowed during warmup have never finished, because after about an hour, they were only at iteration 80.
The chains that get slowed do seem to 'start' quickly. They'll blow through iterations 1-15 quickly, then after that take a very long time per iteration. I've never let one of those finish, because I think it would literally take at least a day for only 2000 iterations and N=200, and that's on a fairly good machine (skylake i5, 8gb ram).

Is there some way of prematurely ending the sampling (but still saving the iterations up to that point), so I can look at the chains that are taking an eternity and examine their samples?

I started another 4-chain -- 2 are working fine. 2 are sampling so very slowly. The two slow chains are currently at 12, while the 2 working chains are at 1000. I don't think I can feasibly wait for the slow chains to actually finish.

Michael Betancourt

unread,
Mar 3, 2016, 10:30:50 AM3/3/16
to stan-...@googlegroups.com
Rstan swallows warnings emitted during warmup.  You’d have to
use the “get_sampler_params” function to get at n_divergent,
step size, and tree_depth by hand.

Also, I hope you are actually referring to different models here and
not different chains targeting the same model.  Interchain variance
of performance is indicative of bad model fits, even for the seemingly
“fast” chains.

On Mar 3, 2016, at 3:23 PM, Stephen Martin <hwki...@gmail.com> wrote:

The models that completed did not have any warnings. They also only take ~ 5 minutes.

On the other hand, the models that get slowed during warmup have never finished, because after about an hour, they were only at iteration 80.
The chains that get slowed do seem to 'start' quickly. They'll blow through iterations 1-15 quickly, then after that take a very long time per iteration. I've never let one of those finish, because I think it would literally take at least a day for only 2000 iterations and N=200, and that's on a fairly good machine (skylake i5, 8gb ram).

On Thursday, March 3, 2016 at 6:27:46 AM UTC-6, Krzysztof Sakrejda wrote:
It means that HMC is happily simulating a path in some direction in parameter space and not making a U-turn (so it never makes the stopping criteria for an iteration).  This can happen either because a posterior is genuinely unconstrained or very heavy tailed in one direction in parameter space, or because initial chosen step size, which is adjusted multiple times in warm-up, is temporarily adapted to a narrow part of the posterior and then sampling escapes to a much larger region.  No matter how you think of it it's diagnosing a model problem for you. Do the models that complete have warnings about hitting max tree depth? Krzysztof

Stephen Martin

unread,
Mar 3, 2016, 10:37:41 AM3/3/16
to Stan users mailing list
No, same model, different chains.

The issue is that *when* the chains all perform well (and thus, the model finishes), the inference is correct.
There is still some between-chain variance on some parameters, but for the parameters that matter, the inference is correct (and mimics that of jags).

I'm not sure what I can do to ease along the slow chains.
It doesn't seem to depend much on the data. I can generate DIF or non-DIF data randomly, and there will be at least 1 chain that is extremely slow.

Michael Betancourt

unread,
Mar 3, 2016, 10:52:04 AM3/3/16
to stan-...@googlegroups.com
VERY strong indication that there are pathological 
neighborhoods in your parameter space — the other
chains aren’t getting the right answer, they’re just
avoiding this neighborhood and giving biased answers.
This is one of the very important reason that running
multiple chains is important.  

You’ll probably have to reparameterize or otherwise
tweak your model to avoid such pathologies.  Assuming
that you can track down the source of the pathology,
which in of itself can be difficult.

Stephen Martin

unread,
Mar 3, 2016, 11:03:15 AM3/3/16
to Stan users mailing list
Well, "right" answer in the sense that I generated the data and know what the actual parameters are, and the recovered parameters correlate highly with the actual parameters in the chains that perform quickly.

So there is no way of diagnosing chains until they fully finish, correct? I can't simply kill the stan process for the remaining chains and examine the iterations up to the point of killing it?

Could I run, say, 200 iterations in only warmup mode and look at pairs() for the warmup to see where the pathology might be, or do I really need to wait ~ 16 hours for 2000 iterations?

Krzysztof Sakrejda

unread,
Mar 3, 2016, 11:14:29 AM3/3/16
to Stan users mailing list


On Thursday, March 3, 2016 at 11:03:15 AM UTC-5, Stephen Martin wrote:
Well, "right" answer in the sense that I generated the data and know what the actual parameters are, and the recovered parameters correlate highly with the actual parameters in the chains that perform quickly.

Un/fortunately with MCMC the algorithm is exploring a space so this is telling you that a corner of that space is hard for the algorithm to explore.  Maybe it's not part of the main mass of the distribution but it may well be relevant for calculating credible intervals (e.g.-if you throw out chains that fail because they run into that corner of parameter space you are biasing your credible intervals to be narrower and you'll end up publishing on the dangers of Herricanes).
 

So there is no way of diagnosing chains until they fully finish, correct? I can't simply kill the stan process for the remaining chains and examine the iterations up to the point of killing it?

There's an option in rstan to write to csv and maybe that would do a streaming write(?) Not sure, Ben would know.  If you're willing to run it in CmdStan that writes the iterations as it goes and you can read in the partial .csv to see where the slow chains are going.
 

Could I run, say, 200 iterations in only warmup mode and look at pairs() for the warmup to see where the pathology might be, or do I real

You could run a dozen chains for 200 iterations (100 warmup) and see if that lets you visualize the problem as is in rstan.  

Michael Betancourt

unread,
Mar 3, 2016, 11:32:02 AM3/3/16
to stan-...@googlegroups.com
The only problem with running short chains is that the adaptation
itself adapts to the number of warmup samples, so adaptation for
200 samples is not the same as adaptation for 1000 samples.  I’d
recommend using CmdStan and examining the samples as they
are streamed to the CSV — it’s one of the first things I try when
tackling a particularly hard model.

Stephen Martin

unread,
Mar 3, 2016, 11:32:43 AM3/3/16
to Stan users mailing list
I'll try that out today at some point. I'll look into the CSV option, then CMDstan (not packaged for arch currently, but I could always become the maintainer of that package), then finally try the iteration approach.

Actually, first I'll try reparameterizing. I *think* using the IRT formulation of alpha*(theta - diff) could lead to a funnel, since difficulty itself is actually defined as a function of alpha. I'll try using the intercept + alpha*theta formulation and see whether the pathology goes away.

Michael Betancourt

unread,
Mar 3, 2016, 11:45:04 AM3/3/16
to stan-...@googlegroups.com
Yup — naive IRT implementations can suffer from
all kinds of fitting problems.  Non-centering and
otherwise careful use of priors is often necessary.

Stephen Martin

unread,
Mar 3, 2016, 12:39:25 PM3/3/16
to Stan users mailing list
Well, I was hopeful. I just changed the parameterization to use yhat <- b0 + alpha*theta instead of yhat <- alpha*(theta - difficulty), thinking that because difficulty itself has alpha in it there may be a funneling problem.
I ran the model and all 4 chains took off, no warmup problem (very high Rhats, but that's another issue I think).

Then I ran it again, with the same data, and hit the same problem -- Slow warmup on all chains.

I'll try CMDstan soon. I have to get it compiled and packaged on arch linux first.

Stephen Martin

unread,
Mar 3, 2016, 3:07:47 PM3/3/16
to Stan users mailing list
Alright. Using rstan and writing the output to a file worked as well as cmdstan would have, I think.

I ran 4 chains, 3 halted and all seemed to have halted around the same time.
I'm uploading two csvs. csv1 is a halted chain and csv2 is a chain that worked. I truncated all csvs to ~63 samples (the smallest achieved for the halted chains after around 30 minutes or so) so they could all be imported as one stan object.
I then modified the CSVs to reflect the number of iterations and changed warmup to 0, so the plot function would work with it.

Indeed -- The halted chains have stepsize ever decreasing toward 0 with maximum leapfrog steps.

I have no idea how I could reparameterize the model at this point.

I have tried looking at pairs() for several variables (alpha/diff pairings, deltas), with chains 1,3, and 4 (the halted chains) on the lower diagonal and 2 (the working chain) on the upper.
Although their sampling is in different areas, nothing apparent appears to me.

Looking at the trace plot, you can certainly see that the working chain is exploring the space, while the other chains have basically a straight line right around the point the step size drops.

What other diagnostics should I examine?
samples.csv1
samples.csv2

Bob Carpenter

unread,
Mar 3, 2016, 3:18:05 PM3/3/16
to stan-...@googlegroups.com

> On Mar 3, 2016, at 12:39 PM, Stephen Martin <hwki...@gmail.com> wrote:
>
> Well, I was hopeful. I just changed the parameterization to use yhat <- b0 + alpha*theta instead of yhat <- alpha*(theta - difficulty), thinking that because difficulty itself has alpha in it there may be a funneling problem.
> I ran the model and all 4 chains took off, no warmup problem (very high Rhats, but that's another issue I think).
>
> Then I ran it again, with the same data, and hit the same problem -- Slow warmup on all chains.
>
> I'll try CMDstan soon. I have to get it compiled and packaged on arch linux first.

There's no packaging, per se. The makefile builds a couple
executables, then builds an executable for each Stan program
you compile. The only thing you need is make and a not-antiquated
C++ compiler. Basically, if you can run RStan, you probably already
have all the tools you need installed.

- Bob

Stephen Martin

unread,
Mar 3, 2016, 3:31:35 PM3/3/16
to Stan users mailing list
I realized that early on into my packaging attempt (that /said/, I could probably still make a package with some shell scripts that would let you compile a model without having to compile it within the source tree).

The point is moot anyway, since I instead used rstan to save csvs of the output, as attached above.

Krzysztof Sakrejda

unread,
Mar 3, 2016, 3:42:57 PM3/3/16
to Stan users mailing list


On Thursday, March 3, 2016 at 3:07:47 PM UTC-5, Stephen Martin wrote:
Alright. Using rstan and writing the output to a file worked as well as cmdstan would have, I think.

I ran 4 chains, 3 halted and all seemed to have halted around the same time.
I'm uploading two csvs. csv1 is a halted chain and csv2 is a chain that worked. I truncated all csvs to ~63 samples (the smallest achieved for the halted chains after around 30 minutes or so) so they could all be imported as one stan object.
I then modified the CSVs to reflect the number of iterations and changed warmup to 0, so the plot function would work with it.

Indeed -- The halted chains have stepsize ever decreasing toward 0 with maximum leapfrog steps.

I have no idea how I could reparameterize the model at this point.

I have tried looking at pairs() for several variables (alpha/diff pairings, deltas), with chains 1,3, and 4 (the halted chains) on the lower diagonal and 2 (the working chain) on the upper.
Although their sampling is in different areas, nothing apparent appears to me.


I just did this for a loop through all the variables (in groups of 5), in addition to step size and log density.  It's clear that in a certain region of parameter space, the log density skyrockets (-5000 -> -1000 or so) and the step size plummets.  As a next step it would be useful to know if: 1) that region of parameter space is ever explored by the other chain; 2) if so, are the hyper-parameters different between good and bad chains; 3) can hyper parameter priors be tightened up to make that region go away... I'm not familiar with these models so I'm not going to guess as to what the problem might be.  

Krzysztof

Stephen Martin

unread,
Mar 3, 2016, 4:03:55 PM3/3/16
to Stan users mailing list
Is there any way of assessing 1)?
That is, is there a way of finding the region of parameter space that the halting chains are hitting?
In the trace plots, there are several variables where the halting chains are similar, so is there a way to automatically find the parameters where the halting chains are fairly similar?

Krzysztof Sakrejda

unread,
Mar 3, 2016, 4:18:49 PM3/3/16
to Stan users mailing list
On Thursday, March 3, 2016 at 4:03:55 PM UTC-5, Stephen Martin wrote:
Is there any way of assessing 1)?
That is, is there a way of finding the region of parameter space that the halting chains are hitting?
In the trace plots, there are several variables where the halting chains are similar, so is there a way to automatically find the parameters where the halting chains are fairly similar?

I usually go for making a lot of plots first, starting with hyper-parameters.  It's not hard to automate the plot generation.  Just plot lp and step size against a few other things for each iteration of a loop.  The reason to focus on lp/step size vs. hyper parameters is that it gives you an easy knob to turn should you find that they are a problem.  An alternative would be just to turn the knob and see if it makes the model behave.  Once you make the model behave you can see how far you can relax the hyper priors before seeing problems again.

That said it sounds (from Michael) like IRT models have these problems and somebody else here should be able to suggest a more reasonable parameterization (?)

Krzysztof

Stephen Martin

unread,
Mar 3, 2016, 4:22:35 PM3/3/16
to Stan users mailing list
I'll try plotting a bit more soon.  It seems like P2 and Delta are two influential hyperparameters. When I set those in the init, more of the chains seem to hit the non-problematic area of the lp. So, I'll try focusing more on those.

The issue here isn't the IRT model, per se.
This is a really weird, even if valid, model.

It *is* an IRT model, but it's a mixture irt model that allows some of the items to differentiate between the clusters and other items to not.
The goal is to find the combination of items that 'best split apart' individuals into clusters.
Like I said, it works in jags, and by the nature of conditional sampling, the chains aren't hitting these weird warmup issues or problematic spots.

Michael Betancourt

unread,
Mar 3, 2016, 4:27:23 PM3/3/16
to stan-...@googlegroups.com
I am highly confident that JAGS is missing these pathological regions
entirely and actually giving biased estimates.  It’s something we see
over and over again.

Bob Carpenter

unread,
Mar 3, 2016, 4:34:07 PM3/3/16
to stan-...@googlegroups.com
It would be great if we had a concrete example of this.
If you can give me the models, I can make a knitr example
doc and add it to our case studies.

Also, if you do the pairs plots in RStan, the
divergent transitions are highlighted.

The biggest issue with the IRT 2PL models is identifiability.
You need to constrain the discrimination to be positive and
then constrain the abilities to be unit normal --- that will
give you the right degrees of freedom for the scale. But
if you do

delta[j] * (alpha[i] - beta[j])

you run into the nasty delta[j] * beta[j] product term that
tends to produce banana-like posteriors. So it indeed helps to
rewrite as

delta[j] * alpha[i] - beta_delta[j]

where beta_delta[j] is a new parameter.

Then, if the data's separable (as in there are questions every
student got right or wrong, or students who got every question right
or wrong), the prior's the only thing keeping the alpha or beta
from marching off to infinity.

Finally, if there's hierarchical structure and small numbers of
counts per group, you want to use the non-centered parameterizations.

- Bob

Stephen Martin

unread,
Mar 3, 2016, 4:36:38 PM3/3/16
to Stan users mailing list
I'm sure you're correct --- It's part of the reason I'm using Stan.
Even if jags gives more consistent results for mixture models in particular, it's largely because it fails to adequately explore the parameter space.

That said -- The model wasn't inferentially incorrect. The parameters recovered were generally correct (given that I know the true parameters).
The issue with this model in Stan isn't that it's finding multiple solutions (that, I'm ok with. Multimodality is expected when several adequate solutions exist). My issue with it is that it's hitting areas that are difficult *to sample* and is causing the sampler to run for 16 hours until completion.  To some degree, if I know the non-difficult areas yield valid inferences, I'd rather just skip the problematic areas in some way.
That is, the pathology of the model seems to affect the sampling performance moreso than the /inference/, which is all I really care about.

Michael Betancourt

unread,
Mar 3, 2016, 4:55:41 PM3/3/16
to stan-...@googlegroups.com
If the pathological neighborhoods didn’t affect inferences then
the Markov chain wouldn’t be exploring them!  You may be
recovering the means okay but the poor fits are most likely
leading to variances that are biased low.

Bob Carpenter

unread,
Mar 3, 2016, 5:00:05 PM3/3/16
to stan-...@googlegroups.com
Do you have an example where JAGS works better than
Stan for mixture models?

- Bob

Krzysztof Sakrejda

unread,
Mar 3, 2016, 5:02:25 PM3/3/16
to Stan users mailing list


On Thursday, March 3, 2016 at 4:36:38 PM UTC-5, Stephen Martin wrote:
I'm sure you're correct --- It's part of the reason I'm using Stan.
Even if jags gives more consistent results for mixture models in particular, it's largely because it fails to adequately explore the parameter space.

That said -- The model wasn't inferentially incorrect. The parameters recovered were generally correct (given that I know the true parameters).
The issue with this model in Stan isn't that it's finding multiple solutions (that, I'm ok with. Multimodality is expected when several adequate solutions exist). My issue with it is that it's hitting areas that are difficult *to sample* and is

This:
 
causing the sampler to run for 16 hours until completion.  To some degree, if I know the non-difficult areas yield valid inferences, I'd rather just skip the problematic areas in some way.

Priors on (hopefully weak ones on hyper parameters, or at least ones you have good prior knowledge on) are how you decide in a mathematically coherent way to skip problematic areas.  HMC really rewards (more) full Bayesian modeling.

Stephen Martin

unread,
Mar 3, 2016, 5:09:19 PM3/3/16
to Stan users mailing list
I understand that, and I've been tweaking the model ever so much to try and skip such areas. I'll try some other priors and see whether anything can be fixed.

Stephen Martin

unread,
Mar 3, 2016, 5:16:56 PM3/3/16
to Stan users mailing list
I can probably dig some up. I have a ton of jags (and now stan) files where I've played with mixture modeling.

I typically generate the data myself (so I know the parameter values), fit the model, then determine how well the model assigned individuals to clusters (and cluster parameters).

I'm not sure I could say jags "works better", it just works "more consistently", in that the chains typically find the same spot of parameter space and it converges quickly.
Stan is better at finding various solutions, and so it yields multimodal posteriors. This isn't a bad thing, it's just less consistent. I'm not opposed to this, and usually the inferences aren't really affected by it very much.
Both of them generally estimate cluster memberships well (after accounting for any between-chain label switching, which still occasionally happens despite some order constraints).

The meta-mixture model I posted on earlier just flops in both of them. Jags never leaves the starting model, and Stan basically finds a unique solution per-chain, so it's not even 'multi-modal', so much as there are as many solutions as there are chains; sometimes the posterior is practically flat as a result.

I haven't quantitatively compared stan and jags in their inferential accuracies for mixture models. In time, I probably will and I can determine which sampler works better in certain conditions. My guess is both will perform well, but stan will continue to find annoying points in the parameter space that doesn't change inference much but does take 16 hours (as per today).

Bob Carpenter

unread,
Mar 3, 2016, 5:23:04 PM3/3/16
to stan-...@googlegroups.com
Can you attach the whole Stan program you're using?
If you marginalize out the mixture responsibilities, Stan
should mix much better than JAGS. It has in every problem
I've looked at in the past.

I'd also suggest using a simple beta prior rather than a
Dirichlet if there are only two outcomes.

- Bob

> On Mar 2, 2016, at 2:48 PM, Stephen Martin <hwki...@gmail.com> wrote:
>
> I'm attempting to fit a fairly complicated model.
>
> It's a way I came up with to assess the probability of differential item functioning (DIF) in irt (and other) models in a bayesian framework, with known OR unknown groups.
> I initially created the model in JAGS, and it worked, but I want to convert most of my models to stan.
>
> I'm mostly interested in assessing the probability of each item's DIF in the unknown or latent group scenario.
>
> In BUGS syntax, it's something like:
>
> for(i in 1:N){
> for(j in 1:J){
> y[i,j] ~ dbern(p[i,j,d])
> logit(p[i,j,1]) <- alpha[j,1] * (theta[i,k[i] - diff[j,1])
> logit(p[i,j,2]) <- alpha[j,k[i]] * (theta[i,k[i]] - diff[j,k[i]]
> }
> k[i] ~ dcat(alpha)
> }
> for(j in 1:J){
> d ~ dcat(overallDIF)
> }
> overallDIF ~ ddirch(1,1)
> alpha ~ ddirich(1,1)
>
> Basically, the model simultaneously estimates the IRT parameters, group memberships, and the probability that each item exhibits DIF.
> More simply, it's finding a combination of items that best splits the data into two clusters.
>
> I translated this model into Stan in the best way I know how, and that's at the bottom.
>
> The problem is that warmup doesn't seem to actually *occur*, or it's taking WAY too long.
> That is, I start 4 chains on the model, one or two of the chains might make progress in warmup and will finish sampling thereafter.
> The other chains will never progress past 0%, or at least after an hour they made no progress.
>
> This seems random. Sometimes, 1 will work or 2 will work.
> *Once* all four worked, and the inference was correct. The issue seems to be in the very initial warmup stage.
>
> What does this suggest about the model? How could I improve sampling?
> Things I've tried:
> 1) Parameterizing in terms of beta and real<lower=0,upper=1> p2, instead of using simplex and dirichlet. This made no difference.
> 2) Estimating alpha and diff from the ltm irt package as initial values. This made no difference.
>
> The code:
>
> data {
> int N; //Sample size
> int J; //Number of items
> int y[N,J]; //Dichotomous responses
> int jOrder; //Which item difficulty to order-constrain, for ident
> }
>
> parameters {
> simplex[2] delta[J]; //DIF or not
> simplex[2] p2[N]; //Mixture component
>
> vector[J] alpha_s[2,2]; //[Delta, Kappa, Item]
> vector[J] diff_s[2,2]; //[Delta, Kappa, Item]
> ordered[2] diff_s1; //Ordered difficulty constraint, for mixture ident
> vector[N] theta[2]; //Theta varies across mixture components
>
> simplex[2] Delta; //Overall DIF
> }
> transformed parameters{
> vector[J] alpha[2,2];
> vector[J] diff[2,2];
> alpha <- alpha_s;
> alpha[1,1] <- alpha[1,2]; //Constrain non-DIF alphas to be equal across components
>
> diff <- diff_s;
> diff[1,1] <- diff[1,2]; //Constrain non-DIF Difficulties to be equal across components
> diff[2,1,jOrder] <- diff_s1[1]; //Replace jOrderth item with ordered difficulty
> diff[2,2,jOrder] <- diff_s1[2];
> }
>
> model {
> //Decl
> real yhat[J,N,2,2]; //Estimate of Jth item, nth person on non-DIF or DIF version, in cluster 1 or 2
> real lp_jndk[J,N,2,2]; //Auxillary variables for holding likelihood parts
> real lp_jnd[J,N,2];
> real lp_jn[J,N];
> real lp_j[J];
>
> //Priors
> ////IRT params
> alpha[1,1] ~ normal(0,1);
> alpha[2,1] ~ normal(0,1);
> alpha[2,2] ~ normal(0,1);
>
> diff[1,1] ~ normal(0,1);
> diff[2,1] ~ normal(0,1);
> diff[2,2] ~ normal(0,1);
>
> ////Mixture probabilities
> for(n in 1:N){
> p2[N] ~ dirichlet(rep_vector(.5,2));
> }
>
> ////Latent abilities, per cluster
> for(k in 1:2){
> theta[k] ~ normal(0,1);
> }
>
> ////Deltas
> for(j in 1:J){
> //delta[j] ~ dirichlet(rep_vector(.5,2));
> delta[j] ~ dirichlet(Delta);
> }
>
> Delta ~ dirichlet(rep_vector(1.0,2)); //May be unnecessary
>
> //Likelihoods
> for(j in 1:J){
> for(n in 1:N){
> for(d in 1:2){
> for(k in 1:2){
> yhat[j,n,d,k] <- alpha[d,k,j]*(theta[k,n] - diff[d,k,j]);
> }
> }
> }
> }
>
> for(j in 1:J){
> for(n in 1:N){
> for(d in 1:2){
> for(k in 1:2){
> //Likelihood on item j of person n on non-DIF or DIF version of item, weighted by cluster membership
> lp_jndk[j,n,d,k] <- log(p2[n,k]) + bernoulli_logit_log(y[n,j],yhat[j,n,d,k]);
> }
> //Weighted likelihood across cluster membership for person n
> lp_jnd[j,n,d] <- log(delta[j,d]) + log_sum_exp(lp_jndk[j,n,d]);
> }
> //Weighted likelihood across cluster membership for person n in each cluster
> lp_jn[j,n] <- log_sum_exp(lp_jnd[j,n]);
> }
> //Sum of N weighted likelihoods for item j
> lp_j[j] <- sum(lp_jn[j]);
> }
> //Increment
> increment_log_prob(sum(lp_j));
>
> R Code:
>
> #Libraries#
> library(sirt)
> library(rstan)
> library(ltm)
> options(mc.cores=parallel::detectCores())
> rstan_options(auto_write=TRUE)
>
> #Simulate heterogenous irt data
> N <- 200
> J <- 40
>
> thetas.hetero <- list(rnorm(N/2),rnorm(N/2))
> bs.hetero <- list(rnorm(J),rnorm(J))
> as.hetero <- list(runif(J,.5,1.2),runif(J,.5,1.2))
> ds.hetero <- sim.raschtype(thetas.hetero[[1]],b=bs.hetero[[1]],fixed.a=as.hetero[[1]])
> ds.hetero <- rbind(ds.hetero,sim.raschtype(thetas.hetero[[2]],b=bs.hetero[[2]],fixed.a=as.hetero[[2]]))
> colnames(ds.hetero) <- paste0('item',1:J)
>
> #Start stan with initial values of alpha/difficulty. This doesn't make a difference.
> #Find initial values for some parameters -- in this case, discrimination/alpha
> ltmCoef <- coef(ltm(ds.hetero ~ z1)) ## Obtain ltm's irt estimates.
> alphaInit <- array(dim = c(2,2,J))
> alphaInit[1,1,1:J] <- ltmCoef[,1]
> alphaInit[1,2,1:J] <- ltmCoef[,1]
> alphaInit[2,1,1:J] <- ltmCoef[,1]
> alphaInit[2,2,1:J] <- ltmCoef[,1]
> diffInit <- array(dim = c(2,2,J))
> diffInit[1,1,1:J] <- ltmCoef[,2]
> diffInit[1,2,1:J] <- ltmCoef[,2]
> diffInit[2,1,1:J] <- ltmCoef[,2]
> diffInit[2,2,1:J] <- ltmCoef[,2]
>
> dif.heteroOut <- stan(model_code=irt.2pl.DIF,data = list(y=ds.hetero,N=N,J=J,jOrder=1),pars=c('theta','alpha','diff','delta','Delta'),
> init=function(){
> list(alpha_s=alphaInit,diff_s=diffInit)
> },verbose=TRUE)

Stephen Martin

unread,
Mar 3, 2016, 6:25:24 PM3/3/16
to Stan users mailing list
I just converted the model to using real<lower=0,upper=1> probabilities with beta priors, and I actually think that fixed the issue.
I just ran it 4 times in a row, and no chains had the same issue as before.  I have a feeling that, although I did try using real probs before, I may not have changed the priors to actually use beta instead of dirichlet.

I can still attach the full stan model. It's currently a mess of comments, and the likelihood is *ugly* (but correct).

The data were generated using sirt, using:

library(sirt)

N
<- 200
J
<- 40

# Simulate heterogenous data

thetas
.hetero <- list(rnorm(N/2),rnorm(N/2))
bs
.hetero <- list(rnorm(J),rnorm(J))
as.hetero <- list(runif(J,.5,1.2),runif(J,.5,1.2))
ds
.hetero <- sim.raschtype(thetas.hetero[[1]],b=bs.hetero[[1]],fixed.a=as.hetero[[1]])
ds
.hetero <- rbind(ds.hetero,sim.raschtype(thetas.hetero[[2]],b=bs.hetero[[2]],fixed.a=as.hetero[[2]]))
colnames
(ds.hetero) <- paste0('item',1:J)


dif
.heteroOut <- stan(
    model_code
= irt.2pl.DIF,
    data
= list(
        y
= ds.hetero,
        N
= N,
        J
= J,
        jOrder
= 1,
        item_indep
= 0
   
),
    verbose
= TRUE,
    chains
= 4,
    refresh
= 1,
    sample_file
='samples.csv',
    control
=list(max_treedepth=12)
)

So basically, I generated two subgroups whose irt parameters are different, and I'm estimating, simultaneously, the probability of each individual belonging to each cluster and the probability that each item's parameters should differ between such clusters.
It'll generally find items that best wedge the individuals into two groups.

I need to do further testing (different datasets, real world dataset, less extreme heterogeneity), but stan is giving roughly the same output as jags now.
No sampling issues in my latest runs. The issue must have been the dirichlet prior, because the beta prior fixed it.
irt.2pl.dif.stan

Bob Carpenter

unread,
Mar 3, 2016, 6:35:04 PM3/3/16
to stan-...@googlegroups.com
It should be doing the same thing, with

(theta1, theta2) ~ Dirichlet((alpha, beta))

[where theta2 = 1 - theta1 by definition in a simplex]

having the same density as

theta1 ~ Beta(alpha, beta)

Even the simplex transform is the same as the <lower=0,upper=1>
transform for a 2-simplex.

- Bob
> <irt.2pl.dif.stan>

Stephen Martin

unread,
Mar 3, 2016, 7:09:57 PM3/3/16
to Stan users mailing list
Indeed, however I have now run it 10 times with varying data each time, 4 chains each, and under the beta parameterization, the chains are not getting stuck.

Under the dirichlet parameterization, chains were stuck more often than not.

I understand that in my case, the two *should* be the same. So why would changing it to beta allow it to work better?
Does it have anything to do with the use of a simplex as opposed to a single parameter?

Bob Carpenter

unread,
Mar 3, 2016, 7:18:04 PM3/3/16
to stan-...@googlegroups.com
That's what I was wondering, but it really shouldn't matter unless
somehow the simplex was failing a constraint somewhere. But if that
was happening, you should be getting lots of warning messages.

- Bob

Bob Carpenter

unread,
Mar 3, 2016, 7:29:03 PM3/3/16
to stan-...@googlegroups.com
A simple example would be good.

Are you using diffuse random inits in JAGS? As far as I know,
you need to do that yourself (or I may be confusing it with BUGS).

The diffuse initialization will help diagnose any underlying
multimodality in your model.

The fact that Stan actually explores your posterior should
be considered a good thing! You can't do proper Bayesian inference
on only a subset of the posterior (though you can do approximate
Bayes that way and maybe get a good enough answer to an applied
problem).

- Bob

Stephen Martin

unread,
Mar 4, 2016, 11:25:19 AM3/4/16
to Stan users mailing list
I never really messed with the inits of jags; from what I could tell, jags randomly generates inits, but I don't really know how they do it.

And right -- I'm glad stan better explores the parameter space, even if it does result in multimodal posteriors.
That sort of thing is expected, I think, with highly complex models. There can be several high probability solutions to a complex model.

I do have a question though. Most of the time when I run a mixture model in stan, I see fairly severe n_divergent warnings at the end.
I know that those are undesirable, and I know that to 'get rid of them' one should set adapt_delta to a higher value, but what exactly *is* a divergent transition? I couldn't find it in the manual, and google didn't help much either.
Again, despite the divergences, the inference seems to be correct (in my weird mixture-DIF with items of varying DIF probability model, it recovers the DIF items well and the recovered ability scores correlate ~ .8 with the true ability scores), but I have a TON of divergences.

Bob Carpenter

unread,
Mar 4, 2016, 6:27:05 PM3/4/16
to stan-...@googlegroups.com
Nope. Here's from page 8 of the JAGS manual (v3.3.0):

If initial values are not supplied by the user, then
each parameter chooses its own initial value based on
the values of its parents. The initial value is chosen
to be a “typical value” from the prior distribution.
The exact meaning of “typical value” depends on the
distribution of the stochastic node, but is usually the
mean, median, or mode.

If you rely on automatic initial value generation and
are running multiple parallel chains, then the initial
values will be the same in all chains. You may not want
this behaviour, especially if you are using the Gelman
and Rubin convergence diagnostic, which assumes that
the initial values are over-dispersed with respect
to the posterior distribution. In this case, you are
advised to set the starting values manually using the
”parameters in” statement.

So the "stability" you're seeing is probably a combination of
using the same inits in each chain and Gibbs poor mixing,
especially with discrete parameters.

The divergent transitions arise when there's a numerical
error (underflow, overflow, divide by zero, etc.), and it
almost always indicates areas of the posterior with extreme
curvature.

What are the messages coming back from the divergences? It
indicates another situation where you're probably getting bias
from missing part of the parameter space.

This can happen if your data's consistent with parameters at
constraint boundaries, like scales (std devs) of zero.

And it's not the high-density modes that we want to explore,
it's the typical set (high probability mass volume) of the posterior,
which can be far away from the modes (as it is in a high-dimensional
multivariate normal --- almost none of the mass is close to the mode above
50 or 100 dimensions).

- Bob

Stephen Martin

unread,
Mar 4, 2016, 6:54:51 PM3/4/16
to Stan users mailing list
Noted. Eek, glad I made the transition then.

As for divergences, most of the mixtures I've been fitting have a *huge* number of divergent transitions, sometimes equal to the number of post-warmup iterations.
If it's occurring near boundaries, I'm guessing the issue is that some people load *very* highly on one model vs another model, and the prior is already inverted to favor that, so the likelihood and the prior are both favoring extreme drops near boundaries (0 or 1).

I'll tweak the priors a bit to not favor such assignments and see whether it fixes the issue. In Jags, I typically used mixture weights of, e.g., dbeta(.5,.5) to favor assigning individuals to one or the other. Stan may not need that as much, because it doesn't have to swap between entire models.

Stephen Martin

unread,
Mar 4, 2016, 6:57:35 PM3/4/16
to Stan users mailing list
Oh! I meant to mention this earlier.

When I used the simplex and dirichlet parameterization, optimizing() would spew hundreds of errors about a non-finite log gradient.
Since converting the model to use real<lower=0,upper=1> probabilities with beta priors, I do not get those errors.
I'm not sure if that helps you all any, but I found that interesting, and further evidence that something is wonky about dirichlet priors.
...

Bob Carpenter

unread,
Mar 4, 2016, 7:18:03 PM3/4/16
to stan-...@googlegroups.com
Until we get a reproducible error, there's not much we can
do. We haven't had problems with them elsewhere.

- Bob

Stephen Martin

unread,
Mar 4, 2016, 7:20:08 PM3/4/16
to Stan users mailing list

This is all I see in rstan:
Warning messages:
1: There were 4000 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help.
2: Examine the pairs() plot to diagnose sampling problems


 
And I ran 4 chains with 1000 post-warmup iterations, so every iteration had n_divergent. I've tried changing adapt_delta and it didn't change anything, still that many divergences.
Again, and interestingly -- the inference is very good; the mean ability estimates, conditioned on their assigned clusters, correlated with the actual latent abilities at .85.

The trace plots and densities look like the sampler is seriously struggling though. An example was attached to this post. Those straight lines and insanely high Rhats are very questionable, but also are right at the boundary for the parameter.

Bob Carpenter

unread,
Mar 4, 2016, 7:24:04 PM3/4/16
to stan-...@googlegroups.com
That's certainly not mixing. You should get a print out of what
they are. Ben changed the RStan behavior so that they get cached
and counted and printed at the end.

Are you sure the model has support over all the parameter values
matching the constraints?

- Bob

Stephen Martin

unread,
Mar 4, 2016, 7:38:41 PM3/4/16
to Stan users mailing list
I don't see any print-out about individual iterations. I pasted precisely what it said, and nothing less.

And I'm fairly sure it does, yes.
Aside from constraining the alpha to be positive half-normal and probabilities to be real<0,1>, the only other constraint is an ordering constraint, but that is accomplished through ordered, and it's for managing within-chain label switching.

It's *possible* that the positive-half-normal priors could be causing an issue, if any alphas want to be near 0; I could instead try a lognormal distribution of some sort to keep the alphas near 0-1 without a direct constraint?

Bob Carpenter

unread,
Mar 4, 2016, 8:00:03 PM3/4/16
to stan-...@googlegroups.com
The beta priors aren't particularly strong, though. What's
the data size that's going with those priors?

- Bob

Stephen Martin

unread,
Mar 4, 2016, 8:07:23 PM3/4/16
to Stan users mailing list
Currently, N=200, with N_1 = 100, N_2 = 100.

I've tried a few beta priors, all fairly uninformative:
(.2,.2)
.5,.5
1,1
2,2

Doesn't seem to make a difference.

Stephen Martin

unread,
Mar 4, 2016, 8:29:04 PM3/4/16
to stan-...@googlegroups.com
To clarify things, I'm attaching the likelihood function I'm using as a pdf (cropped latex pdf).

The idea is that there are two latent groups, and some items may perform better if they estimate different item parameters per group.
The goal is to determine 1) latent group memberships, marginalized over the probability that each item distinguishes between groups and 2) The probability that each item distinguishes between groups.

The priors on the Deltas can vary --- I could assume that the existence of one "DIF" item (high Delta_2) should suggest the existence of other such items (thus, all item Deltas are placed on the same prior distribution), or leave them independent (independent prior distributions).
likelihood.pdf

Bob Carpenter

unread,
Mar 5, 2016, 3:01:05 PM3/5/16
to stan-...@googlegroups.com
Yup. Theoretically the ones less than 1 are very different, but
in practice as long as you see both positive and negative instances,
it's not going to make a huge difference. It will make a difference
though if the posterior has parameters < 1.

And yes, a log normal can help keep things away from zero, but
it can also cause computational issues if the data really wants
to push things out into the tail.

- Bob

Stephen Martin

unread,
Mar 5, 2016, 6:57:15 PM3/5/16
to Stan users mailing list
The reason I was using the <1 priors was to favor categorization (p2 = 0 or 1 being more probable than .4-.6), which is a trick I used in jags to guide the sampler a bit.

Could you elaborate on your point that it'll make a difference if the posterior has parameters < 1?

Bob Carpenter

unread,
Mar 7, 2016, 12:45:06 AM3/7/16
to stan-...@googlegroups.com
Suppose I have a normal(2, 1) distribution. If I don't
truncate, the mean is 2. If I truncate setting lower=1,
the mean is 2.3.

> library(rstan)
> prog <- "parameters { real<lower=1> y; } model { y ~ normal(2, 1); }"
> fit <- stan(model_code=prog)
> print(fit)
...
mean ...
y 2.29 ...


So the original posterior mean is 2, but by truncating at lower=1,
it moves the posterior mean up to 2.3.

- Bob

Stephen Martin

unread,
Mar 7, 2016, 12:54:08 PM3/7/16
to Stan users mailing list
Your post made me realize that my priors were misspecified as well.

I wanted to shrink mixture assignments toward an overall distribution of probabilities (beta), but the prior to that beta was directly another beta, which is incorrect.

That is, the hyper prior distribution was a beta(.5,.5) or something, and a realization of that distribution was entered as the beta prior parameter for the mixture assignment.
This is totally incorrect --- The prior distribution was getting <lower=0,upper=1>, when it should be getting a scaled version of that.  The mean of the prior beta was correct, but the width was totally wrong.

Before:

p2 ~ beta(P2,1-P2);
P2 ~ beta(.5,.5)

This is totally incorrect.
Now it is:
p2 ~ beta(P2*multi,(1 - P2)*multi); (conceptually speaking)
P2 ~ beta(.5,.5);

The multi (multiplier) changes the scale of the p2 distribution.  It didn't help with mixing, but it did improve inference a bit.

Bob Carpenter

unread,
Mar 7, 2016, 3:40:09 PM3/7/16
to stan-...@googlegroups.com
I'd be inclined to take a less extreme hyperprior than beta(0.5, 0.5),
but it probably isn't too sensitive (you could check).

This specification of a beta prior in terms of mean and a kind
of precision is discussed in the manual and I also talk about it
in the case study I just added on repeated binary trials (which in
turn is based on the case study in chapter 5 of Gelman et al.'s
Bayesian Data Analysis book). In this model, it's hard to control
the tendency of multi to drift off to infinity if the data is consistent
with complete pooling (all the p2 are the same).

So what I did to change the model in the case study is switch to
a logit-scale rather than trying to fiddle with beta distributions.

http://mc-stan.org/documentation/case-studies.html

I've been meaning to announce the case studies, but so far there
are only three of mine. I still need to get some more examples out
there (I believe there's one in the mail queue).

- Bob

Stephen Martin

unread,
Mar 7, 2016, 3:53:39 PM3/7/16
to Stan users mailing list
Hm, interesting. I ran into the "multiplier -> infinity" issue and constrained the multiplier from 1 to N (or 1 to J, depending on the distribution).

So if instead of using p2<lower=0,upper=1>[N] ~ beta(P2*P2_multi,(1 - P2)*P2_multi), P2_multi<lower=1,upper=N>
I could use logit2[N].

My question then:
beta:probability :: _____:logit

Stephen Martin

unread,
Mar 7, 2016, 4:23:35 PM3/7/16
to Stan users mailing list
Alright, I like your idea.

I'll try using logit units rather than probabilities.

It seems like the logistic distribution is what I'll want to use.

So, for shrinking logits, I'll want to estimate a logistic distribution of some mean and scale.
An 'uninformed' logistic distribution seems to be logistic(0,1) --- By that, I mean uniformly distributed probabilities, converted to logit, is approximately distributed logistic(0,1).

Would a good prior for the logistic distribution mean itself be a logistic distribution?
I'm not sure what the logistic scale prior should be; The larger the scale, the more it is like beta(<1,<1). The smaller, the more it is like beta(>1,>1).

Bob Carpenter

unread,
Mar 7, 2016, 4:35:06 PM3/7/16
to stan-...@googlegroups.com
Usually normal; Stan doesn't require (or even use) conjugacy.
See the example in the case study.

The typical motivation for making things a logit is either
(a) to do maximum likelihood in an unconstrained space, or
(b) to add other predictors or to generalize to a hierarchical
or multilevel model.

- Bob
Reply all
Reply to author
Forward
0 new messages