IRT models PL1, PL2 + multilevel versions

1,511 views
Skip to first unread message

Bob Carpenter

unread,
Sep 10, 2012, 12:05:13 AM9/10/12
to stan-...@googlegroups.com
Item-response theory (IRT) models are very much the kind
of thing we had in mind when designing Stan -- it's essentially
a logistic regression with high correlation among
the parameters in the posterior.

I just wrote four Stan models for the basic IRT
models PL1 (aka Rasch) with no question discrimination
parameter and PL2 with the discrimination parameter, along
with multilevel variants of the kind described in Gelman
and Hill's regression book starting on p. 314.

You can browse the models online in our source repository:

https://code.google.com/p/stan/source/browse/#git%2Fsrc%2Fmodels%2Fmisc%2Firt

I tested them by generating data for 400 students
on 100 questions, with 25% of the data missing at
random. This results in around 30,000 data points
and 500 (PL1) or 600 (PL2) parameters.

The models take about 5 seconds to compile in g++ at -O3
optimization.

Here's how long Stan took to warmup for 500 iterations
then sample for 500 iterations. All models converged
to appropriate answers across multiple chains.

PL1: 80 seconds (1m 20s)

PL2: 180 seconds (3m 00s)

The hierarchical models took the same amount of time
as their non-hierarchical counterparts.

The PL2 models adapt to a smaller step size and more
steps, which is why they take longer (that and having
20% more parameters and their attendant differentiation
and updating).

It only consumes 14MB of memory peak, so there's no
problem running multiple processes concurrently.

I also ran on much more data. Basic IRT for 1000 students
with 200 questions each (150K data points) ran
1000 warmup and 1000 sampling iterations (twice as many
as reported above) in 4 minutes in 70MB and converged to
the right answer. I even ran basic IRT for 10,000 students with 200
questions each (1.5M data points), and though I lost
patience testing, I extrapolated about 2 hours for
1000 warmup and 1000 sampling iterations. (I cut down
the later tests when I realized it was converging very
quickly from random inits.)

There's a bit of variation in the runs mainly due to
waiting for the step sizes to converge during warmup;
if the random initialization is far from a mode, especially
for the hierarchical parameters or intercept terms, then
it can take a bit longer to move to the mode because of
the need for reduced step sizes to keep the simulation
accurate (I never saw more than about 50% more time needed
than reported above). With good inits it can run a bit faster than
reported above.

The notebook I tested this on is reasonably beefy:

Mac OS X, Mountain Lion,
2.3 GHz i7 CPU
1600MHz DDR3 memory

Although I'm using the latest g++ from Xcode, version 4.21,
the latest from Macports is faster (as much as 20% in
some of our earlier tests of a version of 4.6 and they
continue to get faster).

- Bob




Emmanuel Charpentier

unread,
Sep 10, 2012, 4:12:22 AM9/10/12
to stan-...@googlegroups.com
Dear Bob,

Thank you for this effort. I have skimmed the stan files, and notice a couple of things :

1) You use one more parameter, the mean ability, whih, from a "sampling efficiency" point of view, should play the same role that the mean of the alpha distribution in Gelman & Hill's textbook.

2) You give to this mean an informative prior. This would be hard to defend against a reviewer.

3) Similarly, you give abilities, discriminations and difficulties somewhat informative priors. The same remark applies : whereas imposing a standard deviation on abilities is just a choice of scale; imposing a small standard deviation to discriminations and difficulties *is* informative. Hard to do when you are exploring a situation where no prior knowledge has been formalized.

4) However, in the IRT2 hierrchical model, you do not state a prior for the abilities and difficulties. This, as far as I know, is equivalent to giving them the (improper) prior on (0, \infty), ths making the posterior improper, AFAIK. So much for model comparison...

In short, where Gelman & Hill's solution introduced both an additive and a multiplicative redundancy (by means of centering and scaling "raw" parameters), you keep the additive redundancy and replace the multiplicative redundancy by imposing more informative distributions.

I'll try to assess the impact of these choices by trying various values of the prior parameters.

Again, thank you !

                                                         Emmanuel Charpentier

Bob Carpenter

unread,
Sep 10, 2012, 11:58:12 AM9/10/12
to stan-...@googlegroups.com
You can put whatever priors you want on the
variables. It doesn't affect the computation much.
It does affect the concentration of the posterior,
which is very sensitive to the choice of prior in
these models.

I also didn't provide a prior on the deviation
parameters, which I should probably do.

On 9/10/12 4:12 AM, Emmanuel Charpentier wrote:
> Dear Bob,
>
> Thank you for this effort. I have skimmed the stan files, and notice a couple of things :
>
> 1) You use one more parameter, the mean ability, whih, from a "sampling efficiency" point of view, should play the same
> role that the mean of the alpha distribution in Gelman & Hill's textbook.

Exactly. You can parameterize it the other
way, but it's then adding a dependency to all the
parameters on the mean, which is less flexible than
just having the mean float.

> 2) You give to this mean an informative prior. This would be hard to defend against a reviewer.

Presumably in the same sense that any informative prior might
be hard to defend against a reviewer!

I could've added the priors as data as in the S. McKay Curtis
JSS paper on BUGS item-response models, so that they'd be
easier for users to play with.

I found when the priors were fatter than the values used
to simulate the data, the mean posterior estimates drifted
from their simulated values (in predictable, repeatable ways).

I'd recommend the hierarchical models, which don't have this
problem, but can also be difficult to defend against reviewers
in my experience.

> 3) Similarly, you give abilities, discriminations and difficulties somewhat informative priors. The same remark applies
> : whereas imposing a standard deviation on abilities is just a choice of scale; imposing a small standard deviation to
> discriminations and difficulties *is* informative. Hard to do when you are exploring a situation where no prior
> knowledge has been formalized.

Indeed. As I said above, feel free to change the priors!

> 4) However, in the IRT2 hierrchical model, you do not state a prior for the abilities and difficulties. This, as far as
> I know, is equivalent to giving them the (improper) prior on (0, \infty), ths making the posterior improper, AFAIK. So
> much for model comparison...

Exactly -- the scale parameters have improper priors.
The posterior doesn't behave improperly, but I haven't done
a formal derivation.

Given how the model behaves without a prior (recovering
the true simulated values), and given the quantity of
data for the priors (# of questions, # of students), a
very weak prior shouldn't have much effect.

It's easy enough to add a prior. Andrew likes using
weakly informative priors in these situations, but
I realize other researchers have different approaches.

> In short, where Gelman & Hill's solution introduced both an additive and a multiplicative redundancy (by means of
> centering and scaling "raw" parameters), you keep the additive redundancy and replace the multiplicative redundancy by
> imposing more informative distributions.

Gelman and Hill didn't propose a single solution. They discuss
several methods (setting one value based on prior knowledge,
setting the mean to 0, then their actual model also sets the
scale to 1 by using a z-score type normalization).

I was just trying to use the simplest thing possible.

> I'll try to assess the impact of these choices by trying various values of the prior parameters.

Let us know what happens. I don't have that much time to play
around with IRT parameterizations.

- Bob

Emmanuel Charpentier

unread,
Sep 10, 2012, 12:37:52 PM9/10/12
to stan-...@googlegroups.com
A couple comments below :
This is a bit of a problem, don't you think ? You are analyzing these data to get information about unknown parameters from your data. What you state means that your process will "drift" (i. e. lose precision) if you do not give them   a bit pf prior information (obtained by sheer divination in this case ?). A bit too circular for some reviewers...

I'd recommend the hierarchical models, which don't have this
problem, but can also be difficult to defend against reviewers
in my experience.

I went for the hierarchical solution because of this. but your hierarchical solution also incorporates information in the form of informative hyperpriors.

> 3) Similarly, you give abilities, discriminations and difficulties somewhat informative priors. The same remark applies
> : whereas imposing a standard deviation on abilities is just a choice of scale; imposing a small standard deviation to
> discriminations and difficulties *is* informative. Hard to do when you are exploring a situation where no prior
> knowledge has been formalized.

Indeed.  As I said above, feel free to change the priors!

> 4) However, in the IRT2 hierrchical model, you do not state a prior for the abilities and difficulties. This, as far as
> I know, is equivalent to giving them the (improper) prior on (0, \infty), ths making the posterior improper, AFAIK. So
> much for model comparison...

Exactly -- the scale parameters have improper priors.
The posterior doesn't behave improperly, but I haven't done
a formal derivation.

Since one of my goals is odel comparison, I'd be wary of anything leading to improper posteriors...

Given how the model behaves without a prior (recovering
the true simulated values), and given the quantity of
data for the priors (# of questions, # of students), a
very weak prior shouldn't have much effect.

I'd be in serious trouble if I had to *prove* that the resultant simulations do not come from an improper posterior. I tried this derivation and found it a bit too hard for my limited algebraic/analytic abilities (I'm definitely better at surgery...). A bit of bibliography led me to the recent Jean-Paul Fox's monograph, which did not let me with a useable way of simulating from this difficult posterior.

It's easy enough to add a prior.  Andrew likes using
weakly informative priors in these situations,

So do I, because 1) your priors being weak, you can succesfully argue that they do not carry much information into the posterior, and 2) being proper, you can (try to) use their resultant simulations in marginal likelihood estimation / Bayes factors / whatever model comparison method du jour you wish to use.
 
but
I realize other researchers have different approaches.

> In short, where Gelman & Hill's solution introduced both an additive and a multiplicative redundancy (by means of
> centering and scaling "raw" parameters), you keep the additive redundancy and replace the multiplicative redundancy by
> imposing more informative distributions.

Gelman and Hill didn't propose a single solution.  They discuss
several methods (setting one value based on prior knowledge,

That's for another problem, i. e. to resolve the sign indetermination of the general model. Which does not arise in my case, since I can use subject matter knowledge to postulate that all my items are correlated the same way. Hence the lognormal distribution of the gammas.
 
setting the mean to 0, then their actual model also sets the
scale to 1 by using a z-score type normalization).

I was just trying to use the simplest thing possible.

> I'll try to assess the impact of these choices by trying various values of the prior parameters.

Let us know what happens.  I don't have that much time to play
around with IRT parameterizations.

Neither do I, alas, but I just have to : "When a researche gotta, he's gonna...". I was hoping that Stan's superior sampler could help me with a better simulation without having to wait hours (literally...) for convergence. But the lack of sampling from discrete distributions (if only for the missing data) makes it (currently) unfit for my current purposes.

I will definitely try to compare JAGS and Stan (and possibly R Metropolis adaptive samplers such as the one exemplified in the adaptMCMC package, which performs surprisingly well with very little tuning) for the hierarchical models in the complete-cases case. But not now...

Again, thank you for your attention and your time.

                                                                Emmanuel Charpentier

Bob Carpenter

unread,
Sep 10, 2012, 2:12:42 PM9/10/12
to stan-...@googlegroups.com


On 9/10/12 12:37 PM, Emmanuel Charpentier wrote:
>
> Le lundi 10 septembre 2012 17:58:14 UTC+2, Bob Carpenter a écrit :

> > 2) You give to this mean an informative prior. This would be hard to defend against a reviewer.
...
> I found when the priors were fatter than the values used
> to simulate the data, the mean posterior estimates drifted
> from their simulated values (in predictable, repeatable ways).
>
> This is a bit of a problem, don't you think ?

No (see below).

> You are analyzing these data to get information about unknown parameters

Not in this case (see below).

> from your data. What you state means that your process will "drift" (i. e. lose precision) if you do not give them a
> bit pf prior information (obtained by sheer divination in this case ?). A bit too circular for some reviewers...

I understand your concern, but I want to stress two things:

1. This is not a bug in Stan. Stan is sampling from the posterior.
It's an issue with the models themselves!

2. You can use whatever priors you want in your own models. These
were just set up with exact model specifications for TESTING.

> I'd recommend the hierarchical models, which don't have this
> problem, but can also be difficult to defend against reviewers
> in my experience.
>
>
> I went for the hierarchical solution because of this. but your hierarchical solution also incorporates information in
> the form of informative hyperpriors.

See points 1. and 2. above. :-)

And please feel free to post results back to this list
from different choices of priors.

- Bob

Andrew Gelman

unread,
Sep 10, 2012, 4:16:46 PM9/10/12
to stan-...@googlegroups.com
Just to reply briefly to this discussion:

I recommend never using improper priors. Yes, we have an improper prior for the 8 schools but that's a well studied problem where we know the answer. In general, it should always be possible to put a proper, weakly informative prior that constrains the parameters in a reasonable way without overwhelming your inference with prior information.

Josh Rosenau

unread,
Oct 12, 2012, 2:55:24 PM10/12/12
to stan-...@googlegroups.com
Hi all,

I'm just starting to play with Stan and building up to a multi-level IRT model.  My data are all Likert-scale, so I'm starting by extending these examples to work like a GRM rather than an IRT.  From my reading of the documentation, it looks like making these models work with ordinal data is just a matter of switching bernoulli_logit to ordered_logistic, and adding a vector for cutpoints to the parameters (and to the ordered_logistic call).  Am I missing anything there?

Second, I've been reading the discussion of optimizing for speed, and wonder whether the sample code here can be vectorized more.  In the sample, you have 

for (n in 1:N)
  y
[n] ~ ordered_logistic( exp(log_gamma[q[n]]) * (alpha[jj[n]] - beta[kk[n]] + delta) , c);

One would like to eliminate the loop and have only: 

y ~ ordered_logistic( foo , c)

Could you, then, write:

for (n in 1:N)
  foo
[n] = exp(log_gamma[q[n]]) * (alpha[jj[n]] - beta[kk[n]] + delta);
y
~ ordered_logistic( foo , c)

And would this have any measurable effect on speed?

Or would it be better to create 3 local vectors (unrolling log_gamma, alpha, and beta), and then have:

y ~ ordered_logistic( exp(log_gamma_local * (alpha_local - beta_local) , c)

I guess I could just try it and see, but if this has already been hashed out, I'm happy to stand on the shoulders of giants.

Joshua Rosenau

unread,
Oct 12, 2012, 4:08:55 PM10/12/12
to stan-...@googlegroups.com
Huh, it turns out that just substituting ordered_logistic (with Likert data) for bernoulli_logit (for properly bernoullified data) in the IRT sample makes it run slightly faster.  

Looks like ordered_logistic doesn't have a vectorized form, so I couldn't test the vectorized alternatives with that, but I did try it with the bernoulli_logit function.

Just moving bernoulli_logit out of the loop using foo to hold the calculated vector did give a speed boost, but not nearly the boost that came from this (all other parts of the original file are unchanged:

model {
  row_vector[N] log_gamma_local;
  row_vector[N] alpha_local;
  row_vector[N] beta_local;

  alpha ~ normal(0,sigma_alpha);
  beta ~ normal(0,sigma_beta);
  log_gamma ~ normal(0,sigma_gamma);
  delta ~ cauchy(0,5);
  sigma_alpha ~ cauchy(0,5);
  sigma_beta ~ cauchy(0,5);
  sigma_gamma ~ cauchy(0,5);
  for (n in 1:N) {
    log_gamma_local[n] <- log_gamma[kk[n]];
    alpha_local[n] <- alpha[jj[n]];
    beta_local[n] <- beta[kk[n]];
  }
  y ~ bernoulli_logit(exp( dot_product(log_gamma_local , (alpha_local - beta_local +delta))));
}

This version seems to run about 3 times faster.

--
 
 

Ben Goodrich

unread,
Oct 12, 2012, 5:26:32 PM10/12/12
to stan-...@googlegroups.com
On Friday, October 12, 2012 4:09:08 PM UTC-4, Josh Rosenau wrote:
Looks like ordered_logistic doesn't have a vectorized form, so I couldn't test the vectorized alternatives with that, but I did try it with the bernoulli_logit function.

Just moving bernoulli_logit out of the loop using foo to hold the calculated vector did give a speed boost

Yes, loops involving assignment statements ( <- ) only are cheap. Loops involving sampling statements ( ~ or lp__ <- lp__ + whatever; ) are expensive because they change the log-posterior. So, if you are able to use a vectorized distribution, it is often worthwhile to take the sampling statement out of the loop even if you have to do a loop with assignment statements to line up the parameters.

Although Stan currently does not have an ordered_probit function, you might be able to recast your model like you would under a Gibbs sampler with data augmentation of a latent variable governed by the cutpoints. That would be a lot of effort to code (and might not work well), but in the end you would be able to vectorize the y_star ~ normal(mu, 1).

Ben

Devin Caughey

unread,
Sep 4, 2013, 9:07:07 PM9/4/13
to stan-...@googlegroups.com
Even though it didn't turn out to be a problem in this particular application, I'm curious whether anyone has thoughts on the best/simplest way to resolve the sign indeterminacy (i.e., reflection invariance) of item-response models in Stan. For example, say you want to do so by constraining the discrimination parameter of one item to be positive. Assuming that the discriminations beta have to be declared as a vector (as in "vector<lower=0>[K] beta;"), would it make sense to then transform the relevant parameter (indexed by pos_indx) with something like the following?:

beta[pos_indx] <- abs(beta[pos_indx])

Or should the sign restriction be declared when the parameter is defined?

Thanks for your help!
~Devin

Bob Carpenter

unread,
Sep 4, 2013, 9:36:46 PM9/4/13
to stan-...@googlegroups.com


On 9/4/13 9:07 PM, Devin Caughey wrote:
> Even though it didn't turn out to be a problem in this particular application, I'm curious whether anyone has thoughts
> on the best/simplest way to resolve the sign indeterminacy (i.e., reflection invariance) of item-response models in
> Stan.

Using priors or fixing values is going to work best in Stan.

For a proper IRT model, there's no item coding uncertainty
like there is in ideal point models, so it makes the most sense
to me to constrain the discrimination parameter be to be positive.

> For example, say you want to do so by constraining the discrimination parameter of one item to be positive.
> Assuming that the discriminations beta have to be declared as a vector (as in "vector<lower=0>[K] beta;"), would it make
> sense to then transform the relevant parameter (indexed by pos_indx) with something like the following?:
>
> beta[pos_indx] <- abs(beta[pos_indx])

This is not a good idea because the indeterminacy is
still there in the underlying sampler.

Also, you can't technically reassign a parameter like this.
You'd have something like:

beta[k] <- abs(beta_raw[k]);

The sampling is then actually done over beta_raw, which
has the non-identifiability. It may not sound like this would
be a problem, but we're trying to do things like identify posterior
scales, which this messes with because the high mass area of
the posterior isn't contiguous.

In general, we prefer to make the underlying distribution
in the unconstrained parameters identified. For the same
reason, I wouldn't recommend adjusting unidentified raw parameters
in some other way. It might work, but it's making life hard
on sampler adaptation.

> Or should the sign restriction be declared when the parameter is defined?

Yes. What really happens here is this:

beta[k] <- exp(beta_raw[k]);

And in order to allow sampling to happen on beta_raw when the
model will be defined in terms of beta, we apply a Jacobian
adjustment (here just lp__ <- lp__ + beta_raw[k]).

> Thanks for your help!

But wait, there's more. You also have scale identifiability and
additive offset identity. So if you have student abilities
alpha[j], item difficulties beta[i], item discrimination
delta[i], then

y[i,j] ~ bernoulli_logit( delta[i] * ( alpha[j] - beta[i] ) );

has three kinds of non-identifiability:

1. add to alpha and beta

2. multiply alpha and beta, divide delta

3. flip all signs

What I did in an IRT model I just fit was to take

alpha ~ normal(0,1); // identifying prior on abilities

beta ~ normal(0,10); // vague prior on difficulties

delta ~ lognormal(0,1); // weakly informative prior on discrimination

Constraining delta to be positive solves identifiability issue 3.

The other priors solve 1 and 2.

The relatively strong prior on abilities alpha coupled with
less data and the freedom to move the offset into beta,
drives the posterior of alpha to be standardized. The relatively
weak prior on beta lets it float and absorb what would otherwise be
an intercept term (you can also fit the intercept hierarchically,
though this introduces yet another additive identifiability issue in
the hierarchical prior).

I prefer to make a full hierarchical model, but the above reproduces
a kind of baseline IRT that should fit close to what standard
maximum likelihood estimates fit.

- Bob

Devin Caughey

unread,
Sep 4, 2013, 11:56:54 PM9/4/13
to stan-...@googlegroups.com
Hi Bob,

Thanks for your rapid and helpful response. To clarify, the IRT model I am running is more like an ideal-point model in that the responses are not coded as correct/incorrect but rather as yes/no. Given that, I'm still not entirely clear on how best to identify the sign (the location and scale are identified via transformations on the difficulty and discrimination parameters). Would it work if I define the prior like this:

    beta_raw ~ normal(0, 100);

then fixed the sign of one item k like this:

    beta[k] <- exp(beta_raw[k]);

and finally applied the Jacobian adjustment you mentioned?:

    lp__ <- lp__ + beta_raw[k];

Forgive me for being slow to pick up what your saying---I do appreciate the help!
~Devin

Bob Carpenter

unread,
Sep 5, 2013, 1:00:23 AM9/5/13
to stan-...@googlegroups.com


On 9/4/13 11:56 PM, Devin Caughey wrote:
> Hi Bob,
>
> Thanks for your rapid and helpful response. To clarify, the IRT model I am running is more like an ideal-point model in
> that the responses are not coded as correct/incorrect but rather as yes/no.

And presumably the yeses and nos aren't all the same kind.

For instance, yea and nay votes don't line up with right or
left, because the bills may be right or left, so that
a nay vote may in one case be left and in others be right.
(This is assuming one dimensional models where the main
dimension you'll fit is party affiliation, at least in American congressional
politics.)

> Given that, I'm still not entirely clear on
> how best to identify the sign (the location and scale are identified via transformations on the difficulty and
> discrimination parameters).

These transforms may lead to their own issues if you're
not careful.

You can usually see the poorly identified parameters producing
low sample sizes --- at least I did in the IRT models I was
working on where the sign wasn't an issue.

> Would it work if I define the prior like this:
>
> beta_raw ~ normal(0, 100);
>
> then fixed the sign of one item k like this:
>
> beta[k] <- exp(beta_raw[k]);
>
> and finally applied the Jacobian adjustment you mentioned?:
>
> lp__ <- lp__ + beta_raw[k];
>
> Forgive me for being slow to pick up what your saying---I do appreciate the help!

Yes, this corresponds to what Andrew and Jennifer suggest
in their book. The easiest way to do it would be something like
this:

parameters {
vector[K] beta_raw;
...
}
transformed parameters {
vector[K] beta;
...
beta <- beta_raw;
beta[1] <- exp(beta[1]);
lp__ <- lp__ + beta_raw[1];
...
}
model {
beta ~ normal(0,10);
...
}

or whatever kind of prior you want to put on beta.

This has a somewhat odd interpretation in that you implicitly
are providing a half-normal prior on beta[1] (truncated to
0 to infinity) and a full normal prior on beta[2]...beta[K].
You don't need to divide by two in Stan, because you only need
the probability up to a proportion.

This only defines the sign, and if there's not enough data
for beta[1], I'd suspect it to not be very good at identifying.

In some cases, to determine location, you can pin down an actual
value. Like fixing the ability or difficulty or discrimination
parameter of one student or item. You can do this the same way by
setting beta[1] (or whatever) to a constant, and there's no need to
do any adjustment for that.

We should put more info about this in the manual. It's just
that there are so many different kinds of models, we can't write
long treatises on each one, at least not in our doc.

- Bob
> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

louis...@gmail.com

unread,
May 9, 2014, 10:32:17 AM5/9/14
to stan-...@googlegroups.com
Dear Bob,

It seems that the link is broken. I get: Your client does not have permission to get URL/p/stan/source/browse/ from this server. That’s all we know.

Is there another place where I can find the four models you discussed?

Kind regards,
Louis

Bob Carpenter

unread,
May 9, 2014, 12:18:16 PM5/9/14
to stan-...@googlegroups.com
If you let us know where the broken link is, we can fix
it.

You can find some sample IRT models here:

https://github.com/stan-dev/stan/tree/develop/src/models/misc/irt

and the ones from Gelman and Hill's regression book here:

https://github.com/stan-dev/stan/wiki/Example-Stan-Models

- Bob
> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

louis...@gmail.com

unread,
May 12, 2014, 4:09:05 AM5/12/14
to stan-...@googlegroups.com
Dear Bob,
I was referring to your initial post but the links you provided in your answer work handsomely.
Thanks,
Louis
Reply all
Reply to author
Forward
0 new messages