Differences between optimizing() and sampling() for IRT model

qlathrop

unread,

Jan 4, 2014, 12:12:05 AM1/4/14

to stan-...@googlegroups.com

Hello and thanks in advance for helping,

I'm trying to understand differences in results from the same model and data between sampling (NUTS) and optimizing (BFGS). If there are any additional comments or suggestions regarding the attached model or anything else they are more than welcome.

irt_test.stan is a simple IRT model set up for long data. The model is of the form resp ~ alpha * (theta_p - beta_i) where we have person p's ability theta_p, item i's difficulty beta_i, and a global discrimination parameter (equal across all items) alpha.

test_data.Rdat is some simulated data. theta ~ N(0,1), beta ~N(0,1), alpha = 1.3, complete data from 500 persons and 25 items.

Then in R, compile the model, then sample from the posterior and optimize:

load("test_data.Rdat") #list of named objects, called testData

testModel <- stan(file = "irt_test.stan", iter = 1, chains = 1, data = testData)

outSample <- stan(fit = testModel, data = testData, pars = c("alpha", "beta", "theta"), iter = 800, warmup = 400, chains = 2)

outOptim <- optimizing(testModel@stanmodel, data = testData, hessian = T)

I understand that sampling and finding the posterior mode are two different things. I just didn't expect them to be this different. The indeterminacy in IRT can be solved with priors (as discussed by Bob Carpenter here). With Bob's suggested priors, from sampling I get

lp__ = -5994.2

posterior of alpha mean = 1.22, sd = .0475

sd across posterior means of theta = 0.947

which is close to the simulated scale (although that was a population and this is a sample), and expected by the theta ~ normal(0,1) prior. But with optimizing(), the scale is vastly different:

lp__ = -5461.13

alpha = 18.60, se = .3439 note: the se here is from sqrt(diag(solve(-outOptim$hessian))), comments?

sd across posterior means of theta = 0.089

There is a balancing act going on here, as the population distribution of theta gets tightened, the discrimination increases. Using either method produces similar predicted probabilities of correct response. Here consider the interaction of the last person and the first item (as they are next to each other in the output):

> 1.2230185 * (0.5044867 - 1.649123) #posterior means from stan()

[1] -1.399911 #which is 19.78%

> 18.6019238 * (0.0357181 - 0.112928) #posterior modes from optimizing()

[1] -1.436253 #which is 19.21%

Why is optimizing() finding a scale so far away from the identifying theta ~ normal(0,1) prior? Why is lp__ different? It seems that if the posterior mode is where optimizing() says it is, then why isn't stan() sampling anywhere near it? What is going on here?

Thank you,

-Quinn

Using: R 3.0.2 and rstan 2.1.0

test_data.Rdat

irt_test.stan

Bob Carpenter

unread,

Jan 4, 2014, 7:02:10 AM1/4/14

to stan-...@googlegroups.com

1. In general, there's no reason to expect the
mode to be near the high-mass probability region
unless there is a large amount of data.

2. The log prob calculations for the sampling
model includes the Jacobian term exp(alpha),
or in the more standard model, SUM_i exp(alpha[i]).

3. Your IRT model is non-standard in that alpha is
typically an item-level parameter, not a single
variable. So where you have:

parameters {
real theta[N];
real beta[I];
real <lower=0> alpha;
}

model {
row_vector[NN] diff_local;

theta ~ normal(0, 1);
beta ~ normal(0, 10);
alpha ~ lognormal(0, 1);

for (n in 1:NN) {
diff_local[n] <- theta[id[n]] - beta[item[n]];
}

resp ~ bernoulli_logit(alpha * diff_local);

}

I would've expected to see the declaration as

real<lower=0> alpha[I];

and then this line as:

diff_local[n] <- alpha[item[n]] * (theta[id[n]] - beta[item[n]]);

As is, it's providing an overall scale, which I'm not
sure is going to be well identified and may be the source
of the issue you're seeing.

- Bob

On 1/4/14, 6:12 AM, qlathrop wrote:
> Hello and thanks in advance for helping,
>
> I'm trying to understand differences in results from the same model and data between sampling (NUTS) and optimizing
> (BFGS). If there are any additional comments or suggestions regarding the attached model or anything else they are more
> than welcome.
>
> irt_test.stan is a simple IRT model set up for long data. The model is of the form resp ~ alpha * (theta_p - beta_i)
> where we have person p's ability theta_p, item i's difficulty beta_i, and a global discrimination parameter (equal
> across all items) alpha.
>
> test_data.Rdat is some simulated data. theta ~ N(0,1), beta ~N(0,1), alpha = 1.3, complete data from 500 persons and 25
> items.
>
> Then in R, compile the model, then sample from the posterior and optimize:
>
> |
> load("test_data.Rdat") #list of named objects, called testData
>
> testModel <- stan(file = "irt_test.stan", iter = 1, chains = 1, data = testData)
>
> outSample <- stan(fit = testModel, data = testData, pars = c("alpha", "beta", "theta"), iter = 800, warmup = 400, chains
> = 2)
>
> outOptim <- optimizing(testModel@stanmodel, data = testData, hessian = T)
> |
>
> I understand that sampling and finding the posterior mode are two different things. I just didn't expect them to be this
> different. The indeterminacy in IRT can be solved with priors (as discussed by Bob Carpenter here

> <https://groups.google.com/d/msg/stan-users/U_F1p7YSiro/yl84TruKD7UJ>). With Bob's suggested priors, from sampling I get

>
> lp__ = -5994.2
> posterior of alpha mean = 1.22, sd = .0475
> sd across posterior means of theta = 0.947
>
> which is close to the simulated scale (although that was a population and this is a sample), and expected by the theta ~
> normal(0,1) prior. But with optimizing(), the scale is vastly different:
>
> lp__ = -5461.13
> alpha = 18.60, se = .3439 note: the se here is from sqrt(diag(solve(-outOptim$hessian))), comments?
> sd across posterior means of theta = 0.089
>
> There is a balancing act going on here, as the population distribution of theta gets tightened, the discrimination
> increases. Using either method produces similar predicted probabilities of correct response. Here consider the
> interaction of the last person and the first item (as they are next to each other in the output):
>
> > 1.2230185 * (0.5044867 - 1.649123) #posterior means from stan()
> [1] -1.399911 #which is 19.78%
>
> > 18.6019238 * (0.0357181 - 0.112928) #posterior modes from optimizing()
> [1] -1.436253 #which is 19.21%
>
> Why is optimizing() finding a scale so far away from the identifying theta ~ normal(0,1) prior? Why is lp__ different?
> It seems that if the posterior mode is where optimizing() says it is, then why isn't stan() sampling anywhere near it?
> What is going on here?
>
> Thank you,
> -Quinn
>
> Using: R 3.0.2 and rstan 2.1.0
>
>

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

qlathrop

unread,

Jan 4, 2014, 12:20:33 PM1/4/14

to stan-...@googlegroups.com

Thanks for the quick response. In regards to (3), I see the same type of behavior with the usual 2-parameter model (model is attached, and using the same data as before):

For sampling(), the posterior means of alpha_i range from 0.9 to 1.6, sd across posterior means of theta = 0.95, lp__ -5985.4

For optimizing(), the posterior means of alpha_i range from 2.9 to 5.6, sd across posterior means of theta = 0.35, lp__ -5523.4

-Quinn

irt_test2.stan

Bob Carpenter

unread,

Jan 4, 2014, 12:53:07 PM1/4/14

to stan-...@googlegroups.com

Might this be an identifiability issue?

What do the posterior intervals look like in the sampled
results? Are the optimized results in the 95% intervals?

Are the differences for the individual items similar?
I'd look at a scatterplot of (theta[id[n]] - beta[item[n]])
in both models and see if they're similar.

The scales chosen are going to affect the optimization
results, so again, I want to stress that there's really no
reason to expect them to be the same. The sampling results
are the ones that make sense from a Bayesian perspective.

- Bob

Marcus Brubaker

unread,

Jan 4, 2014, 5:09:43 PM1/4/14

to stan-...@googlegroups.com

Also worth noting is that the optimizer may only be finding a local optima whereas the sampler is more likely to find a global optima.

Cheers,

Marcus

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "stan users mailing list" group.

To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+unsubscribe@googlegroups.com.

Jiqiang Guo

unread,

Jan 4, 2014, 5:25:19 PM1/4/14

to stan-...@googlegroups.com

On Sat, Jan 4, 2014 at 12:12 AM, qlathrop <quinn....@pearson.com> wrote:

note: the se here is from sqrt(diag(solve(-outOptim$hessian))), comments?
sd across posterior means of theta = 0.089

the hessian is on the unconstrained space, so if there is a transformation (i.e., the

parameter is not defined on the real line), it cannot be used as it is. I am not sure delta method can be applied here, and even yes you need to know what transformation is used.