Graded Response Model (Samejima) & M2 statistics

ibaldauf

unread,

Mar 21, 2015, 5:58:03 AM3/21/15

to mirt-p...@googlegroups.com

Dear Phil,

I have a question concerning the simulation of data according the graded response model and the estimation of the M2 statistic for testing the model fit.

I performed 10 000 repetitions (looping with for) of the following calculation:

#-------------------------------------------------------------------------------------------------------------------------------------------------------

# 1. Simulate data according to the graded response model

library(mirt)

ol <- matrix(rlnorm(15, .2, .3))

ci <- matrix(rnorm(15*3, 0, 2), 15)

ci <- t(apply(ci, 1, sort, decreasing=TRUE))

poly.grm <- simdata(ol, ci, 200, itemtype = 'graded')

# 2. dichotomization of the data [1 & 2 = 0] & [3 & 4 = 1]

dich.grm <- ifelse(poly.grm < 3,0,1)

# 3. estimation of the graded response model for the polytomous and dichotomized data

est.grm.poly <- mirt(poly.grm, model=1, itemtype="graded")

est.grm.dich <- mirt(dich.grm, model=1, itemtype="graded")

# 4. calculation of the M2 statistic for testing the model fit

# p-values which are smaller than alpha are coded as 1, else 0

alpha <- 0.05

m.grm.poly <- M2(est.grm.poly)

p.grm.poly <- m.grm.poly[,3]

alpha.grm.poly <- ifelse(p.grm.poly < alpha,1,0)

m.grm.dich <- M2(est.grm.dich)

p.grm.dich <- m.grm.dich[,3]

alpha.grm.dich <- ifelse(p.grm.dich < alpha,1,0)

#-------------------------------------------------------------------------------------------------------------------------------------------------------

The results were surprisingly. The M2 statistic (for testing the model fit) had significant p-values in 2129 from 10000 (0.2129) cases when calculating the polytomous model,

but only 570 from 10000 (0.057) significant p-values when calculating the dichotomous model.

How can that be? Have I made a mistake in the calculation?

Thank you very much for your Help.

Best regards Isabell

Phil Chalmers

unread,

Mar 21, 2015, 10:52:56 AM3/21/15

to ibaldauf, mirt-package

Hi Isabell,

There are a couple things going on here, and I assume some information about the simulation was missing (like how often the models failed to converge, which is very important to consider).

In addition to the failed to converge warnings, if you were using a later version of mirt (1.7+) then you should have received this error quite often:

Error in FUN(newX[, i], ...) :

Items contain category scoring spaces greater than 1.

Use apply(data, 2, table) to inspect and fix

This is because with N=200, and the way the intercepts were defined in the GRM (see here for a better method: https://groups.google.com/forum/#!searchin/mirt-package/simdata/mirt-package/nCvEbML0OaA/jjNw-qtQx8EJ), each item likely dropped responses or kept so few to cause estimation issues. Models that fail to converge should not be tested for goodness of fit.

Phil

--
You received this message because you are subscribed to the Google Groups "mirt-package" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mirt-package...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ibaldauf

unread,

Mar 21, 2015, 1:09:11 PM3/21/15

to mirt-p...@googlegroups.com, isabell...@gmail.com

Many thanks for the prompt reply.

You are right, I suppressed the warnings in my simulation. So I didn't get the 'EM cycles terminated after 500 iterations' -warning (thanks for the hint).

Would it be better to simulate the data as follows:

ol <- matrix(rlnorm(15, .2, .3))

diffs <- t(apply(matrix(runif(40*4, .3, 1), 40), 1, cumsum))

diffs <- -(diffs - rowMeans(diffs))

d <- diffs + rnorm(40)

poly.grm <- simdata(ol, d, 400, itemtype = 'graded')

and perhaps also to increase the sampling size to 400 or maybe even 800?

One final question i would like to ask is: Are there any other functions for testing the modelfit of the Graded Response Model?

Best regards I.B

PS Your packages is great.

Phil Chalmers

unread,

Mar 22, 2015, 11:19:18 AM3/22/15

to ibaldauf, mirt-package

On Sat, Mar 21, 2015 at 1:09 PM, ibaldauf <isabell...@gmail.com> wrote:

Many thanks for the prompt reply.

You are right, I suppressed the warnings in my simulation. So I didn't get the 'EM cycles terminated after 500 iterations' -warning (thanks for the hint).
Would it be better to simulate the data as follows:

ol <- matrix(rlnorm(15, .2, .3))

diffs <- t(apply(matrix(runif(40*4, .3, 1), 40), 1, cumsum))

diffs <- -(diffs - rowMeans(diffs))

d <- diffs + rnorm(40)

poly.grm <- simdata(ol, d, 400, itemtype = 'graded')

and perhaps also to increase the sampling size to 400 or maybe even 800?

Yes the sample size will also help. It's somewhat of an unwritten rule that you should have at least 5 * #parameters when fitting your model, but of course it depends on a lot of other factors in model to make that definitive. Your last attempt was a little on the lower side though, with 15 * 5 = 75 parameters, with out N = 200.

One final question i would like to ask is: Are there any other functions for testing the modelfit of the Graded Response Model?

There also is one based on collapsing factor scores to sum scores, which you can get with fscores(mod, method = 'EAPsum'). They are just the usual X2 and G2 statistics, which can be calculated from the observed and expected returned scores (I don't return them in the function, but probably should for simulation reasons). Cheers.

Phil

Best regards I.B

PS Your packages is great.

Phil Chalmers

unread,

Mar 22, 2015, 11:23:34 AM3/22/15

to ibaldauf, mirt-package

On Sun, Mar 22, 2015 at 11:19 AM, Phil Chalmers <rphilip....@gmail.com> wrote:

On Sat, Mar 21, 2015 at 1:09 PM, ibaldauf <isabell...@gmail.com> wrote:
Many thanks for the prompt reply.

You are right, I suppressed the warnings in my simulation. So I didn't get the 'EM cycles terminated after 500 iterations' -warning (thanks for the hint).
Would it be better to simulate the data as follows:

ol <- matrix(rlnorm(15, .2, .3))

diffs <- t(apply(matrix(runif(40*4, .3, 1), 40), 1, cumsum))

diffs <- -(diffs - rowMeans(diffs))

d <- diffs + rnorm(40)

poly.grm <- simdata(ol, d, 400, itemtype = 'graded')

and perhaps also to increase the sampling size to 400 or maybe even 800?

Yes the sample size will also help. It's somewhat of an unwritten rule that you should have at least 5 * #parameters when fitting your model, but of course it depends on a lot of other factors in model to make that definitive. Your last attempt was a little on the lower side though, with 15 * 5 = 75 parameters, with out N = 200.

One final question i would like to ask is: Are there any other functions for testing the modelfit of the Graded Response Model?

There also is one based on collapsing factor scores to sum scores, which you can get with fscores(mod, method = 'EAPsum'). They are just the usual X2 and G2 statistics, which can be calculated from the observed and expected returned scores (I don't return them in the function, but probably should for simulation reasons). Cheers.

Scratch that, they are returned in as an attribute. Use attr(scores, 'fit') to pull out the data frame of fit attributes after using scores <- fscores(mod, method = 'EAPsum').

ibaldauf

unread,

Mar 22, 2015, 11:50:11 AM3/22/15

to mirt-p...@googlegroups.com, isabell...@gmail.com

Thank you very much for your advice.

I will try ti do so.

ibaldauf

unread,

Mar 27, 2015, 12:14:18 PM3/27/15

to mirt-p...@googlegroups.com, isabell...@gmail.com

Dear Phil,

I made another simulation.

This time i simulated the data as you suggested, dichotomized the data, and calculated the model fit statistics M2, X2 and G2.

This time I used different splitting criteria for the dichotomization [1=0 & 2,3,4=1] , [1,2=0 & 3,4=1] and [1,2,3=0 & 4=1], thereby I received a total of 4 matrices.

I named dichotomizes model for the splitting criteria [1=0 & 2,3,4=1] grm_u, grm_m for the splitting criteria [1,2=0 & 3,4=1] and grm_o for the splitting criteria [1,2,3=0 & 4=1].

GRM is the term for the polytomous model.

The models never failed to converge, neither the polytomous nor the dichotomous.

I received the following results (relative frequencies):

M2GRM M2GRM_u M2GRM_m M2GRM_o X2GRM X2GRM_u X2GRM_m X2GRM_o G2GRM G2GRM_u G2GRM_m G2GRM_o

0.055 0.058 0.051 0.061 0.078 0.331 0.029 0.027 0.091 0.327 0.031 0.032

The relative frequencies are very similar, still at a dichotomization with a splitting critera [1,2,3=0 & 4=1] (GRM_m) it comes for all model fit tests (M2, X2 and G2) to a smaller number of significant p-values.

Furthermore, it should be noted that the model test have not always judged the same models as striking.

Am I right in assuming that X2 is the usual Pearson goodness-of-fit statistic and G2 is the likelihood-ratio test statistic?

I am not quite sure how to interpret this results.

Could it be that a relative frequency of 0.05 normal is for simulated data and reflects the error probability?

Thanks for your help & best regards Isabell

Reply all

Reply to author

Forward