I made quite a bit of progress, but have run into some issues with fit statistics. I'm working with 3 datasets with 3, 7, and 7 repeated-measures and 1000, 1700, and 900 participants. There's no missing data after cleaning. Following one of the papers on using the item bifactor model for vertical scaling, I estimated similar models to account for the repeated-measures and then arrived the most contained model that fit "well" after invariance testing. The models produced similar results to fully repeated-measures models estimated in a CCFA framework.
As you'll see, the fit statistics for the model for the first dataset are reasonable, but the fit statistics for the second dataset seem poor to me, but I have no reference for that because the corresponding model with "configural invariance" only fits better in some respects (see below). Within the CCFA framework, similar models have better fit, but I know that there isn't a correspondence between IRT and CCFA fit statistics. So, my question is what are a set of statistics that I should check for to ensure "good" model fit? My sense is that the M2 statistic is not reasonable because it's test of exact fit. The CFI seems to be inconsistent, but the RMSEA seems to be function well. The AIC, BIC, SABIC, etc. seem to function as in other models and have been useful. Any suggestions would be appreciated? This is a substantive paper, so I need to be able to defend my choices in the analyses.
Regarding item fit statistics, I've noticed some peculiar behavior (I won't say odd). The fit statistics are much larger for the NRM, smaller for the GPCM, and smallest for the GRM in datasets 1 and 2. In dataset 3, they are much larger for the GPCM, smaller for the NRM, and smallest for the GRM. However, the magnitude of all of the fit statistics are much larger in datasets 2 and 3 than dataset 1. Is it enough to select the GRM simply based on this ordering that indicates better fit? And I wonder if the magnitude of these fit statistics becomes larger as the model becomes more complex and if the complexity should be taken into account when interpreting IRT model fit statistics?
Dataset 1 with a general factor for the construct that all items are measuring and 3 specific factors for each administration of the 3 item scale (constraints are from invariance testing):
upes.s1_part4 <- ' G = 1-9
CONSTRAIN = (1, 4, 7, a1), (2, 5, 8, a1), (3, 6, 9, a1),
(4, 7, a3, a4), (5, 8, a3, a4), (6, 9, a3, a4),
(1, 4, 7, d1), (4, 7, d2), (4, 7, d3), (4, 7, d4),
(2, 5, 8, d1), (5, 8, d2), (5, 8, d3), (5, 8, d4),
(3, 6, 9, d1), (6, 9, d2), (6, 9, d3), (6, 9, d4) '
upes.s1_spec <- c(rep(x = 1, times = 3), rep(x = 2, times = 3), rep(x = 3, times = 3))
M2 df p RMSEA RMSEA_5 RMSEA_95 SRMSR TLI CFI
stats 94.5 15 0.000000000000141 0.0728 0.0591 0.0872 0.043 0.878 0.796
Dataset 2 with a general factor for the construct that all items are measuring and 7 specific factors for each administration of the 3 item scale (constraints are from invariance testing):
upes.s2_part4 <- ' G = 1-21
CONSTRAIN = (1, 4, 7, 10, 13, 16, 19, a1), (2, 5, 8, 11, 14, 17, 20, a1), (3, 6, 9, 12, 15, 18, 21, a1),
(1, 4, 7, 13, 16, 19, a2, a3, a4, a6, a7, a8),
(2, 5, 8, 14, 17, 20, a2, a3, a4, a6, a7, a8),
(3, 6, 9, 15, 18, 21, a2, a3, a4, a6, a7, a8),
(1, 4, 7, 10, 13, 16, 19, d1),
(1, 4, 7, 13, 16, 19, d2),
(1, 4, 7, 13, 16, 19, d3),
(1, 4, 7, 13, 16, 19, d4),
(2, 5, 8, 11, 14, 17, 20, d1),
(2, 5, 8, 14, 17, 20, d2),
(2, 5, 8, 14, 17, 20, d3),
(2, 5, 8, 14, 17, 20, d4),
(3, 6, 9, 12, 15, 18, 21, d1),
(3, 6, 9, 15, 18, 21, d2),
(3, 6, 9, 15, 18, 21, d3),
(3, 6, 9, 15, 18, 21, d4)'
upes.s2_spec <- c(rep(x = 1, times = 3), rep(x = 2, times = 3), rep(x = 3, times = 3), rep(x = 4, times = 3), rep(x = 5, times = 3),
rep(x = 6, times = 3), rep(x = 7, times = 3))
M2 df p RMSEA RMSEA_5 RMSEA_95 SRMSR TLI CFI
stats 6989 201 0 0.14 0.137 0.143 0.134 0.532 0.361
Dataset 2, configurable invariance:
M2 df p RMSEA RMSEA_5 RMSEA_95 SRMSR TLI CFI
stats 4410 105 0 0.155 0.151 0.158 0.132 0.432 0.594