Missing data handling in longitudinal studies with mixed model repeated measures.

jamescha...@healthmaps.com.au

unread,

Sep 21, 2014, 7:52:21 PM9/21/14

to missin...@googlegroups.com

Hello,

In some studies that I have been asked to analyse, with data collected at a series of time points, there are missing data despite the best efforts of clinicians to collect all the data. The missing data patterns in the responses are roughly monotone but not strictly monotone. The response variables are scored psychological scales. Some of the studies are RCTs, and others examine change without a control group. I am using mixed model repeated measures (MMRM) which implicitly imputes data via a random intercept (at the participant level) under a missing at random (MAR) assumption based on the variables in the mixed model analysis. For several studies I have additional data for variables that is not used in the MMRM analysis and may give information on the missing data (predictors of missingness, variables correlated with a variable with some missing data, some categorical variables, some pseudo continuous variables which are scored scales). I'm wondering about the best approach extend the analysis to address the missing data issues explicitly. One approach I am looking at is to have the initial MMRM model as the primary analysis, but also have a secondary analysis which deals more thoroughly with the missing data using multiple imputation. To apply multiple imputation, I can reshape the data into wide format and screen the variables I have for whether they predict missingness and whether they are correlated with variables with missing data and are therefore useful for imputing missing values. After building an imputation model involving the variables in the primary analysis as well as additional auxiliary variables, I can apply multiple imputation to do a secondary mixed model analysis. If both the primary and secondary analyses agree then the analysis we would be more confident of the results. If the primary and secondary analyses are substantially different then either the MAR assumption of the primary analysis is not valid and/or the imputation model is not good.

I am using Stata 13.1 and have done some work on one study using -mi impute chained- with 21 imputation equations, 19 of these use -pmm, knn(5)- and 2 use -logit-, with 3 to 12 predictor variables in each equation.

My questions are as follows:

(1) I would like to confirm that I have understood the literature correctly and the approach above with primary and secondary analyses and their interpretation is a reasonable way to proceed.

(2) I am wondering if you could argue that the multiple imputation analysis has less strong MAR assumption so it should be the primary analysis and the analysis without multiple imputation should be the secondary analysis?

(3) The screening process to choose variables for the imputation model may pick up random noise in the data rather than showing true effects relating to the missing data.

One way to address this would be to screen at a more stringent correlation (in the study mentioned above each predictor has a correlation of at least 0.25 with the variable to be imputed), or reduce the number of variables to be used in the each equation for imputing a variable. Any additional ideas on how to select variables for the imputation would be welcome.

(3) With reporting the results, I wrote up some results for one study where I concentrated on the multiple imputation analysis which deals more thoroughly with the missing data (clinicians wanted just one analysis method) rather than reporting details of the initial MMRM with limited implicit imputation, but I am thinking that I need to report both sets of results (primary and secondary in the scheme above) and discuss the implications. Is it best to report both the multiply imputed as well as the without multiple imputation results?

(4) Is there an advantage using Realcom-Impute eg. -smcfcs- or -realcomImpute- rather than the official Stata -mi impute chained-?

Thanks

Jamie

Jonathan Bartlett

unread,

Sep 21, 2014, 8:16:16 PM9/21/14

to missin...@googlegroups.com

Hi Jamie

1) It sounds to me like you have a good understanding, and your interpretation sounds reasonable!

2) I think it depends. In the trial context the primary analysis must usually be prespecified, and it is arguably easier to unambiguously prespecify a MMRM analysis than an imputation analysis. I think if it is a priori plausible for the MAR assumption of your MMRM analysis to be valid, then putting this as the primary analysis is reasonable. Conversely, if you were fairly confident that this MAR assumption will be violated, and that this can be mitigated by conditioning on auxiliary variables in an imputation model, then the multiple imputation analysis would probably be better as the primary analysis.

3) Unless you have a large number of variables relative to subjects/individuals, the general advice is to err on the side of being inclusive / less stringent, since you just lose a bit of efficiency if you include variables that in truth don't help. A recent paper (http://www.biomedcentral.com/1471-2288/12/184) investigated the potential problems that can occur when you try and include too many variables in the imputation model with a small dataset, and gave some recommendations.

4) If everyone is measured at a relatively small number of time points, such that you can organise the data into the wide form (i.e. different variables for each time point's measurements), then no there is probably not any advantage to using Realcom Impute. smcfcs is useful if your analysis model contains non-linearities or interactions, but at the moment it only supports imputation compatible with a substantive model for a single outcome, and so I don't think you could really use it if your analysis model is an MMRM.

Best wishes

Jonathan

jamescha...@healthmaps.com.au

unread,

Sep 21, 2014, 11:33:37 PM9/21/14

to missin...@googlegroups.com

Hi Jonathan

Thanks very much for the prompt and informative response! I've downloaded the article on auxiliary variables and it is very relevant to what I am doing.

Jamie

Reply all

Reply to author

Forward