Hello,
In some studies that I have been asked to analyse, with data collected at a series of time points, there are missing data despite the best efforts of clinicians to collect all the data. The missing data patterns in the responses are roughly monotone but not strictly monotone. The response variables are scored psychological scales. Some of the studies are RCTs, and others examine change without a control group. I am using mixed model repeated measures (MMRM) which implicitly imputes data via a random intercept (at the participant level) under a missing at random (MAR) assumption based on the variables in the mixed model analysis. For several studies I have additional data for variables that is not used in the MMRM analysis and may give information on the missing data (predictors of missingness, variables correlated with a variable with some missing data, some categorical variables, some pseudo continuous variables which are scored scales). I'm wondering about the best approach extend the analysis to address the missing data issues explicitly. One approach I am looking at is to have the initial MMRM model as the primary analysis, but also have a secondary analysis which deals more thoroughly with the missing data using multiple imputation. To apply multiple imputation, I can reshape the data into wide format and screen the variables I have for whether they predict missingness and whether they are correlated with variables with missing data and are therefore useful for imputing missing values. After building an imputation model involving the variables in the primary analysis as well as additional auxiliary variables, I can apply multiple imputation to do a secondary mixed model analysis. If both the primary and secondary analyses agree then the analysis we would be more confident of the results. If the primary and secondary analyses are substantially different then either the MAR assumption of the primary analysis is not valid and/or the imputation model is not good.
I am using Stata 13.1 and have done some work on one study using -mi impute chained- with 21 imputation equations, 19 of these use -pmm, knn(5)- and 2 use -logit-, with 3 to 12 predictor variables in each equation.
My questions are as follows:
(1) I would like to confirm that I have understood the literature correctly and the approach above with primary and secondary analyses and their interpretation is a reasonable way to proceed.
(2) I am wondering if you could argue that the multiple imputation analysis has less strong MAR assumption so it should be the primary analysis and the analysis without multiple imputation should be the secondary analysis?
(3) The screening process to choose variables for the imputation model may pick up random noise in the data rather than showing true effects relating to the missing data.
One way to address this would be to screen at a more stringent correlation (in the study mentioned above each predictor has a correlation of at least 0.25 with the variable to be imputed), or reduce the number of variables to be used in the each equation for imputing a variable. Any additional ideas on how to select variables for the imputation would be welcome.
(3) With reporting the results, I wrote up some results for one study where I concentrated on the multiple imputation analysis which deals more thoroughly with the missing data (clinicians wanted just one analysis method) rather than reporting details of the initial MMRM with limited implicit imputation, but I am thinking that I need to report both sets of results (primary and secondary in the scheme above) and discuss the implications. Is it best to report both the multiply imputed as well as the without multiple imputation results?
(4) Is there an advantage using Realcom-Impute eg. -smcfcs- or -realcomImpute- rather than the official Stata -mi impute chained-?
Thanks
Jamie