Joint model for Missing data in stan

smockin morry

unread,

Mar 27, 2017, 11:34:58 AM3/27/17

to Stan users mailing list

Hey Stan Community

Am attempting to build a joint model that takes care of missing data in a model with multiple predictors as recommended by Bob here and also in the manual. Using one of the R datasets mtcars, i have created a temporal missing values in three variables (disp, cyl,gear). The model runs fine and converges but the resultant coefficients are quite different when compared with data without missing values. Am i missing something in my model specification?. I have attached data and stan model below. Thanks in advance for your response.

Smockin.

Stan Code


data {
  int<lower=0>  N;     // total cases
  int<lower=0>  p;        //predictors
  int<lower=0> N_obs;   //observed cases both dependent & independent variables
  int<lower=0> N_miss;  //missed data in the independent variables but has data on dependent variable
  real y_obs[N_obs];      //dependent var with complete cases
  real y_miss[N_miss];    //dependent var with incomplete cases
  matrix[N_obs,p] x_obs;   //the model matrix of complete cases
}
parameters {
  real<lower=0> sigma;   //population standard dev
   vector[p] beta;     //the regression parameters & intercept
  real x_mu;                  // mean of dist of x values
  real<lower=0> x_sigma;      // sd of dist of x values
  matrix[N_miss,p] x_miss;   //imputed matrix
}

model {
    beta~ cauchy( 0 , 2.5 );
  sigma ~ uniform( 0 , 100 );
  x_mu ~ cauchy( 0 , 2.5 );
  x_sigma ~ uniform( 0 , 100 );

  //observed
  for (n in 1:N_obs) {
    for (k in 1:p)  {
      x_obs[n,k] ~ normal( x_mu , x_sigma );
      y_obs[n] ~ normal(beta[k]*x_obs[n,k], sigma);
      }
  }

  //missing; impute
  for (n in 1:N_miss) {
    for (k in 1:p)  {
      x_miss[n,k] ~ normal( x_mu , x_sigma );
      y_miss[n] ~ normal(beta[k]*x_miss[n,k], sigma);
    }
  }
}

Data


data("mtcars")
testData= mtcars

#Factors
testData$cyl<- factor(testData$cyl)
testData$gear<- factor(testData$gear)

#Dependent variable
y=testData$mpg

#True coefficients
trueCoef=coef(lm(mpg~cyl+disp+gear, data=testData))

# indices with mssing values
set.seed(1)
cyl.miss=sample(1:nrow(testData), 7,replace = F)
set.seed(2)
disp.miss=sample(1:nrow(testData), 7,replace = F)
set.seed(3)
gear.miss=sample(1:nrow(testData), 7,replace = F)

#Create missing values in data
testData$cyl[cyl.miss]<- NA_integer_
testData$disp[disp.miss]<- NA_real_
testData$gear[gear.miss]<-NA_integer_



# dataset with missing values (Incomplete)
incompleteData=testData[!complete.cases(testData),]

# complete datasets
complecasesData=data.frame(na.omit(testData))

#model matrix of predictors
X=model.matrix(~cyl+disp+gear, data=complecasesData)

# data list
data_list <- list(
  N=length(y)
  ,y=y
  ,N_obs=nrow(complecasesData)
  ,N_miss=nrow(incompleteData)
  ,y_obs=complecasesData$mpg
  ,y_miss=incompleteData$mpg
  ,x_obs=X
  ,p=dim(X)[2]
)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
StanCoef<- stan( file = "impute missing.stan", data = data_list)

Results

print(StanCoef, "beta")
Inference for Stan model: impute missing.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

        mean se_mean   sd 2.5%  25%  50%  75% 97.5% n_eff Rhat
beta[1] 0.16       0 0.06 0.06 0.12 0.16 0.20  0.29  1511    1
beta[2] 0.16       0 0.06 0.06 0.12 0.15 0.19  0.28  1944    1
beta[3] 0.16       0 0.06 0.05 0.12 0.15 0.19  0.29  1645    1
beta[4] 0.07       0 0.02 0.03 0.05 0.07 0.08  0.10  3394    1
beta[5] 0.16       0 0.06 0.06 0.12 0.16 0.20  0.29  1772    1
beta[6] 0.16       0 0.06 0.06 0.12 0.15 0.19  0.28  1813    1

Samples were drawn using NUTS(diag_e) at Mon Mar 27 17:49:32 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).
> trueCoef
(Intercept)        cyl6        cyl8        disp       gear4       gear5 
29.41094467 -4.79980682 -4.85587246 -0.02690655  0.03227781  0.31940301

Bob Carpenter

unread,

Mar 29, 2017, 12:49:23 AM3/29/17

to stan-...@googlegroups.com

Usually only some values are missing and you want to impute
them using some kind of joint model that depends on other variables.
And then put them back together with the observed components before
doing the regression.

Here it looks like you're just doing a global imputation marginally on
each x value. And if one covariate is missing, they're all missing?

You shouldn't combine

real<lower=0> x_sigma; // sd of dist of x values

and

x_sigma ~ uniform( 0 , 100 );

Stan requires positive density for all variables meeting constraints,
and x_sigma = 101 meets the constraints on x_sigma but has zero
density.

- Bob

> --
> You received this message because you are subscribed to the Google Groups "Stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> To post to this group, send email to stan-...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

smockin morry

unread,

Mar 29, 2017, 1:41:43 AM3/29/17

to Stan users mailing list

Thanks Bob for the response. Does it mean the problem is in the prior specification of standard deviation? I have commented out the sections you highlighted but the results remain as they were. I have made references to the manual severally but its not helpful in this regard where the model matrix of multiple variables has missing data. Any further suggestions?

Bob Carpenter

unread,

Mar 30, 2017, 9:53:49 AM3/30/17

to stan-...@googlegroups.com

Not clear on what you commented out here.

I think the problem is that the imputation model you're
using for the predictors is bad. You're modeling each
predictor marginally, where what you want to do is model
them jointly.

I'm guessing that if you simulate data from the model (where
each predictor is independent in each instance), you'd get
much better fits.

- Bob

Reply all

Reply to author

Forward