Multiple imputations and stan

Guido Biele

unread,

Mar 28, 2014, 5:29:54 AM3/28/14

to stan-...@googlegroups.com

Multiple imputation

Hello,

I need to run regressions with imputed data and would like to hear your opinion about which method you think I should use.

Background: I am working with a relatively large dataset (N is up to 50000) on which I run regressions with typically 5-15 variables of interest. The data set has many more variables which I can use in the imputation process. Variables can be continuous, ordered factor, or binary). Data are certainly not missing completely at random*. Only around 25% of the cases have missing data.

My first Idea was to do imputation and estimation of the regression model of interest in the same stan model. I did a simple test with simulated data which worked fine (see code below), but while doing the test I realized that scaling this to my own dataset would be extremely time-consuming.

The alternative then is to do the imputation first (e.g. with mi) and to do the final regression model in stan. Here, Ben proposed in an earlier post to

(a) run the stan model repeatedly with the different versions of the imputed data sets and

(b) merge the chains from these models

I’d like to hear your opinion about an alternative way to use the different versions of the imputed data sets:

(a) submit the originally complete part of the dataset and different versions of the imputed data to the same model and

(b) specify the model such that likelihood is computed for the originally completed data plus a randomly chosen imputed data set

Specifically, when

Ycomp = criterion values for complete data

Yimp = criterion values for imputed data

Xcomp = design matrix for complete data

Xsimp = array with design matrices for imputed data

alpha = intercept parameter

beta = vector of regression weights

sigma = regression error

the crucial part of the model specification would be:

Y ~ normal( Xcomp * beta, sigma)

Ycomp ~ normal(Xsimp [k] * beta, sigma) // where k is a random number

It seems to me, this would be more efficient than running multiple stan models on different imputed data sets, when only (in my case) ca. 25% of the cases would differ between imputed data sets. Does this make sense?

If yes, the problem to solve would be to get the random number k. in stan, random numbers can only be generated in the “generated quantities block”. Hence, would this (I fear) trick work:

First I specify a parameter _k with an (implicit) uniform prior in the parameters block:

real<lower=0,upper=impN> real_k; // impN = number of versions of imputed data

Then I convert it to an integer in the model block (following a proposal from Bob in another post):

int k;

k <- 1;

while ((k + 1) < floor(real_k))

k <- k + 1;

I have a bad feeling about specifying a parameter that will be updated even though it has no influence on the likelihood (I don’t know enough about the sampler to know if this is problematic ), but I don’t have a better solution at the moment.

Sorry for the long post.

Looking forward to you feedback

Guido

*The data set includes for example children’s birth weight and their mothers’ response to questions about smoking. Not surprisingly, mothers who did not provide smoking information have on average lighter children.

#############################################################################

####################### test simple imputation model in R #########################

#############################################################################

library(stan)

####################### make data set #########################

nobs = 2500

M = matrix(c(1, 0.4, 0.6, 0.6,

0.4, 1, 0.3, 0.94,

0.6, 0.3, 1, 0.2,

0.6, 0.94, 0.2, 1), nrow=4, ncol=4)

# Cholesky decomposition

L = chol(M)

nvars = dim(L)[1]

# Random variables that follow an M correlation matrix

r = t(L) %*% matrix(rnorm(nvars*nobs,sd = .5), nrow=nvars, ncol=nobs)

r = t(r)

rdata = data.frame(r)

names(rdata) = c('r', 'p1', 'p2', 'p3')

Y = r[,1]

X = r[,2:4]

nvars = 3

nmissings = round(nobs*.25/nvars)*nvars

missings = matrix(sample(nobs,nmissings),ncol = nvars)

complete = seq(nobs)[-missings]

nms =

####################### put data together for stan #########################

standata = list(Y = Y,

X = X,

N = nobs,

Ncomp = length(complete),

compIDX = complete,

Nmis1 = nrow(missings),

Nmis2 = nrow(missings),

Nmis3 = nrow(missings),

mis1IDX = missings[,1],

mis2IDX = missings[,2],

mis3IDX = missings[,3]

)

############################ stan model ##############################

modelcode = "

data {

int<lower=0> N; // number of complete data

vector[N] Y; // vector with outcome

matrix[N,3] X; // design matrix (predictors)

int<lower=0> Ncomp;

int<lower=0> compIDX[Ncomp];

int<lower=0> Nmis1;

int<lower=0> Nmis2;

int<lower=0> Nmis3;

int<lower=0> mis1IDX[Nmis1];

int<lower=0> mis2IDX[Nmis2];

int<lower=0> mis3IDX[Nmis3];

}

parameters {

vector[3] beta; // regression weight fro ultimate predictors

real<lower=0> sigma_Y; // error term for ultimate regression

vector[2] beta_imp1; // regression weights to impute missing variables 1-3

vector[2] beta_imp2;

vector[2] beta_imp3;

real<lower=0> sigma_imp1; // error terms for imputation of missing variables 1-3

real<lower=0> sigma_imp2;

real<lower=0> sigma_imp3;

}

model {

matrix[N,3] compX;

int idx;

compX <- rep_matrix(1,N, 3); // prepare design matrix to be filled with imputed data

compX <- compX .* X;

beta ~ cauchy(0,2.5); // using cauchy priors following Gelman et al 2008

beta_imp1 ~ cauchy(0,2.5);

beta_imp2 ~ cauchy(0,2.5);

beta_imp3 ~ cauchy(0,2.5);

sigma_Y ~ cauchy(0,2.5);

sigma_imp1 ~ cauchy(0,2.5);

sigma_imp2 ~ cauchy(0,2.5);

sigma_imp3 ~ cauchy(0,2.5);

for (n in 1:Ncomp){ // regression weights for imputations

idx <- compIDX[n];

X[idx,1] ~ normal(beta_imp1[1] * X[idx,2] + beta_imp1[2] * X[idx,3], sigma_imp1);

X[idx,2] ~ normal(beta_imp2[1] * X[idx,1] + beta_imp2[2] * X[idx,3], sigma_imp2);

X[idx,3] ~ normal(beta_imp3[1] * X[idx,1] + beta_imp3[2] * X[idx,2], sigma_imp3);

}

for (k in 1:Nmis1){ // impute missing data

idx <- mis1IDX[k];

compX[idx,1] <- beta_imp1[1] * X[idx,2] + beta_imp1[2] * X[idx,3];}

for (l in 1:Nmis2){

idx <- mis2IDX[l];

compX[idx,2] <- beta_imp2[1] * X[idx,1] + beta_imp2[2] * X[idx,3];}

for (m in 1:Nmis3){

idx <- mis3IDX[m];

compX[idx,3] <- beta_imp3[1] * X[idx,1] + beta_imp3[2] * X[idx,2];}

Y ~ normal(compX * beta, sigma_Y);

}

"

############################ run stan model ##############################

stan_imp = stan(model_name="test_imp", model_code = modelcode, data=standata , iter = 5000, chains = 4,seed = 12345)

Bob Carpenter

unread,

Mar 28, 2014, 6:52:34 AM3/28/14

to stan-...@googlegroups.com

I'm not sure about the best way to do multiple imputation, but
we're interested in doing this in Stan.

The approach you're suggesting (use a random subset of imputed data
in each iteration) is a more stochastic data approach
that's related to other issues like "cut" in BUGS that have been
brought up before. I'm not sure it's ideal computationally because
of the interaction with adaptation.

I'm not even sure the approach you suggest produces the same posterior
as what Ben was suggesting (fit a bunch of random subsets, then pool
chains).

Adding auxiliary variables that don't affect the likelihood is
not a problem --- they just marginalize out as you'd expect in
the inferences. HMC itself takes the parameter vector theta and
adds an auxiliary variable of the same size consisting of
momentum/kinetic energy terms. The momentum terms are then marginalized
out for inference (Stan users don't even see them).

- Bob

> --
> You received this message because you are subscribed to the Google Groups "stan users mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Guido Biele

unread,

Mar 28, 2014, 10:12:58 AM3/28/14

to stan-...@googlegroups.com

Thanks for your quick response Bob,

I want to clarify one thing because I did not want to propose to fit a random subset of the imputed data, at least not if subset means using only a subset of the cases for which imputation was done.

Let’s say we have a dataset with 900 complete and 100 incomplete cases with missing data. Then we perform multiple imputations to get imputed 5 data sets, each consisting of the 900 originally complete cases plus 100 cases with imputation of missing data (with some variation in imputation values between the 5 imputed data sets).

Ben’s suggestion was to fit the 5 datasets (1000 cases each) in independent stan calls and then to merge the chains.

My proposal was to make one stan call to which I submit 900 complete cases plus 5 versions of 100 cases with imputations. For the likelihood calculation I would then each time use all 900 originally complete cases plus one of the 5 sets (randomly chosen) of 100 cases with imputed data.

I understand the potential problem during the adaptation (of sampling parameters I assume). Still, I think this is not necessarily a problem because the imputed values should be similar across imputation versions.

I guess I’ll give it a try and report back.

Cheers - Guido

You received this message because you are subscribed to a topic in the Google Groups "stan users mailing list" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/stan-users/daqP88cFQGY/unsubscribe.
To unsubscribe from this group and all its topics, send an email to stan-users+...@googlegroups.com.

Ben Goodrich

unread,

Mar 28, 2014, 11:46:08 AM3/28/14

to stan-...@googlegroups.com

On Friday, March 28, 2014 5:29:54 AM UTC-4, Guido Biele wrote:

My first Idea was to do imputation and estimation of the regression model of interest in the same stan model. I did a simple test with simulated data which worked fine (see code below), but while doing the test I realized that scaling this to my own dataset would be extremely time-consuming.

That is the best way, but it wouldn't work in Stan if there were missingness on the discrete variables. So, you will probably have to impute first.

The alternative then is to do the imputation first (e.g. with mi) and to do the final regression model in stan. Here, Ben proposed in an earlier post to
(a) run the stan model repeatedly with the different versions of the imputed data sets and
(b) merge the chains from these models
I’d like to hear your opinion about an alternative way to use the different versions of the imputed data sets:
(a) submit the originally complete part of the dataset and different versions of the imputed data to the same model and
(b) specify the model such that likelihood is computed for the originally completed data plus a randomly chosen imputed data set

I think if you were to go this route it would be (in the same model) do the complete part and all the imputed parts giving them each a fixed weight of 1/m.

Specifically, when
Ycomp = criterion values for complete data
Yimp = criterion values for imputed data
Xcomp = design matrix for complete data
Xsimp = array with design matrices for imputed data
alpha = intercept parameter
beta = vector of regression weights
sigma = regression error

the crucial part of the model specification would be:
Y ~ normal( Xcomp * beta, sigma)
Ycomp ~ normal(Xsimp [k] * beta, sigma) // where k is a random number

It seems to me, this would be more efficient than running multiple stan models on different imputed data sets, when only (in my case) ca. 25% of the cases would differ between imputed data sets. Does this make sense?
If yes, the problem to solve would be to get the random number k. in stan, random numbers can only be generated in the “generated quantities block”. Hence, would this (I fear) trick work:
First I specify a parameter _k with an (implicit) uniform prior in the parameters block:
real<lower=0,upper=impN> real_k; // impN = number of versions of imputed data
Then I convert it to an integer in the model block (following a proposal from Bob in another post):
int k;
k <- 1;
while ((k + 1) < floor(real_k))
k <- k + 1;

I have a bad feeling about specifying a parameter that will be updated even though it has no influence on the likelihood (I don’t know enough about the sampler to know if this is problematic ), but I don’t have a better solution at the moment.

We have never been able to be too optimistic about these ideas that discretize a continuous parameter. Most likely, the chain would get stuck on the same imputed part for long stretches at a time. I think explicit averaging over the imputed parts would work better than implicit averaging by choosing each imputed part with probability 1/m.

Ben

Guido Biele

unread,

Mar 28, 2014, 12:09:17 PM3/28/14

to stan-...@googlegroups.com

Thanks for your input Ben!

I also thought about weighting the imputed parts, but got stuck in the implementation of the weighting.

(couldn't find a solution in forum or stan-reference).

Can you point me to an example or documentation where this is described?

Thanks - Guido

Ben Goodrich

unread,

Mar 28, 2014, 1:38:02 PM3/28/14

to stan-...@googlegroups.com

I don't think it is explicitly in the manual, but essentially you would be marginalizing over the components of a finite-mixture model where the weights are known to be 1/m for all components rather than estimated.

Ben

Bob Carpenter

unread,

Mar 29, 2014, 3:49:47 PM3/29/14

to stan-...@googlegroups.com

Thanks for the clarification --- that's what I was supposing
you were planning on doing. What I meant is that each Stan iteration
used a subset of the imputations --- in that you do 5 imputed data sets
and each Stan iteration uses one.

But as Ben says, we're not too optimistic about hacking in discrete
sampling, though if it's really just to generate, then a simple
uniform(0,1) divided up into regions will work just fine to generate
random numbers. But there's not a good way to map that to integers in
Stan.

- Bob

Reply all

Reply to author

Forward