missing data imputations and/or the posterior predictive distribution from stan

269 views
Skip to first unread message

Roy Levy

unread,
Dec 4, 2020, 3:01:01 PM12/4/20
to blavaan
Hi all,

I would like to fit a regression model with one predictor that has missing values on the predictor and missing values on the outcome, and also obtain the posterior distributions (i.e., imputations) for those missing values. 

When using the default target="stan", I am able to fit the model, but I don't believe I can monitor and then get results for the missing values. Is there a way to do that? 

Alternatively, I would think that using draws for these values from the posterior predictive distribution would work, as that is essentially the same thing. I know that I can conduct posterior predictive checks through the ppmc function, but I don't see how to get the posterior predicted datasets themselves. Is there a way to do that? 

I have also tried to run the model in JAGS through target="jags" and was able to monitor the variables and get posteriors for them. However, I noticed some unexpected behavior for the regression parameters the in chains when running my model. I'm working on that now and see it occurring even using complete data, so I will likely post a separate thread about that if I can't figure it out on my own.

Thanks,
Roy

Ed Merkle

unread,
Dec 4, 2020, 5:25:19 PM12/4/20
to Roy Levy, blavaan
Hi Roy,

On Fri, 2020-12-04 at 12:01 -0800, Roy Levy wrote:
When using the default target="stan", I am able to fit the model, but I don't believe I can monitor and then get results for the missing values. Is there a way to do that? 

There is not currently a way to do it. target="stan" uses a "full information" likelihood, as opposed to imputing the missing values. I think most of the necessary tools are there to sample missing data as generated quantities, though, so let me look into it.


Alternatively, I would think that using draws for these values from the posterior predictive distribution would work, as that is essentially the same thing. I know that I can conduct posterior predictive checks through the ppmc function, but I don't see how to get the posterior predicted datasets themselves. Is there a way to do that? 

There is an unexported function called postdata() that generates data for computing ppp-values. But I'm not sure that this is what you want: I think you want imputed missing values conditioned on a case's observed data, but postdata() just generates completely new cases from the posterior.


I have also tried to run the model in JAGS through target="jags" and was able to monitor the variables and get posteriors for them. However, I noticed some unexpected behavior for the regression parameters the in chains when running my model. I'm working on that now and see it occurring even using complete data, so I will likely post a separate thread about that if I can't figure it out on my own.

Sure, let me know.

Ed

Roy Levy

unread,
Dec 4, 2020, 7:28:34 PM12/4/20
to blavaan
Hi Ed,

Thanks very much for all the info. 

You’re right that in this case I want the posterior predictive distribution for a case’s missing data, conditional on the case’s observed data, and I see what you’re saying about the postdata() function generating new cases. 

I’ll segue away from my missing data situation for a moment to use this opportunity to ask about getting the posterior predictive data anyway (either as output from the ppmc() function or through the postdata() function). I would value having access to it for two reasons. First, when I do posterior predictive checks I often like to look at the posterior predictive data themselves. Second, I’m looking into using prior predictive data to inspect the implications of the priors. The ‘prisamp = TRUE’ command let’s me sample parameter values from the prior, but I’d also like to sample values for the data. (This would answer questions like "what sort of data would we expect under these prior specifications?") I imagine that a function that takes features of the model definition and then, for each draw from the MCMC, generates out a new dataset would serve this purpose. If the draws were from the prior (via ‘prisamp = TRUE’) it would be prior predictive data, and if the draws were from the posterior (via ‘prisamp = FALSE) it would be posterior predictive data. Perhaps this is what the postdata() function does. 

I’m continuing to look into getting draws for missing data through JAGS (which may also be a route to getting prior predictive data by fitting a model with all missing values). 

I’ll close by expressing my thanks. blavaan is a terrific package, and I appreciate the effort it takes to develop and support it when people like me bug you with requests.

Best,
Roy

Mauricio Garnier-Villarreal

unread,
Dec 5, 2020, 7:31:22 PM12/5/20
to blavaan
Roy

Not sure about the postdata() function, but you can do this with ppmc(). For example like this we can use the lavPredict() function from lavaan, that predicts data sets based on the model parameters. So you will get a list of predicted data sets based on the model parameters at each saved itration. This can be applied to the model with posterior samples and with prior samples

Hope this is useful

library(blavaan)

fitb <- bcfa(HS.model,
            data = HolzingerSwineford1939, std.lv=T)
summary(fitb)

pred_dat <- function(MOD){lavPredict(MOD, type="ov")}

out1g <- ppmc(fitb, discFUN = pred_dat)
summary(out1g)

## access all posterior predictive data sets
## list with length equal to saved samples
out1g@obsDist$discFUN1

Roy Levy

unread,
Dec 6, 2020, 4:26:46 PM12/6/20
to blavaan
Thanks for the pointer to the use of the lavPredict(MOD, type="ov") as the function for the ppmc(). That is very helpful, and I will start to play around with it.

Thanks,
Roy

Terrence Jorgensen

unread,
Dec 6, 2020, 4:28:08 PM12/6/20
to blavaan
we can use the lavPredict() function from lavaan

I think we would need to use lavInspect(), instead, to get actual data rather than expected values of indicators given the factor scores (see ?lavPredict help page for description of type= argument; it is only in models without latent variables that the original data are returned).  For example:

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '
fit <- cfa(HS.model, data = HolzingerSwineford1939)
## these differ:
head(lavPredict(fit, type = "ov", method = "regression"))
head(lavPredict(fit, type = "ov", method = "bartlett"))
## neither matches actual data:
head(lavInspect(fit, "data")) 

So to update Mauricio's suggested user-defined discFUN= function:

pred_dat <- function(MOD) lavaan::lavInspect(MOD, "data") 

which is just as simple.  FYI, for multigroup models, you could do.call(rbind, ...) the returned list and add the grouping variable as another column.  See the ?lavPredict help-page example to see a pretty general solution. 

Terrence D. Jorgensen
Assistant Professor, Methods and Statistics
Research Institute for Child Development and Education, the University of Amsterdam

Mauricio Garnier-Villarreal

unread,
Dec 6, 2020, 6:32:31 PM12/6/20
to blavaan
the problem there is that this (pred_dat <- function(MOD) lavaan::lavInspect(MOD, "data") ) extracts the same observed data at each iteration, so the SD of the realized values is 0.

With lavPredict() it extracts the model predicted data sets based on each iteration modal estimates. Which I think its closer to what Roy is looking for to represent the predictive variability of the observed data across the posterior estimates

Ed Merkle

unread,
Dec 6, 2020, 10:14:21 PM12/6/20
to Mauricio Garnier-Villarreal, blavaan
All,

I think that neither lavPredict nor lavInspect provide exactly what is desired, which are draws from the posterior distribution of the missing data given the observed data and model parameters. I guess that the lavPredict idea gets close, but it provides something like a mean prediction of the missing data, so the values will not exhibit as much uncertainty as they should.

I think I am close to having an implementation (can reuse lots of existing stuff), will let you know soon.

Ed
--
You received this message because you are subscribed to the Google Groups "blavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blavaan+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/blavaan/9b9cb06a-6f3c-4031-bfe4-03ed87db31f5n%40googlegroups.com.

Roy Levy

unread,
Dec 7, 2020, 12:03:22 AM12/7/20
to Ed Merkle, Mauricio Garnier-Villarreal, blavaan

Hi everyone,

 

There are several situations that have been mentioned here that we can distinguish. (Sorry for having conflated a few things and shifted the focus a few times.) I figured I’d lay them out as I see them.

 

1. Imputations for missing data under a fitted model. This was what I asked about initially in the situation of regression with missingness on a predictor as well as missingness on the outcome. What we’d want here is, for each case with missingness, the posterior predictive distribution for the missing values for that case, conditional on that case’s observed data. I believe I can get this using blavaan, by using blavaan to run the model in JAGS.

 

2. The posterior predictive distribution for the data, as typically constructed in posterior predictive model checking. (Set aside any issue of missing data and suppose we have complete data.) My understanding of Mauricio’s comment was that this can be accomplished using the lavPredict(MOD, type="ov") approach in the ppmc() function. My understanding from ?lavPredict is that this function would be useful for latent variable models, but in models with no latent variables it would not work. So in a measured variable regression model (with complete data), it wouldn’t return posterior predictive data for the predictor and outcome. Rather it would return the observed values for the predictor and outcome.

 

3. The prior predictive distribution. It would seem that any functions that would work for the posterior predictive distribution just mentioned in (2) would work here with minimal adjustment. The idea would be use draws from the prior distribution for the parameters (e.g., through prisamp = TRUE) rather than draws from the posterior distribution.

 

4. The posterior predictive distribution for some of the variables, given others. Again, let’s suppose we have complete data. We declare some variables to be of predictive interest, and the rest are declared to those used to make such predictions. Let’s stick with the typical regression setup, where the variable of predictive interest is the outcome, and the rest are the predictors. What we’d want here is, for each case, the posterior predictive distribution for the outcome, conditional on that case’s values for the predictors.

 

I have used JAGS code to obtain each of these, but would certainly like to use blavaan to obtain them.

 

Thanks,
Roy


Mauricio Garnier-Villarreal

unread,
Dec 7, 2020, 7:39:44 PM12/7/20
to blavaan
Roy

Yeah, lavPredict(MOD, type="ov") would give you the predicted indicator scores, when there are latent variables, but it does not applied for latent variables, or regressions

For a regression between observed indicators I would run it and export the posterior draws and estimate the regression across all saved iterations.
For example, this example does that with observed variables, and saves the outcome predicted scores as columns in a matrix
(I do loops instead of apply function, I am sure could be optimize)

This approach should work for both prisamp=T and prisamp=F

Now, If you were to have a regression between factors, would need an extra step. To save the factor scores posterior draws, and use the factor score at each iteration with the respective parameters. This way accounting for factor score uncertainty

This approach is not addressing the imputation predictions

###
library(blavaan)

HolzingerSwineford1939$Visual <- rowMeans(HolzingerSwineford1939[,paste0("x",1:3)], na.rm=T)
HolzingerSwineford1939$Textual <- rowMeans(HolzingerSwineford1939[,paste0("x",4:6)], na.rm=T)
HolzingerSwineford1939$Speed <- rowMeans(HolzingerSwineford1939[,paste0("x",7:9)], na.rm=T)

HS.model <- '
Visual ~ Textual + Speed
Textual ~~ Speed
'

fit <- bcfa(HS.model, data=HolzingerSwineford1939)
summary(fit)
partable(fit)

## extract parameter posterior draws
posts <- as.matrix(blavInspect(fit, "mcmc"))
head(posts)
dim(posts)

## rows = subjects
## columns = number of saved iterations
pred_vis <- NULL
for(j in 1:nrow(posts)){
  temp <- posts[j,"Alpha_free[1]"] +
    posts[j,"bet_sign[1]"]*HolzingerSwineford1939$Textual +
    posts[j,"bet_sign[2]"]*HolzingerSwineford1939$Speed
 
  pred_vis <- do.call(cbind, list(pred_vis, temp))
}
dim(pred_vis)


Roy Levy

unread,
Dec 8, 2020, 11:04:00 PM12/8/20
to blavaan

Thanks Mauricio. Based on your comment and looking into lavPredict a bit more, I now believe I was mistaken in saying that the use lavPredict(MOD, type="ov") inside the ppmc() function would give me the posterior predictive distribution as is typically constructed in posterior predictive model checking (what I laid out in scenario 2, with implications for the prior predictive distribution in scenario 3).

 

I understand that this would yield *predicted* values for the indicators. I haven’t looked into it enough to know if these are conditional on the values of the latent variables for each case or marginalizing over them, but either way that’s not what I am looking to obtain. I’m looking to obtain draws from the predictive distribution.

 

To make this concrete, suppose we have two draws from MCMC for which the values for all the unknown parameters are exactly same. In that case, I believe using lavPredict(MOD, type="ov") would yield the same predicted values for the observables. What I would like is a draw from the predictive distribution, which would vary even when the values for all the unknown parameters are exactly same.

 

Per Mauricio’s comments, I know I can do all of these things in R using output from blavaan, in addition to doing this in JAGS directly. I was just asking if there was a way to request these sorts of things from the blavaan object.

 

Thanks,

Roy

Ed Merkle

unread,
Dec 8, 2020, 11:23:34 PM12/8/20
to Roy Levy, blavaan
I agree with Roy's assessment of lavPredict(, type="ov"). Maybe it is possible to salvage the ppmc approach by manually adding noise (from the observed residual variances) to lavPredict(, type="ov").

But I am thinking about a blavPredict() function, which would be similar to lavPredict() but with some extra mcmc-specific options for the "type" argument. These could include the posterior distributions of missing data, as well as this other posterior predictive stuff.

Ed

Mauricio Garnier-Villarreal

unread,
Dec 9, 2020, 7:27:51 AM12/9/20
to blavaan
To clarify on the funtionality of lavPredict(,type="ov"), for this type it is dependent of factor scores, with the ETA argument, functioanly estimates the factor scores first and use them for the ov predicted values.

pred_fs <- lavPredict(MOD, type="lv")
pred_dat <- lavPredict(MOD, type="ov", ETA= pred_fs )

Could add variability by using the MOD and varying the ETAs, or have the ETAs constant and varying the MOD parameters. Depending on how you are looking to define it

For the ppmc() example, it estimates different set of ETA and model parameters at each iteration, and we could add the noise as Ed suggested

If you use lavPredict(,type="ov") just once with the blavaan object will give the estimates based on the posterior mean parameters and factor scores

Mauricio Garnier-Villarreal

unread,
Dec 9, 2020, 1:04:12 PM12/9/20
to blavaan
Roy

I am gonna give it one more try on code that would be simpler than working with the exported MCMC draws.
ppmc() takes any function that manipulates a lavaan opbject, so we can write a function to export the required values from each draw lavaa model and build the regression (with some noise). And ppmc() will take it.
Coding wise, this might be similar as doing it with the MCMC draws, but we can get the nice summary/plots from ppmc()


HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9

visual ~ textual + speed'

fit <- bcfa(HS.model, std.lv=T,
            data=HolzingerSwineford1939)
summary(fit, standardized=T)
parameterestimates(fit, standardized=T)

pred_fun <- function(MOD){
  dat <- lavInspect(fit, "data")
  pars <- parameterestimates(MOD, standardized=T)
  b1 <- pars[10,"est"]
  b2 <- pars[11,"est"]
  res_v <- pars[21,"std.all"]

  fs <- lavPredict(MOD, type="lv")
 
  pred <- b1*fs[,"textual"] + b2*fs[,"speed"] +
    rnorm(nrow(dat), mean=0, sd=sqrt(res_v))
 
  return(pred)
}


out <- ppmc(fit, discFUN = pred_fun)
summary(out)
plot(out)
hist(out)

Roy Levy

unread,
Dec 9, 2020, 9:53:17 PM12/9/20
to blavaan
Thanks, I really do appreciate the effort it takes to support this!

Ed Merkle

unread,
Dec 11, 2020, 2:58:09 PM12/11/20
to Roy Levy, blavaan
I just pushed a new blavPredict function that should eventually address many of these issues. It still needs more testing, it is not yet fully working, and it is not yet exported. But it should currently produce posterior distributions of missing data. It would be helpful if anybody on the thread tried it out (using target="stan" only). You can do something like this:

remotes::install_github("ecmerkle/blavaan", INSTALL_opts = "--no-multiarch")

## (fit your model with missing data, say the object is called fit)

yimp <- blavaan:::blavPredict(fit, type="ymis")


Ed

Terrence Jorgensen

unread,
Dec 13, 2020, 8:30:06 AM12/13/20
to blavaan
Sorry to miss out on the thread for a while, I was traveling and getting over jet lag.

I understand that this would yield *predicted* values for the indicators. I haven’t looked into it enough to know if these are conditional on the values of the latent variables for each case or marginalizing over them, but either way that’s not what I am looking to obtain. I’m looking to obtain draws from the predictive distribution.

Then I will again insist that lavInspect(fit, "data") is in fact what you want to use.  It of course returns the same observed data for all iterations of each Markov chain (as Mauricio said) because they are the observed data, but the simulated data is what you want, and that does get saved in the updated lavaan object at each iteration.  Behold:

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '
## fit single-group model
fit <- bcfa(HS.model, data = HolzingerSwineford1939, 
            n.chains = 2, burnin = 1000, sample = 500)
ppmcData <- ppmc(fit, thin = 100, discFUN = function(x) lavaan::lavInspect(x, "data"))
lapply(ppmcData@obsDist$discFUN1, head) # all the same: observed data
lapply(ppmcData@simDist$discFUN1, head) # different simulated data per draw

Of course, this is not an actual discrepancy function, so it doesn't matter that the SD of the realized values is 0.  If I understand correctly, this only needs to be a hack for Roy to obtain the posterior predictive data.  Correct?

Roy Levy

unread,
Dec 14, 2020, 4:23:02 PM12/14/20
to Terrence Jorgensen, blavaan
Thanks again to folks spending their time on this.

Regarding the blavPredict function, I have run some tests. The testbed data I used is a small dataset from Enders (2010) book on missing data. There are scores on three variables, IQ, Well-Being, and Job Performance. R code to construct the dataset follows:

IQ=c(78, 84, 84, 85, 87, 91, 92, 94, 94, 96, 99, 105, 105, 106, 108, 112, 113, 115, 118, 134)
WB=c(13, 9, 10, 10, NA, 3, 12, 3, 13, NA, 6, 12, 14, 10, NA, 10, 14, 14, 12, 11)
JP=c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 7, 10, 11, 15, 10, 10, 12, 14, 16, 12)
raw.data <- cbind(IQ, WB, JP)
colnames(raw.data) <- c("IQ", "Well-Being", "Performance")


I am first considering a model using IQ as a predictor of JP. This is a situation where the predictor has complete data, and the outcome has missing data, with the following priors (writing normal distributions in the standard deviation metric):

Intercept ~ N(0, 30)
Slope ~ N(0, 30)
Error precision ~ Gamma(1, 20)

My blavaan code for this is as follows:

blavaan.model.expressions <- '
 
  # Regression model
  Performance ~ prior("normal(0,30)")*1 + prior("normal(0,30)")*IQ

  # Error variance for dependent variable
  Performance ~~ prior("gamma(1,20)[prec]")*Performance

'


I’ve run the model with blavaan calling stan, as well as separately in JAGS. The results give comparable results for the regression parameters, as summarized below:

Intercept: posterior mean about -1.8 and sd about 11.7
Slope: posterior mean about .12 and sd about .1
Error variance: posterior mean about 11.5 and sd about 6.5

So on this JAGS and stan via blavaan agree. However they don’t yield completely comparable results for the imputations for the missing outcome values (i.e., for the first 10 cases). As a summary, they do yield comparable results for the posterior means. But the posterior standard deviations using the stan via blavaan and the blavPredict function are between 13.3-13.8 and the posterior standard deviations using JAGS are in between 4-5. This suggests the variability of the imputations from  blavPredict is a bit larger than that from JAGS. I hope this might suggest a simple fix.


Regarding Terrance’s suggestion to use lavInspect(fit, "data") and in particular to use


ppmcData <- ppmc(fit, thin = 100, discFUN = function(x) lavaan::lavInspect(x, "data"))
lapply(ppmcData@simDist$discFUN1, head) # different simulated data per draw


again, I say thanks! If I’m following this correctly, the ppmc() function generates the posterior predictive distribution of interest (which is what I would like to extract). The function:

function(x) lavaan::lavInspect(x, "data")

is a function to essentially just return data. So when that function is applied to the posterior predicted data, it just returns that posterior predicted data. That's how I'm seeing it, but please correct me if I'm wrong. I will start playing with this, but I would also of course welcome others who have given this some thought to weigh in.

Thanks,
Roy 

--
You received this message because you are subscribed to the Google Groups "blavaan" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blavaan+u...@googlegroups.com.

Mauricio Garnier-Villarreal

unread,
Dec 14, 2020, 7:50:21 PM12/14/20
to blavaan
Roy

About the ppmc function and Terrence suggestion.
ppmc has 3 major functionalities:
- estimates the discFUN given the estimated parameters at each iteration : ppmcData@obsDist$discFUN1
- simulate data based on the model, and estimate the discFUN for the simulated data : ppmcData@simDist$discFUN1
- compare the observed > simulated values : PPP_sim>obs and PPP_sim<obs, builds ppp for the desired functions

This way it follows the principle of the ppp calculation, for whatever dicFUN is. For example, if you do this  ppmc(fit, thin = 2, fit.measures = c("chisq")), the PPP_sim>obs will be equivalent to the ppp reported by blavaan.

So, I now understand Terrence point, that by using function(x) lavaan::lavInspect(x, "data"), you can focus on the simulated part of ppmc instead of the observed values. This way the ppmcData@simDist$discFUN1 presents simulated data set at each iteration

Hope this helps

Ed Merkle

unread,
Dec 15, 2020, 3:09:54 PM12/15/20
to blavaan
Hi Roy,

Thanks for trying it:

On Monday, December 14, 2020 at 3:23:02 PM UTC-6 Roy Levy wrote:

So on this JAGS and stan via blavaan agree. However they don’t yield completely comparable results for the imputations for the missing outcome values (i.e., for the first 10 cases). As a summary, they do yield comparable results for the posterior means. But the posterior standard deviations using the stan via blavaan and the blavPredict function are between 13.3-13.8 and the posterior standard deviations using JAGS are in between 4-5. This suggests the variability of the imputations from  blavPredict is a bit larger than that from JAGS. I hope this might suggest a simple fix.

I failed to convert a covariance matrix to its Cholesky factor. It should be fixed on github now.

I will try to add some more options to blavPredict() for different types of posterior predictions. I believe that ppmc() cannot currently access the latent variable samples, so it may not be able to do certain predictions (but Terrence and Mauricio are the ppmc pros). Based on this thread, I think it will be helpful if the blavPredict() documentation includes the conditional distribution corresponding to each option. I am getting confused by the verbal descriptions here!

Ed

Roy Levy

unread,
Jan 19, 2021, 12:00:20 PM1/19/21
to blavaan
Hi everyone,

I'm circling around on this (sorry to be offline from it for a while). I downloaded the updated version from github, and tested out the blavPredict() function on the situation I mentioned above, where there is regression with missingness on the outcome. I also tested it out on a regression model with missingness on the outcome and on the predictor. In both cases it seems to be giving me results for the distribution of the missing values that match those from JAGS. Thanks for addressing this!

I hope to return to the possibilities for getting posterior predicted data shortly.

Thanks,
Roy

Reply all
Reply to author
Forward
0 new messages