Trouble using DP mixture model for imputation

41 views
Skip to first unread message

Victor Applebaum

unread,
Apr 3, 2026, 2:22:20 PMApr 3
to nimble-users
Hi,

I've been trying to get a Dirichlet process mixture model to work for imputing missing data in Nimble. However, having tried a bunch of different variations I always get an error when compiling.

For example, stealing some simple code from https://r-nimble.org/blog/bayesian-nonparametric-models-in-nimble-part-1-density-estimation.html, I get the following:
###########################################
library(nimble)
library(simDAG)
library(mvtnorm)
setwd("C:/Users/victo/Downloads")


set.seed(1)
x1 <- rmvnorm(800,c(0,0),sigma = matrix(c(1,0.5,0.5,1),2,2))
x2 <- rmvnorm(400,c(-2,2),sigma = diag(0.5,2))
x3 <-rmvnorm(300,c(1,1.5),sigma = diag(1,2))
plot(x1, xlim=c(-5,5),ylim=c(-5,5))
points(x2,col='red')
points(x3,col='blue')


x_full <- rbind(x1,x2,x3)
indicator <- array(c(0,1,1,0,1,1,0,0),dim(x_full))#remove some points to simulate missingness

x_full[indicator == 0] <- NA

constants <- list(
  n  =nrow(xC_full)
)

data <- list(
  y = as.matrix(x_full)[,1]#use only first column part for minimal example
)


code <- nimbleCode({
  for(i in 1:n) {
    y[i] ~ dnorm(mu[i], var = s2[i])
    mu[i] <- muTilde[xi[i]]
    s2[i] <- s2Tilde[xi[i]]
  }
  xi[1:n] ~ dCRP(alpha, size = n)
  for(i in 1:n) {
    muTilde[i] ~ dnorm(0, var = s2Tilde[i])
    s2Tilde[i] ~ dinvgamma(2, 1)
  }
  alpha ~ dgamma(1, 1)
})

Rmodel <- nimbleModel(code,
                      constants = constants,
                      data = data#,
                      #inits = inits()
)

Cmodel <- compileNimble(Rmodel)
###########################################
On running the final line, I get the error: " Error in as.vector(y) : no method for coercing this S4 class to a vector". There seems to be very little information out there about this error with respect to nimble specifically.

The error doesn't arise if I don't run the code x_full[indicator == 0] <- NA, i.e. not having any missing NA values. This would suggest that this is what is causing the issue.

However, ordinarily nimble would impute these without any issues. For example, running the following simpler model with the missing data as above works perfectly well: 
#############################
code <- nimbleCode({
  for(i in 1:n) {
    y[i] ~ dnorm(mu, var = s2)
  }
  mu ~ dnorm(0,1)
  s2 ~ dexp(1)
})
##############################

There is clearly some subtlety about how nimble works that is going over my head, and I was hoping someone could point me in the right direction.

The closest information I could find was this paper https://link.springer.com/article/10.1186/s12911-023-02400-3. Which says that you can't currently update dimensions of an individual node. But I don't think I'm doing that at the moment, and even if I were, I haven't made it to the configuration step where I could implement thier fix to this yet. So I am not sure this is the problem.

I'd be very grateful for any help. Thanks.

Daniel Turek

unread,
Apr 3, 2026, 3:08:38 PMApr 3
to Victor Applebaum, nimble-users
Victor, can you confirm what version of nimble you're using?

First of all, in your script I changed the code for the constants list, which I think had a typo:

constants <- list(
    ##n  =nrow(xC_full)    ## removed this line
    n  =nrow(x_full)       ## added this line
)


Having made that change, your script executed for me, through to and including the call to compileNimble(Rmodel).  That was *including* the line that seemed problematic to you (x_full[indicator == 0] <- NA).  I'm using version version 1.4.2 of nimble.

One additional comment, when building the Rmodel object (using nimbleModel), your code emits a large number of warnings.  This is a result of not providing an initial value for the latent group indicator variable xi[].  You can avoid these warnings by providing initial values for xi:

inits <- list(
    xi = sample(1:nrow(x_full), replace = TRUE)
)


and then providing this "inits" list as an argument to nimbleModel.

I hope this helps getting you started.

Daniel



--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/nimble-users/0c65ef0e-fab4-451d-b159-77e08e51c096n%40googlegroups.com.

Victor Applebaum

unread,
Apr 4, 2026, 9:25:56 AMApr 4
to nimble-users
Hi Daniel,

Thanks for the very quick response. Using xC_full rather than x_full was a typo which came from me trying to create a minimum example. In my session these were the same (i.e. at some point I did xC_full <- x_full) so it wasn't the cause of the problem.

However, by giving up and typing this up here and coming back to it I found a solution. When I quit the R session either by closing RStudio or by restarting the R session within Rstudio, I stop getting the error. Eventually it seems to come back (until I restard the session again), which would suggest I'm doing something that is jumbling it up. I haven't been able to isolate what it is but it is atleast no longer a hard obstacle. So I guess this is mostly resolved now, although if it is relevant to future investigations or if people get similar problems in future, the nimble version was 1.4.1.

Thanks also for the other tip on removing the warnings, I have implemented proper initial values now.

Thanks,
Victor

Daniel Turek

unread,
Apr 5, 2026, 4:19:39 PMApr 5
to Victor Applebaum, nimble-users
I'm glad to hear it's resolved, Victor.  Please keep in touch if you have any questions down the road.

Daniel


Victor Applebaum

unread,
Apr 11, 2026, 9:01:06 AM (11 days ago) Apr 11
to nimble-users
Hi Daneil,

Apologies for coming back - I am still struggling a little to understand some of the subtleties of how the NA filling works. I have a more advanced version as shown:
###########################################  ###########################################
library(nimble)
library(simDAG)
library(mvtnorm)
library(coda)
library(parallel)

PREV_INITS <- FALSE
SAVE_RESULTS <- FALSE

setwd("C:/Users/victo/Downloads")

###################Generating Example Data################
#create continuous data

set.seed(1)
x1 <- rmvnorm(800,c(0,0),sigma = matrix(c(1,0.5,0.5,1),2,2))
x2 <- rmvnorm(400,c(-2,2),sigma = diag(0.5,2))
x3 <-rmvnorm(300,c(3,3),sigma = diag(1,2))
#x4 <-rmvnorm(400,c(4,4),sigma = diag(1,2))
plot(x1, xlim=c(-6,6),ylim=c(-6,6))

points(x2,col='red')
points(x3,col='blue')
#points(x4,col='green')
xC_full <- rbind(x1,x2,x3)


#create discrete data
# build cutpoints automatically for K categories
make_cuts <- function(K) {
  if (K < 2) stop("Each dimension must have at least 2 categories")
  qnorm(seq(0, 1, length.out = K + 1))[-c(1, K + 1)]
}

# convert MVN -> ordinal with varying categories per dimension
to_ordinal_general <- function(x, K_vec) {
  D <- ncol(x)
 
  if (length(K_vec) != D) {
    stop("Length of K_vec must match number of dimensions")
  }
 
  out <- matrix(NA, nrow(x), D)
 
  for (j in 1:D) {
    cuts <- make_cuts(K_vec[j])
   
    out[, j] <- as.integer(cut(
      x[, j],
      breaks = c(-Inf, cuts, Inf),
      labels = 1:K_vec[j]
    ))
  }
 
  return(out)
}

K_vec <- c(2, 3, 4)  # 3 dimensions: binary, 3-level, 4-level
D <- length(K_vec)

# cluster 1
xD1_lat <- rmvnorm(800, rep(0, D),
                   sigma = diag(D))
xD1 <- to_ordinal_general(xD1_lat, K_vec)

# cluster 2
xD2_lat <- rmvnorm(400, rep(1, D),
                   sigma = diag(0.5, D))
xD2 <- to_ordinal_general(xD2_lat, K_vec)

# cluster 3
xD3_lat <- rmvnorm(300, rep(-1, D),
                   sigma = diag(1, D))
xD3 <- to_ordinal_general(xD3_lat, K_vec)

# combine
xD_full <- rbind(xD1, xD2, xD3)


# apply missingness
indicatorC <- array(c(0,1,1,0,0,1,0),dim(xC_full))
indicatorD <- array(c(0,1,1),dim(xD_full))

xC_full[indicatorC == 0] <- NA
xD_full[indicatorD == 0] <- NA

# rows with at least one NA
impute_idx <- which(apply(xC_full, 1, function(row) any(is.na(row))))
# rows with no NA
complete_idx <- which(apply(xC_full, 1, function(row) all(!is.na(row))))


constants <- list(
  I  = nrow(xC_full),#no. individuals/points in the data set
  JC = ncol(xC_full),#no. continuous variables
  K = 10,#cap on no. Dirichlet groupings
  L = K_vec,#no. categories in each discrete variables
  D = ncol(xC_full)#,#no. discrete variables,
)

data <- list(
  xC = as.matrix(xC_full),#continuous variables IxJC
  xD = as.matrix(xD_full)#discrete variables IxD
)

mons <- c('alpha1','alpha2','xC','xD','theta','mu','sigma','w','v','xD_new','z','xC_new','U')


  inits <- list(
    v = rep(0.6,constants$K-1),
    z = rep(1,constants$I),
    mu = array(0,c(constants$K,constants$J)),
    sigma = array(1,c(constants$K,constants$J)),
    xC <- array(0,dim=dim(data$xC)),
    xD <- array(1,dim=dim(data$xD))
  )



uppertri_mult_diag <- nimbleFunction(
  run = function(mat = double(2), vec = double(1)) {
    returnType(double(2))
    p <- length(vec)
    out <- matrix(nrow = p, ncol = p, init = FALSE)
    for(i in 1:p)
      out[ , i] <- mat[ , i] * vec[i]
    return(out)
  })


code <- nimbleCode({
  alpha1 ~ T(dnorm(0, sd=100),0,)
  alpha2 ~ T(dnorm(0, sd=100),0,)
 
 
  for(i in 1:(K-1)) { # stick-breaking variables
    v[i] ~ dbeta(alpha1, alpha2)
  }
  w[1:K] <- stick_breaking(v[1:(K-1)]) # stick-breaking weights
 
  #dirichlet prior for discrete variables
  for (k in 1:D) {
    for (l in 1:L[k]){
      alpha_theta[k,l] <- 1 / L[k]
    }
  }
 
  # Priors for each group
  for(k in 1:K) {
   
    #dirichlet prior for discrete
    for (j in 1:D){
      theta[k,j,1:L[j]] ~ ddirch(alpha_theta[j,1:L[j]])
    }
    for (j in 1:JC){
      mu[k,j] ~ dnorm(0, var = 1)
      sigma[k,j] ~ dinvgamma(2, 1)
    }
    Ustar[k,1:JC,1:JC] ~ dlkj_corr_cholesky(1, JC)
    U[k,1:JC,1:JC] <- uppertri_mult_diag(Ustar[k,1:JC, 1:JC], sigma[k,1:JC])
  }
 
 
 
  # Likelihood
  for(i in 1:I) {
    z[i] ~ dcat(w[1:K])
    #continuous
    muTilde[i,1:JC] <- mu[z[i],1:JC]
    UTilde[i,1:JC, 1:JC] <- U[z[i],1:JC, 1:JC]
    xC[i, 1:JC] ~ dmnorm(muTilde[i,1:JC], cholesky = UTilde[i,1:JC, 1:JC], prec_param = 0)
    #discrete
    for (j in 1:D){
      xD[i,j] ~ dcat( theta[z[i],j,1:L[j]])
    }
  }
 
  #simulate a new individual
  z_new ~ dcat(w[1:K])
  #continuous
  xC_new[1:JC] ~ dmnorm(mu[z_new,1:JC], cholesky = U[z_new,1:JC, 1:JC], prec_param = 0)
  #discrete
  for (j in 1:D){
    xD_new[j] ~ dcat(theta[z_new,j,1:L[j]])
  }
 
})

myModel <- nimbleModel(code = code,
                       data = data,
                       constants = constants,
                       inits =inits)

CmyModel <- compileNimble(myModel)
config <- configureMCMC(CmyModel,
                        monitors = mons)

built <- buildMCMC(config)
# compile model
CmyMCMC <- compileNimble(built)

samps <- runMCMC(CmyMCMC,
                 samplesAsCodaMCMC = TRUE,
                 niter = 1500,
                 nburnin = 1000,
                 thin=1,
                 nchains = 3
########################################### ###########################################

When I run this I get the error: 
Error in mcmc$run(niter, nburnin = nburnin, thin = thinToUseVec[1], thin2 = thinToUseVec[2], : in categorical sampler, all log probability density values are negative infinity and sampling cannot proceed
and a bunch of warnings about 
warning: value of data node xC[1488, 1:2]: value is NA or NaN. warning: logProb of data node xC[1488, 1:2]: logProb is NA or NaN. warning: value of data node xC[1490, 1:2]: value is NA or NaN. warning: logProb of data node xC[1490, 1:2]: logProb is NA or NaN. warning: value of data node xC[1491, 1:2]: value is NA or NaN. warning: logProb of data node xC[1491, 1:2]: logProb is NA or NaN.
These nodes correspond to rows which are partially observed (as opposed to fully NA or no NA) - suggesting that it is these partially observed nodes are causing an issue. Changing the indicatorC line, which creates missing points, to  indicatorC <- array(c(0,1,1),dim(xC_full)) allows the model to successfully run, which as far as I can tell confirms that this is the issue.

However, in the version I am using: 1.4.1, I understand that a special sampler specifically for partially observed mvn was released, and looking at the configureMCMC prints would appear to confirm it is being used:
===== Monitors ===== thin = 1: alpha1, alpha2, mu, sigma, theta, U, v, w, xC, xC_new, xD, xD_new, z ===== Samplers ===== RW_lkj_corr_cholesky sampler (10) - Ustar[] (10 multivariate elements) RW sampler (42) - alpha1 - alpha2 - mu[] (20 elements) - sigma[] (20 elements) posterior_predictive sampler (1073) - z[] (71 elements) - z_new - xD[] (858 elements) - xC[] (143 multivariate elements) partial_mvn sampler (1286) - xC[] (1286 multivariate elements) conjugate sampler (29) - v[] (9 elements) - theta[] (20 multivariate elements) categorical sampler (1429) - z[] (1429 elements) ===== Comments =====

So I was was wondering if you know what is going on here? Thanks very much for any help.

Daniel Turek

unread,
Apr 13, 2026, 12:52:30 PM (9 days ago) Apr 13
to Victor Applebaum, nimble-users
Victor, thanks for following up.  I took a quick look at this, and note that at the time of model building (the call to nimbleModel), the following output appears:

Defining model
  [Warning] Inconsistent dimensions between `inits` and `dimensions`
            Ignoring dimensions in `dimensions`.
  [Warning] dimensions specified are larger than model specification
            for variable `xD`.
Building model
Setting data and initial values
Running calculate on model
  [Note] Any error reports that follow may simply reflect missing values in model variables.
  [Warning] Dynamic index out of bounds: dmnorm(mean = mu[z_new[1], 1:2], cholesky = U[z_new[1], 1:2, 1:2], prec_param = 0)
  [Warning] Dynamic index out of bounds: dcat(prob = theta[z_new[1], j, 1:L_oBj_cB])
  [Warning] Dynamic index out of bounds: dcat(prob = theta[z_new[1], j, 1:L_oBj_cB])
Checking model sizes and dimensions
  [Note] This model is not fully initialized. This is not an error.
         To see which variables are not initialized, use model$initializeInfo().
         For more information on model initialization, see help(modelInitialization).
Warning message:
In md$newModel(data = data, inits = inits, where = where, check = check,  :
  One or more unnamed elements found in inits.

This is concerning for the warnings and notes being produced, which generally should not be ignored.  For example:

Messages suggest there are inconsistent dimensions between variables in some of the input lists.  I can spot for example the data input for xD is 1500x3, but in the model loop, with constants K=1500 and D=2 would give different dimensions (1500x2) for xD.

Also, some warnings are almost certainly being caused by your inits list, incorrectly using the R assignment operator (<-) for the named list elements.  This does not generate "names" for the inits list, but rather attempts to create variables in the R environment.  This should be fixed in both places:

  inits <- list(
    v = rep(0.6,constants$K-1),
    z = rep(1,constants$I),
    mu = array(0,c(constants$K,constants$J)),
    sigma = array(1,c(constants$K,constants$J)),
    xC <- array(0,dim=dim(data$xC)),
    xD <- array(1,dim=dim(data$xD))

  )

Also, a lot of the warnings are most likely coming for incomplete model initialization.  The warnings related to "dynamic indexing out of bounds" are most likely from failing to provide initial values for z[], which are used as dynamic (latent) indices for the theta[] vector.  I would suggest providing valid initial values for all non-data stochastic nodes in the model.

I'm happy to see the partial_mvn sampler getting used.

After fixing a few of these things, it looks like the MCMC runs much smoother.  Please let me know if you need further help.

Cheers,
Daniel


Reply all
Reply to author
Forward
0 new messages