problem imputing values from truncated normal distributions

Marwan Naciri

unread,

May 27, 2024, 12:51:02 PMMay 27

to nimble-users

Hi all,

I’m trying to run a path analysis (time -> X -> S) in which the variable X is not known for all datapoints. There are three types of X datapoints:

- - X is known exactly

- - X is entirely unknown

- - We know of an interval comprising X

The values of time and S are known for all data points.

To accommodate these properties of X, I have been trying to run a model in which all values of X are drawn from truncated normal distributions sharing the same mean and the same standard deviation, but not the same boundaries. Values of X that are either known exactly or entirely unknown are drawn from a normal distribution truncated very far from the mean (so essentially non-truncated), while values of X for which we know an interval are drawn in a normal distribution truncated at the boundaries of the corresponding interval.

The compilation returns no error, and the MCMC algorithm seems to work, but for the coefficients of X (the mean and the slope against A) and the standard deviation of X the chains remain at the initial value and do not sample the parameter space. The Xs are not sampled either (which makes sense for the Xs that are known, but not for the others).

When I discard the info on intervals and consider only values known exactly (and values entirely unknown), the MCMC samples the parameter space of all parameters, and the values used to simulate the data are well recovered (in simulated datasets). However, this requires discarding part of the available data, which I really want to avoid as the data is quite scarce.

Am I missing something?

The code I used to simulate and analyze the data is below.

Thanks in advance,

Marwan

library(tidyverse)

library(nimble)

Sys.setenv(LANG = "en")

# The purpose of this script is to simulate data corresponding to the following

# directed acyclic graph:

# Time ---> X ----> S

# The specificity here is that not all values of X are known. Some of these

# values are entirely unknown, while for some, we know of an interval in which the

# true value is located.

# To handle this data, the idea is to draw all values of X in truncated normal

# distributions sharing the same mean and standard deviation, but not the same

# boundaries:

# - known values and entirely unknown values are drawn in a virutally non-truncated

# normal distribution (i.e. boundaries are very far from the mean so as to

# truncate the distribution where the density is extremely low anyway)

# - values for which we know an interval are drawn in a normal distribution truncated

# at this interval.

# A. Simulate the data =========================================================

N <- 250

time <- sample(x = 1988:2024, size = N, replace = T)

time_s <- as.vector(scale(time))

# ~ 1. SImulate A -------------------------------------------------------

# Simulate Time --------> X

mean_X <- 0

sd_X <- 1

slope_time_X <- -0.6

slope_A_on_X <- 0.2

set.seed(2)

X <- rnorm(n = N,

mean = mean_X + slope_time_X * time_s,

sd = sd_X)

ggplot(data.frame(x = time, y = X),

aes(x = x, y = y)) +

geom_jitter() +

theme_bw() +

geom_smooth(method = "lm")

# ~ 2. Simulate S --------------------------------------------------------------

# Simulate X ---> S

p <- plogis(-1.25*X)

s <- NULL

set.seed(2)

for (k in 1:N) {

s[k] <- rbinom(n = 1, size = 1, prob = p[k])

}

ggplot(data.frame(x = X, y = s), aes(x = x, y = y)) +

geom_jitter(alpha = 0.5, height = 0.2) +

geom_smooth(method = "glm",

method.args = list(family = "binomial")) +

theme_bw()

# ~ 3. Missing values and intervals -------------------------------------------

# Now let's degrade the quality of the dataset by entirely removing some values

# of X, and by replacing some values with intervals

# Probability that at least some info is available (i.e. 1 - probability that the

# value is entirely unknown):

p_some_info <- 0.7

# Probability that a value that is not entirely unknown is know exactly. This

# probability is time-dependent i.e. in early years exact values of A and X were

# never known (only intervals) while in recent years, values are known exactly

# (or not known at all)

p_exact_value <- plogis(-1.5 + 3*time_s)

p <- plogis(-1.5 + 3*time_s)

ggplot(data.frame(x = time, p = p),

aes(x = x, y = p)) +

geom_line() +

theme_bw()

# status of X:

status_X <- NULL

set.seed(2)

for (k in 1:N) {

some_info <- rbinom(n = 1, size = 1, prob = p_some_info)

if (some_info == 1) {

exact_value <- rbinom(n = 1, size = 1, prob = p_exact_value[k])

if (exact_value == 1) {

status_X[k] <- "exact"

} else {

status_X[k] <- "interval"

}

} else {

status_X[k] <- "unknown"

}

# values to be added or substracted to generate the intervals

uncertainty <- c(0.1, 0.2, 0.5, 1)

prob_uncert <- c(0.3, 0.3, 0.3, 0.1)

X_lower_s <- X_upper_s <- rep(NA, times = length(time))

data_X <- X

# Remove missing values and replace true values by intervals where needed

set.seed(2)

for (k in 1:N) {

if (status_X[k] %in% c("interval", "unknown")) {

data_X[k] <- NA

if (status_X[k] == "interval") {

X_lower_s[k] <- X[k] - sample(x = uncertainty, size = 1, prob = prob_uncert)

X_upper_s[k] <- X[k] + sample(x = uncertainty, size = 1, prob = prob_uncert)

}

# When no information is available or that the true value is known, the value is drawn

# from a normal distribution truncated very far from the mean (i.e. effectively no

# truncation)

if (status_X[k] %in% c("exact", "unknown")) {

X_lower_s[k] <- -5

X_upper_s[k] <- 5

}

df_plot_X <- data.frame(x = time,

y = data_X,

y_lower = X_lower_s,

y_upper = X_upper_s,

status_X = status_X)

df_plot_X_2 <- df_plot_X %>%

mutate(ID = 1:nrow(df_plot_X),

offset = rnorm(n = nrow(df_plot_X), mean = 0, sd = 0.2),

x = x + offset) %>%

pivot_longer(cols = c(y_lower, y_upper))

ggplot(data = df_plot_X_2[which(df_plot_X_2$status_X == "interval"), ]) +

geom_point(aes(x = x, y = value)) +

geom_line(aes(x = x, y = value, group = ID)) +

theme_bw() +

labs(x = "time", y = "X")+

ylim(-3, 4.25)

ggplot() +

geom_jitter(data = df_plot_out, aes(x = x, y = y)) +

theme_bw() +

labs(x = "time", y = "X") +

ylim(-3, 4.25)

# B. Run the model with intervals ==============================================

# ~ 1. Build model -------------------------------------------------------------

model_sim_trunc <- nimbleCode ({

for (k in 1:N) {

# X

mu_X[k] <- kappa[1] +

kappa[2] * time_s[k]

X[k] ~ T(dnorm(mu_X[k], sd = sigma_X),

X_LOWER[k],

X_UPPER[k])

# S

logit(p[k]) <- beta[1] +

beta[2] * X[k]

s[k] ~ dbern(p[k])

}

# ++++++++++++++++++++++++++++++++ priors ++++++++++++++++++++++++++++++++++++

for (k in 1:2) {

kappa[k] ~ dnorm(0, sd = 3) # coefs for X

}

sigma_X ~ dunif(0, 5)

for (k in 1:2) {

beta[k] ~ dnorm(0, sd = 3) # coefs for p

}

})

# ~ 2. Bundle data -------------------------------------------------------------

inits_X <- X

inits_X[status_X == "exact"] <- NA

dat <- list(X = data_X,

s = s)

my.constants <- list(N = N,

time_s = time_s,

X_LOWER = X_lower_s,

X_UPPER = X_upper_s)

inits <- function() list(beta = c(0.5, 0.5),

kappa = c(0.1, 0.2),

sigma_X = 0.75,

X = inits_X)

inits_values <- list(inits(), inits())

# Parameters monitored

params <- c("beta",

"kappa", "sigma_X")

params2 <- c("X")

# ~ 3. Run the model -----------------------------------------------------------

start <- Sys.time() ; start

# Create R model

inits_values <- inits()

sim_trunc_R <- nimbleModel(code = model_sim_trunc,

constants = my.constants,

data = dat,

inits = inits_values,

calculate = T)

end <- Sys.time() ; end ; end - start

# Compile model (in C++)

sim_trunc_C <- compileNimble(sim_trunc_R,

showCompilerOutput = FALSE)

end <- Sys.time() ; end ; end - start

# Configure of MCMC

MCMC_conf <- configureMCMC(sim_trunc_R,

monitors = params,

monitors2 = params2)

# Compile MCMC

MCMC_R <- buildMCMC(MCMC_conf)

MCMC_C <- compileNimble(MCMC_R, project = sim_trunc_R)

fit_sim_trunc.2 <- runMCMC(mcmc = MCMC_C,

nchains = 1,

niter = 40000,

nburnin = 10000,

thin = 5,

thin2 = 150,

inits = inits())

end <- Sys.time() ; end - start

# C. Run the model with no intervals ===========================================

# Here, I don't take into account the intervals

# ~ 1. Build model -------------------------------------------------------------

model_sim_no_trunc <- nimbleCode ({

for (k in 1:N) {

# X

mu_X[k] <- kappa[1] +

kappa[2] * time_s[k]

X[k] ~ dnorm(mu_X[k], sd = sigma_X)

# S

logit(p[k]) <- beta[1] +

beta[2] * X[k]

s[k] ~ dbern(p[k])

}

# ++++++++++++++++++++++++++++++++ priors ++++++++++++++++++++++++++++++++++++

for (k in 1:2) {

kappa[k] ~ dnorm(0, sd = 3) # coefs for X

}

sigma_X ~ dunif(0, 5)

for (k in 1:2) {

beta[k] ~ dnorm(0, sd = 3) # coefs for p

}

})

# ~ 2. Bundle data -------------------------------------------------------------

inits_X <- X

inits_X[status_X == "exact"] <- NA

dat <- list(X = data_X,

s = s)

my.constants <- list(N = N,

time_s = time_s)

inits <- function() list(beta = c(0.5, 0.5),

kappa = c(0.1, 0.2),

sigma_X = 0.75,

X = inits_X)

inits_values <- list(inits(), inits())

# Parameters monitored

params <- c("beta",

"kappa", "sigma_X")

params2 <- c("X")

# ~ 3. Run the model -----------------------------------------------------------

start <- Sys.time() ; start

# Create R model

inits_values <- inits()

sim_no_trunc_R <- nimbleModel(code = model_sim_no_trunc,

constants = my.constants,

data = dat,

inits = inits_values,

calculate = T)

end <- Sys.time() ; end ; end - start

# Compile model (in C++)

sim_no_trunc_C <- compileNimble(sim_no_trunc_R,

showCompilerOutput = FALSE)

end <- Sys.time() ; end ; end - start

# Configure of MCMC

MCMC_conf <- configureMCMC(sim_no_trunc_R,

monitors = params,

monitors2 = params2)

# Compile MCMC

MCMC_R <- buildMCMC(MCMC_conf)

MCMC_C <- compileNimble(MCMC_R, project = sim_no_trunc_R)

fit_sim_no_trunc <- runMCMC(mcmc = MCMC_C,

nchains = 1,

niter = 40000,

nburnin = 10000,

thin = 5,

thin2 = 150,

inits = inits())

end <- Sys.time() ; end - start

Perry de Valpine

unread,

May 29, 2024, 11:56:51 AMMay 29

to Marwan Naciri, nimble-users

Hi Marwan,

Thanks for the question. Let me say what I see happening and then ask about what you are aiming for.

After I run the MCMC and then look at some of the log probabilities, via sim_trunc_C$calculate() and then, looking into details, sim_trunc_C$logProb_X, I see that many of these are Inf. These seem to be for the truncated values. When I look at the values involved (e.g. sim_trunc_C$X[7]), I see that there are situations such as (for X[7], and rounding)

0.65 ~ T(dnorm(-0.36, 0.058), 0.39, 0.69)

The truncated probability will be calculated as dnorm(0.65, -0.36, 0.058)/ (pnorm(0.69, -0.36, 0.058) - pnorm(0.39, -0.36, 0.058)) (although the actual calculations are done on a log scale).

Note that the truncation points are so far out in the right tail of the distribution that both of the pnorm's evaluate to numerical 1s (i.e. the difference between 1 and the actual answer is too small for the computer, so it returns an actual 1), resulting in a zero in the denominator, resulting in probability of Inf. In other words, the values involved in this problem look somewhat unrealistic.

I have two suggestions for you. First, I wonder if you might actually be looking at a case of censoring instead of truncation? In that case it would be theoretically possible that X[t] could take any value, yet all that is observed is that X[t] occurred between a lower and upper value. Then you would not use T() but you could add a dinterval to enforce the censoring. Please see our User Manual section 5.2.7 if that sounds relevant.

If you really have a case of truncation (the X[t] values could, a priori, theoretically only occur between lower and upper), then I'm not sure what to suggest. Are there further modeling considerations that might not lead to the unrealistic values you're seeing?

HTH

Perry

--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/663f812b-8e9c-4b9d-9edd-f4fc2827efe4n%40googlegroups.com.

Perry de Valpine

unread,

May 29, 2024, 4:23:13 PMMay 29

to Marwan Naciri, nimble-users

I think you're close.

I'd suggest renaming "X[k]" here to "censored[k]" or something similar. To enforce that t=X_CENSORED[k] falls between the two values in c=X_BOUNDARY[k,], you need to set censored[k] to 1 and provide it as data. You then need to be sure X_CENSORED[k] has an initial value that does in fact fall between the two values in X_BOUNDARY[k,].

As a side note, in your previous code you had used very wide lower and upper choices to avoid censoring or truncation for some X values, but it might be cleaner to simply not include them in any censoring declarations.

On Wed, May 29, 2024 at 1:07 PM Marwan Naciri <marwan...@gmail.com> wrote:

Hi Perry,

Thanks a lot for your insight!

Indeed, I think I'm in a situation where censoring is more relevant: X[t] can take a priori any value, but the available information tells us that it falls between a lower and upper value.
I'm now trying to use dinterval, having read section 5.2.7 of the user guide, but I'm having trouble understanding how to use dinterval. I have tried

X[k] ~ dinterval(t = X_CENSORED[k], c = X_BOUNDARY[k, ])
X_CENSORED[k] ~ dnorm(mu_X[k], sd = sigma_X)

with X_BOUNDARY[k, ] being a two-element vector containing the upper and lower value of the interval, but this produces warnings:

warning: logProb of data node X[1]: logProb is -Inf.
warning: problem initializing stochastic node X[2]: logProb is -Inf.
...

So I must have misunderstood the logic of this function. Could you please point me to the correct formulation?

Many thanks for your help
Marwan

Marwan Naciri

unread,

May 29, 2024, 8:12:57 PMMay 29

to Perry de Valpine, nimble-users

Hi Perry,

Thanks a lot for your insight!

Indeed, I think I'm in a situation where censoring is more relevant: X[t] can take a priori any value, but the available information tells us that it falls between a lower and upper value.

I'm now trying to use dinterval, having read section 5.2.7 of the user guide, but I'm having trouble understanding how to use dinterval. I have tried

X[k] ~ dinterval(t = X_CENSORED[k], c = X_BOUNDARY[k, ])
X_CENSORED[k] ~ dnorm(mu_X[k], sd = sigma_X)

with X_BOUNDARY[k, ] being a two-element vector containing the upper and lower value of the interval, but this produces warnings:

warning: logProb of data node X[1]: logProb is -Inf.
warning: problem initializing stochastic node X[2]: logProb is -Inf.

...

So I must have misunderstood the logic of this function. Could you please point me to the correct formulation?

Many thanks for your help

Marwan

On Wed, May 29, 2024 at 5:56 PM Perry de Valpine <pdeva...@berkeley.edu> wrote:

Marwan Naciri

unread,

May 31, 2024, 4:48:10 PMMay 31

to Perry de Valpine, nimble-users

Hi Perry,

Thank you for your very quick follow-up answer. Now I understand the logic behind dinterval, and the model works well.