Hi all,
I’m trying to run a path analysis (time -> X -> S) in which the variable X is not known for all datapoints. There are three types of X datapoints:
- - X is known exactly
- - X is entirely unknown
- - We know of an interval comprising X
The values of time and S are known for all data points.
To accommodate these properties of X, I have been trying to run a model in which all values of X are drawn from truncated normal distributions sharing the same mean and the same standard deviation, but not the same boundaries. Values of X that are either known exactly or entirely unknown are drawn from a normal distribution truncated very far from the mean (so essentially non-truncated), while values of X for which we know an interval are drawn in a normal distribution truncated at the boundaries of the corresponding interval.
The compilation returns no error, and the MCMC algorithm seems to work, but for the coefficients of X (the mean and the slope against A) and the standard deviation of X the chains remain at the initial value and do not sample the parameter space. The Xs are not sampled either (which makes sense for the Xs that are known, but not for the others).
When I discard the info on intervals and consider only values known exactly (and values entirely unknown), the MCMC samples the parameter space of all parameters, and the values used to simulate the data are well recovered (in simulated datasets). However, this requires discarding part of the available data, which I really want to avoid as the data is quite scarce.
Am I missing something?
The code I used to simulate and analyze the data is below.
Thanks in advance,
Marwan
library(tidyverse)
library(nimble)
Sys.setenv(LANG = "en")
# The purpose of this script is to simulate data corresponding to the following
# directed acyclic graph:
# Time ---> X ----> S
# The specificity here is that not all values of X are known. Some of these
# values are entirely unknown, while for some, we know of an interval in which the
# true value is located.
# To handle this data, the idea is to draw all values of X in truncated normal
# distributions sharing the same mean and standard deviation, but not the same
# boundaries:
# - known values and entirely unknown values are drawn in a virutally non-truncated
# normal distribution (i.e. boundaries are very far from the mean so as to
# truncate the distribution where the density is extremely low anyway)
# - values for which we know an interval are drawn in a normal distribution truncated
# at this interval.
# A. Simulate the data =========================================================
N <- 250
time <- sample(x = 1988:2024, size = N, replace = T)
time_s <- as.vector(scale(time))
# ~ 1. SImulate A -------------------------------------------------------
# Simulate Time --------> X
mean_X <- 0
sd_X <- 1
slope_time_X <- -0.6
slope_A_on_X <- 0.2
set.seed(2)
X <- rnorm(n = N,
mean = mean_X + slope_time_X * time_s,
sd = sd_X)
ggplot(data.frame(x = time, y = X),
aes(x = x, y = y)) +
geom_jitter() +
theme_bw() +
geom_smooth(method = "lm")
# ~ 2. Simulate S --------------------------------------------------------------
# Simulate X ---> S
p <- plogis(-1.25*X)
s <- NULL
set.seed(2)
for (k in 1:N) {
s[k] <- rbinom(n = 1, size = 1, prob = p[k])
}
ggplot(data.frame(x = X, y = s), aes(x = x, y = y)) +
geom_jitter(alpha = 0.5, height = 0.2) +
geom_smooth(method = "glm",
method.args = list(family = "binomial")) +
theme_bw()
# ~ 3. Missing values and intervals -------------------------------------------
# Now let's degrade the quality of the dataset by entirely removing some values
# of X, and by replacing some values with intervals
# Probability that at least some info is available (i.e. 1 - probability that the
# value is entirely unknown):
p_some_info <- 0.7
# Probability that a value that is not entirely unknown is know exactly. This
# probability is time-dependent i.e. in early years exact values of A and X were
# never known (only intervals) while in recent years, values are known exactly
# (or not known at all)
p_exact_value <- plogis(-1.5 + 3*time_s)
p <- plogis(-1.5 + 3*time_s)
ggplot(data.frame(x = time, p = p),
aes(x = x, y = p)) +
geom_line() +
theme_bw()
# status of X:
status_X <- NULL
set.seed(2)
for (k in 1:N) {
some_info <- rbinom(n = 1, size = 1, prob = p_some_info)
if (some_info == 1) {
exact_value <- rbinom(n = 1, size = 1, prob = p_exact_value[k])
if (exact_value == 1) {
status_X[k] <- "exact"
} else {
status_X[k] <- "interval"
}
} else {
status_X[k] <- "unknown"
}
}
# values to be added or substracted to generate the intervals
uncertainty <- c(0.1, 0.2, 0.5, 1)
prob_uncert <- c(0.3, 0.3, 0.3, 0.1)
X_lower_s <- X_upper_s <- rep(NA, times = length(time))
data_X <- X
# Remove missing values and replace true values by intervals where needed
set.seed(2)
for (k in 1:N) {
if (status_X[k] %in% c("interval", "unknown")) {
data_X[k] <- NA
if (status_X[k] == "interval") {
X_lower_s[k] <- X[k] - sample(x = uncertainty, size = 1, prob = prob_uncert)
X_upper_s[k] <- X[k] + sample(x = uncertainty, size = 1, prob = prob_uncert)
}
}
# When no information is available or that the true value is known, the value is drawn
# from a normal distribution truncated very far from the mean (i.e. effectively no
# truncation)
if (status_X[k] %in% c("exact", "unknown")) {
X_lower_s[k] <- -5
X_upper_s[k] <- 5
}
}
df_plot_X <- data.frame(x = time,
y = data_X,
y_lower = X_lower_s,
y_upper = X_upper_s,
status_X = status_X)
df_plot_X_2 <- df_plot_X %>%
mutate(ID = 1:nrow(df_plot_X),
offset = rnorm(n = nrow(df_plot_X), mean = 0, sd = 0.2),
x = x + offset) %>%
pivot_longer(cols = c(y_lower, y_upper))
ggplot(data = df_plot_X_2[which(df_plot_X_2$status_X == "interval"), ]) +
geom_point(aes(x = x, y = value)) +
geom_line(aes(x = x, y = value, group = ID)) +
theme_bw() +
labs(x = "time", y = "X")+
ylim(-3, 4.25)
ggplot() +
geom_jitter(data = df_plot_out, aes(x = x, y = y)) +
theme_bw() +
labs(x = "time", y = "X") +
ylim(-3, 4.25)
# B. Run the model with intervals ==============================================
# ~ 1. Build model -------------------------------------------------------------
model_sim_trunc <- nimbleCode ({
for (k in 1:N) {
# X
mu_X[k] <- kappa[1] +
kappa[2] * time_s[k]
X[k] ~ T(dnorm(mu_X[k], sd = sigma_X),
X_LOWER[k],
X_UPPER[k])
# S
logit(p[k]) <- beta[1] +
beta[2] * X[k]
s[k] ~ dbern(p[k])
}
# ++++++++++++++++++++++++++++++++ priors ++++++++++++++++++++++++++++++++++++
for (k in 1:2) {
kappa[k] ~ dnorm(0, sd = 3) # coefs for X
}
sigma_X ~ dunif(0, 5)
for (k in 1:2) {
beta[k] ~ dnorm(0, sd = 3) # coefs for p
}
})
# ~ 2. Bundle data -------------------------------------------------------------
inits_X <- X
inits_X[status_X == "exact"] <- NA
dat <- list(X = data_X,
s = s)
my.constants <- list(N = N,
time_s = time_s,
X_LOWER = X_lower_s,
X_UPPER = X_upper_s)
inits <- function() list(beta = c(0.5, 0.5),
kappa = c(0.1, 0.2),
sigma_X = 0.75,
X = inits_X)
inits_values <- list(inits(), inits())
# Parameters monitored
params <- c("beta",
"kappa", "sigma_X")
params2 <- c("X")
# ~ 3. Run the model -----------------------------------------------------------
start <- Sys.time() ; start
# Create R model
inits_values <- inits()
sim_trunc_R <- nimbleModel(code = model_sim_trunc,
constants = my.constants,
data = dat,
inits = inits_values,
calculate = T)
end <- Sys.time() ; end ; end - start
# Compile model (in C++)
sim_trunc_C <- compileNimble(sim_trunc_R,
showCompilerOutput = FALSE)
end <- Sys.time() ; end ; end - start
# Configure of MCMC
MCMC_conf <- configureMCMC(sim_trunc_R,
monitors = params,
monitors2 = params2)
# Compile MCMC
MCMC_R <- buildMCMC(MCMC_conf)
MCMC_C <- compileNimble(MCMC_R, project = sim_trunc_R)
fit_sim_trunc.2 <- runMCMC(mcmc = MCMC_C,
nchains = 1,
niter = 40000,
nburnin = 10000,
thin = 5,
thin2 = 150,
inits = inits())
end <- Sys.time() ; end - start
# C. Run the model with no intervals ===========================================
# Here, I don't take into account the intervals
# ~ 1. Build model -------------------------------------------------------------
model_sim_no_trunc <- nimbleCode ({
for (k in 1:N) {
# X
mu_X[k] <- kappa[1] +
kappa[2] * time_s[k]
X[k] ~ dnorm(mu_X[k], sd = sigma_X)
# S
logit(p[k]) <- beta[1] +
beta[2] * X[k]
s[k] ~ dbern(p[k])
}
# ++++++++++++++++++++++++++++++++ priors ++++++++++++++++++++++++++++++++++++
for (k in 1:2) {
kappa[k] ~ dnorm(0, sd = 3) # coefs for X
}
sigma_X ~ dunif(0, 5)
for (k in 1:2) {
beta[k] ~ dnorm(0, sd = 3) # coefs for p
}
})
# ~ 2. Bundle data -------------------------------------------------------------
inits_X <- X
inits_X[status_X == "exact"] <- NA
dat <- list(X = data_X,
s = s)
my.constants <- list(N = N,
time_s = time_s)
inits <- function() list(beta = c(0.5, 0.5),
kappa = c(0.1, 0.2),
sigma_X = 0.75,
X = inits_X)
inits_values <- list(inits(), inits())
# Parameters monitored
params <- c("beta",
"kappa", "sigma_X")
params2 <- c("X")
# ~ 3. Run the model -----------------------------------------------------------
start <- Sys.time() ; start
# Create R model
inits_values <- inits()
sim_no_trunc_R <- nimbleModel(code = model_sim_no_trunc,
constants = my.constants,
data = dat,
inits = inits_values,
calculate = T)
end <- Sys.time() ; end ; end - start
# Compile model (in C++)
sim_no_trunc_C <- compileNimble(sim_no_trunc_R,
showCompilerOutput = FALSE)
end <- Sys.time() ; end ; end - start
# Configure of MCMC
MCMC_conf <- configureMCMC(sim_no_trunc_R,
monitors = params,
monitors2 = params2)
# Compile MCMC
MCMC_R <- buildMCMC(MCMC_conf)
MCMC_C <- compileNimble(MCMC_R, project = sim_no_trunc_R)
fit_sim_no_trunc <- runMCMC(mcmc = MCMC_C,
nchains = 1,
niter = 40000,
nburnin = 10000,
thin = 5,
thin2 = 150,
inits = inits())
end <- Sys.time() ; end - start
--
You received this message because you are subscribed to the Google Groups "nimble-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nimble-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nimble-users/663f812b-8e9c-4b9d-9edd-f4fc2827efe4n%40googlegroups.com.
Hi Perry,Thanks a lot for your insight!Indeed, I think I'm in a situation where censoring is more relevant: X[t] can take a priori any value, but the available information tells us that it falls between a lower and upper value.I'm now trying to use dinterval, having read section 5.2.7 of the user guide, but I'm having trouble understanding how to use dinterval. I have triedX[k] ~ dinterval(t = X_CENSORED[k], c = X_BOUNDARY[k, ])
X_CENSORED[k] ~ dnorm(mu_X[k], sd = sigma_X)with X_BOUNDARY[k, ] being a two-element vector containing the upper and lower value of the interval, but this produces warnings:warning: logProb of data node X[1]: logProb is -Inf.
warning: problem initializing stochastic node X[2]: logProb is -Inf....So I must have misunderstood the logic of this function. Could you please point me to the correct formulation?Many thanks for your helpMarwan