Hi all,
I've been using inlabru for Bayesian spatial population modelling and have run into a question about how prediction intervals should be constructed. I'd appreciate any guidance on whether my understanding is correct and what the recommended approach is.
When I use generate() or predict() to produce intervals at held-out locations for cross-validation, the resulting 95% intervals achieve substantially less than 95% coverage against observed data. After some investigation, I believe this is because the samples returned by generate() represent the posterior of the linear predictor- they are samples of E[y|θ], the conditional mean, and do not include the observation-level noise from the likelihood.
For a Gaussian example this is straightforward to verify: the 95% quantiles from generate() match those from summary.fitted.values, and both undercover by a wide margin. I've attached a minimal reproducible example below that demonstrates this.
My current workaround is to include the observation noise directly inside the generate() formula, which seems to work well and has the nice property of jointly sampling the precision hyperparameter:
The first gives ~17% coverage; the second gives ~95%.
My questions:
I realise this is documented in the INLA literature (Martino & Riebler 2019 note that "adding the observational noise to the fitted values leads to the predictive distributions"), but in practice it's easy to miss, especially since the standard cross-validation metrics (bias, RMSE, correlation) all depend on the posterior mean and are identical regardless of which interval type is used, only coverage reveals the problem. I also haven't seen this mentioned much in the tutorials and teaching literature around INLA.
Minimal reproducible example below. Thanks for any guidance.
Peter
###############################################################################
library(inlabru)
library(matrixStats)
set.seed(2026)
# ── 1. Simulate data ────────────────────────────────────────────────────────
n <- 1000
n_groups <- 25
beta_0 <- 1.0
beta_1 <- 0.5
beta_2 <- -0.5
sigma_re <- 0.1
sigma_obs <- 1.0 # observation-level SD (true precision = 1)
group <- sample(1:n_groups, n, replace = TRUE)
x1 <- rnorm(n)
x2 <- rnorm(n)
re <- rnorm(n_groups, 0, sigma_re)
eta <- beta_0 + beta_1 * x1 + beta_2 * x2 + re[group]
y <- rnorm(n, mean = eta, sd = sigma_obs)
# ── 2. Hold out 20% ────────────────────────────────────────────────────────
holdout <- sample(1:n, round(0.2 * n))
y_true <- y[holdout]
y_train <- y
y_train[holdout] <- NA
df <- data.frame(y = y_train, x1 = x1, x2 = x2, group = group)
# ── 3. Fit model ───────────────────────────────────────────────────────────
cmp <- y ~ Intercept(1) +
x1(x1, model = "linear") +
x2(x2, model = "linear") +
group_re(group, model = "iid",
hyper = list(prec = list(prior = "pc.prec", param = c(0.5, 0.01))))
fit <- bru(components = cmp, data = df, family = "gaussian",
options = list(control.compute = list(config = TRUE),
verbose = FALSE))
# ── 4. Approach A: generate() without observation noise ─────────────────────
samples_A <- generate(fit, newdata = df[holdout, ],
formula = ~ Intercept + x1 + x2 + group_re,
n.samples = 2000)
q_A <- rowQuantiles(samples_A, probs = c(0.025, 0.975))
cov_A <- mean(y_true >= q_A[, 1] & y_true <= q_A[, 2])
cov_A
# predict() also produces the same intervals, since it samples from
# the same latent posterior as generate().
fitted_holdout <- fit$summary.fitted.values[holdout, ]
cov_fitted <- mean(y_true >= fitted_holdout$`0.025quant` &
y_true <= fitted_holdout$`0.975quant`)
## Results: the "generate() without noise" and "fitted values" approaches give
cov_A
cov_fitted
## These are approximately equal
# ── 5. Approach B: generate() WITH observation noise ────────────────────────
samples_B <- generate(fit, newdata = df[holdout, ],
formula = ~ rnorm(
n = 2000,
mean = Intercept + x1 + x2 + group_re,
sd = 1 / sqrt(Precision_for_the_Gaussian_observations)
),
n.samples = 2000)
q_B <- rowQuantiles(samples_B, probs = c(0.025, 0.975))
cov_B <- mean(y_true >= q_B[, 1] & y_true <= q_B[, 2])
cov_B
# Nominal Target 95%
--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/r-inla-discussion-group/144a78ab-79d6-486c-b43b-ef53cec616a1n%40googlegroups.com.