Calibrating overfitting SPDE model

237 views
Skip to first unread message

Joris Wiethase

unread,
May 26, 2021, 9:03:47 AM5/26/21
to R-inla discussion group
Hi all,

Some background:
I will preface this by saying that I am still fairly new to INLA. 
I have been working on specifying a SPDE model, to analyse factors explaining per-pixel rates of a degradation index, in a pre-defined study region (approx. 350km x 170km). Based on yearly degradation maps, I calculated the overall pixel rate, and built a model with Gaussian family,  specifying priors with 'inla.spde2.pcmatern' (see code below).

The problem:
Looking at plots of the random field compared to the model predictions, the two are essentially indistinguishable, suggesting the random effect is taking over (I think). 
I did some posterior predictive model checking, and the model seemed to be overfitting quite drastically:
BG_posterior_probability_OLD.png
Interestingly, when I checked the model using cross-validation based on CPO and PIT, the histogram appears to show a rather uniform distribution:
BG_pit_hist_OLD.png
Tried so far:
Following advice in this thread, I added a local dispersion effect, as iid effect in linear predictor. The result seems to be that the histogram of the posterior predictive p-values is essentially flipped:
BG_posterior_probability_NEW.png

The cross validation histogram looks very similar to the previous one:
BG_pit_hist_NEW.png
I tried setting the alpha parameter to 2/3, as suggested in the same threat, but to no avail.

Questions:
  1. Was the model specified correctly, to begin with?
  2. If the model is correct, what could I attempt next to calibrate the model for a better fit? 
Any help is greatly appreciated!

Joris

The code (my apologies, not sure how to format this as code):
-----------------------------------------------------------------------
deg500 <- a raster stack, containing 20 rasters with 500m pixel resolution
load("BG_raw_spdf.RData"))   a SpatialPointsDataFrame, containing the covariates
annRain <- yearly rainfall for the time period
studyArea <- shapefile for the study area

### ----------------------------------------------------------------------------
time <- 1:nlayers(deg500)
# Compute the overall slope of degradation, accounting for rainfall in a given year
slowfun <- function(y) {
      if(all(is.na(y))){
            NA
      } else {
            m = lm(y ~ time + annRain$mean); summary(m)$coefficients[2]   # get time
      }
}

deg500_rate <- calc(deg500, slowfun)

### Basic INLA model predicting rate from various covariates:
### ----------------------------------------------------------------------------

## Add the rate to the raw_spdf:
deg500_rate_spdf <- as(deg500_rate, "SpatialPointsDataFrame")
deg500_rate_spdf <- deg500_rate_spdf[!is.na(deg500_rate_spdf@data$layer), ]

xy_raw  <- apply(round(coordinates(raw_spdf), 6), 1, paste, collapse = "_")
xy_rate <- apply(round(coordinates(deg500_rate_spdf), 6), 1, paste, collapse = "_")

raw_spdf <- raw_spdf[order(xy_raw), ]
deg500_rate_spdf <- deg500_rate_spdf[xy_rate %in% xy_raw, ]
deg500_rate_spdf <- deg500_rate_spdf[order(xy_rate[xy_rate %in% xy_raw]), ]
raw_spdf$deg_rate <- deg500_rate_spdf$layer * 10  # Small decimal numbers will use more memory to get to integer. Also easier to read, adn index is arbitrary anyways

## Transform and scale covariates:
raw_spdf$lpop <- log(raw_spdf$pop + 1)  # log(x+1) for right skewed data containing zeros
raw_spdf$ltlu <- log(raw_spdf$tlu + 1)  
raw_spdf$pop_s <- c(scale(raw_spdf$lpop))
raw_spdf$rain_s <- c(scale(raw_spdf$rain))
raw_spdf$tlu_s <- c(scale(raw_spdf$ltlu))

## Make a mesh:
max.edge = diff(range(raw_spdf@coords[, 2]))/15

bound.outer = diff(range(raw_spdf@coords[, 2]))/3
mesh <- inla.mesh.2d(
      boundary = studyArea,
      loc = coordinates(raw_spdf),
      max.edge = c(1,5)*max.edge,  # For filling in area
      cutoff = max.edge/5,
      offset = c(max.edge, bound.outer),    # Expanding outside actual locations. Determine inner and outer offset
)
# and the spde:
A = inla.spde.make.A(mesh=mesh, loc=data.matrix(coordinates(raw_spdf)))  # Combine actual points with mesh

# covariates:
Xcov = data.frame(intercept=1, 
                  pop_s = raw_spdf$pop_s,
                  rain_s = raw_spdf$rain_s,
                  tlu_s = raw_spdf$tlu_s,
                  CA1 = (raw_spdf$CA == 1) * 1,  # CCRO
                  CA2 = (raw_spdf$CA == 2) * 1,  # NP
                  CA3 = (raw_spdf$CA == 3) * 1)  # WMA

Xcov = as.matrix(Xcov)
colnames(Xcov)
stck <- inla.stack(tag='est',     # stack is INLA data structure
                   data=list(deg_rate=raw_spdf$deg_rate),
                   effects=list(
                         s=1:mesh$n,     # spatial random effect
                         iidx=1:nrow(raw_spdf),  # iid random effect
                         Xcov=Xcov),
                   A = list(A, 1, 1)
                   
## define the model:
# - sd(raw_spdf@coords[, 2])/10
# - 0.5*diff(range(raw_spdf@coords[, 2]))

spde = inla.spde2.pcmatern(mesh, 
                           alpha = 3/2,  
                           prior.range = c(1.6, 0.5),  
                           prior.sigma = c(0.08, 0.1))

hyper.iid = list(prec = list(prior = 'pc.prec', param = c(0.08, 0.1))) 

formula = deg_rate ~ -1 + Xcov +    
      f(s, model=spde)  +               # spatial random effect
      f(iidx, model="iid", hyper=hyper.iid)   # iid random effect

## make some linear combinations, for effect plots:
pop_lc <- inla.make.lincombs(Xcov1 = rep(1, 100),
                             Xcov2 = seq(min(raw_spdf$pop_s), max(raw_spdf$pop_s), len = 100))
names(pop_lc) <- paste0("pop", 1:100)
rain_lc <- inla.make.lincombs(Xcov1 = rep(1, 100),
                              Xcov3 = seq(min(raw_spdf$rain_s), max(raw_spdf$rain_s), len = 100))
names(rain_lc) <- paste0("rain", 1:100)
tlu_lc <- inla.make.lincombs(Xcov1 = rep(1, 100),
                             Xcov4 = seq(min(raw_spdf$tlu_s), max(raw_spdf$tlu_s), len = 100))
names(tlu_lc) <- paste0("tlu", 1:100)
CA_lc <- inla.make.lincombs(Xcov1 = rep(1, 4),
                            Xcov5 = c(0,1,0,0),
                            Xcov6 = c(0,0,1,0),
                            Xcov7 = c(0,0,0,1))
names(CA_lc) <- paste0("CA", 1:4)
all_lc <- c(pop_lc, rain_lc, tlu_lc, CA_lc)

res <- inla(formula, data=inla.stack.data(stck),
            control.predictor=list(A = inla.stack.A(stck), compute=TRUE),
            family = family,
            lincomb = all_lc,
            control.compute = list(cpo = TRUE),
            control.inla = list(int.strategy='eb'),    # empirical bayes (cheap computation for testing)
            verbose = TRUE)

summary(res)

# ### ----------------------------------------------------------------------------
# ### MODEL DIAGNOSTICS
# ### ----------------------------------------------------------------------------

# # Make plot of predictions ---------------------------------------------
index.val <- inla.stack.index(stck,"est")$data
post.mean.val <- res$summary.linear.predictor[index.val,"mean"]
post.sd.val <- res$summary.linear.predictor[index.val,"sd"]

raw_spdf$predicted <- res$summary.linear.predictor[index.val,"mean"]
raw_spdf$pred_sd <- res$summary.linear.predictor[index.val,"sd"]

mapdata <-data.frame(raw_spdf)

pred_mean <- ggplot() +
      geom_point(data = mapdata, aes(x=x, y=y, col = predicted)) +
      coord_equal() +
      theme_minimal()

pred_sd <- ggplot() +
      geom_point(data = mapdata, aes(x=x, y=y, col = pred_sd), cex = 0.5) +
      coord_equal() +
      theme_minimal()
pred_comb <- gridExtra::grid.arrange(pred_mean, pred_sd, nrow = 1)

# # Visualize the random field, the unexplained spatial error -----------------------------
projector <- inla.mesh.projector(mesh)
projection <- inla.mesh.project(projector,
                                res[["summary.random"]]$s$mean)
INLAutils::ggplot_projection_shapefile(projection, projector) +
      geom_polygon(data = studyArea, aes(long, lat, group = group), fill = NA, col = 'black') +
      coord_equal()

# Posterior predictive check  ----------------------------------------------------------
INLAutils::ggplot_inla_residuals(res, raw_spdf$deg_rate, binwidth = 0.1)

# Cross-validation model checking  ----------------------------------------------------------
sum(res$cpo$failure)  # Check if numerical problems occurred when calculating CPO

hist(res$cpo$pit, main="", breaks = 10, xlab = "PIT")

qqplot(qunif(ppoints(length(res$cpo$pit))), res$cpo$pit, main = "Q-Q plot for Unif(0,1)", xlab = "Theoretical 
       Quantiles", ylab = "Sample Quantiles")
qqline(res$cpo$pit, distribution = function(p) qunif(p), prob = c(0.1, 0.9), col = "red")


Here is the model summary for the overfitted model:
FIXED_OLD.pngRE_OLD.png


And for the adjusted model:
FIXED_NEW.pngRE_NEW.png

Finn Lindgren

unread,
May 26, 2021, 9:57:50 AM5/26/21
to Joris Wiethase, R-inla discussion group
Hi Joris,

the "add an idd effect" from the other thread really only makes sense when the observation likelihood isn't already an iid Gaussian additive effect. E.g., if the observation model is Poisson, then adding an iid effect to the predictor acts as an overdispersion effect.

But in your case with Gaussian observations, we can see that the estimation simply redistributes the variance ~ 1/20 into two ~ 1/40 variances (though the clear way in which it does that _is_ a surprise to me!), leading to effectively the same model, and it's just where the observation noise is entered into the model that is adjusted.

This leads me to believe that the problem lies in the INLAutils::ggplot_inla_residuals() function itself, and that what it computes isn't as well defined a quantity as it should be; You may need to dig into the code for that package to see what they actually compute, and what mathematical object that is. The documentation doesn't say, so I would be inclined not to use that function until I figured out if it actually computes/plots something sensible.

It appears to me it's plotting the "linear predictor" posterior probabilities computed for the observed values, but that doesn't include the observation model itself, so it isn't the correct posterior probabilities of predictive distributions, since those need to involve the observation model as well. This is done correctly for the CPO calculations, which is likely why those consistently look ok!
Pet peeve: this misconception seems common in frequentist-to-bayesian text that incorrectly transpose point estimation methods as if they can be applied without modification to posterior distributions, which is incorrect!

Finn



--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/81710536-b699-41c8-b8df-6b030773fbbdn%40googlegroups.com.


--
Finn Lindgren
email: finn.l...@gmail.com

Finn Lindgren

unread,
May 26, 2021, 10:00:30 AM5/26/21
to Joris Wiethase, R-inla discussion group
TL/DR: it seems ggplot_inla_residuals uses the posterior predictive distribution for the linear predictor instead of the posterior predictive distribution for the observations.

Except for the CPO/PIT calculations from inla() itself, the latter typically needs inla.posterior.sample() to compute/estimate properly.

Finn

Joris Wiethase

unread,
May 26, 2021, 10:44:24 AM5/26/21
to R-inla discussion group
Hi Finn,

thank you for your input, it is very much appreciated! Good to know that including the iid effect doesn't makes sense (shows how little I know about the subject matter!).
In the Wang et al. (2018) book ("Bayesian Regression Modeling with INLA"), I have come across some code that produces the same plot; I suspect that is the code underlying the  INLAutils::ggplot_inla_residuals() function:

predicted.p.value <- c()
n <- length(raw_spdf[, 1])
for(i in (1:n)) {
  predicted.p.value[i] <- inla.pmarginal(q = raw_spdf$deg_rate[i],
                                         marginal=res$marginals.fitted.values[[i]])
}
hist(predicted.p.value,main="",xlab="Posterior predictive p-value")



Considering the CPO plot is the one to trust, do I conclude that the model is specified and working correctly? Or are there other checks to be done first?

Finn Lindgren

unread,
May 26, 2021, 10:54:07 AM5/26/21
to Joris Wiethase, R-inla discussion group
Yes, the only difference between linear.predictor and fitted.values is that fitted.values applies the (inverse of the) link function (which is the identity in your case), so it doesn't include the observation distribution part of posterior predictions.
CPO does include all the needed parts, so is the appropriate part to look at for this type of posterior check. But the observed vs point prediction scatter plot can also be useful; in your case it illustrates the slight shrinkage of the estimated model towards a flat field (since the scatter plot isn't centered around the 1-1 correspondence) which can either be ok, or an indication of oversmoothing (i.e. the opposite to overfitting!!!).
Further calculations of this type are only needed for fully out-of-sample predictions (In some models, the leave-one-out cross-validation that CPO involves isn't the appropriate tool to use.)

I would start by plotting the results spatially. You can also do a spatial plot of the observed-"point prediction" to check for spatial patterns in that, etc.

Finn

Joris Wiethase

unread,
May 26, 2021, 11:07:51 AM5/26/21
to R-inla discussion group
Excellent, thank you once again for your detailed explanation and input! 
Reply all
Reply to author
Forward
0 new messages