Prediction with covariate-- spatio-temporal model

Alok Bohara

unread,

Jul 10, 2022, 12:30:36 PM7/10/22

to R-inla discussion group

Hi all:

I am a bit uncertain about how the prediction and mapping inla performs across the entire gridded map (of Spain, say e.g.) when there are no data available for covariates for non-sampled grids. I am using the terminology grid very loosely to mean the entire map of mainland Spain. I went through the Spain's pollution PM25 example and was able to produce the maps for three years (2015/16/17). This was for a constant-only model. https://www.paulamoraga.com/book-geospatial/sec-geostatisticaldataexamplest.html

I then added a Time trend covariate and reran the model which gave me the maps again. I was not quite sure how it did the prediction without the Time trend available for the entire region.

Sample of codes

###Squared grid
bb <- st_bbox(m) $ m is border of mainland Spain
x <- seq(bb$xmin - 1, bb$xmax + 1, length.out = 50)
y <- seq(bb$ymin - 1, bb$ymax + 1, length.out = 50)
dp <- as.matrix(expand.grid(x, y)) # plot(dp); this gridded Spain will be repeated and stacked three times later
plot(dp, asp = 1)

plot(m) # prints three maps for three years.

##Select grids within Spain only
#Then, we keep only the locations that lie within the map of Spain (Figure 10.10).
p <- st_as_sf(data.frame(x = dp[, 1], y = dp[, 2]),
coords = c("x", "y")
)
st_crs(p) <- st_crs(25830)
ind <- st_intersects(m, p)
dp <- dp[ind[[1]], ]
plot(dp, asp = 1)

####

stk.e <- inla.stack(
tag = "est",
data = list(y = d$value),
A = list(1, A),
effects = list(data.frame(b0 = rep(1, nrow(d)), Time = d$Time), s = indexs)
)

stk.p <- inla.stack(
tag = "pred",
data = list(y = NA), # poll data for other non-station places are not available...
A = list(1, Ap),
effects = list(data.frame(b0 = rep(1, nrow(dp),Time=d$Time)), s = indexs)
)

stk.full <- inla.stack(stk.e, stk.p)

rprior <- list(theta = list(prior = "pccor1", param = c(0, 0.9)))

#The formula is defined as

formula <- y ~ 0 + b0 + Time+ f(s,
model = spde, group = s.group,
control.group = list(model = "ar1", hyper = rprior)

)

#I then went through the rest of the plotting codes to get the three predicted maps for three years just like the one for the constant-only model.

index <- inla.stack.index(stack = stk.full, tag = "pred")$data

res$summary.fitted.values

dp <- data.frame(dp)
names(dp) <- c("x", "y", "time")

dp$pred_mean <- res$summary.fitted.values[index, "mean"]
dp$pred_ll <- res$summary.fitted.values[index, "0.025quant"]
dp$pred_ul <- res$summary.fitted.values[index, "0.975quant"]

library(reshape2)
dpm <- melt(dp,
id.vars = c("x", "y", "time"),
measure.vars = c("pred_mean", "pred_ll", "pred_ul")
)
head(dpm)

ggplot(m) + geom_sf() + coord_sf(datum = NA) +
geom_tile(data = dpm, aes(x = x, y = y, fill = value)) +
labs(x = "", y = "") +
facet_wrap(variable ~ time) +
scale_fill_viridis("PM2.5") +
theme_bw()

########################################################

Q: I was not quite sure how it knows the value of the time trend for those non-sampled stations. For example, a more realistic model may look like this:

PM2.5 (it)= a + b * TimeTrend(t) + c* NoOfFactories(it) + f (etc...)

where i = stations, t = time.

In this case, I have NoOfFactories around the station.

Could I still predict those other areas of Spain without the covariates available to me? Or should I assume that most examples I see in INla books for constant-only model may not be applicable for the model with covariates, especially to predict the entre region. I have seen example of elevation raster image being used as a covariate which makes sense as these elevations can be extracted for the entire region. I am interested in the model like above.

I would appreciate any tips you may provide.

Message has been deleted

Tim Meehan

unread,

Jul 11, 2022, 2:24:34 PM7/11/22

to R-inla discussion group

Hello.

"Q: I was not quite sure how it knows the value of the time trend for those non-sampled stations."

If I understand correctly, you have time in there twice, once as a global 'fixed' effect and once as a spatially and temporally (ST) structured random effect.

# formula <- y ~ 0 + b0 + Time+ f(s, model = spde, group = s.group, control.group = list(model = "ar1", hyper = rprior)).

INLA knows about the time dimension in the ST effect because time and s.group are the same thing. But I'm not sure how you are even able to get this model to run because the stk.p object looks funky.

# effects = list(data.frame(b0 = rep(1, nrow(dp),Time=d$Time)), s = indexs)

Your rep() function looks like it needs a closing parenthesis, and that Time is not making it into the prediction stack. Maybe your predictions only contain the ST time component and not the Time fixed effect? It is definitely possible to predict from the model you described. You can predict it for the observation stations with one prediction stack that has the station locations and covariate values. You can get it for the rest of Spain if you have data on the factory locations across Spain and can feed the number of factories per prediction grid location. If not, the only way you could do it would be to assume an average number of factories or zero factories for prediction grid cells. But that might yield a useless map? Or maybe you could build another ST model for number of factories based on population size or river or road networks or some other available measure.

I can show you how to set up the stacks if you want.

Best,

Tim

Alok Bohara

unread,

Jul 11, 2022, 10:32:32 PM7/11/22

to R-inla discussion group

Thanks Tim for your offer of help in setting up the stacks. I have a real example of water quality data where I have collected data on ecoli found in their houses for about 65% of the total sampled households (to minimize cost). For the remaining 35 % of the sample, I do have covariates. I want to predict the prevalence of ecoli across the city. I think I was able to predict the point prediction for the 35% of the HH. But, I was not sure if I could predict the entire mapping across the city. Ideally, I should not be able to do so as you said. But this example with Time fixed effect threw me off.

First, I think I fixed closing parenthesis error right after posting the message...:

effects = list(data.frame(b0 = rep(1, nrow(d)), Time = d$Time), s = indexs)

)

Result:

mean sd 0.025quant 0.5quant 0.975quant mode kld
b0 8.158 0.374 7.423 8.157 8.895 8.156 0
Time 0.199 0.116 -0.028 0.199 0.428 0.199 0

And the predicted map is attached.. Again, I just don't know how the model predicts across the grid (as in the attached map) when the time fixed effect is known only for the observed stations (24 stations, I believe).

Sample of codes showing how I included fixed Time in prediction is given below:

list_marginals <- list(
"b0" = res$marginals.fixed$b0,
"Time" = res$marginals.fixed$Time,
"precision Gaussian obs" =
res$marginals.hyperpar$"Precision for the Gaussian observations",
"range" = res$marginals.hyperpar$"Range for s",
"stdev" = res$marginals.hyperpar$"Stdev for s",
"rho" = res$marginals.hyperpar$"GroupRho for s"
)

marginals <- data.frame(do.call(rbind, list_marginals))
marginals$parameter <- rep(names(list_marginals),
times = sapply(list_marginals, nrow)
)

etc...

The graphing codes:

library(reshape2)
dpm <- melt(dp,
id.vars = c("x", "y", "time"),
measure.vars = c("pred_mean", "pred_ll", "pred_ul")
)
head(dpm)

etc..

ggplot(m) + geom_sf() + coord_sf(datum = NA) +
geom_tile(data = dpm, aes(x = x, y = y, fill = value)) +
labs(x = "", y = "") +
facet_wrap(variable ~ time) +
scale_fill_viridis("PM2.5") +
theme_bw()

Alok

SpainPollPredictionMap.png

Tim Meehan

unread,

Jul 12, 2022, 6:50:05 PM7/12/22

to R-inla discussion group

Hi Alok,

'I don't know how the model predicts across the grid (as in the attached map) when the time fixed effect is known only for the observed stations'
I don't know either. Maybe by putting intercept and Time in the same data frame, R did some unseen recycling? Anyway one sure way to add the linear global effect to the stacks is like this:

### first put Time on the same scale as s.group for good measure

d$Time <- as.integer(factor(d$year))

### build the stacks

stk.e <- inla.stack(
tag = "est",
data = list(y = d$value),
A = list(1,

1, ### add an extra 1 for the linear global effect which is then a separate list item in effects
A),
effects = list(b0 = rep(1, nrow(d)),
Time=d$Time, ### separate the Time effect from the intercept

s = indexs)
)
stk.p <- inla.stack(
tag = "pred",
data = list(y = NA),

A = list(1,
1, ### same as above
Ap),
effects = list(b0 = rep(1, nrow(dp)),
Time=dp[,3], ### here, using the Time vector with the proper length

s = indexs)
)

stk.full <- inla.stack(stk.e, stk.p)

# form
formula <- y ~ 0 + b0 + Time + f(s,

model = spde, group = s.group,
control.group = list(model = "ar1", hyper = rprior)
)

Alok Bohara

unread,

Jul 14, 2022, 4:05:49 PM7/14/22

to R-inla discussion group

Hi Tim. Thanks for the response. I did separate the variables Time and Constant as per your suggestion, but the prediction was still made over the entire map. Then, I looked at the variable Time=dp[,3] in the prediction stack portion and suspect what the problem may be. The gridded map has been repeated three times (for three time periods) with time cohorts 1, 2, 3, entering in dp as the third column.

dp <- rbind(cbind(dp, 1), cbind(dp, 2), cbind(dp, 3))

Then the use of this code in the prediction part of the stack function, assumes the Time variable is available for the entire map.

effects = list(b0 = rep(1, nrow(dp)),
Time=dp[,3], ### here, using the Time vector with the proper length

I suspect that the Inla is using the time coefficient and applying it to the entire map as 1,2,3 are available everywhere (dp). I don't know how useful is this prediction based on the time trend alone.

What if I had the number of factories in some of the other cities with missing PM10? I am assuming that I have to add the factory variables in the dp. How ???

###Squared grid
bb <- st_bbox(m) #$ m is border of mainland Spain

x <- seq(bb$xmin - 1, bb$xmax + 1, length.out = 50)
y <- seq(bb$ymin - 1, bb$ymax + 1, length.out = 50)
dp <- as.matrix(expand.grid(x, y)) # plot(dp); this gridded Spain will be repeated and stacked three times later

p <- st_as_sf(data.frame(x = dp[, 1], y = dp[, 2]),
coords = c("x", "y")
)
st_crs(p) <- st_crs(25830)
ind <- st_intersects(m, p)
dp <- dp[ind[[1]], ]

dp <- rbind(cbind(dp, 1), cbind(dp, 2), cbind(dp, 3))

Thanks

Alok

Tim Meehan

unread,

Jul 14, 2022, 6:55:01 PM7/14/22

to R-inla discussion group

Hello Alok,

Depending on the specifics of the data, the models y ~ b0 + Time + f(space, model=spde, group=time.group, ...) and y ~ b0 + f(space, model=spde, group=time.group, ...) will often give you very

similar predictions, right? The difference is whether you estimate a global linear time effect along with the spatiotemporal field that captures the spatiotemporal residuals (model 1) or whether the spatiotemporal field captures the full spatiotemporal signal (model 2). So comparing predictions from these two models might not teach us much, especially given that when Time is entered into the model as a global linear effect (model 1) the effect is close to and not different from zero.

So I propose we do one of two things. We can expand the Spain PM2.5 example to add some other meaningful and easily attainable covariate (e.g., population density). Once you attain the data and extract the variable, I can help you work it into the model and stacks. Otherwise, we can touch base when you have your factory analysis ready to go, with data and model in place. What do you think?