Incorporating sample size in spatial INLA-SPDE model with point data.

Mikin

unread,

Jul 14, 2022, 2:41:39 AM7/14/22

to R-inla discussion group

Hello,

I am fitting a model to observe & predict the spatial effect on a variable of interest computed as longitude & latitude point data. Each point has a varied sample size ranging from 100s to 1000s.

I initially thought I could incorporate sample size as a covariate in the model,

[formula = y ~ -1 +Intercept + samplesize + f(spatial.field, model = SPDE)]

but I soon realised that this is wrong as any covariates incorporated in the model must have values at the unsampled coordinates to enable predictions to be made.

I have also considered duplicating every data entry by its respective sample size and then jittering the overlapped points, but given the sample size numbers this would be too computationally intensive.

Is there an alternative way to incorporate sample size into my analysis so that it controls the weight/intensity of each observation point?

Any help is highly appreciated.

Reproducible sample code of model without sample size:
# Sample data
library(data.table)
dat <- data.table(ADMIN = c("China", "China", "China", "China", "China"),
entry_id = c("1", "2", "3", "4", "5"),
y = c("1", "10", "20", "70", "90"),
sample_size = c("20", "100", "1000", "500", "250"),
lon = c("118.1568", "118.066", "118.0356", "113.3613", "112.9701"),
lat = c("26.53895", "22.89347", "23.13714", "23.60236", "23.62635"))

dat$y <- as.numeric(dat$y)

# Obtain country polygon
library(rnaturalearth)
library(rnaturalearthdata)
library(rnaturalearthhires)
library(sf)
countries <- countries10
countries <- st_as_sf(countries)
countries <- countries[c("ADMIN", "geometry")]

dat <- merge(dat, countries, by = c("ADMIN"))

# Extract coordinates of sampled locations
Sampled <- data.frame(lon = dat$lon, lat = dat$lat)

# Create Raster of all polygon locations
library(raster)
library(stars)
Raster <-st_rasterize(st_as_sf(dat$geometry))
write_stars(Raster, "Raster.tif")
Raster <- raster('Raster.tif')

# Extract coordinates of unsampled locations
Unsampled <- rasterToPoints(Raster, spatial = TRUE)
Unsampled <- data.frame(lon = Unsampled$x, lat = Unsampled$y)

# Extract coordinates of country boundary
library(INLA)
Boundary <- as(dat$geometry[[1]], "Spatial") %>% inla.sp2segment()

# Mesh construction
max.edge <- diff(range(as.numeric(Sampled[,1])))/(3*5)
bound.outer <- diff(range(as.numeric(Sampled[,1])))/3
Mesh <- inla.mesh.2d(boundary = Boundary,

loc = Sampled,
max.edge = c(1,3)*max.edge,
offset = c(max.edge, bound.outer),
cutoff = 0.65)

# Plot mesh
plot(Mesh,asp = 1, main = "")

# Define Projector Matrices
Aest <- inla.spde.make.A(mesh = Mesh, loc = cbind(Sampled$lon, Sampled$lat))
Apre <- inla.spde.make.A(mesh = Mesh, loc = cbind(Unsampled$lon, Unsampled$lat))

# Define SPDE
SPDE <- inla.spde2.matern(mesh = Mesh, alpha = 2)

# Define Spatial Index
SIndex <- inla.spde.make.index(name = "spatial.field",

n.spde = SPDE$n.spde,

n.group = 1,

n.repl = 1)

# Define stack
Stackest1 <- inla.stack(data = list(y = dat$y),
A = list(Aest, 1),
effects = list(spatial.field = SIndex, Intercept = rep(1, nrow(Sampled))),
tag = "est")

Stackpre1 <- inla.stack(data = list(y = NA),
A = list(Apre, 1),
effects = list(spatial.field = SIndex, Intercept = rep(1, nrow(Unsampled))),
tag = "pre")

Stack1 <- inla.stack(Stackest1, Stackpre1)

# Formula
Formula1 <- (y ~ -1 + Intercept + f(spatial.field, model = SPDE))

# Fit model
Model1 <- inla(Formula1,
family = "gaussian",
data = inla.stack.data(Stack1),
control.predictor = list(A = inla.stack.A(Stack1), compute = TRUE),
control.compute = list(cpo = TRUE, dic = TRUE, waic = TRUE,

return.marginals.predictor = TRUE, mlik = TRUE, config = TRUE))

# Define index to easily extract predictions
Index <- inla.stack.index(stack = Stack1, tag = "pre")$data

Mean_Frequency <- Model1$summary.fitted.values[Index, "mean"]

# Raster predictions
MeanRaster1 <- rasterize(x = Unsampled,
y = Raster,
field = Mean_Frequency,
fun = mean,
crs = CRS("+init=epsg:4326 +proj=longlat +ellps=WGS84

+datum=WGS84 +no_defs +towgs84=0,0,0"))

# Create color palette
library(viridis)
library(leaflet)
Pal <- colorNumeric("viridis", domain = c(0, 100),

na.color = "transparent", reverse = TRUE)

# Plot predicted mean
leaflet() %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addRasterImage(MeanRaster1, colors = Pal, opacity = 0.5) %>%
addLegend("bottomright", pal = Pal, values = values(MeanRaster1), title = "Mean") %>%
addScaleBar(position = c("bottomleft"))

Mikin

unread,

Jul 14, 2022, 7:32:49 AM7/14/22

to R-inla discussion group

I think I figured it out but just to confirm. Do I just have to modify the code with the inclusion of Ntrials as demonstrated below in blue?

# Define stack
Stackest1 <- inla.stack(data = list(y = dat$y, Ntrials = dat$sample_size),

A = list(Aest, 1),
effects = list(spatial.field = SIndex, Intercept = rep(1, nrow(Sampled))),
tag = "est")

# Fit model
Model1 <- inla(Formula1,

data = inla.stack.data(Stack1),

Ntrials = (inla.stack.data(Stack1))$Ntrials,

control.predictor = list(A = inla.stack.A(Stack1), compute = TRUE),
control.compute = list(cpo = TRUE, dic = TRUE, waic = TRUE,

return.marginals.predictor = TRUE, mlik = TRUE, config = TRUE))

Finn Lindgren

unread,

Jul 14, 2022, 8:54:16 AM7/14/22

to Mikin, R-inla discussion group

Yes, if you're modelling Binomial counts, with family="binomial", then
you need to construct Ntrials in that way.
Finn

> --
> You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/cad6fe08-2b22-400d-97ad-f6b8f82997b7n%40googlegroups.com.

--
Finn Lindgren
email: finn.l...@gmail.com

Finn Lindgren

unread,

Jul 14, 2022, 8:56:26 AM7/14/22

to Mikin, R-inla discussion group

Your initial code used family="gaussian", but since you explicitly
have count data with known sample size, it seems a Binomial model will
likely make more sense.

Finn

Mikin

unread,

Jul 14, 2022, 9:14:53 AM7/14/22

to R-inla discussion group

Hi Finn, thank you for your reply.
The variable of interest (y) being represented as points are allele frequencies (value range of 0 to 100%) not counts so I am not sure if the incorporation of the Ntrials is correct, but I can't think of a different way to incorporate the sample size.

Is it still correct?

Finn Lindgren

unread,

Jul 14, 2022, 1:26:25 PM7/14/22

to Mikin, R-inla discussion group

Aha, no then Ntrials doesn’t help, since that’s specifically for binomial data. However, if you have the proportions as well as the N values, you could convert it back to actual counts, and then apply the binomial model?

Finn

On 14 Jul 2022, at 14:14, Mikin <mika...@gmail.com> wrote:

Hi Finn, thank you for your reply.

To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/8763c177-018e-4339-9dff-a77bde01102fn%40googlegroups.com.

Message has been deleted

Tim Meehan

unread,

Jul 14, 2022, 7:08:27 PM7/14/22

to R-inla discussion group

This might be way off, but another way to think about sample size is as a nuisance variable, like sampling effort. Maybe you could add sample size as a variable in model estimation but use a standard sample size of X for predictions? If the sample size effect was modeled as asymptotic, then you could pick a big number out on the asymptote?

Mikin

unread,

Jul 15, 2022, 3:15:57 AM7/15/22

to R-inla discussion group

Thank you both for your suggestions.

I'm not sure transforming allele frequency to counts is feasible as the population number each was derived from is not available for all, but I will keep it in mind as a back up option if all else fails.

I have attempted utilizing sample size as a covariate and computed a fixed value for all unsampled locations, but as you'll see from the figures below it does not yield a good outcome.

Again thank you both, I will keep reading to see if an alternative solution is possible.

[Top image: No sample size incorporated; Bottom image: Fixed sample size value included at unsampled locations]

No sample size.png

Helpdesk

unread,

Jul 27, 2022, 3:46:13 PM7/27/22

to Mikin, R-inla discussion group

there is argument 'scale' (see inla.doc("gaussian")), or maybe
inla.doc("agaussian")

> --
> You received this message because you are subscribed to the Google
> Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit

> https://groups.google.com/d/msgid/r-inla-discussion-group/7cd1da3a-df20-4a1b-83f3-d692cd71e947n%40googlegroups.com
> .

--
Håvard Rue
he...@r-inla.org

Mikin

unread,

Jul 28, 2022, 1:13:23 AM7/28/22

to R-inla discussion group

Thank you Håvard, I'll check it out!!!

Reply all

Reply to author

Forward