SPDE and Prediction in Zero-Inflated Poisson

460 views

Skip to first unread message

Mark

unread,

Apr 20, 2020, 10:36:18 AM4/20/20

to R-inla discussion group

Hello,

There are some very helpful threads floating around here on zero-inflation prediction, but I have a few item I could not resolve by reviewing those. I’ve attempted to construct a zero-inflated Poisson model. I’ve created a fully reproducible example using a dummy dataset. The R file should be "click and run"

Here are some questions I’m looking for advice on:

1. The data is originally in decimal degree lat/lon. The area of interest spans several UTM regions, which precludes that as a projection option. I understand UTM to be one of the better projections for measuring distance, and I read in the SPDE gitbook that lat/lon is in appropriate for spatial models (with gaussian random fields). Additionally, following your “Bayesian Spatial Modelling with R-INLA” example paper, I decided to work directly on the spherical mesh. My question is this: Is this the most accurate way to calculate the gaussian spatial random field, and is the general process I pursue to accomplish this correct?

2. In my code, there are 882 mesh nodes, 1,000 training datapoints, 1,000 test datapoints (See: length(INLA_model$summary.fitted.values$mean) ). Yet there are 4,000 fitted values, and seems to vary (the original data has around 4,442 fitted values despite essentially being the same setup). Notably, the predictions for some of these mystery points are quite massive. Perhaps they are for mesh nodes with no covariates, confusing the model. The challenge here is that it makes it difficult to specify the “link” argument since I don’t know before-hand how long the prediction vector is. How do I understand where these extra fitted values are coming from?

3. Most importantly, I seem to be struggling to get the fitted values for the model, or at least ones that make sense. I’ve included tables/plots demonstrating the issue in the final two code chunks. Please note that I’m using the “Naïve” approach for now to get approximate results (I hope). I will follow examples on this forum to get more theoretically valid predictions.

a. In other questions I’ve read on this forum, people add link=1 for the testing set because the default is the identity link when the response is missing. But my responses appear to be in terms of counts and do not change to when I add link=c(rep(1,1000),rep(NA,1000)) . How is this possible?

b. It would seem that 1’s are “inflated”. Perhaps it is bad model fit, but my guess is that I have made an error. Any idea why this might be?

Note: The following are items I am planning to address, so no need to address these as of now, unless of course you think they are related to my issues above:

- Lack of prior specification, especially for matern function

- Mesh needs refinement (extent is needlessly large)

INLA is an amazing package—I had been unsuccessfully attempting to fit spatial models for this large dataset (this is only a sample) for several months using other packages. It’s wonderfully fast and stable! If there is anything I can to do make these questions easier to answer, I’m happy to oblige.

Regards,

Mark

library(tidyverse)
library(INLA)
library(rnaturalearth)
library(inlabru)
library(gridExtra)
library(sf)

#Read in data
INLA_train_masked<-read_csv("https://raw.githubusercontent.com/millermc38/Research/master/INLA_train_masked.csv")%>%select(-X1)%>%mutate(x15=round(x15))
INLA_test_masked<-read_csv("https://raw.githubusercontent.com/millermc38/Research/master/INLA_test_masked.csv")%>%
  select(-X1)%>%mutate(x15=round(x15))
#You should be able to ignore the warnings safely

#Get wisconsin sf object, extract coordinates, and project them directly on the globe
wisconsin_sf<-st_as_sf(ne_download(scale = 50, type = "states"))%>%
  filter(code_hasc=="US.WI")
coords<-st_coordinates(wisconsin_sf)
coords_cartesion = inla.mesh.map(coords, projection = "longlat")

# Construct mesh for the state
boundary.loc <- SpatialPoints(as.matrix(coords_cartesion))
boundary <- list(
  inla.nonconvex.hull(coordinates(boundary.loc), 0.012),
  inla.nonconvex.hull(coordinates(boundary.loc), 0.0265))
mesh <- inla.mesh.2d(boundary=boundary,
                     max.edge=c(0.0044, 0.0266),
                     min.angle=c(20, 18),
                     max.n=c(48000, 16000), ## Safeguard against large meshes.
                     max.n.strict=c(128000, 128000), ## Don't build a huge mesh!
                     cutoff=0.001, ## Filter away adjacent points.
                     offset=c(0.012, 0.0265)) ## Offset for extra boundaries, if needed.

#Review the plot. It's larger than it needs to be--I'll fix that  later.
ggplot()+
  gg(mesh)+#This uses inlabru package
  geom_point(aes(x=coords_cartesion[,1],y=coords_cartesion[,2]))+
  labs(title="Constrained refined Delaunay triangulation",
       x="3D Cartesian X",y="3D Cartesian Y")+
  theme_bw()

#Let's make a mesh and an index
spde <- inla.spde2.matern(mesh=mesh, alpha=2)
s_index = inla.spde.make.index(name = "spatial.field", 
                               spde$n.spde)

#Identify lat/lon in the training set, convert them to cartesian, make A Matrix, make train stack
coords_train <- as.matrix(INLA_train_masked%>%select(x16,x17))
coords_cartesian_train = inla.mesh.map(coords_train, projection = "longlat")
A_matrix_train = inla.spde.make.A(mesh=mesh, loc=coords_cartesian_train)
train_stack <- inla.stack(data = list(x15 = INLA_train_masked$x15), 
                          A = list(A_matrix_train, 1),
                          effects = list(c(s_index, list(intercept = 1)), 
                                         list(cov = INLA_train_masked%>%select(-c(x15:x17)))), #Remove lat, lon, response
                          tag = "train")

#Same as above, but now for the test set
coords_test <- as.matrix(INLA_test_masked%>%select(x16,x17))
coords_cartesion_test = inla.mesh.map(coords_test, projection = "longlat")
A_matrix_test = inla.spde.make.A(mesh=mesh, loc=coords_cartesion_test)
test_stack <- inla.stack(data = list(x15 = NA), 
                         A = list(A_matrix_test, 1),
                         effects = list(c(s_index, list(intercept = 1)), 
                                        list(cov = INLA_test_masked%>%select(-c(x15:x17)))), #Remove lat, lon, response
                         tag = "test")

#Combine train/test stacks
INLA_stacks <- inla.stack(train_stack, test_stack)

#Model Formula. Basically: x15 ~ x1+...+X14+spatial.field (x16 is lat, x17 is lon, so leave those off)
inla_formula<-str_c("-1","intercept",paste0(names(INLA_train_masked)[-c(15:17)],collapse = "+"),"f(spatial.field, model=spde)",sep = "+")%>%str_c("x15~",.)%>%as.formula()

#Model. I have four logical processors, and that usually takes around 45 seconds to run this block
INLA_model <- inla(inla_formula, 
                   family = "zeroinflatedpoisson1",
                   data = inla.stack.data(INLA_stacks, spde=spde),
                   control.predictor=list(A=inla.stack.A(INLA_stacks),
                                          compute=TRUE),
                   control.compute=list(config = TRUE)) #to try inlabru predict function
#link=c(rep(1,1000),rep(NA,1000)) does not seem to change the outcome

#Store theta
zero_prob<-summary(INLA_model)[["hyperpar"]][1,1]
one_m_zero_prob<-(1-zero_prob)

#Extract "naive" fitted values for training set, create various summaries
train_index = inla.stack.index(INLA_stacks, "train")$data
train_fit<-(INLA_model$summary.fitted.values$mean[train_index])*one_m_zero_prob
cor(INLA_train_masked$x15,
    train_fit)
train_fit%>%summary
table(round(train_fit))
grid.arrange(ggplot()+geom_histogram(aes(x=train_fit),binwidth = 1),
             ggplot()+geom_histogram(data=INLA_train_masked,aes(x=x15),binwidth = 1))

#Extract "naive" fitted values for testing set, create various summaries
test_index = inla.stack.index(INLA_stacks, "test")$data
test_fit<-(INLA_model$summary.fitted.values$mean[test_index])*one_m_zero_prob
cor(INLA_test_masked$x15,
    test_fit)
test_fit%>%summary
table(round(test_fit))
grid.arrange(ggplot()+geom_histogram(aes(x=test_fit),binwidth = 1),
             ggplot()+geom_histogram(data=INLA_test_masked,aes(x=x15),binwidth = 1))

ZIP INLA Example.R

Finn Lindgren

unread,

Apr 20, 2020, 10:51:01 AM4/20/20

to Mark, R-inla discussion group

Hi Mark,
I can only answers some of this at this time:

1) If a stationary field on (part of) the globe is a reasonable model,
then yes, building it directly on the sphere is a good idea.
You can/should take advantage of the fact that the mesh routines
know about CRS information, so instead of transforming coordinates
yourself, you can
supply the coordinates in whatever format you have them (although
with several UTM inputs, transforming to a common coordinate system
first is practical, which seems like you've don, to longlat?).
You then also need to tell inla.mesh.2d what CRS to build the mesh in,
and there is a shortcut defined for that, crs=inla.CRS("sphere").
That will make it build a sphere with unit radius, so all distances
will be in radians. You have to take that into account when specifying
max.edge values; divide actual metres by the earth radius (around
3.6e6 I believe) to get the corresponding radian distance.
Just as you do now, you can supply nonconvex boundary information;
just make sure to set a crs when you build those, so it can transform
them onto the sphere.

Finn

> --
> You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/9c237fc0-8ea4-4024-9bae-a2f8828959af%40googlegroups.com.

--
Finn Lindgren
email: finn.l...@gmail.com

Mark

unread,

Apr 21, 2020, 4:43:50 PM4/21/20

to R-inla discussion group

Finn,

Please hold off on answering my other questions. I realized that my predicted/fitted values, at least for the training set, are correct (if I am reading the documentation right) and don't need to be multiplied by 1-theta ( I was following another thread I thought was the same question as mine, but was not). Really all that remains for me to figure out is why the model fits so many one's, so I'm trying a hurdle model to see if it is a model specification problem.

Regards,

Mark

Rubén Esteban García Gómez

unread,

Jul 3, 2022, 3:57:52 PM7/3/22

to R-inla discussion group

I have the same question (since I have the same problem with a lot more fitted values than mesh points). I have also fitted Hurdle model.