Integrate over missing values

194 views
Skip to first unread message

Robert Verity

unread,
Aug 31, 2021, 2:53:08 PM8/31/21
to R-inla discussion group

Hi,

I'm new to R INLA, finding it great so far! I have specific problem that involves missing data and I'm not sure how to implement the model, hoping someone can help.

I'm trying to fit a spatial geostatistical model of prevalence based on YES/NO observations at particular locations, all measured at the same time. I'm using SPDE2 combined with a logit link function, and then using a Binomial distribution with N=1. This all works great.

The problem is that for about 10% of the observations I don't have YES/NO observation, instead I have a measure of confidence of a YES result that lies between [0,1]. When thinking about the model I'm therefore imagining that the true observation is missing, and could be either YES or NO with a probability given by this (known) confidence. If we call the confidence q then the (unobserved) data is YES with probability q and NO with probability (1-q). So I want to model these particular datapoints as Bernoulli(p*q), where p is the fitted prevalence surface.

I have no idea how to implement this. Any help great greatly appreciated.

Bob

Helpdesk

unread,
Aug 31, 2021, 3:12:21 PM8/31/21
to Robert Verity, R-inla discussion group

I do not think you can do this with the current version of R-INLA. I do
not understand what you write with data is YES with prob q, and later
that 'model these particular datapoints as Bernoulli(p*q)'

if you can write down the math in more detail I would be able to tell
for sure, if this is doable, or if we can make it doable

H
> --
> You received this message because you are subscribed to the Google
> Groups "R-inla discussion group" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to r-inla-discussion...@googlegroups.com.
> To view this discussion on the web, visit
> https://groups.google.com/d/msgid/r-inla-discussion-group/9d832bf5-630c-4eba-8304-bea0c1cc7669n%40googlegroups.com
> .

--
Håvard Rue
he...@r-inla.org

Finn Lindgren

unread,
Aug 31, 2021, 3:22:04 PM8/31/21
to Helpdesk, Robert Verity, R-inla discussion group
Hi both,
When q is known and between 0 and 1, this would just require a scaled logit-link function, so that instead of the ordinary
p=plogis(eta)
used for the certain observations, one would need
p=q*plogis(eta)
for the “special” observations. (H: I think such a fixed-scaling version is already implemented? Or maybe that’s just for special models.)

When q is _unknown_, the problem is a Bernoulli version of thinned Poisson models, and the resulting nonlinear eta-predictor can be implemented with inlabru. But the Bernoulli version of this is closer to an open research problem, so if the version above is sufficient, that should be a lot easier to make happen!

Finn

> On 31 Aug 2021, at 20:12, Helpdesk <he...@r-inla.org> wrote:
>
> 
> To view this discussion on the web, visit https://groups.google.com/d/msgid/r-inla-discussion-group/7f9844243a292a8962c05c32bb41066ae277f807.camel%40r-inla.org.

Helpdesk

unread,
Aug 31, 2021, 3:24:31 PM8/31/21
to Finn Lindgren, Robert Verity, R-inla discussion group
yes, this is what I hoped would be the case, but then I got a little
confused over some of the details in the writing. a scaled version as
below, is not implemented, but can easily be added

On Tue, 2021-08-31 at 20:21 +0100, Finn Lindgren wrote:
> p=q*plogis(eta)
> for the “special” observations. (H: I think such a fixed-scaling
> version is already implemented? Or maybe that’s just for special
> models.)

--
Håvard Rue
he...@r-inla.org

Robert Verity

unread,
Aug 31, 2021, 3:58:57 PM8/31/21
to R-inla discussion group
Thanks all.

I can confirm that the q is known for all points (although not necessarily the same value everywhere), so it should be the scaled logit-link function that I need, as described by Finn. Then I guess I treat all observations the same and just set q=1 for the non-special observations. You mind helping me with how to implement p=q*logis(eta)?

This is what I currently have in case it helps:

res <- inla(formula = n_pos ~ 0 + b0 + f(s, model = spde),
            family = "binomial",
            data = inla.stack.data(stk_full),
            Ntrials = 1,
            control.predictor = list(
              link = 1,
              compute = TRUE,
              A = inla.stack.A(stk_full)
            ),
            control.family = list(link = "logit"),
            control.compute = list(config = TRUE)
)

Thanks!
Bob

Helpdesk

unread,
Aug 31, 2021, 4:01:26 PM8/31/21
to Robert Verity, R-inla discussion group
it has to be added in the code (by us)

Finn Lindgren

unread,
Aug 31, 2021, 4:45:38 PM8/31/21
to Robert Verity, R-inla discussion group
Hi,

I expect having a be a different value for each observation may complicate it, as link functions are usually the same across all observation of a certain type, but I’ll Haavard answer that since he’s the one who’d need to implement it in the internal inla code.

Finn

On 31 Aug 2021, at 20:59, Robert Verity <bobve...@gmail.com> wrote:


--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.

Finn Lindgren

unread,
Aug 31, 2021, 4:59:28 PM8/31/21
to Robert Verity, R-inla discussion group
The inlabru alternative I mentioned would be to use an ordinary logit transformation on the product pq itself:
pq = p*q = q * plogis(eta)
qlogis(pq) = qlogis(q * plogis(eta))

The right hand side is nonlinear in eta, so this can’t be done in regular inla, but inlabru does iterated linearisation that might be able to handle the nonlinearity.
In inlabru, you define the components separately from the predictor, and the predictor expression itself is just an ordinary R expression, so the “formula” would look something like

obs ~ qlogis(q * plogis(Intercept + other + components))

See https://inlabru-org.github.io/inlabru/articles/web/2d_lgcp_distancesampling.html for a distance sampling example that also uses a nonlinear predictor in inlabru. That type of model is a thinned Poisson point process, where the thinning probability is also estimated, leading to a similar nonlinear construction to what you need.

There are some numerical pitfalls with the plain formula/predictor expression above, so if you want to try this method, let me know I if you run into numerical issues (I have some partial solutions to those issues)

Finn

On 31 Aug 2021, at 21:45, Finn Lindgren <finn.l...@gmail.com> wrote:

Hi,

Jonathan Judge

unread,
Sep 1, 2021, 1:07:34 PM9/1/21
to Finn Lindgren, Robert Verity, R-inla discussion group
This is quite interesting as the model sounds like the "conditional binomial" model described by Wikipedia but which I haven't really seen implemented in regression front-ends.  https://en.wikipedia.org/wiki/Binomial_distribution#Conditional_binomials 

The framework is important I think because despite the caricature of bernoulli trials as being mere coin flips or a super-controlled laboratory experiment, in reality there often is an initial sorting process that is bernoulli followed by the actual bernoulli process of interest.  

Per Finn's comment, I do see that texperimental latent effects for sigmoid activation via a predictor and what appears to be the inverse.  I was wondering what the expected use case or cases might be for those and am not sure if that would fit the bill for this or not.

Jonathan

Finn Lindgren

unread,
Sep 1, 2021, 3:09:33 PM9/1/21
to Jonathan Judge, Robert Verity, R-inla discussion group
Hi Jonathan,

yes, that's correct, the thinned Bernoulli construction is just the special case of the two-step binomial construction on the wikipedia page, and it's the same kind of thinning mechanism as in thinned Poisson models.
From a practical estimation point of view, the problem with the Binomial/Bernoulli case is that the logit of the product of p and q is not the sum of the logit for p and q separately, whereas in the poisson case, the log-link is nice since the log of a product is the sum of the logs, making it easier to write such models on the standard generalised linear models framework. For known (but varying) value of q, one just needs more flexibility in what the link function is allowed to be;
the q-logit link needed is log(pq) - log(1-pq), but normally all observations use the same link function, making implementation easier. The theory doesn't care quite as much about that (but non-iid observations always makes asymptotics more difficult, if one cares about that).  The approach in inlabru is to assume that the non-standard non-linear predictor expression for the observation probability is "linear enough" (near the posterior mode, where "near" depends on the posterior variance) that linearising at the posterior mode gives an adequate approximation to the full posterior distribution, which allows the use of inla to compute that approximation.

One of my main practical concerns is when the data contradicts the given q-values, by having "too many" 1-observations compared with what q would indicate should be the maximum possible large sample proportion. In a maximum likelihood setting that would send p to 1, and the linear predictor to +infinity. The Bayesian priors would normally prevent that, but practically it might still be an issue.

Finn
--
Finn Lindgren
email: finn.l...@gmail.com

Helpdesk

unread,
Sep 1, 2021, 4:05:14 PM9/1/21
to Finn Lindgren, Jonathan Judge, Robert Verity, R-inla discussion group

The new testing version (under R-4.1) add this 'q' scaling with
family="xbinomial"; see inla.doc("binomial") and the simulated example
therein.
> > > https://groups.google.com/d/msgid/r-inla-discussion-group/0B597BCD-4161-4689-9216-25C9ABD6DFBC%40gmail.com

Robert Verity

unread,
Sep 1, 2021, 5:12:24 PM9/1/21
to R-inla discussion group
Thanks all - particularly Haavard for making those changes to code, I really appreciate it. I've just tried it out with xbinomial and it works great :)

My use-case is that I have vaccination status (binary variable) for a spatially referenced population and am interested in identifying "cold spots" of vaccination. Some individuals don't have a vaccination card, but we do have a separate measure of "confidence" (i.e. probability) that they are vaccinated, hence the need for q-values. But a more common use-case I would have thought would be prevalence estimates where there is imperfect test sensitivity, meaning observed values are Binomial/Bernoulli with probability p*sensitivity, where p is the true prevalence. If different tests are used in different groups then q would have to be variable (although probably limited to a small number of levels).

For what it's worth, I think I also got the same approach working using inlabru following the guidance above. I've pasted my code below, cribbing off the examples in the inlabru tutorials. I'd be interested to know if this is/isn't correct for future reference.

Best wishes,
Bob


# setup
library(INLA)
library(inlabru)
library(tidyverse)
library(magrittr)
library(ggplot2)

# ------------------------------------------------------------------
# GENERATE TRUE PREVALENCE SURFACE

# define boundaries and create fine mesh
bnd = inla.mesh.segment(cbind(c(0, 10, 10, 0),
                              c(0, 0, 10, 10)))
mesh_fine <- inla.mesh.2d(boundary = bnd, max.edge = 0.2)

# specify true SPDE model
spde_fine <- inla.spde2.pcmatern(mesh_fine,
                                 prior.sigma = c(1, 0.01),
                                 prior.range = c(1, 0.01))

# get true precision matrix from model
true_range <- 4
true_sigma <- 1
true_Q <- inla.spde.precision(spde_fine,
                              theta = log(c(true_range, true_sigma)))

# convert to true SD
true_sd <- diag(inla.qinv(true_Q))^0.5

# sample the true field
true_field <- inla.qsample(1, true_Q)[,1]

# get true field into dataframe and apply logistic transform to get true
# prevalence
truth <- expand.grid(lon = seq(0, 10, length = 100),
                     lat = seq(0, 10, length = 100)) %>%
  dplyr::mutate(true_field <- inla.mesh.project(mesh_fine,
                                                loc = cbind(lon, lat),
                                                field = true_field),
                true_prev = plogis(true_field))

# change format
coordinates(truth) <- c("lon", "lat")
truth <- as(truth, "SpatialPixelsDataFrame")

# plot true prevalence surface
pl_truth <- ggplot() +
  gg(truth, mapping = aes_string("lon", "lat", fill = "true_prev")) +
  coord_equal() +
  scale_fill_distiller(palette = "RdBu") +
  ggtitle("True prevalence")
pl_truth

# ------------------------------------------------------------------
# DRAW OBSERVED DATA

# choose sampling locations uniformly within domain and get true prevalence at
# these locations
n <- 200
mydata <- data.frame(lon = runif(n, 0, 10),
                     lat = runif(n, 0, 10)) %>%
  dplyr::mutate(true_field = inla.mesh.project(mesh_fine,
                                               loc = cbind(lon, lat),
                                               field = true_field),
                true_prev = plogis(true_field))

# contrived example of test sensitivity increasing from 0 to 1 as we move from
# left to right of domain. Draw binary observations from this
# prevalence*sensitivity combo
mydata <- mydata %>%
  dplyr::mutate(sens = lon / 10,
                pos = rbinom(n, 1, prob = true_prev*sens))
coordinates(mydata) <- c("lon", "lat")

# plot observations against true prevalence surface
pl_truth +
  geom_point(aes(x = lon, y = lat, col = as.factor(pos)),
             data = as.data.frame(mydata))

# ------------------------------------------------------------------
# FIT MODEL AND VISUALISE RESULTS

# create course mesh
mesh <- inla.mesh.2d(boundary = bnd, max.edge = 0.5)
#plot(mesh)

# specify SPDE model on mesh
spde <- inla.spde2.pcmatern(mesh,
                            prior.sigma = c(10, 0.01),
                            prior.range = c(1, 0.01))

# specify model components
cmp <- ~ mySPDE(main = coordinates, model = spde) + Intercept(1)

# specify model formula
#form <- pos ~ mySPDE + Intercept
form <- pos ~ qlogis(sens * plogis(mySPDE + Intercept))

# fit model
t0 <- Sys.time()
fit <- bru(components = cmp,
           data = mydata,
           family = "binomial",
           Ntrials = 1,
           formula = form)
Sys.time() - t0

# predict on grid
t0 <- Sys.time()
pix <- pixels(mesh, nx = 200, ny = 200)
pred <- predict(fit, pix,
                ~ plogis(mySPDE + Intercept))
Sys.time() - t0

# plot fitted surface against data
pl_est <- ggplot() +
  gg(pred, mapping = aes_string("x", "y", fill = "mean")) +
  geom_point(aes(x = lon, y = lat, col = as.factor(pos)),
             data = as.data.frame(mydata)) +
  coord_equal() +
  scale_fill_distiller(palette = "RdBu") +
  ggtitle("Estimated prevalence")
pl_est

# plot SD of posterior. Should be a general trend of higher uncertainty towards
# left of domain
pred$sd <- sqrt(pred$var)
pl_sd <- ggplot() +
  gg(pred, mapping = aes_string("x", "y", fill = "sd")) +
  geom_point(aes(x = lon, y = lat, col = as.factor(pos)),
             data = as.data.frame(mydata)) +
  coord_equal() +
  scale_fill_distiller(palette = "RdBu") +
  ggtitle("SD of estimated prevalence")
pl_sd

Finn Lindgren

unread,
Sep 1, 2021, 5:17:04 PM9/1/21
to Robert Verity, R-inla discussion group
From a quick glance, your inlabru version looks OK, but I'll need to revisit this a bit later to compare more. Great to have both versions to compare!
Finn

Reply all
Reply to author
Forward
0 new messages