How to handle IPS and sampler polygons when presence-only and presence-absence data have different spatial extents?

Moritz Klaassen

unread,

Apr 10, 2025, 5:20:16 AM4/10/25

to R-inla discussion group

Dear INLA/bru community,

I’m working on an integrated marine model that combines:

Presence-only (PO) data (citizen science data that covers a narrow region with many presences).
Presence–absence (PA) data (collected in a broader survey across a larger region with few presences).

I’m using inlabru (/ the wrapper pointedSDMs) and am confused about how best to define integration points (IPS) or samplers when these two data types have different spatial extents. In particular:

My PO data only exist in a small corridor, so I’d like the IPS there to capture the small spatial sampling effort.
My PA data spans a larger region and is modeled as a binomial likelihood.

In normal usage, I can pass one polygon sampler to inlabru for presence-only to define the integration domain. But if the PA data covers a bigger area, how should I incorporate that extra region without artificially generating IPS for the PO dataset outside its corridor?

Should I perhaps use large domain for the IPS that then cover spatially both datasets, and then rely on a “bias term” or offset to handle the narrower PO coverage? Is that approach recommended, or is it better to literally give each dataset its own sampler polygon so that the PO data are integrated only where they truly exist, and the PA data “knows” about its broader region?

Finally, if I want to create a finer integration scheme (higher density of IPS) for the PO region but keep a coarser one for the rest, is there a standard inlabru approach for multi-resolution integration, or do I need to manually piece that together?

Any advice or examples on specifying IPS/ samplers for integrated models with different spatial coverage would be greatly appreciated!

Thanks so much,
Moritz

Finn Lindgren

unread,

Apr 10, 2025, 11:57:11 AM4/10/25

to Moritz Klaassen, R-inla discussion group

Hi Moritz,

first of all, don't try to mix the two types of data into a single observation model; keep them separate, with one bru_obs() call for each type of data.

That alone should make it much clearer what you need to do, and you can focus on one data type at a time; they can have completely unrelated integration schemes, etc.,

even though the underlying latent model is the same, or at least has several components in common.

The PA observation model should be straightforward, as it doesn't involve spatial integration _at all_; these are georeferences 0/1 observations (assuming these are treated as "presence/absence at specific locations"), and _not_ a point pattern model.
The details depend a bit on whether you can treat it as point-referenced data or as region-aggregated information.

Then you can figure out what the search effort model for the PO data should be; "presence-only" isn't a fully defined term unfortunately.

It is sometimes meant to mean "we looked in this region and here is where we found things", and sometimes taken to mean "we got these reports of found things".

These are two very different situations, and only the first one has a clear well-defined basic model choice (an inhomogeneous point process model).

For the second one people do all sorts of ad hoc things, including "pseudo-absences" that may or may not be meaningful or well founded; they usually are constructed in a way that they mimic an approximation of a point process model when only some fixed effect covariates are involved. When full spatial random field are used, one has to use a proper point process model and integration scheme.

It sounds like your PO data is somewhere between the two situations, so if you can define a sensible "this is the polygon or collections of polygons where there was some presence-only search effort", then use that to define a point process model on that polygon/region, and feed that to a separate bru_obs(), so you don't improperly mix it with the PA model, which is not the same type of data.

Finn

--
You received this message because you are subscribed to the Google Groups "R-inla discussion group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to r-inla-discussion...@googlegroups.com.
To view this discussion, visit https://groups.google.com/d/msgid/r-inla-discussion-group/6d202841-1b71-4e93-9a08-8832cb4df66bn%40googlegroups.com.

--

Finn Lindgren
email: finn.l...@gmail.com

Moritz Klaassen

unread,

Apr 11, 2025, 5:50:02 AM4/11/25

to R-inla discussion group

Hi Finn,

Thank you so much for your detailed explanation and suggestions! I only started exploring INLA and inlabru recently, and the advice from you and others on the forum has been invaluable in getting my integrated model running.

I ended up following the approach you recommended: keeping the PO and PA observations as separate likelihood contributions and providing an integration domain only for the presence-only data within its known sampling corridor.
At the bottom of this message, there is a summary of my current model setup and output. I would really appreciate it if you could take a quick look and let me know if everything seems sensible from a first look.

Some notes:

I am modelling a shared spatial field with AR1 temporal grouping across 12 months.
The PO dataset (monthly resolution) uses a separate bias field component (po_sf_biasField) that also has 12 monthly groups with an AR1 correlation structure.
The PA dataset covers only 1 of those months.
The environmental covariates (temperature, bathymetry, slope) each have their own 1D SPDE model component, using PC priors. Temperature changes across the months and I managed to dynamically link the values to the PO (presences and IPS)

I guess I have two major questions remaining:

Shared Spatial Random Field + Bias Field

Does my use of a single shared_spatial field, grouped by month and modeled with an AR1 correlation, seem appropriate when the PO data spans 12 months but the PA data covers only a single month? Currently, the PA data is treated as one “slice” (month = X) of that same AR1 structure. I’m wondering whether this is standard practice?
I’ve also introduced a separate po_sf_biasField component to capture sampling bias in the PO data. This bias field has the same monthly grouping/AR1 structure as the PO data. Is that a sensible approach to account for possible differences in search effort over space and time in a presence-only survey, or would you suggest a different strategy?

Question on priors

Currently, I’m using PC priors for the SPDE components (2D SPDE for the SRF and 1D for my covariates), roughly guided by domain knowledge about species home ranges (for the SRF) (for example, a prior expectation that the spatial range might be on the order of tens of kilometers). However, I’m still unsure how best to refine these priors iteratively. Do you typically lean more on prior domain knowledge (the species core habitat area is about X km, so the prior range scale should be around X”), or do you adjust iteratively based on posterior checks?

Model components

~-1
+ shared_spatial(main = geometry, model = shared_field, group = month,
ngroup = 12, control.group = list(model = "ar1"))
+ po_sf_intercept(1)
+ pa_sf_intercept(1)
+ temperature(main = temperature, model = INLA::inla.spde2.pcmatern(
mesh = fmesher::fm_mesh_1d(
loc = seq(-1.33369767665863, 2.07408857345581, length.out = 20),
boundary = "free"
),
alpha = 2,
prior.range = c(0.2, 0.05),
prior.sigma = c(0.5, 0.05),
constr = TRUE)
)
+ bathymetry(main = bathymetry, model = INLA::inla.spde2.pcmatern(
mesh = fmesher::fm_mesh_1d(
loc = seq(-1.14557325839996, 2.52323031425476, length.out = 20),
boundary = "free"
),
alpha = 2,
prior.range = c(0.2, 0.05),
prior.sigma = c(0.5, 0.05),
constr = TRUE)
)
+ po_sf_biasField(main = geometry, model = po_sf_bias_field, group = month,
ngroup = 12, control.group = list(model = "ar1"))
+ slope(main = slope, model = INLA::inla.spde2.pcmatern(
mesh = fmesher::fm_mesh_1d(
loc = seq(-0.944889545440674, 9.53582763671875, length.out = 20),
boundary = "free"
),
alpha = 2,
prior.range = c(0.2, 0.05),
prior.sigma = c(0.5, 0.05),
constr = TRUE)
)

Model output
Summary of 'modISDM' object: inlabru version: 2.12.0 INLA version: 24.12.11 Types of data modelled: po_sf Present only pa_sf Present absence

Fixed effects:
mean sd 0.025quant 0.5quant 0.975quant mode
po_sf_intercept -6.664 1.230 -9.075 -6.664 -4.254 -6.664
pa_sf_intercept -4.040 0.392 -4.809 -4.040 -3.271 -4.040

Random effects:
Name Model
shared_spatial SPDE2 model
temperature SPDE2 model
po_sf_biasField SPDE2 model
bathymetry SPDE2 model
slope SPDE2 model

Model hyperparameters:
mean sd 0.025quant 0.5quant 0.975quant mode
Range for shared_spatial 47.111 18.682 20.222 43.917 92.523 38.109
Stdev for shared_spatial 0.476 0.161 0.236 0.450 0.861 0.404
GroupRho for shared_spatial 0.847 0.090 0.615 0.867 0.960 0.905
Range for temperature 2.968 1.197 1.331 2.739 5.957 2.333
Stdev for temperature 0.514 0.161 0.275 0.488 0.903 0.439
Theta1 for po_sf_biasField -1.289 0.627 -2.556 -1.278 -0.088 -1.230
Theta2 for po_sf_biasField -1.969 0.445 -2.827 -1.975 -1.076 -2.001
GroupRho for po_sf_biasField 0.999 0.001 0.998 0.999 1.000 1.000
Range for bathymetry 1.277 0.602 0.483 1.154 2.797 0.943
Stdev for bathymetry 0.766 0.232 0.407 0.734 1.313 0.674
Range for slope 9.490 9.607 1.404 6.671 34.877 3.471
Stdev for slope 0.319 0.159 0.113 0.286 0.722 0.229

DIC: -41289.11, WAIC: 5992.56, Marg. log-likelihood: -26784.69

Everything seems to run smoothly, and the estimates appear reasonable. I would love to hear if you see any red flags or have any additional suggestions regarding my setup.

Many thanks again!
Moritz

Reply all

Reply to author

Forward