Poor model fit (ppcOcc) for large stacked dataset

Tim

unread,

Jun 6, 2025, 12:54:13 AMJun 6

to spOccupancy and spAbundance users

Hi Jeff

I am using PGOcc/spPGOcc to assess mammal responses to fire using a dataset comprising 2410 sites-surveys (rows), which we have corresponded about over email previously. The 2410 rows are made up of 862 unique sites that each have 1-12 surveys in different seasons or years. I am using a stacked approach because the data come from different projects over many years (site and year are used as random effects).

My models almost always fail GOF tests (Bayesian p-value typically 0 or close to it). I have tried many possible fixes, including different random effects structures, different occupancy and detection covariates, aggregating survey days, increasing the number of iterations, and spatial and non-spatial models across multiple species.

The only thing I have found to produce acceptable p-values (0.1-0.9) across multiple species is to thin the dataset to one survey per site (862 rows). I also tried thinning the data to a maximum of two or three surveys per site (1316 and 1544 rows, respectively), but the GOFs fail in those cases.

I also noticed the following behaviours:

1) if I take blocks of surveys (e.g., rows 1:100, 301:400, …, summing to 862 rows) the GOFs fail. In that instance, the dataset still contains multiple surveys per site (1-12), but not all sites are analysed. This may suggest the GOFs are failing due to repeat surveys of sites.

2) if I independently randomise every covariate in the blocked dataset such that the covariate dependency within sites (i.e. repeat values across surveys) is removed, the GOFs still fail.

3) for the dataset that was thinned to one survey per site and the GOFs were good, if I remove the site random effect from the models, the GOFs fail for group = 1. This is surprising to me since there is only one row of data per site.

These are my model formulae:

occ.formula <- ~ scale(YSF) + I(scale(YSF)^2) + scale(TWI) + scale(rain) + scale(freehold) + (1|LocationName_numeric) + (1|Year)

det.formula <- ~ Track + Season + (1|LocationName_numeric) + (1|Year)

*LocationName_numeric is the ‘site’ variable

I have plotted the discrepancy values and inspected the underlying data. The rows with high discrepancy values appear to span a wide range of covariate values and include both sites that were surveyed once only and sites surveyed multiple times. When I looked at the offending rows for two different species, there was some overlap in the problematic rows, but also many rows with high discrepancy values for one species but not the other.

Any advice you have about diagnosing/fixing the problem is greatly appreciated!

Thanks for your help.

Tim

Jeffrey Doser

unread,

Jun 12, 2025, 5:38:03 AMJun 12

to Tim, spOccupancy and spAbundance users

Hi Tim,

Thanks for the note, and sorry for the delay. Seems like you've done some pretty extensive testing of how to fix the problem at hand. You might try looking at the residuals() function. This will calculate residuals following the approach of Wright et al. (2019) which might allow you to more so fine tune what is going on here, beyond what the PPC can give you. Based on your tests, it does appear the problem is with explaining inter-year variation in detection probability. Can you hypothesize any variables that may be related to such variation in detection probability within a season that could help soak up some of that variability? Also, are the species you're working with very rare? If there's only a few detections of the species and a large number of zeros, you may struggle to get an adequately fitting model.

As a slight aside, given the multi-year data set, I'd eventually suggest switching to using the functions tPGOcc and stPGOcc, particularly if you go to use the spatially explicit functions. I didn't develop PGOcc/spPGOcc with the concept of fitting stacked models, and although it's possible, it is a bit confusing for the spatial model in particular. tPGOcc and stPGOcc are designed exactly for the type of data you have. However, the residuals() function will only work for PGOcc/spPGOcc, so perhaps best to hold off on that switch until you dig into those.

Jeff

--
You received this message because you are subscribed to the Google Groups "spOccupancy and spAbundance users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to spocc-spabund-u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/spocc-spabund-users/a9446f20-b10c-4cb7-a148-ddedb5f464e1n%40googlegroups.com.

--

Jeffrey W. Doser, Ph.D.

Assistant Professor

Department of Forestry and Environmental Resources

North Carolina State University

Statistical Ecology and Forest Science Lab

Pronouns: he/him/his

Marc Kéry

unread,

Jun 12, 2025, 6:03:02 AMJun 12

to spOccupancy and spAbundance users

Dear Tim, Jeff,

I wonder whether you should investigate the possibility of a 'trap response' in the data ? In capture-recapture, trap response has traditionally been a strong focus in tests of goodness of fit, but in occupancy models we seem to consider this complication far less, if at all. Trap response denotes a specific sort of temporal dependence in detection probability, on whether previous detection has taken place or not. One may distinguish between immediate trap response, permanent trap response, and various intermediates. To model immediate trap response you could make detection probability at site i, occasion t and species k, p[i,t,k], depend on the observed datum y[i, t-1, k]. That is, you supply a version of the observed detection/nondetection data as an observational covariate and then fit that as a factor (not sure here though whether spOccupancy would allow observational covariates that differ by species ?). What that will do then is estimate a different p during an occasion that follows an occasion with a detection than during an occasion that follows an occasion without a detection.

Best regards --- Marc

To view this discussion visit https://groups.google.com/d/msgid/spocc-spabund-users/CAGz0RZCTfJ7XVKuLY9eC1CQmH4oQO0OQpWDRSVi4DN6arxbujQ%40mail.gmail.com.

--

______________________________________________________________

Marc Kéry
Tel. ++41 41 462 97 93
marc...@vogelwarte.ch
www.vogelwarte.ch

Swiss Ornithological Institute | Seerose 1 | CH-6204 Sempach | Switzerland
______________________________________________________________

*** Hierarchical modeling in ecology ***

https://www.mbr-pwrc.usgs.gov/pubanalysis/roylebook/

Tim

unread,

Jun 17, 2025, 4:41:31 AMJun 17

to spOccupancy and spAbundance users

Thanks Jeff and Marc for the advice and ideas. I will investigate the residuals function and do some reading about trap response models.

With regards to tPGOcc and stPGOcc, I was hesitant to go down that route because the dataset is highly unbalanced when it comes to repeat surveys over seasons and years. Only about 50% of the 800ish sites have more than one primary survey and all surveys occurred in nine years between 2012 and 2024 inclusive (i.e. there are no surveys in some years). I can see in your tutorial though that the functions can handle missing data, but I assumed that the models wouldn't cope with a heavily unbalanced dataset with large amounts of missing surveys (untested as of yet).

Regards

Tim

Jeffrey Doser

unread,

Jun 17, 2025, 11:26:36 AMJun 17

to spOccupancy and spAbundance users

Hi Tim and Marc,

That's a good idea, Marc. The multi-species functions in spOcc/spAbund don't let you fit covariates that differ by species, but of course one could supply a different covariate for each individual species when fitting individual species models.

Tim, the tPGOcc and stPGOcc essentially will fit the same type of model as the "stacked" version of the model under certain function settings (i.e., if you set ar1 = FALSE), so in terms of dealing with the unbalancedness there really isn't any difference (besides how the data are formatted). If you're not planning to go down the spatial route, then the stacked approach you have with PGOcc will work just fine. The formatting to get the spatial model with spPGOcc would be a bit trickier though, so I'd probably suggest going with that.