Full dynamic occupancy model using JAGS with NA in observation matrix ans observation covariates

Thomas Duchesne

unread,

Jan 31, 2025, 11:24:12 AM1/31/25

to hmecology: Hierarchical Modeling in Ecology

Hi,

I try to perform a full dynamic model on JAG for a data set extracted from a citizen science program.

I have ~600 sites surveyed for 11 years with maximum 10 secondary sessions (surveys per year). Unfortunately, I do not have 10 survey for each site every year (~70% NA).

I want to perform a full-time dependent model with annual estimation of extinction, colonisation and detection parameters. In addition, I want to add covariates of Julian date (continuous variable) and type of list (categorical variable) on detection model.

It seems JAG don't like NA values in Y array (detection/non-detection) as well as in Observation covariates array (same dimension as Y array). I am a little bit lost. How can I deal with it ?

Here is my code :

data_detect_m<-as.matrix(data_detec)
data_detect_m <- array(data_detect_m, dim=c(nrow(data_detect_m), 10, length(2014:2024)))
head(data_detect_m)

data_type_m<-as.matrix(data_type)
data_type_m <- array(data_type_m, dim=c(nrow(data_detect_m), 10, length(2014:2024)))
head(data_type_m)

data_date_m<-as.matrix(data_date)
data_date_m <- array(data_date_m, dim=c(nrow(data_detect_m), 10, length(2014:2024)))
head(data_date_m)

Y <- as.array(data_detect_m) # nSites x nSurvey x nYears

# Convert categorical variable `list_type` into numeric indices (1-4)
ObsCov1_index <- as.numeric(factor(data_type_m, levels = c("Comprehensive_daily_list",
"Medium_daily_list",
"Short_daily_list",
"Single_records")))

ObsCov1_array <- array(ObsCov1_index, dim = dim(data_type_m))
ObsCov2 <- as.array(data_date_m) # nSites x nSurvey x nYears
dim(ObsCov2)

# Get dimensions
nsites <- dim(Y)[1] # Number of sites
nsurvey <- dim(Y)[2] # Number of surveys per year
nyears <- dim(Y)[3] # Number of years

# Structure the data for JAGS
bdata <- list(
y = Y, # Detection matrix (nsites x nsurvey x nyears)
list_type = ObsCov1_array, # List type as numeric indices
julian_date = ObsCov2, # Julian date (continuous)
nSites = nsites,
nSurvey = nsurvey,
nYears = nyears,
nListTypes = 4 # Number of levels in list_type
)

cat(file = "dynocc_model.jags", "
model {
for (i in 1:nSites) {
# Initial occupancy probability
z[i,1] ~ dbern(psi1)

# Occupancy dynamics (Colonization and Extinction)
for (t in 2:nYears) {
logit(psi[i, t]) <- logit(phi[t-1] * z[i, t-1] + (1 - z[i, t-1]) * gamma[t-1])
z[i, t] ~ dbern(psi[i, t])
}

# Detection process
for (t in 1:nYears) {
for (j in 1:nSurvey) {
# Quadratic effect of Julian date
logit(p[i, j, t]) <- beta0 +
beta1 * julian_date[i, j, t] +
beta2 * pow(julian_date[i, j, t], 2) +
alpha[list_type[i, j, t]] # Categorical effect

y[i, j, t] ~ dbern(p[i, j, t] * z[i, t])
}
}
}

# Priors
psi1 ~ dbeta(1,1) # Prior for initial occupancy
for (t in 1:(nYears-1)) {
phi[t] ~ dbeta(1,1) # Extinction probability (yearly)
gamma[t] ~ dbeta(1,1) # Colonization probability (yearly)
}

beta0 ~ dnorm(0, 0.1) # Detection intercept
beta1 ~ dnorm(0, 0.1) # Linear effect of Julian date
beta2 ~ dnorm(0, 0.1) # Quadratic effect of Julian date

for (k in 1:nListTypes) {
alpha[k] ~ dnorm(0, 0.1) # Effect of list type categories
}

# Derived quantities
for (t in 1:nYears) {
N[t] <- sum(z[, t]) # Total occupied sites per year
}
}
")

# Initialize z using max detection per site-year
zst <- apply(Y, c(1, 3), max, na.rm = TRUE)

# Replace NA values (if any)
zst[is.na(zst)] <- 0

# Define initialization function
inits <- function() {
list(z = zst)
}

# Parameters to monitor
wanted <- c("psi1", "phi", "gamma", "beta0", "beta1", "beta2", "alpha", "p", "N")

# Run JAGS Model
outY <- jags(
data = bdata,
inits = inits,
parameters.to.save = wanted,
model.file = "dynocc_model.jags",
n.adapt=1000,
n.chains = 3,
n.iter = 2500,
n.burnin = 500,
n.thin = 2,
parallel = TRUE
)

I got some error such as :

Unable to resolve the following parameters: julian_date[1,1,2] (line 18, 19) julian_date[1,1,3] (line 18, 19) julian_date[1,2,1] (line 18, 19) julian_date[1,2,2] (line 18, 19)

I get it is due to NA values. I am quite new in the Bayesian world as I am much more used to frequentist approach. Can someone help me ?

I initially wanted to add annual heterogeneity in site-level extra-dispersion in p as suggested for citizen science data in AHM2 p 268 but I'll be very happy to run simpler models.

Many thanks.

Marc Kery

unread,

Jan 31, 2025, 1:27:04 PM1/31/25

to hmecology: Hierarchical Modeling in Ecology

Dear Thomas,

nearly all ecological data are imbalanced, i.e., don't have the same replication "everywhere". This is especially true for citizen-science data. For analyses with JAGS, it is then often useful to restore balance by filling in missing values until everything has the same degree of replication. Balanced data can be stored in regular arrays and this makes the BUGS code much neater and therefore easier to read and understand.

However, then there's then these missing values and we need to do something about some of them. The short formula about how to deal with missing values in JAGS is this: "NAs in the response are not a problem, but NAs in the covariates need to be dealt with". What I mean by that is that missing responses will simply be estimated as part of the model fitting. They are not a problem, but just make the model fitting slower. With your moderately-sized data set I think this should not be a problem. However, missing covariates must be dealt with. The simplest in your case would simply be to fill the missing covariates with just any value (e.g., 0), and then JAGS will be happy. This sort of missing-value-imputation is innocuous in your case, since all of these invented covariate data will correspond to cases of a missing response, and hence, the former will not contribute anything to the estimation.

In general, however, I note that for more "respectable" sample sizes (which are very easy to happen with mass-participation citizen-science programs), estimating all the missing values in an NA-filled-up response array may easily become computationally intractable. In that case, the best solution is to vectorize all response and covariate data, add covariates that index sites, years, replicates, etc, and then rewrite the model for this format of the data. This is not all that hard, but it does make the code harder to understand, especially for a novice. In the AHM2 book, we have an example of such a vectorized analysis of a dynocc model in Chapter 4 (section "4.10 ANALYSIS OF CITIZEN SCIENCE DATA USING OCCUPANCY MODELS").

Good luck and best regards — Marc

From: hmec...@googlegroups.com <hmec...@googlegroups.com> on behalf of Thomas Duchesne <thomas....@natagora.be>
Sent: Friday, January 31, 2025 16:20
To: hmecology: Hierarchical Modeling in Ecology <hmec...@googlegroups.com>
Subject: Full dynamic occupancy model using JAGS with NA in observation matrix ans observation covariates

--
*** Three hierarchical modeling email lists ***
(1) unmarked: for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology (this list): for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2016, 2021) and Schaub & Kéry (2022)
---
You received this message because you are subscribed to the Google Groups "hmecology: Hierarchical Modeling in Ecology" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hmecology+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/hmecology/9b0921f4-a120-4e0e-b112-a66bd7cab2d8n%40googlegroups.com.

Thomas Duchesne

unread,

Feb 11, 2025, 6:19:57 PM2/11/25

to hmecology: Hierarchical Modeling in Ecology

Dear Mark,

Thanks a lot for your answer. My code is now working perfectly even if the computation time is huge for species with more than 1500 sites, 14 primary sessions and 10 secondary sessions (~90% of NA). I can reduce NAs by sampling by prioriziting 5 secondary survey per year but this is still time consuming for model convergence.

For these species, I will try to rewrite the model with vectorised response and covariates. I hope this will save a lot of time by reducing NAs.

Best regards,

Thomas Duchesne

Marc Kery

unread,

Feb 12, 2025, 2:11:02 AM2/12/25

to hmecology: Hierarchical Modeling in Ecology

Dear Thomas,

in Switzerland, for citizen-science data on birds, we may have 10000 sites, 30 years and daily 150 occasions, with most site-year combos having NAs. 99% of a regular array may be filled with NAs. Totally undoable in this form, but doable when the data are vectorized.

Anyone who has largish arrays (e.g., more than a couple hundred sites, >5-10 years, and potentially lots of occasions) should probably write the analysis in the vectorized form.

Best regards --- Marc

From: hmec...@googlegroups.com <hmec...@googlegroups.com> on behalf of Thomas Duchesne <thomasd...@hotmail.be>
Sent: Tuesday, February 11, 2025 14:57

To: hmecology: Hierarchical Modeling in Ecology <hmec...@googlegroups.com>

Subject: Re: Full dynamic occupancy model using JAGS with NA in observation matrix ans observation covariates

To view this discussion visit https://groups.google.com/d/msgid/hmecology/363fc375-0bed-4914-abf3-cb926fc9679fn%40googlegroups.com.

Perry de Valpine

unread,

Feb 18, 2025, 4:15:03 PM2/18/25

to Marc Kery, hmecology: Hierarchical Modeling in Ecology

Hi Thomas and Marc,

I'm coming to the discussion late but will chime in a few ideas. I hope I'm following the issue. Thomas, I may not be following all of your needs, but I'll point out a couple of idioms that might be relevant.

One is a "ragged array", roughly like:

for (i in 1:nSites) {

for (t in 2:nYears[i]) {

logit(psi[i, t]) <- logit(phi[t-1] * z[i, t-1] + (1 - z[i, t-1]) * gamma[t-1])
z[i, t] ~ dbern(psi[i, t])
}

}

so that the number of years can vary by site. Another is nested indexing, roughly like (using another part of your model):

for (t in 1:nYears) {

for (j in 1:nSurvey[t]) {

# Quadratic effect of Julian date

logit(p[i, surveys[t, j], t]) <- beta0 +
beta1 * julian_date[i, surveys[t, j], t] +

beta2 * pow(julian_date[i, surveys[t, j], t], 2) +

alpha[list_type[i, surveys[t, j], t]] # Categorical effect

y[i, surveys[t, j], t] ~ dbern(p[i, surveys[t, j], t] * z[i, t])

}
}

I hope I got that right, but maybe the general idea is clear. A problem with having many NA data nodes is that, in JAGS or nimble, you'll end up with MCMC sampling effort dedicated to imputing those missing data. If you skip over declaring them in code when they're not needed, they shouldn't be sampled. If you don't really need those imputed values for posterior interest, the wasted computation can be substantial.

If you want to make the jump to nimble (very similar to JAGS, with a conversion guide under documentation at r-nimble.org), you can gain further control. Following the idea that computing is about both algorithms and data structures, there is often less attention to the data structures we use for big sparse data sets. One idea that worked well in making spatial capture-recapture models more efficient was to move away from a huge largely empty (0 or NA) data structure to more compact ones (e.e., sites where detections occurred and how many detections for each animal) with the same information content in a much smaller format. One can then write a new "distribution" that simply takes the more compact data structures as inputs and can run more efficient versions of the calculations with them. I'm curious if that would be helpful for the cases you and Marc are describing.

Best wishes,

Perry

To view this discussion visit https://groups.google.com/d/msgid/hmecology/ZR0P278MB08693C3ECBFA3FB5F3CBA1A1EBFC2%40ZR0P278MB0869.CHEP278.PROD.OUTLOOK.COM.

Reply all

Reply to author

Forward