Advice on species selection for a Spatial Multi-Species Occupancy Model

28 views
Skip to first unread message

Susana Requena

unread,
Oct 16, 2025, 11:05:37 AMOct 16
to spocc-spa...@googlegroups.com
Hi all,

We are developing a Spatial Multi-Species Occupancy Model for an
ensemble of trans-Saharan migratory birds. The dataset is based on
detection–nondetection data from point counts across West Africa, with
remote-sensing covariates used as predictors of occurrence.

The sampling covers eleven “winter” seasons (2009–2019 and 2021), five
countries, 156 places, 200 transects (sites), and 2,598 count points.
However, the dataset structure has proven challenging, and so far I
haven’t managed to obtain results, even with the conceptually simplest
model. I’m stuck in the analysis. I think I should simplify the
structure, and I’d appreciate your advice before proceeding. I’m
describing the dataset below so you can get an idea

From what I have found in the literature and my intuition in this case
(I'm very inexperienced, this will be my first serious try with
spOcc), I think that simplifying the structure of the data (ie
summarising categories) and reducing the number of species to those
that are "reasonably" well-sampled across seasons and points would
help. Atthe same time, I want to retain rare species that still
contain enough information for modelling, I’ve seen approaches (the
most recent, the brilliant paper from Doser et al., 2025) applying a
threshold at a minimum number of sites (e.g., species detected in ≥ 50
locations) and/or summarise data for each species at each site in a
fixed number of spatial replicates (5 in the example mentioned).

However, I'm concerned that by doing so, I can worsen the sampling
bias even more, especially since some sites were sampled more
intensively than others. So I ned to look for a compromise while
balancing the dataset.

My (initial) questions to you are:

1. What criteria or exploratory analyses (or rule of thumb!) would you
suggest to decide whether and how to reduce the number of species?
2. If a detection threshold is appropriate, what would be a reasonable
cutoff (e.g., 50 sites) for inclusion?

At this stage, my basic objective is to fit and complete a very basic
model that includes detection variables (e.g., time, distance) and a
couple of remote-sensing predictors before gradually increasing the
complexity. I'd be immensely happy and grateful.

Many thanks for your time and any suggestions you can share, and
apologies for this cumbersome message. Please feel free to ask me for
more details if needed.

Cheers,

Susana

--------------------------------------------------------------------
Dataset overview

Temporal coverage: 11 seasons (2009–2019, 2021), from September to early May

Spatial coverage: 5 countries, 156 places, 200 transects, 2,598 points.
Range of transects by place: 1 – 11, mode =1 (93% of the places).
Range of points per transect: 1 – 41, mode =10 (17.5% transects).

Survey method: Point counts spaced every 200m, at 3', 5', and 5+
minute within a 0.5 km radius

Sampling effort: Range of visits per point: 1 – 53, mode = 1 (57% of
the points were visited just once), from 2 – 4 (14%), 5 visits (18%).
The 4.5 % of the points from just two sites summed 53 visits across
all these seasons. In the same season, some points could receive as
many as 25 or 15 visits.

Detection frequency of focal species (N = 39)
13 species detected in > 100 points
8 species detected in 85–42 points
18 species detected in ≤ 20 points

Bruce Wayne

unread,
Oct 17, 2025, 4:00:29 PMOct 17
to spOccupancy and spAbundance users
Susana,

We might need a bit more clarification on the questions you are asking of the data. For example, I have a data set where there are species that are detected from just one site.  However, I am interested in community level occupancy so including those rare species is important for the community metric.  I included these species and got relatively good community level Bayesian p-values for model fit (three covariates for detection and one categorical covariate for occupancy (three levels).  Are you interested in species level occupancy or community level occupancy?

I think that might be a good place to start.
Alan

Susana Requena

unread,
Oct 20, 2025, 6:58:03 PMOct 20
to Bruce Wayne, spOccupancy and spAbundance users
Hi Alan, many thanks for taking the time to help, it's much appreciated

The dataset includes information on Afro-Palearctic species listed in
the AEMLAP (African-Eurasian Migratory Landbird Action Plan). These
species—such as warblers, flycatchers, larks, and shrikes—do not form
an ecological community per se, but rather represent a migratory
assemblage. The species are very different in terms of their ecology
and life-history traits.

Initially we are interested on species-level occupancy. Our objective
is to identify habitat associations and compare occupancy patterns
across taxa and seasons, using environmental covariates derived from
remote sensing (e.g., land cover, climate, and some geographical
parameters). These covariates were extracted for each sampling point
at the closest available time and location to the corresponding visit.

Although some points were sampled multiple times within a season,
nearly half of the points were visited only once during the 11 years.
Also, there's a one-year gap in the dataset because of the pandemic.
All this contributes to a very unbalanced dataset, and I am also
concerned about the closure assumption. I wonder if prescinding from
those species and/or sites with limited sampling effort could help to
balance the dataset, so that we could get the model to converge. If
this is the case, which strategy do you suggest?

Cheers,
S.
> --
> You received this message because you are subscribed to the Google Groups "spOccupancy and spAbundance users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to spocc-spabund-u...@googlegroups.com.
> To view this discussion visit https://groups.google.com/d/msgid/spocc-spabund-users/a4beb28d-35e3-4bf7-b491-932203d58ac4n%40googlegroups.com.

Jeffrey Doser

unread,
Oct 31, 2025, 10:34:24 AM (6 days ago) Oct 31
to Susana Requena, Bruce Wayne, spOccupancy and spAbundance users
Hi Susana,

Thanks for the message, and apologies for the delay. I would echo what Alan said in that what you do will be highly dependent on your objectives. Since you are focused on species-specific inference you will very likely require more detections for a given species to be included in the analysis compared to what Alan had when his inferences were primarily at the community level. The number of detections you need for a given species in order to get reliable estimates from the model is dependent on a variety of factors such as: (1) where the detections occur and how do those locations relate to covariates in the model; (2) how many covariates you are trying to estimate; (3) how the detections are distributed over time. 

With all that said, it's hard to recommend a specific approach to take in terms of thresholds to specify, but I would suggest doing a very iterative approach to fitting the models. I would first suggest fitting a non-spatial, single-species model that includes all the covariates you are interested in (and potentially random effects). First fit the model for a common species with a large number of detections. Then, once you get that to work, I would try to fit the same model with some of the more rare species in the data set. See what species you can get a single-species model to work and how many detections those species have. That will be helpful in determining an eventual threshold for a multi-species model. You will likely find for many of your rare species that the models will not successfully converge, or they will give extremely unrealistic (e.g., massive uncertainty in the parameter estimates) that indicates there is not enough data to estimate all model parameters. Then, once you've done that I would suggest moving onto a multi-species model. You could start by including all species that were detected, or you could set some baseline threshold designed to remove any species that likely are not truly using the area but may have been more spurious detections. If you find this model is difficult to converge, you can gradually add in restrictions to remove some of the more rare species from the analysis.

As far as the unbalanced sampling goes, it is certainly possible to get estimates from such an unbalanced data set, even when the majority of sites only have one replicate per season (if I'm understanding correctly). This paper by myself and Sara Stoudt talks a bit about how reliable estimates from such a scenario can be under different assumptions, which suggested that even with just a few sites having replication, the estimates prove to be quite reliable. Of course, this depends on the patterns of sampling and whether there is any form of pattern that determines which sites received more visits than others (e.g., if sites thought to have more rare species were visited more than others). There is a nice preprint by Luza et al. that I believe looks at some of these concepts in more depth, which might be worth checking out.

All that to say, I think the best thing to do is start simple and try to gradually chip away at determining how much of the data you can use in the analysis.

Hope all is well!

Jeff



--
Jeffrey W. Doser, Ph.D.
Assistant Professor
Department of Forestry and Environmental Resources
North Carolina State University
Pronouns: he/him/his

Susana Requena

unread,
Nov 4, 2025, 5:26:03 AM (3 days ago) Nov 4
to Jeffrey Doser, Bruce Wayne, spOccupancy and spAbundance users
Hey Jeff,

Thanks so much for the detailed reply — that’s super helpful. No
worries at all about the delay! Your suggestion to start simple and
build up iteratively makes a lot of sense, classic parsimony principle
approach ;). Indeed, after reflecting on Alan’s reply, I started with
a single-species model for some common species, as you suggested, with
mixed results. I’ll keep working on the approach and see how things
behave with some of the rarer ones before moving toward a
multi-species setup.

It’s also reassuring to hear that the unbalanced sampling shouldn’t be
a major obstacle. Thank you for the references.

I’ll keep you posted on how things progress — I’m looking forward to
seeing how the data respond as the models get more complex and,
eventually, a multi-species and multi-season one. Really appreciate
you taking the time to lay all this out so clearly. Hope all’s well on
your end too!

Best,
Susana
Reply all
Reply to author
Forward
0 new messages