Dealing with uneven number of visits when using palaeontological data

christophe...@gmail.com

unread,

Nov 29, 2022, 8:07:59 AM11/29/22

to unmarked

Hi everyone,

I’m going to preface this by saying I’m very new to occupancy modelling and using unmarked in general, so apologies if there is a simple answer to this!

To give some background, I’m a palaeontologist who’s become interested in occupancy modelling as a way to better understand the factors influencing preservation in the fossil record. In my current project, I’m attempting to apply some single season occupancy models to a dataset of dinosaur occurrences from the end Cretaceous using unmarked to answer questions about how the fossil record influences perceived patterns prior to the end Cretaceous mass extinction at 66 million years ago (I’m happy to go into the specifics of this project if people are interested, but didn’t want to write an essay!). Asides from a few papers, occupancy modelling hasn’t really caught on in palaeo yet, so I’m unsure about how to setup my data considering the peculiarities of fossil data.

In terms of my setup, I’ve downloaded North American fossil occurrences from the paleobiology database (http://paleobiodb.org for those interested), which contains presence only records of fossil occurrences reported in ‘collections’ (basically a collecting event from a distinct rock horizon/geologic time at a specific geographic locality). Collections can have one or many fossils occurrences within them. I have then used R to grid up collections from the same time intervals into both 0.1 or 0.5 degree grids, with the aim of testing these datasets in different occupancy models. I am using these grid cells as my ‘sites’, and then each collection is a ‘visit’, with the presence of my target taxon within each collection as a detection, and the absence of it as a non-detection. This is similar to how I’ve seen some occupancy approaches deal with historical data not specifically collected for occupancy modelling, and also similar to some other palaeo papers that have used occupancy modelling.

However, my issue comes with the fact that I have a strongly uneven number of visits between sites, which is introducing an enormous amount of NAs in the site by species matrix. To give an example, the majority of my sites (grid cells) have between 1-3 collections in them, but maybe 5% of them have between 80-130 collections. This obviously is extremely skewed, and I’ve read elsewhere that having such uneven numbers of visits can cause problems for the model through the introduction of NAs. As such, I’ve been trying to decide on a way to deal with it, and I’ve come up with a number of options:

1) Leave it, and accept that this might cause issues (definitely not my preferred method!).

2) Truncation of sites. There’s a paper (Lawing, A.M., Blois, J.L., Maguire, K.C., Goring, S.J., Wang, Y. and McGuire, J.L., 2021. Occupancy models reveal regional differences in detectability and improve relative abundance estimations in fossil pollen assemblages. Quaternary Science Reviews, 253, p.106747) that’s applied occupancy modelling to fossil pollen records, which also had issues with uneven ‘visits’ between ‘sites’ (although their setup differed from mine due to their data types). To deal with this, they introduced a cut off of 10 ‘visits’, and removed any further visits from their matrix. They found that this improved model results, but I was somewhat unsure of the validity of removing any visits after the first 10 or so.

3) Remove those sites. Another possibility would just be removing any sites which have a vastly larger number of visits. However, fossil data is already fairly sparce, and I would then be losing a large amount of potentially valuable information.

4) Subsample/make ‘average’ occupancy. My final thought was to effectively create an ‘average’ detection history. I wrote a quick script in R that randomly samples the detection history of a site with more visits than a specified limit (e.g. 10 visits per site) 1000 times to generate an ‘average’ detection history. However, this obviously means I can’t attach observation level detection covariates within this model. I also thought about randomly sampling a specific number of visits and just running the model, but then each model would have to be run many times to create an ‘average’, and interpreting the results from that would be very problematic.

As you can see, I am unsure of the best procedure to use, or whether any of these options make sense. I am also unsure of whether this is a also indicative of a larger issue of spatial auto-correction, and whether that should be incorporated in my model somehow.

Does anyone have any suggestions or thoughts on what the best course of action for me may be?

Thanks so much for reading (sorry this ended up so long!), and if you have any questions just let me know and I’ll do my best to clarify for you!

Best wishes,

Chris

Ken Kellner

unread,

Nov 30, 2022, 11:05:01 AM11/30/22

to unmarked

Hi Chris,

I would probably first try to fit the model with the full dataset. There's nothing strictly wrong (that I can think of) about having such unbalanced visits, but it is possible it would cause problems with the optimization.

If you want to try to reduce the number of visits, I don't think I would use the 'average' occupancy approach as you suggest. Two ideas:

(1) For sites with J >10 visits (or whatever value of J), keep only max 10 randomly selected visits. Kind of like what the paper did, but don't just always keep the first 10, just in case that introduces some bias. You can keep the associated detection covariates this way.

(2) Approach 1, but fit the model many times, each time selecting different random sets of visits for the sites with J > 10 visits. Then look at the distribution of the parameter estimates from all your models and see how sensitive the model results are to this random truncation. You could also do this for varying J to see if that choice matters.

Ken

Jim Baldwin

unread,

Nov 30, 2022, 12:37:58 PM11/30/22

to unma...@googlegroups.com

Just to echo previous comments: There should be no computational issues with lots of NA's. (While the unmarked code could be modified to use "sparse arrays" to eliminate NA's under the hood and make the y matrix more compact, that seems unnecessary - unless some dataset had thousands of "visits" at some sites.)

The NA issue is a "computational" issue (or maybe non-issue). But the large variability in the number of site visits is a potential statistical issue in that maybe the heavily surveyed sites are just different in the detection model (and maybe the occupancy model). Why not just have a different detection model for sites with greater than 10 visits? That way you use all of the data. If you find that the detection parameters are essentially the same between the two different survey intensities, that would be good to know. And likewise if you do find a difference. (This certainly doesn't eliminate all sources of "selection bias" but it would seem a reasonable way to start.)

Jim

--
*** Three hierarchical modeling email lists ***
(1) unmarked (this list): for questions specific to the R package unmarked
(2) SCR: for design and Bayesian or non-bayesian analysis of spatial capture-recapture
(3) HMecology: for everything else, especially material covered in the books by Royle & Dorazio (2008), Kéry & Schaub (2012), Kéry & Royle (2016, 2021) and Schaub & Kéry (2022)
---
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/unmarked/1c787f42-ff7d-4ddd-bdda-54dae14db51cn%40googlegroups.com.

Christopher Dean

unread,

Dec 5, 2022, 6:16:07 AM12/5/22

to unma...@googlegroups.com

Hi both,

Sorry for the delay in my reply. Thank you so much for your suggestions and advice. Glad to know there’s nothing inherently wrong with running a dataset with largely uneven visits.

I like the suggestion of randomly sampling and re-fitting models to assess the variability – that seems like a sensible thing to test and provide a quantifiable estimate for, so will make sure to give that a go.

In regard to the different detection model for sites with >10 visits, do you mean adding in a site-level detection covariate specifying whether each site is over/under a certain number of visits? Again, this seems very sensible and a good general test.

Thanks again so much for your help, and have a great week,

Chris

To view this discussion on the web visit https://groups.google.com/d/msgid/unmarked/CAJCKiwyPC0jQ_TSwYmpNwaP-Yh9_%2B_bMSW%2BN5Z4Z40y%2B5CzM%2BA%40mail.gmail.com.

Stefano Anile

unread,

Jun 2, 2023, 2:44:28 AM6/2/23

to unmarked

Hi,

The function filter_repeat_events seems to be good to your scope.

https://cornelllabofornithology.github.io/ebird-best-practices/occupancy.html#occupancy-data-sites