Deriving estimates manually based of clustered observations

48 views
Skip to first unread message

Don Carlos

unread,
Apr 23, 2021, 4:47:30 AMApr 23
to distance-sampling
Dear Eric et al.,

I am trying to understand some of the math in deriving the estimates based on surveys with cluster size variable. I have tried to dig into the source code of Distance/mrds, but cannot produce equivalent outputs. Any pointer and help hugely appreciated. I have to apologise in advance for any basic math mistakes I must have made...

Using the mink ClusterExercise from Distance as a reproducible example.

library(Distance)
library(tidyverse)
options(digits = 7)
data(ClusterExercise)
mod.cs <- ds(ClusterExercise)


Question 1: How is the Expected.S derived?

In the link below your response indicates that this is derived from the average observed group size in the sample.
https://groups.google.com/g/distance-sampling/c/2W7DJtaB82Q/m/eg3f1BzNBAAJ

However, although that seems to be the case of the strata specific estimates, it does not seem to be the case for the total mean - as given in the output in ds.summary().

CALCULATED.Expected.cluster.size <- ClusterExercise %>%
  group_by(Region.Label) %>%
  summarise(mean=mean(size, na.rm=T)) %>%
  as.data.frame() %>%
  pull(mean)

DS.Expected.cluster.size <- summary(mod.cs)[["dht"]][["Expected.S"]][["Expected.S"]][-3]
all.equal(CALCULATED.Expected.cluster.size, DS.Expected.cluster.size) # TRUE

Calculating the overall survey wide mean cluster size

CALCULATED.TOTAL.Expected.cluster.size.total <- ClusterExercise %>%
  summarise(mean=mean(size, na.rm=T)) %>%
  as.data.frame() %>%
  pull(mean)


DS.TOTAL.Expected.cluster.size <- summary(mod.cs)[["dht"]][["Expected.S"]][["Expected.S"]][3]
all.equal(CALCULATED.TOTAL.Expected.cluster.size.total, DS.TOTAL.Expected.cluster.size) # FALSE


Not equivalent, and there is a difference of 0.05 in mean. Guess this is a relatively small number, but as it is also happening in my actual analysis I would like to know if there is some weighting of means going on?


Question 2: Getting the SE and CV for the Expected.S

Following the above attempt to derive Expected.S I have tried to estimate its SE and CV, which do not add up to what Distance provides me, based on the following steps, and referencing the following link:
  https://groups.google.com/g/distance-sampling/c/QYQ4Oysf9JY/m/aEkV6_JpAwAJ

# using outputs from ds
size.vct <- mod.cs[["ddf"]][["data"]][["size"]]
n.clst <- length(size.vct)

var.s <- var(size.vct, na.rm=T)
se.s <- sqrt(var.s/n.clst)
cv.s <- se.s/mean(size.vct) # mean not equivilant to ds Expected.S as per Q1
cv.s


I get cv.s = 0.1013096 and ds outputs cv.s = 0.2072411

Question 3: Deriving final estimates

# pulling the results from the mod.cs, I get:
average.p <- summary(mod.cs)[["ds"]][["average.p"]]
Expected.S <- summary(mod.cs)[["dht"]][["Expected.S"]][["Expected.S"]][3] # mean group size
effort <- summary(mod.cs)[["dht"]][["clusters"]][["summary"]][["Effort"]][[3]] #total effort
truncation <- summary(mod.cs)[["ddf"]][["meta.data"]][["int.range"]][2]


# and estimating D.hat, based on clustered group size, I get outputs which are not equivalent to the ds outputs:
# Using D = n * f(0) * E(s) / 2L
D.hat <-  n.clst *(1/average.p)*Expected.S/((2*truncation)*effort)
summary(mod.cs)

I get D.hat = 0.06556279 and ds gets D.hat = 0.05723343
, similar, but not equivalent.

cv.er <- 0.2416576
cv.pa <- 0.07679084
cv.s <- 0.2072411
ds.D.hat <- 0.05723343

se.D <- ds.D.hat * sqrt(cv.er^2 + cv.pa^2 + cv.s^2)
cv.D <- se.D / ds.D.hat


I get cv.D = 0.3274815 and ds gets cv.D = 0.3682045

Eric Rexstad

unread,
Apr 25, 2021, 10:24:54 AMApr 25
to Don Carlos, distance-sampling

Don Carlos

Detailed question and supporting calculations regarding computations involving animals occurring in groups.  The fundamental challenge with your work is the choice of data set.  Indeed this data set does include animals (minke whales) that occur in groups, but with the added complication that the survey employed stratification.  Hence the computations include not only details involved in group size, but also details involved in stratified estimates.  I think it is the later that is the cause of the discrepancies you describe.

Sprinkled below are some modifications to your calculations along with some narrative and supporting documentation

library(Distance)
data(ClusterExercise)
mod.cs <- ds(ClusterExercise)
thesummary <- summary(mod.cs)

Expected.S <- thesummary[["dht"]][["Expected.S"]][["Expected.S"]][3] # mean group size
effort <- thesummary[["dht"]][["clusters"]][["summary"]][["Effort"]] #effort
abundance.indiv <- thesummary[["dht"]][["individuals"]][["N"]][["Estimate"]] #abund indiv
abundance.groups <- thesummary[["dht"]][["clusters"]][["N"]][["Estimate"]] #abund group

Question 1 (expected group size)

For the study area, expected group size is the ratio of estimated individual abundance to estimated group abundance:

expected.group.size.studyarea <- abundance.indiv / abundance.groups
print(expected.group.size.studyarea)
print(thesummary$dht$Expected.S)

Question 2 (precision of estimated expected group size)

From comments in the function dht.se in the mrds package:

This computes the se(E(s)). It essentially uses 3.37 from Buckland et al. (2004) but in place of using 3.25, 3.34 and 3.38, it uses 3.27, 3.35 and an equivalent cov replacement term for 3.38. This uses line to line variability whereas the other formula measure the variance of E(s) within the lines and it goes to zero as p approaches 1.

Here is Eqn. 3.37


Question 3 (density in study area)

Because the study area was stratified, the overall density is a weighted mean of stratum-specific densities, with the weighting factor being the sizes of the respective strata.  See Buckland et al. (2001) Sect 3.7.1, Eqn 3.122 or equivalently (below) Buckland et al. (1993), Sect 3.8.1

area <- thesummary[["dht"]][["clusters"]][["summary"]][["Area"]] #area
density <- thesummary[["dht"]][["individuals"]][["D"]][["Estimate"]] #density

density.studyarea <- (area[1]/area[3] * density[1]) + (area[2]/area[3] * density[2])
print(density.studyarea)
print(density[3])


Question 3b (precision of overall density estimate)

Variance (or other measures of precision) of the density estimate from a stratified survey is described in Section 3.7.1 of Buckland et al. (2001) detailed in Eqns. 3.123 through 3.126; equivalently Buckland et al. (1993), Sect. 3.8.1 (below)

Perhaps others can provide more detailed answers to your questions.

--
You received this message because you are subscribed to the Google Groups "distance-sampling" group.
To unsubscribe from this group and stop receiving emails from it, send an email to distance-sampl...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/distance-sampling/CAFTQVoD%3D22CNpA5FB3s2tbywkbdp9zzw0MKrGgVK3ykz-_ZS9g%40mail.gmail.com.
-- 
Eric Rexstad
Centre for Ecological and Environmental Modelling
University of St Andrews
St Andrews is a charity registered in Scotland SC013532

Jamie McKaughan

unread,
May 20, 2021, 6:03:40 AMMay 20
to distance-sampling
Hi team,

I was hoping you might be able to help me on a similar scenario. I have used three camera grids in one survey area to increase the number of survey locations without needing more cameras. This has given me three strata that I have fitted different detection functions for, and whose combined AIC is better than a common detection function. I have used the above directions to estimate an overall density using the weighted mean of stratum-specific densities (my final subset of code below).

Eff1 <- bh1.90.hn0$dht$individuals$summary$Effort
Eff2 <- bh2.90.hr0$dht$individuals$summary$Effort
Eff3 <- bh3.90.hr0$dht$individuals$summary$Effort
TotalArea <- sum(Eff1, Eff2, Eff3)
Dens1 <- bh1.90.hn0$dht$individuals$D$Estimate
Dens2 <- bh2.90.hr0$dht$individuals$D$Estimate
Dens3 <- bh3.90.hr0$dht$individuals$D$Estimate

density.studyarea <- (Eff1/TotalArea * Dens1) + (Eff2/TotalArea * Dens2) + (Eff3/TotalArea * Dens3) 
print(density.studyarea)

To calculate the other elements that are present in the normal summary can I just substitute the density for the respective criteria I want to estimate (e.g. se or %cv) into the above code?
e.g. 
se1 <- bh1.90.hn0$dht$individuals$D$se
se2 <- bh2.90.hr0$dht$individuals$D$se
se3 <- bh3.90.hr0$dht$individuals$D$se
se.studyarea <- (Eff1/TotalArea * se1) + (Eff2/TotalArea * se2) + (Eff3/TotalArea * se3) 

I also would like to bootstrap my estimates - is this simply a case of bootstrapping all three of my strata models and then using the combination method as above to establish the weighted bootstrap LCI and UCI for example?

I hope that makes sense.

Many thanks in advance,
Jamie

Eric Rexstad

unread,
May 21, 2021, 3:48:27 AMMay 21
to Jamie McKaughan, distance-sampling

Jamie

I would approach your problem in a slightly different coding way.  It is as if you've conducted multiple surveys in the same area.  I would combine the three grids into a single data set and use "grid" as a covariate in the detection function model and also as you stratification criterion.

Using the function 'dht2' in the Distance package, it can perform the effort-weighted computations you are performing by hand.  The function will also properly handle the computation of uncertainty.  The scenario is depicted thus



For your second question about the bootstrap, the function `bootdht` in the Distance package can perform bootstrap computations, but it is not yet capable of bootstrapping complex situations such as yours.  

Jamie McKaughan

unread,
May 21, 2021, 4:42:04 AMMay 21
to distance-sampling
Hi Eric

Thanks for this - I will give it a go instead!

Many thanks
Jamie

Jamie McKaughan

unread,
May 25, 2021, 11:33:59 AMMay 25
to distance-sampling
Hi Eric

Does using covariates restrict each covariate to have to use the same detection function? Is there a way to make R choose which is best for each? i.e. Grid1 might use HR, but Grid3 use HN, and provide effort-weighted computations accordingly? Or is that not good practice regardless?

Thanks
Jamie

Eric Rexstad

unread,
May 25, 2021, 11:49:48 AMMay 25
to Jamie McKaughan, distance-sampling

Jamie

Use of "grid" as a covariate in the detection function will make the scale parameter (sigma) of a key function differ for each grid.  Catch being that you must assume that all grids share the same key function (no mixing of hazard rate and half normals between grids).

You can contrast the grid-specific detection function model against a model without the grid covariate--that model then assumes the detection function is the same across all grids.

Jamie McKaughan

unread,
May 25, 2021, 11:55:10 AMMay 25
to distance-sampling
Okay - cool.

Thanks
Jamie

Jamie McKaughan

unread,
May 26, 2021, 9:53:35 AMMay 26
to distance-sampling
Hi Eric

I have run a couple of analyses this way, but in all cases the 'total' value does not provide se, cv or CI's of any use - I was expecting a weighted version from what you wrote before - is this normal or does this suggest an error in my code/data files?

Summary statistics:
 Region.Label Area CoveredArea  Effort   n  k ER se.ER cv.ER
        Grid1    1    3583.315 3520385 234 19  0     0 0.228
        Grid2    1    2768.840 2720213 276 19  0     0 0.460
        Grid3    1    3458.497 3397759 452 21  0     0 0.450
        Total    1    9810.652 9638357 962 59  0     0 0.256

Density estimates:
 Region.Label Estimate    se    cv    LCI    UCI     df
        Grid1   0.2035 0.047 0.232 0.1262 0.3282 19.239
        Grid2   0.3106 0.144 0.462 0.1233 0.7824 18.299
        Grid3   0.4073 0.184 0.452 0.1657 1.0009 20.347
        Total   0.3056 0.000 0.000 0.3056 0.3056 38.195

Component percentages of variance:
 Region.Label Detection    ER
        Grid1      3.27 96.73
        Grid2      0.82 99.18
        Grid3      0.86 99.14
        Total      2.62 97.38

Many thanks
Jamie

Eric Rexstad

unread,
May 26, 2021, 10:42:51 AMMay 26
to Jamie McKaughan, distance-sampling

Jamie

Not sure what's happening here, not surprised you are concerned.  Note that the first table of results (Summary statistics).  I'm guessing the encounter rates are minute (because they are detections per snapshot).  The cv.ER values in that first table seem plausible, so those zeros are likely just formatting problems.

But the line you have highlighted does not seem right--the upper and lower confidence interval bounds are equal, implying SE=0, which clearly isn't right.  What code did you use to get this result?

Jamie McKaughan

unread,
May 27, 2021, 11:13:37 AMMay 27
to distance-sampling
Hi Eric

Yes I think you are right regarding encounter rate. Table below shows the value a little better! Like you say the remaining elements are not plausible though.

BBJ.30 issues.JPG
I used the below to apply my model:

bh.90.hr0.Grid <- ds(bh.90, transect = "point", key="hr", adjustment = NULL,
                cutpoints = mybreaks, truncation = trunc.list, formula = ~Region.Label)

And then applied the stratification as you showed above:

bh.90.hr0.Grid.dens <- dht2(bh.90.hr0.Grid, flatfile=bh.90, strat_formula = ~Region.Label,
                            er_est = "P2", convert_units = conversion, stratification = 'replicate')

Then produced the density report:

print(bh.90.hr0.Grid.dens, report="density")

Thanks,
Jamie

Jamie McKaughan

unread,
Jun 1, 2021, 6:31:13 AMJun 1
to distance-sampling
Hi Eric,

I quit restarted R and reran all my code and unfortunately the same result occurs. Does my code look correct?

Many thanks
Jamie

Eric Rexstad

unread,
Jun 1, 2021, 7:29:29 AMJun 1
to Jamie McKaughan, distance-sampling

Jamie

I haven't spotted problems with your code.  If you want to send more details to me offline, I might be able to have a look tomorrow.

Reply all
Reply to author
Forward
0 new messages