very different GOF results from two similar models (distsamp and gdistsamp) with identical coef estimates and SEs

413 views
Skip to first unread message

EMA

unread,
Mar 3, 2016, 6:33:16 PM3/3/16
to unmarked

Hi all,


This Group has been a huge help to me as I’ve attempted to learn unmarked—thanks to everyone who devotes time to helping folks like me.  Here’s the problem that finally spurred me to post a question:


I have distance sampling data for breeding birds from 140 point count stations that were surveyed four times during the breeding season.  Temporary emigration models (Chandler et al. 2011) seemed ideal for these data, but after a lot of attempts, I’ve decided to not model availability.  I have a good idea of true density from intensive territory mapping for one species of interest, but model estimates for that species are more than twice the “true” density.  The reason for this, I believe, is that replicate counts in this system (desert grasslands) vary widely in detections, which results in low estimates of phi (mean <0.5) despite any survey-level (“yearly site-level” in gdistsamp) covariates that I include. 


Instead, my approach until recently was to sum the detections from all surveys for each distance class and include an offset of the number of surveys in distsamp models.  Although I’ve never been confident about this approach, the estimates seem reasonable, and I have good model fit for most species.  After reading pertinent chapters in the new Kery and Royle and playing with the yellow wagtail data, I decided to try stacking my data and using gdistsamp on unmarked frames with no yearlySiteCovs.  This is appealing because it would allow me to try negative binomial mixtures and model survey-level covariates on the detection function.  I suspected that if I fit two models with the same abundance and detection covariates: 1) a distsamp model with summed data and an offset, and 2) a gdistsamp model with stacked data and no yearlySiteCovs, estimates would be very similar.  Sure enough, these two models returned identical parameter estimates and SEs.  The problem arises, however, when I run GOF tests on these two models.  Fit statistics for the distsamp model look good, but for the gdistsamp model, results are abysmal.


So this brings me to two questions:

1) Are there any obvious problems with my approach that are glaringly obvious?

2) If not, why are the fit statistics from two models with identical parameter estimates and SEs so different?


Hopefully, I’m just being dense and missing something obvious, but figuring it out seems to be beyond me at this point.  Below are some simplified code and outputs that (maybe) make this more clear.  I tried to alter the wagtail data a bit to reproduce my problem, but it seems that I can’t get a Poisson mixture with only site-level covariates (which I need to demonstrate my distsamp/offset approach) to return decent fit statistics.


Any help would be highly appreciated!

 

> # distance data from four surveys, binned into 8 distance classes (22m each)
> yDat2 <- formatDistData(ptct, distCol="meters", transectNameCol="plot", dist.breaks=seq(0,176,22), occasionCol="ct.num")
> str(yDat2)
 int [1:140, 1:32] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:140] "aw1" "aw2" "aw4" "aw400" ...
  ..$ : chr [1:32] "[0,22]1" "(22,44]1" "(44,66]1" "(66,88]1" ...
> 
> # sum distance data from each survey by distance classes (sum of all detections in each class )
> yDat <- formatDistData(ptct, distCol="meters", transectNameCol="plot", dist.breaks=seq(0,176,22))
> head(yDat)
      [0,22] (22,44] (44,66] (66,88] (88,110] (110,132] (132,154] (154,176]
aw1        0       2       1       1        1         1         2         0
aw2        0       1       3       0        3         1         1         0
aw4        1       1       3       2        1         1         1         0
aw400      0       0       3       3        3         2         1         0
aw402      0       0       0       0        0         1         1         1
aw403      0       0       1       1        1         0         1         0
> 
> # create umf with three site covariates: 2 continuous (scaled) and 1 to be used as an offset (number surveys = 4; not scaled)
> umf.offset <- unmarkedFrameDS(y=as.matrix(yDat),survey="point",dist.breaks=seq(0,176,22),unitsIn="m",
+                        siteCovs=data.frame(sc.scaled$cv.tr.mean,sc.scaled$gr.nat.nn.RA.dvrsty.PC1,sc.scaled$surveys))
> str(umf.offset)
Formal class 'unmarkedFrameDS' [package "unmarked"] with 9 slots
  ..@ dist.breaks: num [1:9] 0 22 44 66 88 110 132 154 176
  ..@ tlength    : num [1:140] NA NA NA NA NA NA NA NA NA NA ...
  ..@ survey     : chr "point"
  ..@ unitsIn    : chr "m"
  ..@ y          : int [1:140, 1:8] 0 0 1 0 0 0 0 0 0 1 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:140] "aw1" "aw2" "aw4" "aw400" ...
  .. .. ..$ : chr [1:8] "[0,22]" "(22,44]" "(44,66]" "(66,88]" ...
  ..@ obsCovs    : NULL
  ..@ siteCovs   :'data.frame':       140 obs. of  3 variables:
  .. ..$ sc.scaled.cv.tr.mean             : num [1:140] -0.85 -0.85 -0.85 -0.56 -0.353 ...
  .. ..$ sc.scaled.gr.nat.nn.RA.dvrsty.PC1: num [1:140] -0.116 -1.122 -1.321 -0.238 0.938 ...
  .. ..$ sc.scaled.surveys                : num [1:140] 4 4 4 4 4 4 4 4 4 4 ...
  ..@ mapInfo    : NULL
  ..@ obsToY     : num [1, 1:8] 1 1 1 1 1 1 1 1
> 
> # stack data
> ystacked<- rbind(yDat2[,1:8], yDat2[,9:16], yDat2[,17:24], yDat2[,25:32]) 
> str(ystacked)
 int [1:560, 1:8] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:560] "aw1" "aw2" "aw4" "aw400" ...
  ..$ : chr [1:8] "[0,22]1" "(22,44]1" "(44,66]1" "(66,88]1" ...
> 
> # umf with stacked data, no yearly site covariates, site covariates repeated 4 times
> cv.tr.mean <- rep(sc.scaled$cv.tr.mean,4)
> gr.nat.nn.RA.dvrsty.PC1 <- rep(sc.scaled$gr.nat.nn.RA.dvrsty.PC1,4)
> umf.stacked <- unmarkedFrameGDS(y = ystacked, survey="point", unitsIn="m",
+                                 dist.breaks=seq(0,176,22), numPrimary=1, 
+                                 siteCovs=data.frame(cv.tr.mean, gr.nat.nn.RA.dvrsty.PC1))
> str(umf.stacked)
Formal class 'unmarkedFrameGDS' [package "unmarked"] with 11 slots
  ..@ dist.breaks   : num [1:9] 0 22 44 66 88 110 132 154 176
  ..@ tlength       : num [1:560] 1 1 1 1 1 1 1 1 1 1 ...
  ..@ survey        : chr "point"
  ..@ unitsIn       : chr "m"
  ..@ numPrimary    : num 1
  ..@ yearlySiteCovs: NULL
  ..@ y             : int [1:560, 1:8] 0 0 0 0 0 0 0 0 0 0 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:560] "aw1" "aw2" "aw4" "aw400" ...
  .. .. ..$ : chr [1:8] "[0,22]1" "(22,44]1" "(44,66]1" "(66,88]1" ...
  ..@ obsCovs       : NULL
  ..@ siteCovs      :'data.frame':    560 obs. of  2 variables:
  .. ..$ cv.tr.mean             : num [1:560] -0.85 -0.85 -0.85 -0.56 -0.353 ...
  .. ..$ gr.nat.nn.RA.dvrsty.PC1: num [1:560] -0.116 -1.122 -1.321 -0.238 0.938 ...
  ..@ mapInfo       : NULL
  ..@ obsToY        : num [1, 1:8] 1 1 1 1 1 1 1 1
 
 
> ### fit distsamp model with offset
> (distMod <- distsamp(~cv.tr.mean
+                      ~ offset(log(surveys))+cv.tr.mean+I(cv.tr.mean^2)+gr.nat.nn.RA.dvrsty.PC1,
+                      umf, keyfun = "hazard", output = "density",unitsOut = "ha"))
Call:
distsamp(formula = ~cv.tr.mean ~ offset(log(surveys)) + cv.tr.mean + 
    I(cv.tr.mean^2) + gr.nat.nn.RA.dvrsty.PC1, data = umf, keyfun = "hazard", 
    output = "density", unitsOut = "ha")
 
Density:
                        Estimate     SE      z  P(>|z|)
(Intercept)              -0.5934 0.0543 -10.92 8.83e-28
cv.tr.mean                0.3168 0.0509   6.22 4.93e-10
I(cv.tr.mean^2)          -0.0631 0.0284  -2.22 2.64e-02
gr.nat.nn.RA.dvrsty.PC1   0.0754 0.0268   2.82 4.87e-03
 
Detection:
                 Estimate     SE      z  P(>|z|)
shape(Intercept)    4.625 0.0363 127.46 0.00e+00
shapecv.tr.mean    -0.156 0.0232  -6.71 1.97e-11
 
Hazard-rate(scale):
 Estimate   SE    z P(>|z|)
     1.24 0.08 15.6 1.5e-54
 
AIC: 3049.603 
> 
> ### fit gistsamp model with same covariates (no offset, no phi)
> (gdistMod <- gdistsamp(~ cv.tr.mean+I(cv.tr.mean^2)+gr.nat.nn.RA.dvrsty.PC1, 
+                        ~ 1, 
+                        ~ cv.tr.mean,
+                        data = umf.stacked, keyfun = "hazard", output = "density",unitsOut = "ha", 
+                        mixture = "P", K = 100, se = TRUE))
 
Call:
gdistsamp(lambdaformula = ~cv.tr.mean + I(cv.tr.mean^2) + gr.nat.nn.RA.dvrsty.PC1, 
    phiformula = ~1, pformula = ~cv.tr.mean, data = umf.stacked, 
    keyfun = "hazard", output = "density", unitsOut = "ha", mixture = "P", 
    K = 100, se = TRUE)
 
Abundance:
                        Estimate     SE      z  P(>|z|)
(Intercept)              -0.5934 0.0543 -10.92 8.78e-28
cv.tr.mean                0.3169 0.0509   6.22 4.90e-10
I(cv.tr.mean^2)          -0.0631 0.0284  -2.22 2.64e-02
gr.nat.nn.RA.dvrsty.PC1   0.0754 0.0268   2.82 4.87e-03
 
Detection:
            Estimate     SE      z  P(>|z|)
(Intercept)    4.625 0.0363 127.47 0.00e+00
cv.tr.mean    -0.156 0.0232  -6.71 1.95e-11
 
Hazard-rate(scale):
 Estimate   SE    z P(>|z|)
     1.24 0.08 15.6 1.5e-54
 
AIC: 5750.533 
> ### all coefficients and SEs identical
 
 
> ### Goodness of fit: distsamp model
> fitstatsDS <- function(distMod) {
+   observed <- getY(distMod@data)
+   expected <- fitted(distMod)
+   resids <- residuals(distMod)
+   sse <- sum(resids^2)
+   chisq <- sum((observed - expected)^2 / expected)
+   freeTuke <- sum((sqrt(observed) - sqrt(expected))^2)
+   out <- c(SSE=sse, Chisq=chisq, freemanTukey=freeTuke)
+   return(out)
+ }
> 
> (pbDS <- parboot(distMod, fitstatsDS, nsim=50, report=10))
Call: parboot(object = distMod, statistic = fitstatsDS, nsim = 50, report = 10)
 
Parametric Bootstrap Statistics:
               t0 mean(t0 - t_B) StdDev(t0 - t_B) Pr(t_B > t0)
SSE          1367         -35.63             76.5        0.686
Chisq        1075         -38.94             47.8        0.843
freemanTukey  467           1.99             16.2        0.451
 
t_B quantiles:
               0% 2.5%  25%  50%  75% 97.5% 100%
SSE          1191 1252 1343 1433 1457  1510 1537
Chisq         965 1027 1085 1112 1142  1205 1239
freemanTukey  405  437  456  466  474   489  508
 
t0 = Original statistic compuated from data
t_B = Vector of bootstrap samples
> ### distsamp with offset: pretty good fit
 
 
> ### Goodness of fit: gdistsamp model
> fitstatsGDS <- function(gdistMod) {
+   observed <- getY(gdistMod@data)
+   expected <- fitted(gdistMod)
+   resids <- residuals(gdistMod)
+   sse <- sum(resids^2)
+   chisq <- sum((observed - expected)^2 / expected)
+   freeTuke <- sum((sqrt(observed) - sqrt(expected))^2)
+   out <- c(SSE=sse, Chisq=chisq, freemanTukey=freeTuke)
+   return(out)
+ }
> 
> (pbGDS <- parboot(gdistMod, fitstatsGDS, nsim=50, report=1))
Call: parboot(object = gdistMod, statistic = fitstatsGDS, nsim = 50, report = 1)
 
Parametric Bootstrap Statistics:
                t0 mean(t0 - t_B) StdDev(t0 - t_B) Pr(t_B > t0)
SSE           1682            291             44.8         0.00
Chisq        49155          -2981           1188.3         0.98
freemanTukey  1051            175             18.5         0.00
 
t_B quantiles:
                0%  2.5%   25%   50%   75% 97.5%  100%
SSE           1293  1298  1356  1400  1423  1467  1471
Chisq        49524 50060 51111 52365 52971 54279 55035
freemanTukey   839   842   862   876   890   908   923
 
t0 = Original statistic compuated from data
t_B = Vector of bootstrap samples
> ### gdistsamp with stacked data: poor fit
 
 

 

Jeffrey Royle

unread,
Mar 3, 2016, 6:41:22 PM3/3/16
to unma...@googlegroups.com
Could you check the observed, fitted and residuals objects defined in the fitStats functions to confirm that they're all the same (or not)? 
Possibly one of the utility functions producing the fitted values, or the data, has an error...

--
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

EMA

unread,
Mar 4, 2016, 3:35:44 PM3/4/16
to unmarked
Hi Andy,

Thanks for looking into this.  The fitstats functions are coded the same way, but the object values are different.  I would expect them to be different because in the distsamp/offset model, Y is a 140 row (stations) by 8 column (distance classes) matrix where detections from all four counts are summed by distance class, while the gdistsamp/stacked Y matrix is 560x8.  Here are the first few rows of the observed, expected and residual objects for each model:

> head(observed <- getY(distMod@data))
      [0,22] (22,44] (44,66] (66,88] (88,110] (110,132] (132,154] (154,176]
aw1        0       2       1       1        1         1         2         0
aw2        0       1       3       0        3         1         1         0
aw4        1       1       3       2        1         1         1         0
aw400      0       0       3       3        3         2         1         0
aw402      0       0       0       0        0         1         1         1
aw403      0       0       1       1        1         0         1         0
> head(observed <- getY(gdistMod@data))
      [0,22]1 (22,44]1 (44,66]1 (66,88]1 (88,110]1 (110,132]1 (132,154]1 (154,176]1
aw1         0        1        1        1         0          0          1          0
aw2         0        0        0        0         0          0          0          0
aw4         0        0        2        1         0          1          1          0
aw400       0        0        2        2         0          1          0          0
aw402       0        0        0        0         0          0          1          0
aw403       0        0        0        0         0          0          0          0
> head(expected <- fitted(distMod))
          [,1]      [,2]     [,3]     [,4]     [,5]     [,6]     [,7]      [,8]
[1,] 0.2430734 0.7292201 1.215255 1.661171 1.793406 1.557631 1.227865 0.9443200
[2,] 0.2253136 0.6759408 1.126464 1.539800 1.662374 1.443825 1.138153 0.8753247
[3,] 0.2219546 0.6658639 1.109671 1.516845 1.637591 1.422301 1.121185 0.8622754
[4,] 0.2709168 0.8127504 1.354182 1.823019 1.879913 1.570436 1.210270 0.9191614
[5,] 0.3199107 0.9597321 1.598557 2.120951 2.112475 1.718878 1.305408 0.9835162
[6,] 0.2522307 0.7566920 1.261037 1.723752 1.860969 1.616312 1.274122 0.9798953
> head(expected <- fitted(gdistMod))
            [,1]       [,2]       [,3]       [,4]       [,5]       [,6]       [,7]       [,8]
[1,] 0.006244105 0.01873232 0.03121765 0.04267284 0.04607126 0.04001549 0.03154427 0.02426006
[2,] 0.005787910 0.01736373 0.02893689 0.03955515 0.04270529 0.03709195 0.02923964 0.02248762
[3,] 0.005701628 0.01710488 0.02850552 0.03896549 0.04206867 0.03653901 0.02880375 0.02215239
[4,] 0.006959499 0.02087850 0.03478717 0.04683155 0.04829447 0.04034491 0.03109246 0.02361382
[5,] 0.008218166 0.02465450 0.04106524 0.05448567 0.05426928 0.04415850 0.03353655 0.02526704
[6,] 0.006479328 0.01943798 0.03239366 0.04428037 0.04780682 0.04152292 0.03273258 0.02517397
> head(resids <- residuals(distMod))
          [0,22]    (22,44]    (44,66]    (66,88]   (88,110]  (110,132]  (132,154]   (154,176]
aw1   -0.2430734  1.2707799 -0.2152549 -0.6611709 -0.7934058 -0.5576310  0.7721352 -0.94431998
aw2   -0.2253136  0.3240592  1.8735357 -1.5398001  1.3376265 -0.4438252 -0.1381528 -0.87532475
aw4    0.7780454  0.3341361  1.8903291  0.4831554 -0.6375907 -0.4223006 -0.1211852 -0.86227535
aw400 -0.2709168 -0.8127504  1.6458184  1.1769806  1.1200875  0.4295644 -0.2102700 -0.91916144
aw402 -0.3199107 -0.9597321 -1.5985565 -2.1209508 -2.1124748 -0.7188783 -0.3054082  0.01648382
aw403 -0.2522307 -0.7566920 -0.2610372 -0.7237522 -0.8609688 -1.6163117 -0.2741222 -0.97989535
> head(resids <- residuals(gdistMod))
           [0,22]1    (22,44]1    (44,66]1    (66,88]1   (88,110]1  (110,132]1  (132,154]1  (154,176]1
aw1   -0.006244105  0.98126768  0.96878235  0.95732716 -0.04607126 -0.04001549  0.96845573 -0.02426006
aw2   -0.005787910 -0.01736373 -0.02893689 -0.03955515 -0.04270529 -0.03709195 -0.02923964 -0.02248762
aw4   -0.005701628 -0.01710488  1.97149448  0.96103451 -0.04206867  0.96346099  0.97119625 -0.02215239
aw400 -0.006959499 -0.02087850  1.96521283  1.95316845 -0.04829447  0.95965509 -0.03109246 -0.02361382
aw402 -0.008218166 -0.02465450 -0.04106524 -0.05448567 -0.05426928 -0.04415850  0.96646345 -0.02526704
aw403 -0.006479328 -0.01943798 -0.03239366 -0.04428037 -0.04780682 -0.04152292 -0.03273258 -0.02517397

Any thoughts?

Thanks

Jeffrey Royle

unread,
Mar 5, 2016, 1:29:46 PM3/5/16
to unma...@googlegroups.com
those fitted values look really goofy for the distmod case.....

EMA

unread,
Mar 7, 2016, 1:55:40 PM3/7/16
to unmarked
Yeah, the distMod fitted values are all 38.93 times larger than the gdistMod fitted values.  Could the problem be related to how fitted() deals with the offset?  I've noticed that something funny happens when I use predict() with one of my models that includes an offset on the number of surveys.  Here's a simplified model based on the same umf to demonstrate:

> d.off <- distsamp(~1 ~ 1+ offset(log(surveys)),
+                   umf.offset, keyfun = "hazard", output = "density",unitsOut = "ha")
> d.no.off <- distsamp(~1 ~ 1,
+                      umf.offset, keyfun = "hazard", output = "density",unitsOut = "ha")
> backTransform(d.no.off, type="state")
Backtransformed linear combination(s) of Density estimate(s)

 Estimate    SE LinComb (Intercept)
     2.18 0.113    0.78           1

Transformation: exp 
> backTransform(d.off, type="state")
Backtransformed linear combination(s) of Density estimate(s)

 Estimate     SE LinComb (Intercept)
    0.545 0.0283  -0.606           1

Transformation: exp 
> mean(predict(d.no.off, type="state")$Predicted)
[1] 2.1816
> mean(predict(d.off, type="state")$Predicted)
[1] 2.181601
> mean(predict(d.off, type="state",newdata=data.frame(surveys=1))$Predicted)
[1] 0.5454003






Reply all
Reply to author
Forward
0 new messages