finicky distance breaks

87 views
Skip to first unread message

Philip Shirk

unread,
Dec 6, 2017, 7:24:59 PM12/6/17
to unmarked
I'm simulating some data and analyzing it in unmarked, and I noticed that the analysis is often (but not always), extremely sensitive to the distance breaks in the analysis. Below is some example code. When binning the data into bins that seem perfectly sensible to me, the analysis fails miserably, but if I use the same distance breaks that are used in most of the vignette examples (i.e. c(0,5,10,15,20)), then the estimates are quite sensible. What's going on?

E.g.

library('unmarked')

y <- data.frame(transect = factor(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10)),distance=c(5.54,2.8,1.87,11.43,4.16,12.3,2.37,3.65,11.05,10.94,4.62,13.61,11.48,4.9,8.26,4.1,1.49,1.28,9.15,0.4,4.2,0.2,5.81,1.86,12.9,4.27,5.96,4.67,4.22,10.22,6.22,8.37,9.62,1.04,0.6,7.04,7.55,5.46,4.55,1.17,5.48,0.8,9.86,3.95,13.09,9.55,10.69,13.07,5.47,10.13,7.8,0.9,8.98,1.83,5.11,12.45,4.15,11.97,9.52,2.79,14.45,8.22,12.47,3.94,12.48,6.52,11.71,3.6,12.3,7.17,2.58,10.75,5.38,13.26,11.82,3.51,5.59,7.53,10.94,4.11,0.15,14.11,3.32,4.63,13.52,0.45,0.22,9.07,2.18,3.16,6.62,0.41,8.54,8.53,1.14,5.16,9.6,6.51,4.22,3.47,6.28,6.44,3.13,10.58,3.2,0.61,14.48,5.84,5.44,9.31,2.66,5.76,0.89,6.93,1.42,14.14,0.82,1.94,7.03,12.11,5.3,2.1,11.14,9.83,0.86,10,0.2,11.47,7.7,2.85,1.12,8.52,2.87,4.86,3.55,12.44,1.32))

# number of distance-break categories
cats <- 5

# 2 different options for distance breaks. The first doesn't work for me. The 2nd does.
breaks <- c(0, round((max(y$distance)+1)/cats)*(1:cats))
breaks <- c(0,5,10,15,20)

# format the data
unmk.dat <- formatDistData(distData = y,
                           distCol="distance",
                           transectNameCol="transect",
                           dist.breaks = breaks)

umf <- unmarkedFrameDS(y=unmk.dat,
                       siteCovs=NULL,
                       survey="line",
                       dist.breaks=breaks,
                       tlength=rep(100,10),
                       unitsIn="m")

# Analyze data
hn_Null <- distsamp(formula = ~1~1,
                    data = umf,
                    keyfun="halfnorm",
                    output="density",
                    unitsOut="ha")

# extract density estimate & calculate bias
unmarkeddensity <- backTransform(hn_Null, type="state")
unmarkeddensity_bias <- (unmarkeddensity@estimate - 60)/60

# extract estimate of detection parameter & calculate bias
unmarkedsigma <- backTransform(hn_Null, type='det')
unmarkedsigma_bias <- (unmarkedsigma@estimate - 10)/10

unmarked.estimates <- data.frame(Item = c("density", "sigma"),
                                 True = c(60, 10),
                                 Estimated = c(unmarkeddensity@estimate, unmarkedsigma@estimate),
                                 Bias = c(unmarkeddensity_bias, unmarkedsigma_bias))

unmarked.estimates

Philip Shirk

unread,
Dec 6, 2017, 7:25:03 PM12/6/17
to unmarked

Jeffrey Royle

unread,
Dec 6, 2017, 7:28:21 PM12/6/17
to unma...@googlegroups.com
hi Phil,
 what do you mean by "fails miserably"?
 

--
You received this message because you are subscribed to the Google Groups "unmarked" group.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dan Linden

unread,
Dec 6, 2017, 8:51:43 PM12/6/17
to unmarked
It looks like the sigma estimate is really poor without that last break.  If you add 18 to the first set of breaks, the estimates work.  If you remove 20 from the second set of breaks, the estimates no longer work.  I think this is because the distribution of distances is pretty flat, so a half-normal detection function isn't estimated well without an additional bin that confirms the detection rate drops off beyond 15 units.

Philip Shirk

unread,
Dec 7, 2017, 10:53:44 AM12/7/17
to unmarked
It does run, but the estimates (particularly the estimate of sigma) are very biased.
To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+u...@googlegroups.com.

Richard Chandler

unread,
Dec 7, 2017, 11:05:21 AM12/7/17
to Unmarked package
Hi Philip,

To assess bias, you would need to simulate many datasets. Have you tried that? 

Also, I think Dan identified the key issue. Your second set of break points includes the distance interval: (15,20], whereas the first one has a maximum distance of 15. As a result, the first one essentially is using more data because it suggests that there were no detections between 15 and 20 m. This provides a lot of information about sigma that is not available when you use a maximum distance of 15. To do a fair comparison, you would need to use the same maximum distance for each scenario. 

Richard




To unsubscribe from this group and stop receiving emails from it, send an email to unmarked+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Richard Chandler
Assistant Professor
Warnell School of Forestry and Natural Resources
University of Georgia

Philip Shirk

unread,
Dec 7, 2017, 11:17:01 AM12/7/17
to unmarked
I noticed that the estimates look fairly flat (at a glance), but I'm not convinced that entirely explains the issue. With the data that I'm simulating, it only seems to be an issue when the sample size is between ~ 122 and 148 observations. Obviously, there might be an error somewhere else in my code (but I haven't been able to figure out where....). When simulating a lot of datasets (in the same way I produced the dataset in my original post) and analyzing the results, the bias looks like this:


(Note that I capped bias at 2 so that the blue gam() line fit on the plot.)
This morning I tried a few things to see if I could fix it:
1) In the above example I was truncating the furthest 10% of data. I decreased that to 1%.
2) I included one bin beyond the furthest observation in each dataset (by changing the function defining the breaks to: 
        breaks <- c(0, round((max(y$distance+1)/cats)*(1:(cats+1))))
3) I included a wider range of sample sizes, just to see what would happen. As you can see, the proportion of highly biased estimates decreased greatly, but there are still some there, and again only in the N = ~ 120 - 140 range...
Auto Generated Inline Image 1
Auto Generated Inline Image 2

Dan Linden

unread,
Dec 7, 2017, 9:12:29 PM12/7/17
to unmarked
Hey Phil, can you share your simulation code?

Under a half-normal distribution with sigma = 10, if you truncated at 15 you could be removing >13% of potential data points: pnorm(15,0,10,lower.tail=F)*2

Not sure why you would want to do that, but it seems fixing this has greatly reduced the relative error.

Philip Shirk

unread,
Dec 8, 2017, 11:21:21 AM12/8/17
to unmarked
Sure, I'll email it to you. And sorry for all the imprecise wording in my posts. The example dataset I posted above was already truncated, so no need to do it again. The reason that I'm truncating is because I was trying to mimic common practice with distance sampling, but perhaps that's counter-productive when the distance data are *actually* normally distributed?

Dan Linden

unread,
Dec 8, 2017, 4:54:18 PM12/8/17
to unmarked

Yeah, that truncating is for removing outliers that might otherwise cause trouble for the distance function.  But truncating too much would cause trouble also, as we've seen.

So I took a quick look at the code -- you've got quite a good setup for the simulation and model fitting.  But you looked at a very dense range of values for density, and had only been simulating 5 iterations for each density value.  This sample size is much too small for simulation, particularly for trying to assess bias.

In addition, you were taking relative error and turning into absolute error.  You can't assess bias with absolute error because values are constrained to be positive.

I've attached a version here with 125 iterations per density value (which is still low).  Not sure what the small wave is between 100 and 150, but maybe it has something do with the simulation settings.  Either way, doesn't appear to be bias here, except at small values.
Reply all
Reply to author
Forward
0 new messages