Why do kde estimates from ks and adehabitatHR packages in R produce different results? (with same bandwidth and grid)

671 views

Skip to first unread message

Brad Nissen

unread,

Feb 5, 2020, 2:56:41 AM2/5/20

to GIS In Ecology Forum

This question is highly R focused, but I have seen other R questions on this forum and I am hoping someone here can help me get to the root of this problem.

I have a series of VHF radio-telemetry data from animals that I am hoping to be able analyze in R. I have done my research on the different packages available for this, and decided that it appears utilizing the package ks to select a plug-in bandwidth method for all individuals will be most appropriate, based on previous research with my study species. This is reinforced in section 4.3 of this spatial ecology book and in this paper by Millspaugh et al (2006).

I have written two R functions, initially to check that I got similar results, and then hopefully select one to analyze the home ranges of my study animals. My results from the two functions were not as similar as I hoped, and that is what brings me to this forum - to figure out why. The first function relies entirely on the ks package, by using the kde() function and a pre-defined grid expanded by a factor of the bandwidth to get the volume of the Utilization distribution (UD). The UD was then sorted by high to low density. I calculated the size of a grid cell, then calculated the cumulative sum of the grid cell volumes. I then took the number of cells in this list needed to give a cumulative volume of 0.95 or 0.50 (depending on if I am interested in the 95% or 50% kde area) and then multiplied that number of cells by the cell size to get the area. Perhaps there is a flaw in my logic here that I am not seeing, if so please alert me.

The annotated R code for that function is here:

kde_ks.hpi <- function(filename, percentage, gridsize, multiplier){

data <- read.csv(file = filename)

x <- as.data.frame(data$X)

y <- as.data.frame(data$Y)

loc <- cbind(x,y)

data.h <-Hpi(loc)

x.grid.size<- (gridsize) #Set grid size as no. of nodes in the x direction

band.mult<- (multiplier)

x=seq(min(loc[,1])-band.mult*sqrt(data.h[1,1]), max(loc[,1]) + band.mult *sqrt(data.h[1,1]), length.out=x.grid.size)

y=seq(min(loc[,2])-band.mult*sqrt(data.h[2,2]), max(loc[,2])+ band.mult*sqrt(data.h[2,2]), by=(x[2]-x[1])) #sets out nodes in y axis, spaced the same as x axis

eval.pts <-expand.grid(x,y)

UD <-kde(loc,H=data.h,eval.points=eval.pts)

output<- data.frame(cbind(UD$eval.points[,1],UD$eval.points[,2],UD$estimate))

colnames(output)<- c("xcoord","ycoord","UD_ht")

cell.area <- (x[2]-x[1])*(y[2]-y[1])

grid.vol <- sum(cell.area*output[,3]) #check that grid UD vol is >=0.99

vol<-output[,3]/sum(output[,3]) #standardize volume to add to 1

output<-data.frame(cbind(output,vol))

output<-output[order(-output$vol),] #sort by descending volume

cumV<-cumsum(output$vol) #calculate cumulative UD volume

output<-data.frame(cbind(output,cumV)) #add cumulative vol to output

sub.UD<-subset(output,output$cumV<=(percentage/100)) #subset output by percent vol

cellcount <- nrow(sub.UD)

area <- cellcount*cell.area

KDE <- data.frame(c(area,grid.vol))

KDE

}

So as an example -

kde_ks.hpi("C01 .csv", percentage=50,gridsize =150,multiplier =3) #I will attach a stripped down version of this data file if anyone is interested in reproducing my results.

In this case - I found a 99% home range area of 3958 meter^2 for this individual.

My second function is largely based on code that I found from the Spatial Ecology textbook that I referenced above. This code transforms the bandwidth and kde from the ks package with rasters to utilize functions in the adehabitatHR function to easily get UD volumes and areas. The difference between my code and the one in the book is that I define the grid using the same methods as the first function. Yet, I don't get the same results, despite using the same grid and bandwidth as I did in the previous function. Below is the annotated code for that function:

CRS.SC <- CRS("+init=epsg:32616") #define study area projection first

kde_ade.ks.hpi <- function(filename, percentage, gridsize, multiplier){

data <- read.csv(file = filename) #read in data

x <- as.data.frame(data$X)

y <- as.data.frame(data$Y)

xy <- c(x,y)

loc <- cbind(x,y)

data.h <-Hpi(loc) #define bandwidth using plug-in method

data.proj <- SpatialPointsDataFrame(xy,data, proj4string = CRS.SC)

boundingVals <- data.proj@bbox #get the bounding values of the animal locations

band.mult <- (multiplier) #define the value used to expand the evaluation grid

y.expand <- band.mult*sqrt(data.h[2,2]) #set the value used to expand the grid by a function of the multiplier and the bandwidth

x.expand <- band.mult*sqrt(data.h[1,1]) #note that this expansion is different in the x and y direction

deltaX <- as.integer(((boundingVals[1,2]) - (boundingVals[1,1])) + (2*x.expand)) #get the total length of the grid axis

deltaY <- as.integer(((boundingVals[2,2]) - (boundingVals[2,1])) + (2*y.expand))

x.grid.size <- (gridsize) #set the number of grid nodes in the X direction

gridRes <- deltaX/x.grid.size #determine the grid resolution, (i.e the size of one side of a cell)

y.grid.size <- deltaY/gridRes #determine the number of nodes in the Y direction using the same cell size as the x axis

boundingVals[2,1] <- boundingVals[2,1] - y.expand #min Y - expand the in each direction grid by the previously determined value

boundingVals[2,2] <- boundingVals[2,2] + y.expand #max Y

boundingVals[1,1] <- boundingVals[1,1] - x.expand #min X

boundingVals[1,2] <- boundingVals[1,2] + x.expand #max X

gridTopo <- GridTopology((boundingVals[,1]), c(gridRes,gridRes),c(x.grid.size,y.grid.size)) #Grid Topology object is basis for sampling grid (offset, cellsize, dim)

sampGrid <- SpatialGrid(gridTopo, proj4string = CRS.SC) #Using the Grid Topology and projection create a SpatialGrid class

sampSP <- as(sampGrid, "SpatialPixels") #Cast over to Spatial Pixels

sampRaster <- raster(sampGrid) #convert the SpatialGrid class to a raster

sampRaster[] <- 1 #set all the raster values to 1 such as to make a data mask

evalPoints <- xyFromCell(sampRaster, 1:ncell(sampRaster)) #Get the center points of the mask raster with values set to 1

hpikde <- kde(x=loc, H=data.h, eval.points=evalPoints) #Create the KDE using the evaluation points

hpikde.raster <- raster(sampRaster) #Create a template raster based upon the mask and then assign the values from the kde to the template

hpikde.raster <- setValues(hpikde.raster,hpikde$estimate)

hpikde.px <- as(hpikde.raster,"SpatialPixelsDataFrame") #Cast over to SPxDF

hpikde.ud <- new("estUD", hpikde.px) #create new estUD using the SPxDF

hpikde.ud@vol = FALSE #Assign values to a couple slots of the estUD

hpikde.ud@h$meth = "Plug-in Bandwidth"

hpikde.ud.vol <- getvolumeUD(hpikde.ud, standardize=TRUE) #Convert the UD values to volume using getvolumeUD from adehabitatHR and cast over to a raster

hpikde.ud.vol.raster <- raster(hpikde.ud.vol)

hpikde.vol <- getverticeshr(hpikde.ud, percent = percentage,ida = NULL, unin = "m", unout = "m2", standardize=TRUE) #Here we generate volume contours using the UD

hpikde.vol$area #Determine UD area at that contour

}

So for example:

> kde_ade.ks.hpi(filename = "C01 .csv", percentage=50,gridsize=150,multiplier=3)

[1] 3965.43

So in this case - I found a 50% kde home range area of 3965 meter^2 for this individual. Not really that big a difference from the other value of 3958 m^2- but I would love to know where it comes from, as it the difference between the two functions can vary based on how I define the grid, multiplier, or percentage. At 99% the same data set produces a value of 67182 m^2 and 67453 m^2 for the first and second methods, respectively. In this case the second method produces a larger value both times, but when I try this with a different animal, the first method gives me larger areas. What is going on here, why is there a difference? Is one method better than another? What do you all recommend for moving forward?

I am hoping to use these functions and alter them to produce shapefiles that I can then map in GIS, once my methodology is settled.

Thank you very much for your time,

Brad Nissen

C01 .csv

GIS in Ecology

unread,

Feb 7, 2020, 5:09:56 AM2/7/20

to GIS In Ecology Forum

Hi Brad,

This is definitely a good question for this forum.

I don't use the KDE and UD functions in R much, but I can see two possible reasons for the discrepencies.

The first is due to the extent used to calculate the KDEs/UDs. As these are based on a grid structure, if you process the same data with different extents (that is different coordinates for the four corners of the grid that is generated), you cna get slightly different resutls. For this reason, it is always better to specify the exact extent that your outputs will use rather than using any default options.

The second is due to the function used to smooth the known densities into neighbouring areas. Different tools can use different default functions for this smoothing process (such as Gausian or quadratic) and this can result in different density estimates for specific grid cells for the outputs.

The way to check whether either of these are the reason for the differences you are observing is to read carefully through the help files/ducumentation or the R functions to see what the default options are and how these can be changed by additng additional arguments to your R code.

Either way, if you are calculating home range estimate sof any kind in R, I would always recommend visualising the outputs in GIS package like QGIS as this will allow you to examine and compare the spatial distirbution of your outputs. This makes it much easier to spot potential problems, trouble-shoot them and also to undderstand whether the differences you are getting between different methods of calculating the home ranges are providing very similar or very different understandings of the inidividual's home range. You may find that even though there are differences in the exact areas for the home ranges generated froim different tools, when you look at the actual home ranges, that these are not meaningful differences in terms of understanding where a specific individual regularly occurs.

Hope this helps. If if doesn't post back on this thread and I'll see what else I can suggest. If you do find a solution to this issue, could you post it here too as it will be of interest to others?

Thanks.

Colin

Brad Nissen

unread,

Feb 9, 2020, 8:16:37 AM2/9/20

to GIS In Ecology Forum

Hi Colin,

Thank you for taking the time to write out this well-thought out response. You are right about the extent and the function being two obvious reasons for why kde estimates could be different. However, in my case I specifically defined the grid extent and resolution, as well as the bandwidth - so I would assume we can at least rule those out as reasons for the difference. Notably, in the second method, I did not use that functions default method of defining the bandwidth and instead used the ks package to define the bandwidth with the plug-in method in both cases. In the second case, I converted this grid extent and hpi for use in the adehabitatHR package afterwards.

As for the actual function used in the smoothing process, that is certainly a possibility and I investigated into the R documentation, although to be honest I'm not sure I can entirely decipher what I have found. To someone a little more math savvy, the answer may be right in front of us. The documentation for the ks package reads as follows:

Details

There are three main types of functions in this package:

• computing kernel estimators - these function names begin with ‘k’

• computing bandwidth selectors - these begin with ‘h’ (1-d) or ‘H’ (>1-d)

• displaying kernel estimators - these begin with ‘plot’.

The kernel used throughout is the normal (Gaussian) kernel K. For 1-d data, the bandwidth h is the standard deviation of the normal kernel, whereas for multivariate data, the bandwidth matrix H is the variance matrix. –For kernel density estimation, kde computes:

The bandwidth matrix H is a matrix of smoothing parameters and its choice is crucial for the performance of kernel estimators.

For the adehabitatHR package - the documentation on the kernelUD function reads:

Details

"The Utilization Distribution (UD) is the bivariate function giving the probability density that an animal is found at a point according to its geographical coordinates. Using this model, one can define the home range as the minimum area in which an animal has some specified probability of being located. The functions used here correspond to the approach described in Worton (1995). The kernel method has been recommended by many authors for the estimation of the utilization distribution (e.g. Worton, 1989, 1995).

[Note: I looked into this paper and found this function:

...which does look different from the one given in the ks package documentation. How to interpret that - I'm not sure, and maybe this is our answer?]

The default method for the estimation of the smoothing parameter is the ad hoc method, i.e. for a bivariate normal kernel

h = σn− ^(1/6) where σ^2 = 0.5(var(x) + var(y)) which supposes that the UD is bivariate normal.

If an Epanechnikov kernel is used, this value is multiplied by 1.77 (Silverman, 1986, p. 86). Alternatively, the smoothing parameter h may be computed by Least Square Cross Validation (LSCV). The estimated value then minimizes the Mean Integrated Square Error (MISE), i.e. the difference in volume between the true UD and the estimated UD. Note that the cross-validation criterion cannot be minimized in some cases. According to Seaman and Powell (1998) "This is a difficult problem that has not been worked out by statistical theoreticians, so no definitive response is available at this time" (see Seaman and Powell, 1998 for further details and tricky solutions). plotLSCV allows to have a diagnostic of the success of minimization of the cross validation criterion (i.e. to know whether the minimum of the CV criterion occurs within the scanned range). Finally, the UD is then estimated over a grid."

Again, it is worth noting that I did not use this package to calculate the H value, so the sections on that can be ignored for my case - but it is worth noting that they are different.

That's all I have for now - I do plan to plot those areas that were calculated today in ArcGIS Pro, so once I do that - I will have a better idea of how large this difference actually is, and whether it should be considered biologically significant. If you think it would be helpful/relevant to folks, I can try posting some images from my results.

I will happily post what I find out here, I hope if anyone else has questions about this sort of thing that this post can be of help to them in some way. I have also cross-posted this in a couple other forums so if I hear useful anything back from those realms - I will be sure to share it here as well.

Thanks again for all the help,

Brad

Brad Nissen

unread,

Feb 11, 2020, 4:44:05 PM2/11/20

to GIS In Ecology Forum

Hi All,

I wanted to follow up that after much more testing of these two methods with a few different datasets - it seems that despite producing different results, the differences are subtle enough that it doesn't really warrant further investigation. I've added a couple charts indicating this - and I plan to use the second method that involves both the ks and adeHabitatHR package for my home range calculations (in the chart called "adeHR") - as this will allow me to easily write shapefiles from the outputs. I hope this is helpful to someone out there, and thank you for providing a forum to discuss this!