Limits to the size of dataset ctmm can handle?

Jillian Rutherford

unread,

Sep 24, 2020, 5:48:34 PM9/24/20

to ctmm R user group

Greetings!

I am attempting to use the ctmm package to perform AKDE estimation of home range on very large GPS datasets, and receiving error messages related to insufficient memory.

For instance, I have a telemetry object with 87,211 obs of 8 variables, and when I try to generate a variogram using the variogram() function with fast=FALSE, I get the message; "Error: cannot allocate vector of size 56.7 Gb".

If I accept that I cannot run fast=FALSE and instead opt for fast=TRUE, the operation succeeds, but then I am met with a similar "cannot allocate vector of size X" error when I attempt to use the akde() function.

These error messages occur even when running the script on a server that has 250GB RAM, and the help documentation I have consulted leads me to believe it is an issue with R being able to access contiguous blocks of memory.

My question is, are there known limits to the size of dataset/telemetry object that can be successfully analyzed using the functions available in the ctmm package? Are there any good workarounds for dealing with very large datasets? I am hesitant to subset or thin the data in any way because the purpose of my research is to compare the results when different characteristics of the data are perturbed, and therefore I would like to include a full estimate of the original dataset.

Thank you very much in advance for your time and help!

Jillian

Christen Fleming

unread,

Sep 24, 2020, 7:13:43 PM9/24/20

to ctmm R user group

Hi Jillian,

There are efficient algorithms in the package for most every calculation, though there are options and complications that can cause trouble. I try to document this as much as possible in the individual help files.

variogram(...,fast=FALSE) is an inherently O(n²) calculation and is vectorized for speed, so that's never going to work well on larger datasets. O(n²) means that the computational speed (and memory in this case) scales with the square of the amount of data. However, the default variogram(...,fast=TRUE) is exactly the same if there is no irregularity in the sampling schedule beyond missingness. And even if there is irregularity in the data, the differences are often quite small in larger datasets. variogram(...,fast=FALSE)is more for squeezing blood out of smaller, lower quality datasets.

An opposite problem that people sometimes have with variogram(...,fast=TRUE) is if their data has something like 1Hz bursts with huge gaps. variogram(...,fast=TRUE) is O(n log n) but involves constructing a discrete time grid for the entire sampling period to perform an FFT on. Therefore, in that case, you can also run out of memory and need to invoke the dt option to choose a larger sampling interval for which to grid the entire dataset, or to break up the dataset and average the individual variograms together.

With akde(), the default res option is larger than necessary for aesthetic reasons. I'm kind of surprised that you ran out of memory with a single individual? But you should be able to safely go from res=10 to res=1 without issue, and that would decrease the memory costs by a factor of 100. Tell me if that works here and I can take a look at tweaking the default options.

For akde() with multiple individuals being calculated simultaneously on the same grid, it's easier to run out of memory because I need to re-write that code to be more efficient. But nobody has raised a real issue there yet, so I haven't put priority on that.

Best,

Chris

Jillian Rutherford

unread,

Sep 30, 2020, 12:59:01 PM9/30/20

to ctmm R user group

Hello Chris,

It worked!! Thank you!! I was able to fiddle with res and get it up to 5, which provided a decent visualization while also not throwing an error.

Also, great to know that fast=TRUE is an okay choice given the size of my admittedly irregular dataset. I have been running this only with a single individual, and plan to continue to do so in the future, so there's no pressure (at least from me, yet) to rewrite that part of the code any time soon.