akde with very large data sets

Marian

unread,

Jan 28, 2024, 7:28:01 PMJan 28

to ctmm R user group

Hi,

I've got some vulture gps data at high (but variable) resolution, with birds ranging between just 8,000 (RIP) and 175,000 observations.My goal is to run iRSFs, but I'm hung up on the AKDE part.

I've had success running just the first month of observations for a bird- 11k observations take about 10 minutes. Scaling this up to the entire deployment period means that an average bird of around 85k points takes over 36 hours. Is there any way for me to speed this up? Do I just have to be patient?

Thanks,

Marian

Christen Fleming

unread,

Jan 28, 2024, 10:18:16 PMJan 28

to ctmm R user group

Hi Marian,

The model fitting algorithms are the slowest part of the calculation but are O(n), so if you have 10x the data, then it should only take 10x longer to calculate. Is it really taking 200x longer to fit 10x the amount of data? If so, I could profile that and try to see what the bottleneck is.

But for speed, you can parallelize over individuals in a foreach loop and fit one individual per physical CPU core. I would save the fit objects to separate RDA files as you go, because sometimes parallelized computation is weird in R.

Best,

Chris

Marian

unread,

Jan 29, 2024, 7:03:16 PMJan 29

to ctmm R user group

Interesting..... this suggests then that I may have set something up very wrong somewhere. How does one go about profiling things?

I set a foreach loop running on the university's computing cluster, but as its been running since thursday, and on something like 90 cores, I'm guessing that's run afoul of whatever has been messing things up on my home computer. Unfortunately, I didn't set it up to save the objects as it goes, which I am certainly now regretting.

Christen Fleming

unread,

Jan 31, 2024, 10:25:38 AMJan 31

to ctmm R user group

Hi Maria,

I don't recommend that you profile the code (that's more for the developer to diagnose the issue).

I do recommend saving individual fit objects as you go. That way if something gums up in the loop (whether it's a bug or a bad dataset), you don't have to re-do everything. This also let's you see your progress.

Best,

Chris

Jesse Alston

unread,

Jan 31, 2024, 2:27:52 PMJan 31

to Christen Fleming, ctmm R user group

You can also use array jobs (e.g., https://www.uwyo.edu/data-science/resources/knowledge-base/how-to-submit-array-of-tasks-to-slurm.html) to parallelize things in bash rather than in your R code itself, which I have found to be more flexible--if one animal fails or takes forever, the other ones will still finish and you can later check out the problem animal on its own.

If you poke around on my GitHub, you should be able to find some bash files that take advantage of array jobs.

Jesse

--
You received this message because you are subscribed to the Google Groups "ctmm R user group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ctmm-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ctmm-user/bf49ad82-9a4a-4f1f-8e11-989d3503e757n%40googlegroups.com.

--

Jesse Alston

+1 252 532 5236

jmalston.com

Marian

unread,

Jan 31, 2024, 5:42:14 PMJan 31

to ctmm R user group

Thanks Jesse! I've only used slurm once before but it sounds like it could be the way to go here. I'll have to look into how to do it properly.

Chris, what would you need from me for the profiling? I'd love to figure out what is slowing this all down, although I'm afraid it's just going to end up being me setting something up very unwisely.

Christen Fleming

unread,

Feb 2, 2024, 4:31:32 PMFeb 2

to ctmm R user group

Hi Maria,

If you have an individual that takes 200x longer to fit 10x the amount of data, then I just need the dataset and minimal script to reproduce that difference in computation time.

Best,

Chris

Marian

unread,

Feb 9, 2024, 12:15:09 PMFeb 9

to ctmm R user group

Here's the theoretically working code and data. I'd be willing to bet money that this is simply me doing something very stupid. Thanks for taking a look at this!

adke large data set.zip

Christen Fleming

unread,

Feb 22, 2024, 5:41:27 PMFeb 22

to ctmm R user group

Hi Maria,

Sorry for the delay.

You have a huge 500 m/s outlier in the data and with sub-minute data, you probably also need a location-error model.

I would run the data through outlie() to filter out the outliers and then run a subset with and without an error model to see how much of a difference that makes.

Best,

Chris

Marian

unread,

Mar 29, 2024, 10:33:51 AMMar 29

to ctmm R user group

Sorry for the delay, I've been doing some much less cool SSF stuff. I've cut out outliers with unreasonable speed and vertical speed, but unfortunately it hasn't helped the model. How would I go about fitting a location-error model? I've gone through my notes and I can't seem to find it. Thanks!

Christen Fleming

unread,

Apr 1, 2024, 12:04:07 AMApr 1

to ctmm R user group

Hi Maria,

There's vignette('error') in the package and a workshop script in the ctmm-learn materials.

Best,

Chris

Message has been deleted

Jessica Gorzo

unread,

Apr 26, 2024, 6:58:29 PMApr 26

to ctmm R user group

Hello,

I came across this post and am a new user of ctmm (and new to home range analysis in general). I have been starting to use outlie() to look at my data. In this thread I see...

"I would run the data through outlie() to filter out the outliers..."

I am trying to do just that; I was able to run my data through outlie (with some doing as it isn't Movebank based) and generate the plots showing outliers. Looking at the example, I can see that the blue line corresponds to high speed, and I was able to sort and find this record in the output of outlie(). I'm trying to use this to automatically flag points in my data, though. I looked at the outlie() source code but was only able to disentangle so much, because I can't find a function that is somewhere in a dependency. Anyway, it would be great if alongside these plots, there could be an outlier detection column added or something (e.g. flag the record that corresponds to the heavily weighted blue line). Maybe it already exists somewhere?

Anyway, if you have guidance on a workflow to automate using the output of outlie() that would be helpful!

Thanks,

Jesse Alston

unread,

Apr 29, 2024, 2:05:22 PMApr 29

to Jessica Gorzo, ctmm R user group

Hi Jessica,

If you haven't looked at this vignette yet, it has some helpful information. You can just add a column in your data that marks outliers, whereas outliers are marked in stand-alone vectors in the vignette.

Jesse

--
You received this message because you are subscribed to the Google Groups "ctmm R user group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ctmm-user+...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ctmm-user/6c858d48-487d-4ea9-b217-cba8500b28e0n%40googlegroups.com.

Jessica Gorzo

unread,

Apr 29, 2024, 2:18:21 PMApr 29

to ctmm R user group

Thanks Jesse,

I do not see the vignette here but I was able to "reverse engineer" this enough to get the outliers shown in the outlie() plots. Can you send me a link to the vignette? Maybe I reinvented the wheel here a bit, but possibly suggest for the package just adding the lwd and cex as columns in the output data frame of the outlie object, to easily be able to identify the points highlighted in the output plot?

Jesse Alston

unread,

Apr 29, 2024, 2:37:42 PMApr 29

to Jessica Gorzo, ctmm R user group

Sorry, forgot to add the link: https://ctmm-initiative.github.io/ctmm/articles/error.html#outlier-detection

The points identified in the outlier plot can be identified using the code in the third code box in the outlier section of this vignette.

Jesse

To view this discussion on the web visit https://groups.google.com/d/msgid/ctmm-user/c29ded66-5ed3-4d99-a7a8-6bc040057e0cn%40googlegroups.com.

Jessica Gorzo

unread,

Apr 29, 2024, 5:31:47 PMApr 29

to Jesse Alston, ctmm R user group

Thanks Jesse! I see now the example you're referring to; I stand by my suggestion to add lwd prefaced with a 0 (i.e. c(0,lwd)) and cex as columns in the dataframe that is output from outlie()

Since the plot is automatically generated, it would allow the user to quickly/easily filter the data frame to see which points correspond to the red dots/blue lines. I'll see if I can put this as a suggestion through the proper channels in GitHub.

Thanks for your reply and cheers!

Reply all

Reply to author

Forward