Increasing processing speed of ctmm.select(), speed(), speeds()

494 views
Skip to first unread message

jenny....@gmail.com

unread,
Feb 4, 2021, 1:05:19 PM2/4/21
to ctmm R user group
Hello! 

I am trying to use CTMM models to estimate distance and speed for >1800 foraging trips for a seabird. My code (developed during Animove 2018) works fairly well, but is pretty slow, i.e., I've processed fewer than 1/3 of these trips in a week. I've tried adjusting the cores argument in ctmm.select(), speed(), and speeds() to take advantage of the computer's 16 cores, but I still only see one core being used per R studio window. Is there something else I can do to speed up this processing? My code is below.

DATA.SUBSET is the individual trip's data and FIT[[MIN]] is the best model output from ctmm.select(). In speeds() I switched from prior = FALSE and fast = FALSE to prior = FALSE and fast = TRUE and that sped up processing, but not as much as I hoped. I just realized I have prior = TRUE and no fast argument in speed(). What is more accurate and also fast: prior = TRUE and fast = TRUE or prior = FALSE and fast = TRUE? The number of cores to use varies in the code below because it didn't seem to affect processing speed. 

GUESS <- ctmm.guess(DATA.SUBSET,
                          variogram = variogram(input_total),
                          interactive = FALSE)

FIT <- ctmm.select(DATA.SUBSET,
                         CTMM = GUESS,
                         method = "pHREML",
                         control=list(method="pNewton"), 
                         level = 1, # 1 means it will always attempt a simpler model
                         verbose = TRUE, # returns the sorted list of attempted models if verbose = TRUE
                         MSPE = "velocity", # rejects non-stationary features that increase error of velocity (or position)
                         IC = "AICc", # information criterion used for selection
                         trace = TRUE, # reports progress updates
                         cores = 8) # number of cores to use during model fitting

ctmm_speed <- ctmm::speed(object = DATA.SUBSET, 
                          CTMM = FIT[[MIN]],
                          robust = TRUE, # robust is true as T hits the limit of CI (0); True uses median
                          units = FALSE, # puts everything in the same unit and consistency   
                          prior = TRUE, # simulation approach, but can take a long time to run for large tracks 
                          level = 0.95, # confidence level to report
                          cores = 2)

ctmm_speeds <- ctmm::speeds(object = DATA.SUBSET, 
                                  CTMM = FIT[[MIN]],
                                  robust = TRUE, # robust is true as T hits the limit of CI (0); True uses median
                                  units = FALSE, # puts everything in the same unit and consistency   
                                  prior = FALSE, # simulation approach, but can take a long time to run for large tracks 
                                  fast = TRUE,
                                  level = 0.95, # confidence level to report
                                  cores = 10)  # number of cores to use on computer 

Thanks in advance for your help!

Jenny

  --
Jenny Howard
she/her/hers
PhD Candidate
Anderson Lab
Wake Forest University

Christen Fleming

unread,
Feb 4, 2021, 5:00:54 PM2/4/21
to ctmm R user group
Hi Jenny,

I have not forgotten you! I had an interview for the quantitative ecologist position at Wake Forest about a month ago and your PI was on the committee. I've been slowly improving these methods and still plan to incorporate segmentation and behavioral switching into the package—it's just not in the current grant, unfortunately, and I have other deliverables to prioritize.

If you are running in Windows, many of the core arguments don't do anything because Windows can't fork processes and socket parallelization can slow things down. They should work in UNIX (Linux or MacOS). However, with a large project like this, you don't want to be parallelizing at that level. What you probably want to do is parallelize over your 1800 individuals, such as with a foreach loop. If 1/3 takes a week to process on one core, then the entire dataset should only take ~1.4 days total when distributed over 15/16 of the cores in a foreach loop. [Also, I always make sure to progressively save my results in large loops like that.]

You don't want to be adjusting the prior argument 99% of the time (maybe I should even remove that option). fast=TRUE is fine as long as it works. But it can fail if parameters are too close to boundaries, where the central limit theorem does not hold.

You might try fitting one model to all of the foraging trips (sans nesting/colony data) and then using that as the starting guess for the subset fitting. That might get you closer to the most likely parameters for the subset.
If the subsetted movement models appear similar (within uncertainty) across foraging trips, you might also consider just using the one model (fit to all trips). I don't recall enough details of your data to remember if this makes sense.

Some smaller details:
In ctmm.select, method="pHREML", control=list(method="pNewton"), level=1, are all default arguments now, so you don't need those any longer.
I don't think MSPE="velocity" will do anything for a stationary model if IC="AICc". If you want to use the former, then you would need to set IC=NA. I recall that working better in some cases for you guys.

Best,
Chris

Jenny Howard

unread,
Feb 5, 2021, 11:43:19 AM2/5/21
to Christen Fleming, ctmm R user group
Hi Chris!

Thanks for your quick reply. That's really cool to hear you applied for the quantitative ecologist position! I hope you find something this year; I can only imagine how hard it is to find a job during the pandemic.

Thanks for the information on the cores argument. I totally missed the fact that would only work for non-Windows computers. I am playing around with a foreach loop now (great suggestion by Genevieve Finerty) on a small subset of the data, before I do it for all of them and really hog my computer's time. If I want to print progress updates (like "processing file 1 of 1881"), I assume I need to put that within the foreach loop or does that not work with parallel processing?

Briefly about the data: Data were sampled with a GPS logger every 3 or 5 minutes depending on logger and breeding season. I have data from approximately 700 birds, with multiple trips for each bird across five different breeding seasons. Because of this data collection, I was processing each trip separately. The birds are central-place foragers, so a trip starts and ends at the colony. I have removed GPS data from when the birds are on the nest, so that I only input data from their foraging trip. However, they do switch up behaviors during a foraging trip; because some trips last for days, they often rest on the water at night and move during the day. Some trips are incomplete, but most are complete. I remember in earlier emails we had talked about how current ctmm modeling would only work for these complete trips.

My main goals are to get a bird's distance travelled during a foraging trip, and to get an estimated instantaneous speed to be able to calculate a bird's airspeed. Is MSPE = "velocity" the best approach for those goals? 

I had tried setting IC = NA and was getting more warning messages than when I set IC = AICc: 
1: In ctmm.fit(data, GUESS, trace = trace2, ...) :
  pREML failure: indefinite ML Hessian or divergent REML gradient.
2: In ctmm.fit(data, GUESS[[1]], trace = trace2, ...) :
  pREML failure: indefinite ML Hessian or divergent REML gradient.

From looking at the ctmm pdf it doesn't seem like this is necessarily a concern, but it probably isn't good if that happens for most of them? Would I expect the model to be the same whether I use IC = NA or IC = AICc? When you say fitting the same movement model to the subsetted data, would that be something like instead of ctmm.select() I use ctmm.fit() and specify the model within the CTMM argument? The most common models for these trips seem to be OUF isotropic but I occasionally see OUf anisotropic.

When you suggest fitting a model to all the data and using that as my guess for the subset fitting, that would be ALL 1881 tracks? Or do it on a bird basis (slightly larger subset than subsetting it by trip)?

Re: prior, it is preferred to have prior = TRUE to be more accurate, but combined with fast = TRUE will make it accurate and faster? What are cases when it would fail at the boundaries? Is that with a smaller sample size or more coarse sampling? prior = TRUE, fast = FALSE is incredibly slow (i.e., 3-4x slower), but I guess that should become less of a problem with the foreach loop. 

Thanks for your help,

Jenny

--
You received this message because you are subscribed to a topic in the Google Groups "ctmm R user group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ctmm-user/bIP5L-wq6uE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ctmm-user+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ctmm-user/aecbd446-8a48-4c05-85e3-e7d202173a71n%40googlegroups.com.

Christen Fleming

unread,
Feb 5, 2021, 1:08:55 PM2/5/21
to ctmm R user group
Hi Jenny,

Thanks. I have another year of funding after this one, so I'll be okay if nothing comes through this year. I'm also not very good at interviewing and stuff, so this year is kind of a practice run.

I think printing works under parallelization. You can also click the refresh button in your environment panel in RStudio during any script.

You might try fitting one model to all data of one bird (sans colony, but including incomplete trips) and seeing if there is a significant difference between that and the individual trip fits.

Those warnings are fine. They are just indicating that parameter estimate quality is not ideal for reasons that you can't control. If you get more warnings, its probably because more models are attempted.

When you have abundant data, most of the sensible model selection criteria give you a good (usually the same) model. When you have tiny amounts of data, then AIC/BIC start to perform poorly and you might consider LOOCV or MSPE... especially if they can select a more parsimonious model (like isotropic).

prior=TRUE is necessary to propagate model parameter uncertainty. fast=TRUE is reasonably accurate when the parameter estimates are away from boundaries. You get parameter estimates near boundaries from data quality issues, including those you mention.

Best,
Chris
Reply all
Reply to author
Forward
0 new messages