Tips for improving code parallelization

175 views
Skip to first unread message

Bert van der Veen

unread,
Jun 19, 2024, 3:51:30 AM6/19/24
to TMB Users
Hello all,

I have recently been looking more into parallelizing model fitting with TMB, and am in need of some guidance.

I started out by using parallel_accumulator, but quickly found out that some parts of my code parallelized better than others. Under certain configurations, CPU usage looked very spiky, which I took as a sign that some threads are waiting for others to complete their tasks.

Is there a better way than placing PARALLEL_REGION in different parts of the code, to try and narrow down what part of the code is causing these issues? This type of trial-and-error approach is very time consuming. Further, how do I decide on PARALLEL_REGION placement? 

Kind regards,
Bert

Kasper Kristensen

unread,
Jun 19, 2024, 10:11:14 AM6/19/24
to TMB Users
Parallelization in TMB has evolved a lot since the beginning. There are now three options:

1. PARALLEL_REGION macro (example: https://github.com/kaskr/adcomp/blob/master/tmb_examples/register_atomic_parallel.cpp)
2. parallel_accumulator (example: https://github.com/kaskr/adcomp/blob/master/tmb_examples/linreg_parallel.cpp)
3. Automatic parallelization (see ?TMB::openmp)

Of these, I recommend option (3) because you don't have to change anything in your program. Despite it being simple, it often provides better parallel splits than what you can come up with manually. It also gives a 'autopar work split' output you can use to assess the quality of the split.

Bert van der Veen

unread,
Jun 19, 2024, 10:14:36 AM6/19/24
to TMB Users
Thanks Kasper.

I did notice option 3) and have tried it out; it seemed to do the same as option 2), how different should I expect these to be?

After writing my first post I noticed that parallelization worked better when fitting the model with optim's L-BFGS-B, so I wonder if perhaps it is the size of the approximate Hessian that is slowing things down, rather than TMB's capabilities of efficiently parallelizing the code.

Kasper Kristensen

unread,
Jun 19, 2024, 10:27:15 AM6/19/24
to TMB Users
Option 3 does a better job at balancing the work on threads, but perhaps that's not an issue for your model.

To test the parallel scalability you may want to try TMB::benchmark (example e.g. here: https://github.com/glmmTMB/glmmTMB/issues/733 ). That should tell you whether TMB parallelization is a dead end.

Bert van der Veen

unread,
Jun 19, 2024, 1:48:27 PM6/19/24
to TMB Users
The benchmark confirms that the parallelization significantly speeds up the code, although more so for the likelihood than for the gradient. This confirms that the issue is related to the used optimisation routine. Optim's L-BFGS-B evaluates the likelihood less often compared to BFGS or nlminb, which I suppose is somehow the cause.

Ben Bolker

unread,
Jun 19, 2024, 3:33:17 PM6/19/24
to brrtj...@gmail.com, tmb-...@googlegroups.com

If the optimizer is becoming important it may be worth trying out a variety from optimx and/or nloptr ...


--
To post to this group, send email to us...@tmb-project.org. Before posting, please check the wiki and issuetracker at https://github.com/kaskr/adcomp/. Please try to create a simple repeatable example to go with your question (e.g issues 154, 134, 51). Use the issuetracker to report bugs.
---
You received this message because you are subscribed to the Google Groups "TMB Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tmb-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tmb-users/ffddab18-7854-4ddd-b934-8ca5ae1dd25fn%40googlegroups.com.

Bert van der Veen

unread,
Jun 20, 2024, 6:18:39 AM6/20/24
to TMB Users
I am happy to get suggestions re. optimisation, it is something I have been looking into. AFAIK optimx does not offer any optimisation routines that can be of help here (more than base R), and my experience with unconstrained (derivative-based) optimisation in nloptr is not much more successful than that in base R.

Andrea Havron

unread,
Jun 21, 2024, 2:16:55 AM6/21/24
to Bert van der Veen, TMB Users
Hi Bert,

There is new and better functionality for running TMB models in parallel with OpenMP when you compile with TMBad instead of CppAD. TMBad will automatically determine the most optimal split of the computational graph and you don't need to modify the .cpp code, i.e. you do not need to insert parallel_accumulator, you just set up the nll as you would normally.

Here's an example of the R code needed to run a model in parallel with TMBad:
compile("example.cpp", openmp = TRUE, framework = "TMBad")
openmp(ncores, autopar = TRUE)
obj <- MakeADFun(data, parameters, DLL="example")
TMB:::op_table(obj$env$ADFun) #shows how processes are split into separate threads
openmp(1, autopar = FALSE) #switch back to one core

Cheers,
Andrea 



--
To post to this group, send email to us...@tmb-project.org. Before posting, please check the wiki and issuetracker at https://github.com/kaskr/adcomp/. Please try to create a simple repeatable example to go with your question (e.g issues 154, 134, 51). Use the issuetracker to report bugs.
---
You received this message because you are subscribed to the Google Groups "TMB Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tmb-users+...@googlegroups.com.

Ben Bolker

unread,
Jun 21, 2024, 2:16:58 AM6/21/24
to brrtj...@gmail.com, tmb-...@googlegroups.com

Have you tried the new(ish) autopar feature? See ?openmp


--

Bert van der Veen

unread,
Jul 2, 2024, 1:21:33 PM7/2/24
to TMB Users
Apologies for the slow response. I did try the autopar feature, and have been using TMBad all the way. Autopar did not make a difference.
Reply all
Reply to author
Forward
0 new messages