This is quite exciting... evaluating ODEs for many subjects is very costly and screams for threaded parallelization, i.e. to allocate multiple CPUs per chain and then execute the ODE integration per patient on multiple CPUs in parallel.
So far there were two problems and I have possibly found a solution to both of these:
1. threading... well, openmp just does the job. The nice thing is that any compiler not supporting openmp will just execute things serially (default clang on MacOS is such a case, but gcc 4.6 can run openmp)
2. the second big fat problem was to get autodiff to play with the threading... there is a very simple solution to this: Just don't use autodiff - that solves the problem!
Not using AD during ODE integration is possible as long as one supplies an ode_system object which calculates analytically the Jacobian.
Now, this approach should be usable for about any very expensive operation we have! I mean we anyway use pre-computed gradients for our expensive operations. ODEs is just one of those; I guess there is more.
So the solution to get multiple CPUs per chain may be darn simple - only parallelize stuff which does not use AD; of course, having AD things being parallel would even be better, but this is a start. Moreover, I am absolutely confident that applying this approach to ODEs will be a BIG step forward for Stan. I know, there are more problems to overcome, but this is a very straightforward possibility now.
Ahh, I forgot: I have a working prototype which grabs 2 CPUs and happily crunches the ODE system in 26s while on a single CPU it takes twice the time, 52s - this should give major speedups during inference.
Best,
Sebastian
I don't know how to roll parallelism into the language
for general use. Any ideas?
#pragma omp parallel for
for (i in 1:N) {
// do safe stuff in parallel
}
pfor (i in 1:N) {
// do safe stuff in parallel
}
#pragma omp parallel for
for (int = 1; i < N; i++) {
// do safe stuff in parallel
}
> #pragma omp parallel for
> for (int = 1; i < N; i++) {
> // do safe stuff in parallel
> }
>
> But I don't know how we would detect whether the body of the for loop is parallel safe or not.
Adding the pragma would be trivial, but it doesn't buy
us anything.
#pragma omp parallel for
for (int i = 1; i < N; i++) {
mu[i] = integrate_ode(...);
} // end parallel part
target += normal_lpdf(y | mu, sigma);
If i do not disable the decoupling operation, then i get a segfault. In short, one must not do anything ad related when running parallel is what i empricially observed, but then it works nicely.
Best,
Sebastian
With OpenMP one can easily instruct the code to execute certain critical parts of the code only serially. Maybe this is the key to parallelize code which runs mainly with doubles and only very rarely needs access to the shared AD stack. The code parts which access the shared AD stack can be marked in order to perform the access in a serial order only. One can even request that the order of those code blocks must be done in order of execution of the wrapping for-loop.
Enforcing such ordering, which is internally handled with mutex locking and unlocking, will degrade the performance somewhat, but maybe it is a good compromise.
Sebastian
On Monday, August 8, 2016 at 10:04:35 PM UTC+2, Bob Carpenter wrote:
> > On Aug 8, 2016, at 3:04 PM, Ben Goodrich
> >
Yes, OpenMP is cross-platform and synchronization can be easily enforced with a few pragma statements. What I learned from the meeting yesterday is that reentry is also something to worry about - going through the OpenMP documentation it looks like that this is also can be taken care of.
So OpenMP offers quite some control over the flow of the program. I don't want to say with this that it is easy to do, not at all, but worthwhile to follow up. Is there some documentation available on what parts of AD in Stan have to be executed serially? This is probably not documented specifically, but where would one start?
Sebastian
On Wednesday, August 10, 2016 at 2:10:07 PM UTC+2, Bob Carpenter wrote:
> If it's portable across Windows/Linux/Mac OSX, that'd be
> really useful.
>
> For most of our use cases, the serial part (adding
> the expression into the expression graph) only needs to be
> synchronized, not ordered. For instance, in the ODE
> case for multiple patients, each patient is independent
> of the others conditioned on the parameters, so their
> likelihood contributions can be added independently.
>
> - Bob
>
> > On Aug 9, 2016, at 8:59 AM, Sebastian Weber
> >
I am still on a toy example, yes, but this looks even more interesting: When parallelizing on 2 CPUs I go from 52s to 26s when I integrate an ODE system, but I do not do the decoupling operation. This was my first experiment. Not doing the decoupling operation means that my program only deals with doubles and the result is not yet usable inside the Stan eco system which needs vars.
Now, I added a omp critical statement inside the ode state decoupling operation - this is the part of the code which interacts with the AD stack. Moreover, I used an "ordered" statement such that the end of my loop I can collect things in correct serial order. All of this adds heavily to synchronization duties for OpenMP, but the runtime only increases by 3s to an overall 29s. So I am still getting a large speedup since the really heavy lifting is done inside the ODE integrator.
Are there other parts of Stan which could benefit form this? What about GP stuff?
Sebastian
http://stackoverflow.com/questions/11095309/openmp-set-num-threads-is-not-working
https://eigen.tuxfamily.org/dox/TopicMultiThreading.html
In case one calls OpenMP enabled functions recursively, one can define what happens in this case. The safe choice is to let parallelization occur only on the top level and disable nested parallelization; this is all configurable when using OpenMP.
Sebastian
On Wednesday, August 10, 2016 at 11:26:17 PM UTC+2, Bob Carpenter wrote:
> I think for most of the GP operations, we want to parallelize
> the underlying matrix operations unless there's something specific
> to GPs.
>
> How do you control the number of threads with OpenMP? The problem
> you're going to run into is that you want at most as many threads
> as there are cores (assuming they're all being kept busy), but once
> you start nesting parallel algorithms inside one another this becomes
> tricky and you can get big slowdowns.
>
> - Bob
>
> > On Aug 10, 2016, at 10:04 PM, Sebastian Weber
> >