parallelization - working prototype!!!

Sebastian Weber

μη αναγνωσμένη,

7 Αυγ 2016, 1:08:42 μ.μ.7/8/16

ως stan development mailing list

Hi!

This is quite exciting... evaluating ODEs for many subjects is very costly and screams for threaded parallelization, i.e. to allocate multiple CPUs per chain and then execute the ODE integration per patient on multiple CPUs in parallel.

So far there were two problems and I have possibly found a solution to both of these:

1. threading... well, openmp just does the job. The nice thing is that any compiler not supporting openmp will just execute things serially (default clang on MacOS is such a case, but gcc 4.6 can run openmp)

2. the second big fat problem was to get autodiff to play with the threading... there is a very simple solution to this: Just don't use autodiff - that solves the problem!

Not using AD during ODE integration is possible as long as one supplies an ode_system object which calculates analytically the Jacobian.

Now, this approach should be usable for about any very expensive operation we have! I mean we anyway use pre-computed gradients for our expensive operations. ODEs is just one of those; I guess there is more.

So the solution to get multiple CPUs per chain may be darn simple - only parallelize stuff which does not use AD; of course, having AD things being parallel would even be better, but this is a start. Moreover, I am absolutely confident that applying this approach to ODEs will be a BIG step forward for Stan. I know, there are more problems to overcome, but this is a very straightforward possibility now.

Ahh, I forgot: I have a working prototype which grabs 2 CPUs and happily crunches the ODE system in 26s while on a single CPU it takes twice the time, 52s - this should give major speedups during inference.

Best,
Sebastian

Bob Carpenter

μη αναγνωσμένη,

7 Αυγ 2016, 1:52:14 μ.μ.7/8/16

ως stan...@googlegroups.com

That's fantastic. It should give you pretty close to
an N times speedup with N cores if memory contention doesn't
become a bottleneck. And you can run multiple groups of
cores on different physical machines for different chains.

We can do the same thing for matrix operations and
linear algebra because we're mainly doing the heavy
lifting with double values at this point.

I don't know how to roll parallelism into the language
for general use. Any ideas?

For this case, it could tie into some specialized
functions that evaluate in parallel for parallel ODEs.

An alternative that would deal with autodiff would be
to do thread-specific nested autodiff. There's a performance
hit to doing that of about 20% I think vs. running
single threaded. Then as long as the entire function
(solving all the ODEs) can be tied up with gradients for
the outside, it should work.

Does OpenMP run cross platform? If not, we'll have to figure
out some platform-specific runs.

- Bob

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ben Goodrich

μη αναγνωσμένη,

7 Αυγ 2016, 7:53:06 μ.μ.7/8/16

ως stan development mailing list

On Sunday, August 7, 2016 at 1:52:14 PM UTC-4, Bob Carpenter wrote:

I don't know how to roll parallelism into the language
for general use. Any ideas?

OpenMP --- for our purposes --- is basically just a way to parallelize for loops. So, I think we would either need to have the parser not strip out lines with # so that users could do OpenMP stuff themselves

#pragma omp parallel for
for (i in 1:N) {
  // do safe stuff in parallel
}

which I suggested two and a half years ago

https://github.com/stan-dev/stan/issues/439

or introduce a parallel for loop construct in the Stan language like

pfor (i in 1:N) {
  // do safe stuff in parallel
}

that parses to

#pragma omp parallel for
for (int = 1; i < N; i++) {
  // do safe stuff in parallel
}

But I don't know how we would detect whether the body of the for loop is parallel safe or not.

I am continuing to think that the next lowest hanging fruit (having already parallelized chains) is to parallelize the leapfrog steps rather than parellelizing within a leapfrog step. But to do that, we would still need to thread-local some stuff. And to do that, we would need to utilize C++11. And to do that, we would need to get the size of the compiled models down to reasonable sizes without resorting to LTO.

Ben

Sebastian Weber

μη αναγνωσμένη,

8 Αυγ 2016, 3:34:09 π.μ.8/8/16

ως stan development mailing list

Hi!

To my knowledge OpenMP is cross-platform: https://en.wikipedia.org/wiki/OpenMP

So this includes Windows! You only need a compiler which supports it. The clang++ on Macos does not support it at the moment, but you can install clang-omp as I understood. Moreover, compilers not supporting the OpenMP directives will simply just ignore it.

There is one caveat to take care of - exceptions must not propagate outside of an OpenMP thread. I remember that on stackoverflow there were some solutions suggested which take care of this (roughly speaking you catch all exceptions during parallelism and then rethrow as needed).

So a pfor construct would be great, but then we need to figure out how to detect code which can run in parallel. No idea how to do that (traits in C++?).

Another idea would be to have functions with a "_par" suffix for which users must specify the gradients. Then we allow only _par functions to be called in a pfor loop. For example.

For the ODE stuff... I know there is still some work to be done. I mean for PK problems I would have to stop and restart the integrator from time to time; no idea how to do this at this stage. I guess we would need to have the event handling inside Stan.

I also expect the speedup to scale really nicely with # of CPUs...just awesome. What Ben suggests to parallelize over the tree which gets build up sounds also very appealing.

I am really looking forward to throw a single chain onto one of our 20 CPU computing nodes.

Best,
Sebastian

Joshua N Pritikin

μη αναγνωσμένη,

8 Αυγ 2016, 6:22:45 π.μ.8/8/16

ως stan...@googlegroups.com

On Sun, Aug 07, 2016 at 04:53:05PM -0700, Ben Goodrich wrote:
> I am continuing to think that the next lowest hanging fruit (having
> already parallelized chains) is to parallelize the leapfrog steps
> rather than parellelizing within a leapfrog step.

That sounds like a good idea.

In OpenMx, we often use BFGS optimizers with a numerical approximation
of the gradient. An approach we took is to parallelize the numerical
approximation of the gradient. This works well because the central
difference algorithm requires 2 * #parameters evaluations of the fit
(usually more than the # of cores) and each evaluation is roughly the
same amount of work (so the work is evenly distributed among the cores).

--
Joshua N. Pritikin, Ph.D.
Virginia Institute for Psychiatric and Behavioral Genetics
Virginia Commonwealth University
PO Box 980126
800 E Leigh St, Biotech One, Suite 1-133
Richmond, VA 23219
http://people.virginia.edu/~jnp3bc

Bob Carpenter

μη αναγνωσμένη,

8 Αυγ 2016, 2:03:55 μ.μ.8/8/16

ως stan...@googlegroups.com

> On Aug 7, 2016, at 7:53 PM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> On Sunday, August 7, 2016 at 1:52:14 PM UTC-4, Bob Carpenter wrote:
> I don't know how to roll parallelism into the language
> for general use. Any ideas?
>
> OpenMP --- for our purposes --- is basically just a way to parallelize for loops. So, I think we would either need to have the parser not strip out lines with # so that users could do OpenMP stuff themselves
>
> #pragma omp parallel for
> for (i in 1:N) {
> // do safe stuff in parallel
> }
>
> which I suggested two and a half years ago
>
> https://github.com/stan-dev/stan/issues/439
>
> or introduce a parallel for loop construct in the Stan language like
>
> pfor (i in 1:N) {
> // do safe stuff in parallel
> }
>
> that parses to
>
> #pragma omp parallel for
> for (int = 1; i < N; i++) {
> // do safe stuff in parallel
> }
>
> But I don't know how we would detect whether the body of the for loop is parallel safe or not.

Adding the pragma would be trivial, but it doesn't buy
us anything.

The problem is that neither of our two key operations are thread safe:

* autodiff

* incrementing log density

What we need to do is run
each of the pieces with what the autodiff folks call "pre-accumulation"
mode --- that is, calculate the partials of input with respect
to output in each parallel segment and then put everything together.

Almost everything we do can be broken down into adding into
the log density, so we could try synchronizing there. Then
we'd need each of the autodiff pieces to run in their own sandbox
to produce something that could be synched into the log density.

I'm not sure parallel loops is the right abstraction here (as
opposed to say, a work queue of functions to evaluate, each
in its own autodiff sandbox to produce a pre-accumulated result
(output and partials to all inputs) that could then be synched
back into the result.

- Bob

>
> I am continuing to think that the next lowest hanging fruit (having already parallelized chains) is to parallelize the leapfrog steps rather than parellelizing within a leapfrog step. But to do that, we would still need to thread-local some stuff. And to do that, we would need to utilize C++11. And to do that, we would need to get the size of the compiled models down to reasonable sizes without resorting to LTO.
>
> Ben
>
>

Bob Carpenter

μη αναγνωσμένη,

8 Αυγ 2016, 2:05:59 μ.μ.8/8/16

ως stan...@googlegroups.com

> On Aug 8, 2016, at 3:34 AM, Sebastian Weber <sdw....@gmail.com> wrote:
>
> Hi!
>
> To my knowledge OpenMP is cross-platform: https://en.wikipedia.org/wiki/OpenMP
>
> So this includes Windows! You only need a compiler which supports it. The clang++ on Macos does not support it at the moment, but you can install clang-omp as I understood. Moreover, compilers not supporting the OpenMP directives will simply just ignore it.
>
> There is one caveat to take care of - exceptions must not propagate outside of an OpenMP thread. I remember that on stackoverflow there were some solutions suggested which take care of this (roughly speaking you catch all exceptions during parallelism and then rethrow as needed).

Yup, basically devolves back to C practices.

> So a pfor construct would be great, but then we need to figure out how to detect code which can run in parallel. No idea how to do that (traits in C++?).
>
> Another idea would be to have functions with a "_par" suffix for which users must specify the gradients. Then we allow only _par functions to be called in a pfor loop.

That eliminates having to sandbox the gradients thread locally.
Then we'd just need to synch additions to the larger expression
graph.

- Bob

Bob Carpenter

μη αναγνωσμένη,

8 Αυγ 2016, 2:08:29 μ.μ.8/8/16

ως stan...@googlegroups.com

Finite differencing is embarassingly parallel.

And you could do the same thing with forward mode autodiff to
compute gradients, but it's not worth it when you have
reverse mode autodiff---you'd need N CPUs where N is the
number of parameters for maybe a factor of 2 speedup.

Where parallelism would be hugely helpful would be in Jacobian
calculations for multivariate functions and in (dense or diagonal)
Hessians.

- Bob

Ben Goodrich

μη αναγνωσμένη,

8 Αυγ 2016, 3:04:01 μ.μ.8/8/16

ως stan development mailing list

On Monday, August 8, 2016 at 2:03:55 PM UTC-4, Bob Carpenter wrote:

> #pragma omp parallel for
> for (int = 1; i < N; i++) {
> // do safe stuff in parallel
> }
>
> But I don't know how we would detect whether the body of the for loop is parallel safe or not.

Adding the pragma would be trivial, but it doesn't buy
us anything.

But that is pretty much all Sebastian was doing

#pragma  omp parallel for
for (int i = 1; i < N; i++) {
  mu[i] = integrate_ode(...);
} // end parallel part
target += normal_lpdf(y | mu, sigma);

If the parallel loop only does assignment and doesn't involve any i - 1 stuff, then it is fine, right? The target += line is evaluated in the main thread.

Ben

Sebastian Weber

μη αναγνωσμένη,

8 Αυγ 2016, 3:19:57 μ.μ.8/8/16

ως stan development mailing list

That's almost what i did, but not quite. That is, I had to disable the decouple ode states operation which apparently messes with the ad stack. So the integrate ode call returns a large vector of vector of doubles which one would need to assemble together in the main thread after integrating all odes.

If i do not disable the decoupling operation, then i get a segfault. In short, one must not do anything ad related when running parallel is what i empricially observed, but then it works nicely.

Best,
Sebastian

Bob Carpenter

μη αναγνωσμένη,

8 Αυγ 2016, 4:04:35 μ.μ.8/8/16

ως stan...@googlegroups.com

> On Aug 8, 2016, at 3:04 PM, Ben Goodrich <goodri...@gmail.com> wrote:
>
> On Monday, August 8, 2016 at 2:03:55 PM UTC-4, Bob Carpenter wrote:
> > #pragma omp parallel for
> > for (int = 1; i < N; i++) {
> > // do safe stuff in parallel
> > }
> >
> > But I don't know how we would detect whether the body of the for loop is parallel safe or not.
>
> Adding the pragma would be trivial, but it doesn't buy
> us anything.
>
> But that is pretty much all Sebastian was doing
>
> #pragma omp parallel for
> for (int i = 1; i < N; i++) {
> mu[i] = integrate_ode(...);
> } // end parallel part
> target += normal_lpdf(y | mu, sigma);

If there aren't any parameters in the ..., then that's OK, because
setting mu[i] for different i should be thread safe.

The tricky part is getting the derivatives from mu[i] back down
to the parameters for the ... bits. I

>
> If the parallel loop only does assignment

The key is the mu[i] assignment, which is disjoint across
the loop iterations. But you couldn't do sum += ... that way.

> and doesn't involve any i - 1 stuff, then it is fine, right? The target += line is evaluated in the main thread.

Right, but it needs the derivatives to be reconstructed from
target back through normal_lpdf back to the parameters in the
... parts. Sebastian's pulling out the parallel part to implement
using purely double values.

- Bob

Sebastian Weber

μη αναγνωσμένη,

9 Αυγ 2016, 2:59:03 π.μ.9/8/16

ως stan development mailing list

Hi!

With OpenMP one can easily instruct the code to execute certain critical parts of the code only serially. Maybe this is the key to parallelize code which runs mainly with doubles and only very rarely needs access to the shared AD stack. The code parts which access the shared AD stack can be marked in order to perform the access in a serial order only. One can even request that the order of those code blocks must be done in order of execution of the wrapping for-loop.

Enforcing such ordering, which is internally handled with mutex locking and unlocking, will degrade the performance somewhat, but maybe it is a good compromise.

Sebastian

On Monday, August 8, 2016 at 10:04:35 PM UTC+2, Bob Carpenter wrote:
> > On Aug 8, 2016, at 3:04 PM, Ben Goodrich
> >

Bob Carpenter

μη αναγνωσμένη,

10 Αυγ 2016, 8:10:07 π.μ.10/8/16

ως stan...@googlegroups.com

If it's portable across Windows/Linux/Mac OSX, that'd be
really useful.

For most of our use cases, the serial part (adding
the expression into the expression graph) only needs to be
synchronized, not ordered. For instance, in the ODE
case for multiple patients, each patient is independent
of the others conditioned on the parameters, so their
likelihood contributions can be added independently.

- Bob

Sebastian Weber

μη αναγνωσμένη,

10 Αυγ 2016, 8:40:27 π.μ.10/8/16

ως stan development mailing list

Hi!

Yes, OpenMP is cross-platform and synchronization can be easily enforced with a few pragma statements. What I learned from the meeting yesterday is that reentry is also something to worry about - going through the OpenMP documentation it looks like that this is also can be taken care of.

So OpenMP offers quite some control over the flow of the program. I don't want to say with this that it is easy to do, not at all, but worthwhile to follow up. Is there some documentation available on what parts of AD in Stan have to be executed serially? This is probably not documented specifically, but where would one start?

Sebastian

On Wednesday, August 10, 2016 at 2:10:07 PM UTC+2, Bob Carpenter wrote:
> If it's portable across Windows/Linux/Mac OSX, that'd be
> really useful.
>
> For most of our use cases, the serial part (adding
> the expression into the expression graph) only needs to be
> synchronized, not ordered. For instance, in the ODE
> case for multiple patients, each patient is independent
> of the others conditioned on the parameters, so their
> likelihood contributions can be added independently.
>
> - Bob
>
> > On Aug 9, 2016, at 8:59 AM, Sebastian Weber
> >

Bob Carpenter

μη αναγνωσμένη,

10 Αυγ 2016, 8:58:06 π.μ.10/8/16

ως stan...@googlegroups.com

While constructing the expression graph:

* any operation allocating memory in the arena (memalloc_ calls, including
new vari)

* anything that manipulates the stack (setting adjoints to zero, calculating
gradients)

All of the built-in var functions do the former and anything caclulating
gradients or higher-order derivatives does the latter.

That's just reverse mode. Forward mode's fine in parallel,
but we tend to only use fvar<var> and fvar<fvar<var> > instances,
so have to play by reverse-mode rules anyway.

- Bob

Joshua N Pritikin

μη αναγνωσμένη,

10 Αυγ 2016, 9:08:41 π.μ.10/8/16

ως stan...@googlegroups.com

On Wed, Aug 10, 2016 at 05:40:27AM -0700, Sebastian Weber wrote:
> So OpenMP offers quite some control over the flow of the program. I don't
> want to say with this that it is easy to do, not at all, but worthwhile to
> follow up. Is there some documentation available on what parts of AD in
> Stan have to be executed serially? This is probably not documented
> specifically, but where would one start?

I don't want to discourage experimentation, but if you want to see a big
speed up then you need to think about how to avoid synchronization
primitives and balance the work among threads. A mutex, or even an atomic
read/write, will often destroy most of the advantage of multiple cores.

Bob Carpenter

μη αναγνωσμένη,

10 Αυγ 2016, 9:18:31 π.μ.10/8/16

ως stan...@googlegroups.com

That's true with highly parallel and naive implementations,
which is why I generally tell people it's hard to parallelize
autodiff.

In the use case outlined here by Sebastian, there's a big gain
to be had even with synchronization. For each of say a few thousand
patients, we need to solve a compute-intensive ODE with sensitivities.
Putting it back together in Stan is simple (just a handful of assignment
statements), so the synch isn't locking out most of the work. Even
expected balance comes for free.

If you have suggestions for concurrent, lock-free, reverse-mode
autodiff, I'm all ears.

- Bob

Sebastian Weber

μη αναγνωσμένη,

10 Αυγ 2016, 4:04:52 μ.μ.10/8/16

ως stan development mailing list

Hi!

I am still on a toy example, yes, but this looks even more interesting: When parallelizing on 2 CPUs I go from 52s to 26s when I integrate an ODE system, but I do not do the decoupling operation. This was my first experiment. Not doing the decoupling operation means that my program only deals with doubles and the result is not yet usable inside the Stan eco system which needs vars.

Now, I added a omp critical statement inside the ode state decoupling operation - this is the part of the code which interacts with the AD stack. Moreover, I used an "ordered" statement such that the end of my loop I can collect things in correct serial order. All of this adds heavily to synchronization duties for OpenMP, but the runtime only increases by 3s to an overall 29s. So I am still getting a large speedup since the really heavy lifting is done inside the ODE integrator.

Are there other parts of Stan which could benefit form this? What about GP stuff?

Sebastian

Bob Carpenter

μη αναγνωσμένη,

10 Αυγ 2016, 5:26:17 μ.μ.10/8/16

ως stan...@googlegroups.com

I think for most of the GP operations, we want to parallelize
the underlying matrix operations unless there's something specific
to GPs.

How do you control the number of threads with OpenMP? The problem
you're going to run into is that you want at most as many threads
as there are cores (assuming they're all being kept busy), but once
you start nesting parallel algorithms inside one another this becomes
tricky and you can get big slowdowns.

- Bob

Krzysztof Sakrejda

μη αναγνωσμένη,

10 Αυγ 2016, 5:33:11 μ.μ.10/8/16

ως stan development mailing list

On Wednesday, August 10, 2016 at 5:26:17 PM UTC-4, Bob Carpenter wrote:
> I think for most of the GP operations, we want to parallelize
> the underlying matrix operations unless there's something specific
> to GPs.
>
> How do you control the number of threads with OpenMP?

http://stackoverflow.com/questions/11095309/openmp-set-num-threads-is-not-working

Sebastian Weber

μη αναγνωσμένη,

11 Αυγ 2016, 4:48:45 π.μ.11/8/16

ως stan development mailing list

Eigen does support OpenMP out of the box for a limited set of algorithms (general matrix - matrix products and PartialPivLU). The Intel MKL also advertises that it can use OpenMP.

https://eigen.tuxfamily.org/dox/TopicMultiThreading.html

In case one calls OpenMP enabled functions recursively, one can define what happens in this case. The safe choice is to let parallelization occur only on the top level and disable nested parallelization; this is all configurable when using OpenMP.

Sebastian

On Wednesday, August 10, 2016 at 11:26:17 PM UTC+2, Bob Carpenter wrote:
> I think for most of the GP operations, we want to parallelize
> the underlying matrix operations unless there's something specific
> to GPs.
>
> How do you control the number of threads with OpenMP? The problem
> you're going to run into is that you want at most as many threads
> as there are cores (assuming they're all being kept busy), but once
> you start nesting parallel algorithms inside one another this becomes
> tricky and you can get big slowdowns.
>
> - Bob
>
> > On Aug 10, 2016, at 10:04 PM, Sebastian Weber
> >

Bob Carpenter

μη αναγνωσμένη,

11 Αυγ 2016, 8:28:06 π.μ.11/8/16

ως stan...@googlegroups.com

Great news all around. The matrix-matrix products are
going to improve the low levels of many higher-level
algorithms.

I'm not sure which LU decomposition we're using --- it
varies for positive definite and general cases.

- Bob

Απάντηση σε όλους

Απάντηση στον συντάκτη

Προώθηση