benchmark on matrix-matrix products?

Sebastian Weber

unread,

Aug 21, 2016, 7:09:14 AM8/21/16

to stan development mailing list

Hi!

Encouraged from this benchmark here on matrix-matrix products with OpenMP in Eigen,

https://plafrim.bordeaux.inria.fr/doku.php?id=people:guenneba

I thought it would be nice to try this out in a Stan program... does someone have a compute intensive Stan model around which would be suitable to test this? It should be as easy as enabling -fopenmp to get a speedup proportional to the number of cores on the machine. So this speedup is almost for free given you have sufficient CPUs available and a Stan program using so called GEMM expressions.

Best,
Sebastian

Bob Carpenter

unread,

Aug 21, 2016, 7:58:13 AM8/21/16

to stan...@googlegroups.com

That is nearly perfect parallelization up to 8 cores which
is indeed very encouraging. It is a very parallelizable
problem, so it's believable.

Parallelizing double calls to Eigen is a great way forward
for us because it doesn't interfere with autodiff. I wonder

I don't know how much it'd help in the various factoring cases,
like when we have a big covariance/precision matrix, as in a GP.

Also, it won't be able to achieve speedup on top of what
you'll get by parallelizing independent ODE integrations.

This is related to the thread brought up by Rob on all the copying
involved in our current approach to sending double values into Eigen
when we store Matrix<var,...>. Any usage now is going to be bounded
by the copy costs, but that could still be a big win. Copying will
only cost O(N^2) operations plus memory contention for an N x N
matrix, whereas multiplication is O(N^3).

I think the original poster is a bit confused about the advantages
of hyperthreading---they're pretty much useless for compute
intensive jobs (both my own experience and what I've heard from others).
I do believe they could help with swapping among lightweight processes,
such as OS background jobs.

- Bob

P.S. I've thought the same about GPUs but I don't know if we can get the
arithmetic precision we need with GPUs. I know almost nothing
about GPU approaches in practice, but the FUD (fear, uncertainty, doubt,
a marketing strategy of large companies) on the street is
that they only work well on single precision (32-bit) floating point,
and the word from Michael is that won't be good enough for HMC
(even more divergences, I suspect).

> --
> You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Weber

unread,

Aug 22, 2016, 8:28:46 AM8/22/16

to stan development mailing list

This recent change to how matrix multiplications are done is what made me think about GEMM with OpenMP. My hope would be that this way we get some real benefit here.

Of course, this is not anything useful for ODE matters, yes. For this I have a few ideas, namely

1. one autodiff arena per thread which would allow general Stan language "pfor" constructs.

2. enabling OpenMP in a natural way on few commands in the language. For example integrate_ode_* functions have a "natural" generalization to multiple cases by introducing a variant with ragged arrays (for the time vector) and an array for the parameters. I guess other Stan functions could be generalized similarly.

Option 1 is best, I think, but could be hard to do. Option 2 is straightforward, but does not generalize easily (and needs ragged arrays to be there).

Sebastian

> > On Aug 21, 2016, at 1:09 PM, Sebastian Weber:

Bob Carpenter

unread,

Aug 22, 2016, 8:48:17 AM8/22/16

to stan...@googlegroups.com

> On Aug 22, 2016, at 2:28 PM, Sebastian Weber <sdw....@gmail.com> wrote:
>
> This recent change to how matrix multiplications are done is what made me think about GEMM with OpenMP. My hope would be that this way we get some real benefit here.

I hope the same thing!

> Of course, this is not anything useful for ODE matters, yes. For this I have a few ideas, namely
>
> 1. one autodiff arena per thread which would allow general Stan language "pfor" constructs.

That's not sufficient. We need to be able to then merge
the partials with respect to the concurrently evaluated
fragment of the density back into the global graph.

There's a thread-local construct for global variables which
will do just what you're asking---create one copy per thread.
There's a performance hit there of around 20% in our earlier tests
if I'm remembering correctly, but it may be higher since we've
optimized or lower if thread-local implementations have gotten better.

- Bob

Reply all

Reply to author

Forward