Stan's custom matrix multiply() slower than Eigen operator*

Bob Carpenter

unread,

Apr 13, 2015, 1:50:58 AM4/13/15

to stan...@googlegroups.com

I've been doing some careful evaluations for the paper on Stan's
autodiff and realized that for two matrices of type Matrix<var, Dynamic, Dynamic>,
that Eigen's built-in operator*(a,b) is about 50% faster for autodiff
than Stan's custom stan::agrad::multiply(a,b) function!

I'm attaching plots of time taken relative to Stan for pretty much all
the other open-source, operator overloading C++ autodiff packages and
for just calling the functions with double arguments.

matrix_product_eigen_rel_eval.pdf

matrix_product_stan_rel_eval.pdf

Marcus Brubaker

unread,

Apr 13, 2015, 9:47:42 AM4/13/15

to stan...@googlegroups.com

Hmmm, strange. This definitely goes against our experience and intuition elsewhere, eg, dot_product() being much faster than a looped sum. Maybe there's something fishy going on with multiply.

Bob, can you check if this holds true for just a*b vs multiply(a,b)? I'd like to know if it's Eigen being excessively clever with lazy evaluation with the sum or what.

Memory access patterns could be at play but I'm hard pressed to believe that this is what's happening here. One possibility is that there is excessive bounds checking happening inside multiply. That could definitely kill performance but that would depend on how things were compiled. Can you run things again with -DEIGEN_NO_DEBUG ?

Cheers,
Marcus

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I ran everything twice and the results are stable (each size is evaluated
2000 times for each system).

The horizontal axis is total number of elements in two matrices
a and b of type Matrix<var, Dynamic, Dynamic>. So the biggest eval, with
almost 2^12 (4096 or so) variables is multiplying two
(45 x 45) matrices. The functions being evaluated (and autodiffed for
gradients) are:

eigen: (a * b).sum()

stan: sum(multiply(a,b))

The operator*() and .sum() operations are built into Eigen, whereas
the sum() and multiply() operations are from stan::agrad.
For systems other than Stan, the Eigen version is used in both graphs.

I imagine what's going on is that Eigen is very clever about
using the traits defined in std::numeric_limits<stan::agrad::var>
to increase memory locality by doing blocking operations.
In all other evaluations, CppAD was slower than Adept, but here
CppAD is a win because it also defines numeric_limits.

Maybe we should just be calling Eigen's operator*() for multiplying two var
matrices, though it'll probably take about 30--50% or so more memory. Our
implementation of multiply() reduces the number of vari instances, each of
which have 16 additional bytes of overhead for a vtable pointer and adjoint,
for an additional cost of something like 32 * (N^3 - N^2) bytes for using
Eigen's operator*().

Of course, we can't just call Eigen for var * double versions, and we
don't exactly do a lot of matrix-matrix products of parameters, so it's
probably no big deal. Just a bit embarassing for the paper!

- Bob

--
You received this message because you are subscribed to the Google Groups "stan development mailing list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to stan-dev+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marcus Brubaker

unread,

Apr 13, 2015, 10:00:42 AM4/13/15

to stan...@googlegroups.com

One last thought. Can you confirm that the proper reverse-mode overloaded version of multiply is getting called? These results would make perfect sense if the default one was somehow getting called instead because we'd loose all the AD benefits of our version *and* loose the benefits of Eigen's expression templates.

Cheers,

Marcus

Bob Carpenter

unread,

Apr 13, 2015, 6:08:58 PM4/13/15

to stan...@googlegroups.com

That was with both -DNDEBUG and -DEIGEN_NO_DEBUG. I'll set up the
test to just do the multiply without the summation later today.

- Bob

Bob Carpenter

unread,

Apr 13, 2015, 6:09:59 PM4/13/15

to stan...@googlegroups.com

And yes, I already verified that our overloaded multiply was
getting called --- that was my first thought as to what was
going wrong.

- Bob

Marcus Brubaker

unread,

Apr 13, 2015, 6:13:23 PM4/13/15

to stan...@googlegroups.com

Drat, there went my easy-to-fix ideas. Let me know what you find. If nothing obvious pops up, can you send me the code you're using to do the testing and I'll take a look. Unless we have somehow completely misunderstood where our computational costs are, this seems likely to be some kind of bug (booo) but is also an opportunity for optimization (wooo).

Cheers,

Marcus

Bob Carpenter

unread,

Apr 13, 2015, 6:13:59 PM4/13/15

to stan...@googlegroups.com

By the way, the reason I think it's memory is that the only
two systems that provide the appropriate memory size traits
are CppAD and Stan, and they're relatively much faster with
the Eigen calls than they are with just basic arithmetic, compared
to Adept or Sacado, which are the other two systems that are
neck-and-neck with Stan in other evaluations.

- Bob

On Apr 13, 2015, at 11:47 PM, Marcus Brubaker <marcus.a...@gmail.com> wrote:
>

Bob Carpenter

unread,

Apr 14, 2015, 7:32:59 AM4/14/15

to stan...@googlegroups.com

Here's the results with no sum involved, which is the same relative
issue as with.

I'm attaching the test programs and makefile, but you'll need to
either download the rest from stan-dev/papers/agrad_rev/experiments
or comment out the other system's eval code.

- Bob

matrix_product_eigen_timing2.cpp

matrix_product_stan_timing2.cpp

test_harness.hpp

makefile

plot_results.R

matrix_product_eigen_nosum_rel_eval.pdf

matrix_product_stan_nosum_rel_eval.pdf

Reply all

Reply to author

Forward