high dimensional dot products: improving performance / GPU / R-ops

Jeremiah Lowin

unread,

Oct 4, 2012, 1:54:08 PM10/4/12

to thean...@googlegroups.com

Hi everyone -- I've submitted a pr that addresses "numpy semantics" for dot products. We've been iterating through it and I think it's looking ok. Since that's working in a basic form, I wanted to check in with some thoughts on improving it.

Presently, the pr uses numpy's dot for everything. However, high dimensional dots are implemented very inefficiently in numpy. It's almost always faster to take a tensordot or express the dot in 2D with transposes and reshapes. The difference can be dramatic:

dot2d = lambda x,y : np.dot(x.reshape(-1, x.shape[-1]), y.swapaxes(0,-2).reshape(y.shape[-2], -1)).reshape(x.shape[:-1]+ y.shape[:-2] + y.shape[-1:])

x = np.random.random((50, 50, 50))
y = np.random.random((50, 50, 50))
timeit np.dot(x, y)                                            # 313 ms
timeit np.tensordot(x, y, [[x.ndim - 1], [y.ndim - 2]])        #  63 ms
timeit dot2d(x, y)                                             #  63 ms

x = np.random.random((5, 100, 100))
y = np.random.random((5, 100, 100))
timeit np.dot(x, y)                                            # 20 ms
timeit np.tensordot(x, y, [[x.ndim - 1], [y.ndim - 2]])        #  3 ms
timeit dot2d(x,y)                                              #  3 ms

So my first thought was to improve tt.dot by applying the reshapes/transposes appropriately for high-dimensional inputs. This has a major side benefit -- besides the raw speedup from numpy alone, I believe this would allow theano to run high-dimensional dot products on the GPU.

A parallel issue is that my current PR uses tt.tensordot to compute gradients for high-dim inputs. This is problematic because tensordot has no R-op -- therefore neither do my high-dim dot products. Looking for a way around this, I came across a thread in theano-users (https://groups.google.com/forum/#!topic/theano-users/FyPJXu0Simk) that introduces an optimization to run tensordot on the GPU by expressing it in terms of a reshaped matrix product (see related PR's https://github.com/Theano/Theano/pull/334 and https://github.com/Theano/Theano/pull/410). Note that this is basically a general case of what I'm proposing for computing the dot product itself.

So my second thought is to use a version of this new tensordot to compute gradients, instead of tt.tensordot. Reason #1 is that it comes with an R-op. Reason #2 is that it can run on the GPU without an optimization. And finally, I've been testing it and when at least one of the inputs has dim > 2, it is almost never slower than tt.tensordot (It seems like the reason it's GPU-only is because of concerns that on the CPU it would be slower than tt.tensordot due to reshape/transpose overhead). In fact, it's possible to accelerate it almost 25% in some cases because for dots the first array doesn't need to be transposed at all.

So I propose to implement these two changes to tt.dot, on top of the existing pr, but I wanted to check if you think I'm overlooking anything?

Thanks to everyone for their time and efforts.

James Bergstra

unread,

Oct 4, 2012, 2:10:49 PM10/4/12

to thean...@googlegroups.com

Thanks for the good work and thorough summary. A few thoughts:

1. Adding an R op to a linear operation such as tensordot is probably
easy, so if it is the case that an important user-facing Op (either
dot or tensordot) should have one, then I see no reason to work around
adding it instead of simply adding it.

2. It seems to me from the numpy documentation that np.dot is a
special case of np.tensordot. How then is it that np.tensordot is
actually also faster than np.dot? (Your sample code demonstrated this
right?) This seems like a potentially-simple oversight in the numpy
implementation, and not something that we should take care to design
around.

3. If tensordot is indeed a more general form of dot, then perhaps
tt.dot should just be a shorthand for the construction of a tensordot
Op, and we should revise our various optimizations to operate on
tensordot Ops instead of dot Ops. It seems relatively straightforward
to test whether a tensordot corresponds to a dot -- the axis must be a
single constant integer equal to some ndim - 2 or something.

I guess if I understand your points, then it seems that tt.dot should
be a constructor for a TensorDot op, and optimizations should
transform TensorDot nodes into blas Ops, transposes, reshapes, etc.

Maybe this is more of a change than you had in mind though, and I
think it would deserve a new and separate PR.

- James

Jeremiah Lowin

unread,

Oct 4, 2012, 2:53:29 PM10/4/12

to thean...@googlegroups.com

I've been browsing around and have pieced together the following:

1. np.dot calls a BLAS optimized multiplication function, but falls back on a slower method when ndim > 2 (see this numpy ticket http://projects.scipy.org/numpy/ticket/2163).

2. (this surprised me) np.tensordot actually does what the GPU optimization does for theano's tensordot -- it transposes, reshapes, and takes a 2D matrix product (see the source code for np.tensordot).

So for ndim > 2, tensordot uses a 2D version of np.dot, but if you had called np.dot itself, it would have used a slower high-dimensional routine. Presumably, that's the source of the performance difference!

One thing I take from this is that it might make sense to have Theano's "GPU-only" version of tensordot replace the CPU one (which simply calls numpy's tensordot). After all, they're apparently performing the identical operation. There shouldn't be a performance hit on the CPU unless theano's reshape/transpose operations are much more expensive than their numpy analogues.

Also, this might be simpler than completely refactoring the TensorDot and Dot ops. In this scenario, tensordot becomes a convenience function for chaining transpose/reshape/dot ops together, just as it presently is on the GPU, and there's no need to add an R-op or gradient at all (as they already exist on the component ops).

However, I'm certainly not qualified nor sufficiently informed to be weighing in on Theano architectural decisions -- I defer to everyone else's expertise here.

Jeremiah

James Bergstra

unread,

Oct 4, 2012, 3:01:43 PM10/4/12

to thean...@googlegroups.com

Thanks for clarifying, this makes good sense to me. If there's an
equivalence between a tensordot and a mess of reshapes, transposes,
and dot, then it would be cleaner and perhaps easier down the road to
represent this as a tensordot, but certainly simpler in the short term
to represent it as a mess. As you say, R_op works immediately, and
all of these ops are among the best-tested in Theano. +1 for making
tensordot build this larger subgraph rather than constructing a
tensordot op at all (this was your suggestion right?). And yes, I
think this *would* make sense in the current PR.

Jeremiah Lowin

unread,

Oct 4, 2012, 3:20:18 PM10/4/12

to thean...@googlegroups.com

Yes, that was my suggestion. And fortunately, it's a rather small mess.

I'll wait a while to see if anyone disagrees, but if not I will start updating the PR so we can see how this would look and if it actually makes any sense.

Thanks

Razvan Pascanu

unread,

Oct 4, 2012, 3:30:01 PM10/4/12

to thean...@googlegroups.com, thean...@googlegroups.com

+1 from me

Sent from my iPad

Frédéric Bastien

unread,

Oct 5, 2012, 11:51:25 AM10/5/12

to thean...@googlegroups.com

+1,

The small overcome objection could come that reshape could do a copy
if the data is not contiguous in memory. But with the speed difference
of 10x you saw us, even if we copy, it should be faster with your
modification.

This also remove some op from theano, that is always a good thing as
it make the optimization simpler to write.

I'm not sure it is worth rewriting in a new PR everything to use a
tensordot op instead of a graph about the dot op. What is sure is that
I won't do that as I don't see a good reason for it vs the significant
time to do it.

Fred

Jeremiah Lowin

unread,

Oct 5, 2012, 5:33:25 PM10/5/12

to thean...@googlegroups.com, no...@nouiz.org

There are many opportunities to optimize the new tensordot function by skipping operations that are unnecessary or redundant, like if the input is already a matrix (no reshape) or already has its dimensions in the required order (no transpose). Relatedly, there are a number of cases where I could choose to use vector/matrix multiplication or matrix/matrix multiplication. I'm curious if there's a good reason to prefer one over the other? I believe Theano has an optimization for matrix/matrix products, I don't know about matrix/vector.

As an example of when this choice comes up, if all the axes of an input array are being summed over, then its reshaped form could be either a vector or a row/column matrix (depending on whether it's the first or second input). Alternatively, if a vector is passed in as an input, it can either be left as a vector or reshaped to a row/column matrix. In the first example, reshaping must take place and I just need to decide whether to reshape it as (N) or (1xN). In the second example, the reshape is optional and probably only worth it if there is a significant speedup from some theano-specific optimization. Basically this is a question of reshape overhead vs optimization speedup.

Right now, when reshaping is required I use row/column matrices instead of vectors; when reshaping is optional, I don't do it at all. I'm curious if there's a better option?

Thanks!

James Bergstra

unread,

Oct 5, 2012, 5:42:13 PM10/5/12

to thean...@googlegroups.com

On Fri, Oct 5, 2012 at 5:33 PM, Jeremiah Lowin <jlo...@gmail.com> wrote:
> There are many opportunities to optimize the new tensordot function by
> skipping operations that are unnecessary or redundant, like if the input is
> already a matrix (no reshape) or already has its dimensions in the required
> order (no transpose). Relatedly, there are a number of cases where I could
> choose to use vector/matrix multiplication or matrix/matrix multiplication.
> I'm curious if there's a good reason to prefer one over the other? I believe
> Theano has an optimization for matrix/matrix products, I don't know about
> matrix/vector.
>
> As an example of when this choice comes up, if all the axes of an input
> array are being summed over, then its reshaped form could be either a vector
> or a row/column matrix (depending on whether it's the first or second
> input). Alternatively, if a vector is passed in as an input, it can either
> be left as a vector or reshaped to a row/column matrix. In the first
> example, reshaping must take place and I just need to decide whether to
> reshape it as (N) or (1xN). In the second example, the reshape is optional
> and probably only worth it if there is a significant speedup from some
> theano-specific optimization. Basically this is a question of reshape
> overhead vs optimization speedup.
>
> Right now, when reshaping is required I use row/column matrices instead of
> vectors; when reshaping is optional, I don't do it at all. I'm curious if
> there's a better option?
>
> Thanks!
>

If you implement tensordot in terms of reshapes, transposes, and dot,
then there are several optimizations that should already kick in, do
they? Time has been spent on optimizing such subgraphs already, but
I'm sure there are still badly-handled cases. I'd recommend holding
off on such optimizations for this PR, and then putting together some
badly-handled cases to motivate a new PR for any missing
optimizations. Theano doesn't currently have support for
profile-driven optimizations, so sometimes we just have to make a
choice without knowing all the information we'd like.

I don't think it will make any difference whether you make your
vectors rows or columns, they should be put into the same canonical
form during the optimization path regardless.

HTH,
- James

Jeremiah Lowin

unread,

Oct 5, 2012, 6:24:49 PM10/5/12

to thean...@googlegroups.com

Well, I also thought there were such optimizations, so I didn't bother with them at first. But then I added a guard against transposing a matrix to its same dimensions (i.e. m.transpose(0,1)) and saw some significant speedups, so I thought perhaps I was mistaken and did some other tests like m.reshape([m.shape[0], m.shape[1]])

I'm not at my desk right now, so I can't check to be sure -- it's possible I got some spurious results. For now I'll assume I was wrong and commit the non-guarded functions -- simpler is always better to start.

Thank you

James Bergstra

unread,

Oct 16, 2012, 4:11:20 PM10/16/12

to thean...@googlegroups.com

Has this been logged in github as an issue?

Jeremiah Lowin

unread,

Oct 16, 2012, 5:11:20 PM10/16/12

to thean...@googlegroups.com

No -- I can't reproduce what I saw. It was probably some combination of caching and my CPU doing something else. This entire change is logged as PR #986, however. Once the non-aligned arrays bug is fixed (now in GH issues) , I can finish the unit tests for the new tensordot.

Jeremiah Lowin

unread,

Oct 17, 2012, 5:14:04 PM10/17/12

to thean...@googlegroups.com, no...@nouiz.org

The new tensordot wrapper is getting close to done. The unit tests don't run on my machines because of the _dot22 bug we've been discussing elsewhere, so unfortunately ironing everything out has been a little rough around the edges (and a little too reliant on Travis!). I've caught a couple more corner cases this afternoon.

Here are some interesting things I've learned -- but keep in mind this is the result of very casual exploration:

1. The new tensordot is much faster than the existing dot for non-trivial high-dimensional matrices. For two random (50 x 50 x 50) arrays, I time dot at 304ms and tensordot at 56ms. The old tensordot Op clocks in at 63ms, so call it even.

2. The new tensordot is sometimes *slower* than dot at calculating gradients due to optimizations. For the same 50^3 arrays, dot takes 113ms to compute the gradient and new tensordot takes 159ms. The old tensordot was also quick -- 111ms. However, without optimizations, dot takes 1.9 seconds and tensordot takes 1.6 seconds (same as old tensordot). Since dot's grad method actually calls tensordot, I feel like there's probably some work that can be done here to get them to the same place.

3. The new tensordot is much faster than dot at calculating an R-op, 630ms to 91ms. In fairness, neither of them could do it at all prior to this PR. This is somewhat surprising for the same reason as above -- dot's grad method calls tensordot, so I'm not sure why there's a difference at all.

Given all that, I would like dot to actually call tensordot for inputs with ndim > 2. In other words, I think it would be very clean to have the Dot op handle 2D dot products and access that op through two functions: tensor.tensordot for general cases and tensor.dot for very specific ones (calling the op directly or tensordot, as appropriate). We're about 90% there right now, except that I can't get tensor.dot to work as a function. I tried writing a simple "tensor.dot" function to do what I've described, but it broke tests all over the place. It seems like there is an expectation elsewhere in Theano that tensor.dot is a Dot op instance and not a function.Does anyone have some suggestions, or am I just making a mistake? In the meantime I left that simple wrapper as tensor.dotfunc.

Thanks very much,

Jeremiah

On Friday, October 5, 2012 11:51:46 AM UTC-4, Frédéric Bastien wrote:

colman Tse

unread,

Sep 28, 2016, 11:31:28 AM9/28/16

to theano-dev, no...@nouiz.org

Dear Jeremiah,

To my understanding, np.tensordot would be the simplest way to modify np.dot for a speed up, i wonder does it helps in matrix dot vector calculation as well.

np.tensordot(x, y, [[x.ndim - 1], [y.ndim - 2]]

however when i try to implement it on by tensordotting a matrix (30000x300) with a vector (300x1), it gives memory error, while np.dot easily completed the process, does that mean tensor dot is much more memory intensive or am i just doing something wrong?

np.tensordot(A,v,axes=0)

A the matrix, v the vector

best,

Colman

Reply all

Reply to author

Forward