(CPU) Performance, support for BLAS instead of Eigen?

Lucas Beyer

unread,

Nov 10, 2015, 6:49:43 PM11/10/15

to Discuss

First of all, let me thank you for open-sourcing tensorflow, it looks like a well-designed library.

I noticed that some of the core Ops for CPU seem to be implemented using Eigen3. This made me concerned about performance, thus I timed it. (Timing code available at https://github.com/lucasb-eyer/gemm) These are the important results:

Julia CPU: 4.3072s
Theano CPU: 4.3492s
Tensorflow CPU: 13.3393s

Both Julia and Theano clearly use the same underlying BLAS library, thus have similar timings. Tensorflow is horribly slow on the CPU in comparison! I believe this comes from the use of Eigen3. (Yes, I verified that all my cores are being fully utilized during computation.) This brings me to my main question:

Is there a plan to support using a BLAS library as back-end for Tensorflow CPU in the future?

Cheers,

Lucas

PS: These are GPU timings for a GTX980:

Julia GPU: 0.5368s
Theano GPU: 0.6000s
Tensorflow GPU: 0.6979s

You'll note the almost 100ms difference, which is quite a lot. I'm not sure where it comes from, but if I had to guess, I'd guess different memory-management. I'm running the Op 10 times and my GPU doesn't have enough memory for 10 result tensors, so freeing is happening somewhere.

Andrew Tomazos

unread,

Nov 10, 2015, 7:00:44 PM11/10/15

to Lucas Beyer, Discuss

Out-of-the-box Eigen single-core CPU performance is comparable to optimized BLAS implementations. The difference is that Eigen doesn't use multiple cores, whereas something like ATLAS does.

If the underlying LA library dominates the performance then looking at these numbers:

Julia CPU: 4.3072s
Theano CPU: 4.3492s
Tensorflow CPU: 13.3393s

I ask myself whether this was run on an x86 2 core (4 hyperthreads) machine. This would account for the 3x performance difference.

If you run the test again, observe your system monitor of the per-core CPU utilization. When Tensorflow is running is one CPU at 100% and the rest idling? (You can also look in `top` if you are on Linux. 100% CPU usage means one core. 400% CPU usage means 4 cores.)

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/7bf52dae-529a-4bf0-a3f0-00edf9940b01%40tensorflow.org.

Vijay Vasudevan

unread,

Nov 10, 2015, 7:19:40 PM11/10/15

to Lucas Beyer, Discuss

On Tue, Nov 10, 2015 at 3:49 PM, Lucas Beyer <lucasb....@gmail.com> wrote:

First of all, let me thank you for open-sourcing tensorflow, it looks like a well-designed library.

I noticed that some of the core Ops for CPU seem to be implemented using Eigen3. This made me concerned about performance, thus I timed it. (Timing code available at https://github.com/lucasb-eyer/gemm) These are the important results:
Julia CPU: 4.3072s
Theano CPU: 4.3492s
Tensorflow CPU: 13.3393s

Hi Lucas,

How did you build TensorFlow? From source using bazel, or using our pip binary?

What kind of processor are you running on: does it support avx or avx2?

Both Julia and Theano clearly use the same underlying BLAS library, thus have similar timings. Tensorflow is horribly slow on the CPU in comparison! I believe this comes from the use of Eigen3. (Yes, I verified that all my cores are being fully utilized during computation.) This brings me to my main question:

Is there a plan to support using a BLAS library as back-end for Tensorflow CPU in the future?

One of the designs of TensorFlow is to allow you to implement your own versions of operation implementations (kernels).

For example, we have two different implementations of gradient kernels for conv2d here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/conv_grad_ops.cc#L428

So yes, it is possible to plug-in an alternative implementation of say, the matmul operation using a different underlying BLAS library :).

If this is something that interests you, let us know and we can help you prototype and benchmark!

Cheers,
Lucas

PS: These are GPU timings for a GTX980:
Julia GPU: 0.5368s
Theano GPU: 0.6000s
Tensorflow GPU: 0.6979s
You'll note the almost 100ms difference, which is quite a lot. I'm not sure where it comes from, but if I had to guess, I'd guess different memory-management. I'm running the Op 10 times and my GPU doesn't have enough memory for 10 result tensors, so freeing is happening somewhere.

--

Vijay Vasudevan

unread,

Nov 10, 2015, 7:32:12 PM11/10/15

to Lucas Beyer, Discuss

On Tue, Nov 10, 2015 at 3:49 PM, Lucas Beyer <lucasb....@gmail.com> wrote:

In your benchmark, you have:

C = tf.matmul(A, B)

...

s.run(C)

C is a python Tensor object, and calling s.run(C) has the effect that the contents of the matrix multiply result are being copied back from GPU to CPU so it can be used in your python program. If you want to just run the operation and fetch nothing in return, you would have to do:

s.run(C.op)

(see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/ops.py#L170 for details).

On GPU we're just calling CUBLAS underneath for the matmul op, so I wouldn't actually expect any performance difference for this simple benchmark.

Lucas Beyer

unread,

Nov 11, 2015, 4:18:16 AM11/11/15

to Discuss

Thanks for the answers.

@Andrew: I know the difference between BLAS and Eigen, which is why I hoped for BLAS support. Eigen does use multiple cores, it's just worse at using them. When I wrote "(Yes, I verified that all my cores are being fully utilized during computation.)", I may not have been clear enough, sorry. I meant each virtual core was at ~100%, i.e. ~1200% total.

My machine is an i7-5930K with 6 cores (12 hyperthreads) which supports AVX2.

@Vijay:

How did you build TensorFlow? From source using bazel, or using our pip binary?

I used pip. I also have a bazel compiled version available for testing, if necessary.

If this is something that interests you, let us know and we can help you prototype and benchmark!

It does, but it's quite hard for me to believe Google doesn't have that internally (cf. the diff in performance!), and so I hoped to avoid vain effort in the case it's planned to be done/released anyways.

C is a python Tensor object, and calling s.run(C) has the effect that the contents of the matrix multiply result are being copied back from GPU to CPU so it can be used in your python program.

I know and it's the same in the Theano version of my benchmark, barring mistakes. Only the Julia GPU benchmark doesn't transfer back, which is unfair and I'll fix.

s.run(C.op)

That only measures the time necessary for scheduling the computation, not for it to complete, because cuBLAS is async by default. Proof: the time it measures is "0.0001380443573s".

On GPU we're just calling CUBLAS underneath for the matmul op, so I wouldn't actually expect any performance difference for this simple benchmark.

Yes, but there's still 100ms (~16-20%) difference to Theano, which I guess comes from memory-management. (But I didn't verify)

Cheers,
Lucas

Gökçen Eraslan

unread,

Nov 11, 2015, 4:35:48 AM11/11/15

to Lucas Beyer, Discuss

On 2015-11-11 10:18, Lucas Beyer wrote:
> Thanks for the answers.
>
> @Andrew: I know the difference between BLAS and Eigen, which is why I
> hoped for BLAS support. Eigen does

> <http://eigen.tuxfamily.org/dox/TopicMultiThreading.html> use multiple

> cores, it's just worse

> <http://gcdart.blogspot.co.uk/2013/06/fast-matrix-multiply-and-ml.html>

> at using them. When I wrote "(Yes, I verified that all my cores are
> being fully utilized during computation.)", I may not have been clear
> enough, sorry. I meant each virtual core was at ~100%, i.e. ~1200% total.
>
> My machine is an i7-5930K with 6 cores (12 hyperthreads) which supports
> AVX2.
>

Would be nice to see how your results compare to tensorflow compiled
with MKL-linked eigen.

Lucas Beyer

unread,

Nov 11, 2015, 5:08:43 AM11/11/15

to Discuss

Would be nice to see how your results compare to tensorflow compiled with MKL-linked eigen.

Ah, very good point, I'll have to sift through the build scripts to see how to do that, but it sounds like a better option than implementing a new Op.

Thanks,
Lucas

Vijay Vasudevan

unread,

Nov 11, 2015, 2:39:43 PM11/11/15

to Lucas Beyer, Discuss

On Wed, Nov 11, 2015 at 1:18 AM, Lucas Beyer <lucasb....@gmail.com> wrote:

Thanks for the answers.

@Andrew: I know the difference between BLAS and Eigen, which is why I hoped for BLAS support. Eigen does use multiple cores, it's just worse at using them. When I wrote "(Yes, I verified that all my cores are being fully utilized during computation.)", I may not have been clear enough, sorry. I meant each virtual core was at ~100%, i.e. ~1200% total.

My machine is an i7-5930K with 6 cores (12 hyperthreads) which supports AVX2.

@Vijay:

How did you build TensorFlow? From source using bazel, or using our pip binary?
I used pip. I also have a bazel compiled version available for testing, if necessary.

Yeah, our pip install doesn't built with avx or avx2, so you'd have to build from source, and pass the compiler flags to enable mavx or mavx2. I'm sure you'll see better performance that way :)

If this is something that interests you, let us know and we can help you prototype and benchmark!
It does, but it's quite hard for me to believe Google doesn't have that internally (cf. the diff in performance!), and so I hoped to avoid vain effort in the case it's planned to be done/released anyways.

Indeed, we do have a lot of benchmarks internally that show much more competitive results, but our internal environment is different than the opensource environment, so it's nice to have someone like you to help understand the differences in the opensource version.

C is a python Tensor object, and calling s.run(C) has the effect that the contents of the matrix multiply result are being copied back from GPU to CPU so it can be used in your python program.
I know and it's the same in the Theano version of my benchmark, barring mistakes. Only the Julia GPU benchmark doesn't transfer back, which is unfair and I'll fix.

s.run(C.op)
That only measures the time necessary for scheduling the computation, not for it to complete, because cuBLAS is async by default. Proof: the time it measures is

"0.0001380443573s".

Indeed -- everything is asynchronous :). If you did want to minimize the amount of data transferred back to CPU, you could probably slice a single number from the output and copy only that back.

On GPU we're just calling CUBLAS underneath for the matmul op, so I wouldn't actually expect any performance difference for this simple benchmark.
Yes, but there's still 100ms (~16-20%) difference to Theano, which I guess comes from memory-management. (But I didn't verify)

For something as small as this, I doubt it's memory management, but feel free to verify. If you run the benchmarks for longer (more than 10 matmuls, say), does the gap shrink? It's possible that the first call to run is a bit slower than the others, so you could either run the graph once or twice for each framework before timing to see if there's some small startup cost.

Cheers,
Lucas

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/ac9e85de-6140-421e-bb0a-b8cf6f9b2cc4%40tensorflow.org.

Lucas Beyer

unread,

Nov 11, 2015, 3:52:35 PM11/11/15

to Discuss

On GPU we're just calling CUBLAS underneath for the matmul op, so I wouldn't actually expect any performance difference for this simple benchmark.
Yes, but there's still 100ms (~16-20%) difference to Theano, which I guess comes from memory-management. (But I didn't verify)

For something as small as this, I doubt it's memory management, but feel free to verify. If you run the benchmarks for longer (more than 10 matmuls, say), does the gap shrink?

No: running 200 times, the best time still is 0.6948s.

It's possible that the first call to run is a bit slower than the others, so you could either run the graph once or twice for each framework before timing to see if there's some small startup cost.

Actually, I've been doing exactly that for all of the benchmarks :) I also only report the fastest of the 10 runs, not the average.
I posted a link to the code in the first post in case you're curious about the details, or want to verify I'm not making a Tensorflow-noob mistake.

Actually, showing all the times, they're all within [0.69, 0.71] for Tensorflow, whereas Theano's are all within [0.60, 0.62] and Julia's (having added the device->host copy of the result) are within [0.60, 0.61]. Using 15k² matrices instead of 10k², the times are within [2.25, 2.28] for Tensorflow, [2.00, 2.01] for Theano and [1.99, 2.00] for Julia.

The Julia and Theano times being virtually identical suggests I really am measuring both computation and transfer of results, because the Julia code is explicit about this.
Still, it looks like Tensorflow has an overhead whose magnitude is relative to the problem size/runtime, which strikes me as odd. I don't know the Tensorflow's internals (yet) and have limited time, so this is where my investigation ends (for now) unless someone has more ideas.

Thanks for all the input so far,
Lucas

Vijay Vasudevan

unread,

Nov 11, 2015, 4:01:31 PM11/11/15

to Lucas Beyer, Discuss

Indeed, that is odd. We're generally working on performance improvements for our OSS release so I'm sure this will shake out in short time.

FWIW we also have some C++-only microbenchmarks of matmul to cross-check against: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/matmul_op_test.cc if you were so inclined.

Thanks for the effort!

--

You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.

To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/e1aa4753-ae24-4f4e-8003-df728bdc8f1e%40tensorflow.org.

pb...@google.com

unread,

Nov 11, 2015, 4:48:28 PM11/11/15

to Discuss

I have not idea how much difference this would make to performance, but the way this benchmark is written causes the result to both be DMA'd back from the GPU into a host memory Tensor, and then (AFAIK) converted into a numpy array. This second step might explain the latency difference.

I am aware of at least performance regression in the DMA path which crept in over the last couple of weeks, for which a fix is in the pipeline.

In order to just copy the result back to host memory, a you can do

with tf.device('cpu:0'):

It would be possible to check

pb...@google.com

unread,

Nov 11, 2015, 4:50:51 PM11/11/15

to Discuss

...In order to just copy the result back to host memory, a you can do something like:

with tf.device('cpu:0'):

D = tf.identity(C)

sess.run(D.op)

Paul

Lucas Beyer

unread,

Nov 11, 2015, 7:16:26 PM11/11/15

to Discuss, pb...@google.com

and then (AFAIK) converted into a numpy array. This second step might explain the latency difference.

Aha good catch, this sounds very plausible.

Adding the `np.array(..)` to Theano brings its timings into the range [2.18, 2.20] for the 15k² matmul and [0.68, 0.69] for the original 10k².
Conversely, using your `tf.identity` trick (neat, thanks) brings Tensorflow into the range [2.05, 2.08] for 15k² and [0.61, 0.62] for the original 10k².

I believe this experiment to confirm your idea. The remaining discrepancy when converting to a np.array (2.18 vs. 2.25) is irrelevant for all practical purposes imho, and I actually prefer what Tensorflow does here.

Thanks everyone for chiming in!

Cheers,
Lucas

Reply all

Reply to author

Forward