Cheers,
Lucas
PS: These are GPU timings for a GTX980:
You'll note the almost 100ms difference, which is quite a lot. I'm not sure where it comes from, but if I
had to guess, I'd guess different memory-management. I'm running the Op 10 times and my GPU doesn't have enough memory for 10 result tensors, so freeing is happening somewhere.
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/7bf52dae-529a-4bf0-a3f0-00edf9940b01%40tensorflow.org.
First of all, let me thank you for open-sourcing tensorflow, it looks like a well-designed library.
I noticed that some of the core Ops for CPU seem to be implemented using Eigen3. This made me concerned about performance, thus I timed it. (Timing code available at https://github.com/lucasb-eyer/gemm) These are the important results:
- Julia CPU: 4.3072s
- Theano CPU: 4.3492s
- Tensorflow CPU: 13.3393s
Both Julia and Theano clearly use the same underlying BLAS library, thus have similar timings. Tensorflow is horribly slow on the CPU in comparison! I believe this comes from the use of Eigen3. (Yes, I verified that all my cores are being fully utilized during computation.) This brings me to my main question:
Is there a plan to support using a BLAS library as back-end for Tensorflow CPU in the future?
Cheers,
Lucas
PS: These are GPU timings for a GTX980:
- Julia GPU: 0.5368s
- Theano GPU: 0.6000s
- Tensorflow GPU: 0.6979s
You'll note the almost 100ms difference, which is quite a lot. I'm not sure where it comes from, but if I had to guess, I'd guess different memory-management. I'm running the Op 10 times and my GPU doesn't have enough memory for 10 result tensors, so freeing is happening somewhere.
--
How did you build TensorFlow? From source using bazel, or using our pip binary?
If this is something that interests you, let us know and we can help you prototype and benchmark!
C is a python Tensor object, and calling s.run(C) has the effect that the contents of the matrix multiply result are being copied back from GPU to CPU so it can be used in your python program.
That only measures the time necessary for scheduling the computation, not for it to complete, because cuBLAS is async by default. Proof: the time it measures is "0.0001380443573s".s.run(C.op)
On GPU we're just calling CUBLAS underneath for the matmul op, so I wouldn't actually expect any performance difference for this simple benchmark.
Would be nice to see how your results compare to tensorflow compiled with MKL-linked eigen.
Thanks for the answers.My machine is an i7-5930K with 6 cores (12 hyperthreads) which supports AVX2.
@Andrew: I know the difference between BLAS and Eigen, which is why I hoped for BLAS support. Eigen does use multiple cores, it's just worse at using them. When I wrote "(Yes, I verified that all my cores are being fully utilized during computation.)", I may not have been clear enough, sorry. I meant each virtual core was at ~100%, i.e. ~1200% total.
@Vijay:How did you build TensorFlow? From source using bazel, or using our pip binary?I used pip. I also have a bazel compiled version available for testing, if necessary.
It does, but it's quite hard for me to believe Google doesn't have that internally (cf. the diff in performance!), and so I hoped to avoid vain effort in the case it's planned to be done/released anyways.If this is something that interests you, let us know and we can help you prototype and benchmark!
C is a python Tensor object, and calling s.run(C) has the effect that the contents of the matrix multiply result are being copied back from GPU to CPU so it can be used in your python program.I know and it's the same in the Theano version of my benchmark, barring mistakes. Only the Julia GPU benchmark doesn't transfer back, which is unfair and I'll fix.That only measures the time necessary for scheduling the computation, not for it to complete, because cuBLAS is async by default. Proof: the time it measures iss.run(C.op)
"0.0001380443573s".
Yes, but there's still 100ms (~16-20%) difference to Theano, which I guess comes from memory-management. (But I didn't verify)On GPU we're just calling CUBLAS underneath for the matmul op, so I wouldn't actually expect any performance difference for this simple benchmark.
Cheers,
Lucas
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/ac9e85de-6140-421e-bb0a-b8cf6f9b2cc4%40tensorflow.org.
Yes, but there's still 100ms (~16-20%) difference to Theano, which I guess comes from memory-management. (But I didn't verify)On GPU we're just calling CUBLAS underneath for the matmul op, so I wouldn't actually expect any performance difference for this simple benchmark.For something as small as this, I doubt it's memory management, but feel free to verify. If you run the benchmarks for longer (more than 10 matmuls, say), does the gap shrink?
It's possible that the first call to run is a bit slower than the others, so you could either run the graph once or twice for each framework before timing to see if there's some small startup cost.
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/e1aa4753-ae24-4f4e-8003-df728bdc8f1e%40tensorflow.org.
and then (AFAIK) converted into a numpy array. This second step might explain the latency difference.