Hi,I did some quick experiments, for two 1024x1024 matrices, matrix multiplication (tf.matmul) and matrix element-wise multiplication (tf.mul or simply *) has similar time cost. Even when N = 10k, the performance is comparable. But in terms of algorithm complexity, tf.matmul (N^3) is clearly more expensive than tf.mul (N^2). So is there any way to optimize tf.mul so it can be much faster, like reaching the level of tf.matmul? If not, is there any "theoretical" reasons why matmul is fundamentally efficient than element-wise mul in GPU?
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/84dc8e28-2866-4659-adf2-d2ae742d3d0b%40tensorflow.org.
import tensorflow as tf
import numpy as np
from time import time
def foo(X, Y, op=tf.matmul):
start = time()
with tf.Graph().as_default():
with tf.Session() as sess:
A = tf.placeholder(dtype=tf.float32)
B = tf.placeholder(dtype=tf.float32)
C = op(A, B)
sess.run(C, feed_dict={A:X, B:Y})
return (time() - start)
def time_muls(num_rows, num_cols):
X = np.random.random((num_rows, num_cols))
XT = X.T
print 'transfer', foo(X, X, op=lambda x,y: x),\
'matmul', foo(X, XT, op=tf.matmul), \
'mul', foo(X, X, op=tf.mul)
time_muls(1000, 1000)
transfer 0.00689506530762 matmul 0.0163590908051 mul 0.0144131183624
time_muls(10000, 10000)
transfer 0.242649078369 matmul 1.8508541584 mul 0.751158952713
Most likely your benchmark code is not measuring what you think it is measuring: it might be measuring the cost of the first run, or it is benchmarking data transfers rather than computation. Showing us your benchmarking code is the only way to know for sure, but I would indeed expect elementwise multiply to be faster than matmul once you get to a large enough size.
On Wed, Jan 11, 2017 at 4:52 PM, tc <iamti...@gmail.com> wrote:
Hi,I did some quick experiments, for two 1024x1024 matrices, matrix multiplication (tf.matmul) and matrix element-wise multiplication (tf.mul or simply *) has similar time cost. Even when N = 10k, the performance is comparable. But in terms of algorithm complexity, tf.matmul (N^3) is clearly more expensive than tf.mul (N^2). So is there any way to optimize tf.mul so it can be much faster, like reaching the level of tf.matmul? If not, is there any "theoretical" reasons why matmul is fundamentally efficient than element-wise mul in GPU?
--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/11f3bccc-135f-4a07-ba69-1f86bcd714fd%40tensorflow.org.
('op1', 2.2579541206359863) ('op2', 2.782616138458252)
however, op2 have 16 times computation than op1
To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/11f3bccc-135f-4a07-ba69-1f86bcd714fd%40tensorflow.org.
Hi all,Sorry for the late reply! I misread Sanjoy's question and thought he was asking Xuan for the benchmark he ran.We do have per-op benchmarks in TF. The current element-wise binary op benchmark doesn't include Mul. I modified //tensorflow/core/kernels/cwise_ops_test.cc slightly to include it in my TF branch here. To test:
git clone https://github.com/penpornk/tensorflow
cd tensorflow; git checkout mul_vs_matmul
# To test with just AVX instructions.
export CC_OPT_FLAGS='-mavx'
yes "" | "$PYTHON_BIN_PATH" configure.py
bazel run --config=opt //tensorflow/core/kernels:cwise_ops_test -- --benchmarks=BM_cpu_Mul_scalar
bazel run --config=opt //tensorflow/core/kernels:matmul_op_test -- --benchmarks=BM_Matmul
# To test with up to AVX-512 instructions.
export CC_OPT_FLAGS='-mavx -mavx2 -mfma -mavx512f'
yes "" | "$PYTHON_BIN_PATH" configure.py
bazel run --config=opt //tensorflow/core/kernels:cwise_ops_test -- --benchmarks=BM_cpu_Mul_scalar
bazel run --config=opt //tensorflow/core/kernels:matmul_op_test -- --benchmarks=BM_MatmulYou will see results like this. I'm comparing Mul with 1024 * 1024 = 1048576 elements with Matmul with n = 1024 (square matrices):
BM_cpu_Mul_scalar/1048576 139588 4388 30047.8MB/s 7512.0M items/s
BM_Matmul_1024_1024_1024_false_false_DT_FLOAT_cpu 1209870 453 1774970.9M items/sThe `items/s` is `flops/s` here. AVX vs AVX-512 results on my Skylake machine:Mul: 7.512 Gflops/s vs 7.2921 Gflops/sMatmul: 1,774.9709 Gflops/s vs 1,696.4339 Gflops/sMatmul here attained 236x and 232x higher flops rate than elementwise mul because it has more memory reuse and can take advantage of fused multiply-add. Take a look at their runtime breakdowns:Matmul runtime = time to calculate 2 * n^3 flops + time to load 2 * n^2 elements + time to store n^2 elementsMul runtime = time to calculate n^2 flops + time to load 2 * n^2 elements + time to store n^2 elementsIn reality, Matmul's overall load/store time will be larger than Mul's time, because fast memories (cache/register) often cannot fit the whole matrix in, so the reuse factor is not n per element.Now, back to AVX-512 for tf.mul
I found that tf.matmul take advantage of avx512, while tf.mul not
Generic TensorFlow wheels are built with AVX optimizations, which support bulk load/store/multiply on 256-bit registers. Eigen stores data in packets that are vectorized based on compilation flags, so tf.mul is already (properly) vectorized on 256-bit registers. Using AVX-512 with tf.mul could have given at most 2x speedup, not 16x. In my results above, compiling with AVX-512 actually gives slower results than AVX. This might be because AVX-512 is new and has not been optimized much in Eigen.
I also wrote a simple element-wise multiply code (in my mul_vs_matmul branch) with direct calls to AVX and AVX-512 intrinsics to make a clearer side-by-side comparison.cd <tensorflow_root>/tensorflow/mul_vs_matmulmake mul_avxmake mul_avx512./mul_avx./mul_avx512My AVX vs AVX-512 numbers are 9.8 vs 10.0988 Gflops/s (median of 5 runs). That's about 3% speedup.As for why tf.matmul can use AVX-512 when the wheel is only compiled with AVX, that is because all single-precision matmul- and convolution-like ops in TensorFlow calls MKL-DNN matrix multiplication routine (sgemm) as a building block. MKL-DNN's sgemm detects the CPU architecture at run time and generates vectorized code that can take full advantage of the CPU's capabilities.Best,Penporn