Efficiency of element-wise matrix multiplication (tf.mul)

1,785 views
Skip to first unread message

tc

unread,
Jan 11, 2017, 7:52:02 PM1/11/17
to Discuss
Hi,

I did some quick experiments, for two 1024x1024 matrices, matrix multiplication (tf.matmul) and matrix element-wise multiplication (tf.mul or simply *) has similar time cost. Even when N = 10k, the performance is comparable. But in terms of algorithm complexity, tf.matmul (N^3) is clearly more expensive than tf.mul (N^2). So is there any way to optimize tf.mul so it can be much faster, like reaching the level of tf.matmul? If not, is there any "theoretical" reasons why matmul is fundamentally efficient than element-wise mul in GPU?

Vijay Vasudevan

unread,
Jan 11, 2017, 8:00:13 PM1/11/17
to tc, Discuss
Most likely your benchmark code is not measuring what you think it is measuring: it might be measuring the cost of the first run, or it is benchmarking data transfers rather than computation.  Showing us your benchmarking code is the only way to know for sure, but I would indeed expect elementwise multiply to be faster than matmul once you get to a large enough size.

On Wed, Jan 11, 2017 at 4:52 PM, tc <iamti...@gmail.com> wrote:
Hi,

I did some quick experiments, for two 1024x1024 matrices, matrix multiplication (tf.matmul) and matrix element-wise multiplication (tf.mul or simply *) has similar time cost. Even when N = 10k, the performance is comparable. But in terms of algorithm complexity, tf.matmul (N^3) is clearly more expensive than tf.mul (N^2). So is there any way to optimize tf.mul so it can be much faster, like reaching the level of tf.matmul? If not, is there any "theoretical" reasons why matmul is fundamentally efficient than element-wise mul in GPU?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+unsubscribe@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/84dc8e28-2866-4659-adf2-d2ae742d3d0b%40tensorflow.org.

tc

unread,
Jan 11, 2017, 10:47:19 PM1/11/17
to Discuss, v...@google.com
Here comes the code:

import tensorflow as tf
import numpy as np
from time import time
def foo(X, Y, op=tf.matmul):
    start = time()
    with tf.Graph().as_default():
        with tf.Session() as sess:
            A = tf.placeholder(dtype=tf.float32)
            B = tf.placeholder(dtype=tf.float32)
            C = op(A, B)
            sess.run(C, feed_dict={A:X, B:Y})
    return (time() - start)
def time_muls(num_rows, num_cols):
    X = np.random.random((num_rows, num_cols))
    XT = X.T
    print 'transfer', foo(X, X, op=lambda x,y: x),\
          'matmul', foo(X, XT, op=tf.matmul), \
          'mul', foo(X, X, op=tf.mul)

And two runs on Titan X GPU:

time_muls(1000, 1000)
transfer 0.00689506530762 matmul 0.0163590908051 mul 0.0144131183624
time_muls(10000, 10000)
transfer 0.242649078369 matmul 1.8508541584 mul 0.751158952713

You can see mul is faster than matmul, which is not surprising, but the little amount of speed-up is quite surprising. As it is expected to be n times faster by comparing O(n^3) algorithm vs O(n^2) one.

On Wednesday, January 11, 2017 at 5:00:13 PM UTC-8, Vijay Vasudevan wrote:
Most likely your benchmark code is not measuring what you think it is measuring: it might be measuring the cost of the first run, or it is benchmarking data transfers rather than computation.  Showing us your benchmarking code is the only way to know for sure, but I would indeed expect elementwise multiply to be faster than matmul once you get to a large enough size.
On Wed, Jan 11, 2017 at 4:52 PM, tc <iamti...@gmail.com> wrote:
Hi,

I did some quick experiments, for two 1024x1024 matrices, matrix multiplication (tf.matmul) and matrix element-wise multiplication (tf.mul or simply *) has similar time cost. Even when N = 10k, the performance is comparable. But in terms of algorithm complexity, tf.matmul (N^3) is clearly more expensive than tf.mul (N^2). So is there any way to optimize tf.mul so it can be much faster, like reaching the level of tf.matmul? If not, is there any "theoretical" reasons why matmul is fundamentally efficient than element-wise mul in GPU?

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.

Vijay Vasudevan

unread,
Jan 11, 2017, 11:26:26 PM1/11/17
to tc, Discuss
Indeed, you are benchmarking warmup time and CPU->GPU transfers more than compute.  Take a look at the following code:

def bar(X, Y, op):
    with tf.Graph().as_default():
        with tf.Session() as sess:
            A = tf.Variable(X, dtype=tf.float32)
            B = tf.Variable(Y, dtype=tf.float32)
            C = op(A, B)

            # Initialize variables
            tf.global_variables_initializer().run()
            start = time()
            # Run the op but don't fetch the results.
            sess.run(C.op)

    return (time() - start)


def time_muls(num_rows, num_cols):
    X = np.random.random((num_rows, num_cols))
    XT = X.T

    print 'matmul', bar(X, XT, op=tf.matmul), \
          'mul', bar(X, X, op=tf.mul)

time_muls(1000, 1000)

def bar_warmup(X, Y, op):
    with tf.Graph().as_default():
        with tf.Session() as sess:
            A = tf.Variable(X, dtype=tf.float32)
            B = tf.Variable(Y, dtype=tf.float32)
            C = op(A, B)

            # Initialize variables
            tf.global_variables_initializer().run()
            # Do two warmup runs.
            sess.run(C.op)
            sess.run(C.op)

            start = time()
            for iter in range(10):
              sess.run(C.op)
            return (time() - start) / 10

def time_muls_warm(num_rows, num_cols):
    X = np.random.random((num_rows, num_cols))
    XT = X.T

    print 'matmul', bar_warmup(X, XT, op=tf.matmul), \
          'mul', bar_warmup(X, X, op=tf.mul)

time_muls_warm(1000, 1000)



The results from running this for me are:

matmul 0.138916015625 mul 0.00278496742249
matmul 0.000651097297668 mul 0.000140810012817

The results from your original program on my machine:
transfer 0.458722114563 matmul 0.162073135376 mul 0.159493923187



The differences:
1) bar() specifies variables on GPU, initializes them once ahead of time, and executes the graph without fetching the result back to CPU and converted into a numpy array.  foo()'s execution of sess.run(C, ...) will time the feeding of data from numpy to CPU, CPU to GPU, and then fetches the result back, so you are timing transfer time in every situation.

2) bar_warmup() is like bar() except it runs the graph twice to warm up any caches and performs any one-time setup that is necessary in TensorFlow, then times the execution of 10 run calls and divides the total time by 10 to get the average run step.  As you can see, all three results are very different, and measure different things.

bar_warmup() is the only one that is most accurately measuring the time to execute matmul and elementwise multiply kernels.  The others are measuring one-time overhead and/or data transfer.

If I run time_muls_warmup(10000, 10000), I see:

matmul 0.370167517662 mul 0.00492670536041

and you can start to see the real algorithmic differences show up.

I hope this helps.

tc

unread,
Jan 12, 2017, 1:37:05 AM1/12/17
to Discuss, v...@google.com
Thanks! That's very helpful. I originally thought testing on OP=lambda x,y: x can give an estimate of data transferring cost etc., but it seems there are other extra cost that cannot be captured this way.

BTW, in real NN models, matmul op between the same X  and Y may not be run multiple times, should time_muls be more accurate then? Unless matmul(X, Y) or other operations can warm up matmul(X', Y'). 

Vijay Vasudevan

unread,
Jan 12, 2017, 1:43:13 AM1/12/17
to tc, Discuss
The start-up cost is usually per graph, so if you use the same graph but feed in different data as you would in normal training, then it will be fast.

Some ops (I think just convolution right now) do auto-tuning based on the *shapes* of the data (different shapes have a different optimal algorithm in cudnn), so if your shapes are fixed, it will be be fast after the first run with a specific shape.

In other words:
- fixed graph, different data, same shape = fast after the first run of the graph
- fixed graph, different shapes on each run() = can be fast, depending on the operations.
- fixed graph, different shapes, convolution = some autotuning is done on each new "shape" configuration (this can be turned off using environment variables, at the cost of optimal performance for convolution).


tc

unread,
Jan 12, 2017, 2:25:09 AM1/12/17
to Discuss, v...@google.com
Thanks for the great summary!

I also plot this figure below based on time_muls_warm and it turns out that mul is indeed O(N) faster than matmul when N (square matrix size) is large enough. But in my machine O(N) ~ N / 120, which means matmul has much less (constant) overhead than mul. Are there ways to improve mul kernel so this gap can be narrow down or closed?

Xuan Ju

unread,
Mar 20, 2020, 12:35:14 PM3/20/20
to Discuss, v...@google.com
I test similar code on cpu,and I found that tf.matmul take advantage of avx
512, while tf.mul not
so, tf.matmul is 16 times faster than tf.mul

Sanjoy Das

unread,
Mar 20, 2020, 1:04:06 PM3/20/20
to Xuan Ju, Penporn Koanantakool, Discuss, Vijay Vasudevan
Hi Xuan,


Do you have a benchmark that shows tf.mul is 16x faster than tf.matmul? 

-- Sanjoy

Xuan Ju

unread,
Mar 23, 2020, 10:09:24 AM3/23/20
to Discuss
test code:import time
import timeit
import tensorflow as tf
a = tf.get_variable(shape=[512, 16, 16], name="A2",dtype=tf.float32)
b = tf.get_variable(shape=[512, 16, 16], name="B2",dtype=tf.float32)

op1 = tf.multiply(a,b)
op2 = tf.matmul(a, b)

sess = tf.Session()

sess.run(tf.initialize_all_variables())
start = timeit.default_timer()
loop_num = 10000
for i in range(loop_num):
    sess.run(op1.op)
tim = timeit.default_timer()-start
print("op1", tim)

start = timeit.default_timer()
for i in range(loop_num):
    sess.run(op2.op)
tim = timeit.default_timer()-start
print("op2", tim)


result:
('op1', 2.2579541206359863)
('op2', 2.782616138458252)
however, op2 have 16 times computation than op1
To unsubscribe from this group and stop receiving emails from it, send an email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/11f3bccc-135f-4a07-ba69-1f86bcd714fd%40tensorflow.org.

Penporn Koanantakool

unread,
Mar 24, 2020, 5:10:15 PM3/24/20
to Sanjoy Das, Xuan Ju, Discuss, Vijay Vasudevan
My email was rejected from dis...@tensorflow.org because I wasn't a member of the group. Just joined. Resending the email again.

On Tue, Mar 24, 2020 at 1:31 PM Penporn Koanantakool <pen...@google.com> wrote:
Hi all,

Sorry for the late reply! I misread Sanjoy's question and thought he was asking Xuan for the benchmark he ran.

We do have per-op benchmarks in TF. The current element-wise binary op benchmark doesn't include Mul. I modified //tensorflow/core/kernels/cwise_ops_test.cc slightly to include it in my TF branch here. To test:

git clone https://github.com/penpornk/tensorflow
cd tensorflow; git checkout mul_vs_matmul

# To test with just AVX instructions.
export CC_OPT_FLAGS='-mavx
'
yes "" | "$PYTHON_BIN_PATH" configure.py
bazel run --config=opt //tensorflow/core/kernels:cwise_ops_test -- --benchmarks=BM_cpu_Mul_scalar
bazel run --config=opt //tensorflow/core/kernels:matmul_op_test -- --benchmarks=BM_Matmul

# To test with up to AVX-512 instructions.
export CC_OPT_FLAGS='-mavx -mavx2 -mfma -mavx512f
'
yes "" | "$PYTHON_BIN_PATH" configure.py
bazel run --config=opt //tensorflow/core/kernels:cwise_ops_test -- --benchmarks=BM_cpu_Mul_scalar
bazel run --config=opt //tensorflow/core/kernels:matmul_op_test -- --benchmarks=BM_Matmul

You will see results like this. I'm comparing Mul with 1024 * 1024 =  1048576 elements with Matmul with n = 1024 (square matrices):
BM_cpu_Mul_scalar/1048576     139588 4388 30047.8MB/s 7512.0M items/s
BM_Matmul_1024_1024_1024_false_false_DT_FLOAT_cpu        1209870 453 1774970.9M items/s

The `items/s` is `flops/s` here. AVX vs AVX-512 results on my Skylake machine:
Mul:          7.512 Gflops/s vs     7.2921 Gflops/s
Matmul:  1,774.9709 Gflops/s vs 1,696.4339 Gflops/s 

Matmul here attained 236x and 232x higher flops rate than elementwise mul because it has more memory reuse and can take advantage of fused multiply-add. Take a look at their runtime breakdowns:
Matmul runtime = time to calculate 2 * n^3 flops + time to load 2 * n^2 elements + time to store n^2 elements
   Mul runtime = time to calculate     n^2 flops + time to load 2 * n^2 elements + time to store n^2 elements

In reality, Matmul's overall load/store time will be larger than Mul's time, because fast memories (cache/register) often cannot fit the whole matrix in, so the reuse factor is not n per element.
 
Now, back to AVX-512 for tf.mul

I found that tf.matmul take advantage of avx512, while tf.mul not
Generic TensorFlow wheels are built with AVX optimizations, which support bulk load/store/multiply on 256-bit registers. Eigen stores data in packets that are vectorized based on compilation flags, so tf.mul is already (properly) vectorized on 256-bit registers. Using AVX-512 with tf.mul could have given at most 2x speedup, not 16x. In my results above, compiling with AVX-512 actually gives slower results than AVX. This might be because AVX-512 is new and has not been optimized much in Eigen.

I also wrote a simple element-wise multiply code (in my mul_vs_matmul branch) with direct calls to AVX and AVX-512 intrinsics to make a clearer side-by-side comparison.
cd <tensorflow_root>/tensorflow/mul_vs_matmul
make mul_avx
make mul_avx512
./mul_avx
./mul_avx512

My AVX vs AVX-512 numbers are 9.8 vs 10.0988 Gflops/s (median of 5 runs). That's about 3% speedup.

As for why tf.matmul can use AVX-512 when the wheel is only compiled with AVX, that is because all single-precision matmul- and convolution-like ops in TensorFlow calls MKL-DNN matrix multiplication routine (sgemm) as a building block. MKL-DNN's sgemm detects the CPU architecture at run time and generates vectorized code that can take full advantage of the CPU's capabilities.

Best,
Penporn
Reply all
Reply to author
Forward
0 new messages