Implementing XNOR + bitcounting matrix multiplication

546 views
Skip to first unread message

edward....@gmail.com

unread,
May 19, 2016, 7:31:39 PM5/19/16
to Discuss, Yina Tang
Hello,

I am interested in implementing the XNOR + bit counting method mentioned at http://arxiv.org/abs/1603.05279. So far the following code works, but it's really slow (30x slower than just a plain tf.matmul())


      for (int ar=0; ar < a_.rows(); ar++)
      {
          for (int br=0; br< b_.rows(); br++) {
              unsigned int Cvalue = 0;
              for (int c=0; c< a_.cols(); c++)
              {
                  unsigned int value =popcnt(a_(ar, c) ^ b_(br,c));
                  Cvalue += value;
              }
              out(ar, br) = - ( 2*(float)Cvalue - a.dimension(1) );
          }
      }

a_ and b_ are both Eigen:Matrix<uint32_t>

Is there any optimization I can do to optimize these 2 for loops? The tf.matmul implementation is really fast.

    matOut.device(ctx->eigen_device<CPUDevice>()) = matA.contract(matB, dim_pair);


Thank you.

Rasmus Larsen

unread,
May 19, 2016, 7:45:50 PM5/19/16
to edward....@gmail.com, Discuss, Yina Tang
Since this is still essentially a matrix multiply, you will be limited by memory speed if you do a naive tripe-loop implementation like this. You have to block the algorithm to reuse data in caches and take advantage of vector instructions SSE/AVX/FMA etc. on modern CPUs . This is an art form, and a lot of tuning goes into matrix multiply libraries. this is a good reference: https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf

Have you looked at the assembly code to verify that your compiler is generating the vector instructions you expect? Sometimes you have to resort to using compiler intrinsics for this kind of thing. https://software.intel.com/sites/landingpage/IntrinsicsGuide/ is a place to start.

Hope this helps,
   Rasmus

--
You received this message because you are subscribed to the Google Groups "Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to discuss+u...@tensorflow.org.
To post to this group, send email to dis...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/discuss/c2795b97-f168-4139-a524-7ee32427d678%40tensorflow.org.

edward....@gmail.com

unread,
May 20, 2016, 9:22:23 PM5/20/16
to Discuss, edward....@gmail.com, yn.t...@gmail.com
Thanks for the pointer, I will look at the paper.
 
For now I think the main bottle neck is because the loop are not multithreaded. when using Eigen's contract() for Matmul, these two lines of code offers 4x speed difference (on a core i7)
(slow)
    matOut = matA.contract(matB, dim_pair);

vs this (fast)

    matOut.device(ctx->eigen_device<CPUDevice>()) = matA.contract(matB, dim_pair);

So maybe an multi threaded solution can be an easy first step.
Is there any documentation on device(), slice(),contract() and broadcast() calls in Tensorflow/Eigen?

edward....@gmail.com

unread,
May 31, 2016, 11:21:19 AM5/31/16
to Discuss, edward....@gmail.com, yn.t...@gmail.com
Now I have implemented a cuda kernel for xnor and bitcounting.
but the cublas handle generation takes a lot of time. what is the proper way to call cublasSgeam()? I couldn't find an example in the tensorflow repository.

    cublasHandle_t handle;
    cublasCreate(&handle);
    cublasSetStream(handle, d.stream());
    cublasSgeam( handle, CUBLAS_OP_T, CUBLAS_OP_N, n, k, &alpha, in1, k, &beta, in1, n, in1T, n );
 
       auto d = ctx->eigen_device<Eigen::GpuDevice>();
  auto d = ctx->eigen_device<Eigen::GpuDevice>();
    cublasHandle_t handle
;
    cublasCreate
(&handle);
    cublasSetStream
(handle, d.stream());
    cublasSgeam
( handle, CUBLAS_OP_T, CUBLAS_OP_N, n, k, &alpha, in1, k, &beta, in1, n, in1T, n );


Thanks!

bharath...@gmail.com

unread,
May 6, 2017, 12:33:40 PM5/6/17
to Discuss, edward....@gmail.com

Hi Edward,

Were you able to run the XNOR-Net on any neural network?
If yes, then could you please let me know some performance results or outcomes?

Thanks in advance.

edward....@gmail.com

unread,
May 6, 2017, 1:47:06 PM5/6/17
to Discuss, edward....@gmail.com, bharath...@gmail.com
there are many quantization results online now. I personally didn't find this method helpful because the neural net loss too much information.
Reply all
Reply to author
Forward
0 new messages