Hello,
I am interested in implementing the XNOR + bit counting method mentioned at
http://arxiv.org/abs/1603.05279. So far the following code works, but it's really slow (30x slower than just a plain tf.matmul())
for (int ar=0; ar < a_.rows(); ar++)
{
for (int br=0; br< b_.rows(); br++) {
unsigned int Cvalue = 0;
for (int c=0; c< a_.cols(); c++)
{
unsigned int value =popcnt(a_(ar, c) ^ b_(br,c));
Cvalue += value;
}
out(ar, br) = - ( 2*(float)Cvalue - a.dimension(1) );
}
}
a_ and b_ are both Eigen:Matrix<uint32_t>
Is there any optimization I can do to optimize these 2 for loops? The tf.matmul implementation is really fast.
matOut.device(ctx->eigen_device<CPUDevice>()) = matA.contract(matB, dim_pair);
Thank you.