I have done some experiments for matrix multiplication using gemmlowp and cblas_sgemm. While cblas_sgemm uses 32bit floating point values and gemmlowp uses uint8_t input and int32_t output.First of all, the gemmlowp computation time seems to be greatly depending on the input matrix MapOrder, the difference is more than 10 times for a 20x800 matrix * 800*500 matrix.
Secondly, even for the fastest combination of the MapOrder, the computation time of gemmlowp is still 2 times more than using cblas_sgemm, on a macbook pro. The computation time compared doesn't include quantization time for input.
I was wondering whether this kind of performance is expected? or is there anything I need to set correctly to make the gemmlowp faster than cblas_sgemm?
--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/3b5b3392-8bcf-4450-b5ec-1b68b6a1d6e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Thanks, Benoit!You are right. I further experimented on an iPhone6, and the fastest version of gemmlowp is about 3 times faster than the cblas_sgemm in accelerate framework for the same matrix multiplication in last post.However, I wasn't include the time spent for pre and post processing of input and output. Is there any optimized APIs in gemmlowp that I can call for matrix quantization, rotation, and de-quantization?
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/100c3c87-b766-48f1-a187-53a447bbba44%40googlegroups.com.
I found a bug in my previous implementation, so here is some updated numbers:While the MapOrder matters, it is only about 1.5 times difference on iOS between the best and worst combinations.On OSX, the cblas_sgemm from Accelerate Framework is about 200 times faster, while on iOS, the cblas_sgemm is just about 2 times faster.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/d4f4b8ab-28d9-465b-bd83-719168c676d1%40googlegroups.com.
What are the sizes of M,N,K for your benchmark? And I'm assuming your including the cost for doing the quantization/dequantization?Benoit: is there a reason why quantization/dequantization routines aren't built into gemmlowp? Is there an experimental build for quant/dequant built TensorFlow, perhaps through an experimental build? While it's relatively easy to write quant/dequant for a particular convnet, taking care of this quant/dequant boilerplate would be super convenient.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/d4b345dd-a8c2-4694-ab31-d3168d5825a4%40googlegroups.com.