computation time of gemmlowp vs accelerate framework veclib cblas

Ning Xu

no leída,

6 ago 2016, 13:42:336/8/16

a gemmlowp

I have done some experiments for matrix multiplication using gemmlowp and cblas_sgemm. While cblas_sgemm uses 32bit floating point values and gemmlowp uses uint8_t input and int32_t output.

First of all, the gemmlowp computation time seems to be greatly depending on the input matrix MapOrder, the difference is more than 10 times for a 20x800 matrix * 800*500 matrix.

Secondly, even for the fastest combination of the MapOrder, the computation time of gemmlowp is still 2 times more than using cblas_sgemm, on a macbook pro. The computation time compared doesn't include quantization time for input.

I was wondering whether this kind of performance is expected? or is there anything I need to set correctly to make the gemmlowp faster than cblas_sgemm?

Benoit Jacob

no leída,

6 ago 2016, 22:59:486/8/16

a Ning Xu,gemmlowp

On Sat, Aug 6, 2016 at 1:42 PM, Ning Xu <ning...@gmail.com> wrote:

I have done some experiments for matrix multiplication using gemmlowp and cblas_sgemm. While cblas_sgemm uses 32bit floating point values and gemmlowp uses uint8_t input and int32_t output.

First of all, the gemmlowp computation time seems to be greatly depending on the input matrix MapOrder, the difference is more than 10 times for a 20x800 matrix * 800*500 matrix.

A substantial impact of the MapOrder is indeed expected (though I didn't expect a 10x difference) because at the moment only the case of a RowMajor left-hand side times a Column-Major right-hand side has been carefully optimized. The reason for that is a strong focus on the needs of mobile neural network applications, where this combination of storage orders has been the most common one.

This would be addressed by the addition of packing paths for the opposite storage order in pack*.h

Secondly, even for the fastest combination of the MapOrder, the computation time of gemmlowp is still 2 times more than using cblas_sgemm, on a macbook pro. The computation time compared doesn't include quantization time for input.

At the moment the primary focus of our optimization effort (at Google) in gemmlowp has been ARM architectures.

We have received great contributions from Intel adding SSE4 kernels and packing paths, however:

- there isn't yet a fast x86 unpacking path (there is no unpack_sse.h at the moment).

- SSE4 likely isn't fully taking advantage your your MacBook Pro hardware. For that, a AVX2 path would likely be needed. If your BLAS implementation does take advantage of AVX or AVX2, that would thus put it at a strong advantage on your hardware.

Cheers,

Benoit

I was wondering whether this kind of performance is expected? or is there anything I need to set correctly to make the gemmlowp faster than cblas_sgemm?

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/3b5b3392-8bcf-4450-b5ec-1b68b6a1d6e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ning Xu

no leída,

7 ago 2016, 11:16:527/8/16

a gemmlowp

Thanks, Benoit!

You are right. I further experimented on an iPhone6, and the fastest version of gemmlowp is about 3 times faster than the cblas_sgemm in accelerate framework for the same matrix multiplication in last post.

However, I wasn't include the time spent for pre and post processing of input and output. Is there any optimized APIs in gemmlowp that I can call for matrix quantization, rotation, and de-quantization?

Thanks!

Ning

On Saturday, August 6, 2016 at 7:59:48 PM UTC-7, Benoit Jacob wrote:

Benoit Jacob

no leída,

7 ago 2016, 11:35:307/8/16

a Ning Xu,gemmlowp

On Sun, Aug 7, 2016 at 11:16 AM, Ning Xu <ning...@gmail.com> wrote:

Thanks, Benoit!

You are right. I further experimented on an iPhone6, and the fastest version of gemmlowp is about 3 times faster than the cblas_sgemm in accelerate framework for the same matrix multiplication in last post.

However, I wasn't include the time spent for pre and post processing of input and output. Is there any optimized APIs in gemmlowp that I can call for matrix quantization, rotation, and de-quantization?

By quantization do you mean converting between floating-point and uint8 buffers? If yes: that is outside the scope of gemmlowp and gemmlowp doesn't offer such features.

What do you mean by rotation?

If you meant lower-level APIs to perform separately pack/kernel/unpack, then no, gemmlowp does not offer such lower-level APIs, though you could still call internal code directly, since it's all in (internal) headers included by the public headers. See how SingleThreadGemm.h does it.

Benoit

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/100c3c87-b766-48f1-a187-53a447bbba44%40googlegroups.com.

Ning Xu

no leída,

8 ago 2016, 20:50:298/8/16

a gemmlowp,ning...@gmail.com

I found a bug in my previous implementation, so here is some updated numbers:

While the MapOrder matters, it is only about 1.5 times difference on iOS between the best and worst combinations.

On OSX, the cblas_sgemm from Accelerate Framework is about 200 times faster, while on iOS, the cblas_sgemm is just about 2 times faster.

On Sunday, August 7, 2016 at 8:35:30 AM UTC-7, Benoit Jacob wrote:

Ning Xu

no leída,

8 ago 2016, 22:36:528/8/16

a gemmlowp,ning...@gmail.com

On Monday, August 8, 2016 at 5:50:29 PM UTC-7, Ning Xu wrote:

I found a bug in my previous implementation, so here is some updated numbers:

While the MapOrder matters, it is only about 1.5 times difference on iOS between the best and worst combinations.

On OSX, the cblas_sgemm from Accelerate Framework is about 200 times faster, while on iOS, the cblas_sgemm is just about 2 times faster.

Sorry, it's about 20 times instead of 200 times.

Benoit Jacob

no leída,

8 ago 2016, 23:19:058/8/16

a Ning Xu,gemmlowp

It would be interesting to look at profiles!

Apple makes great profilers; alternatively, you can enable the builtin profiler in gemmlowp's benchmark.cc by defining GEMMLOWP_TEST_PROFILE.

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/d4f4b8ab-28d9-465b-bd83-719168c676d1%40googlegroups.com.

car...@allenai.org

no leída,

9 ago 2016, 0:35:139/8/16

a gemmlowp,ning...@gmail.com

What are the sizes of M,N,K for your benchmark? And I'm assuming your including the cost for doing the quantization/dequantization?

Benoit: is there a reason why quantization/dequantization routines aren't built into gemmlowp? Is there an experimental build for quant/dequant built TensorFlow, perhaps through an experimental build? While it's relatively easy to write quant/dequant for a particular convnet, taking care of this quant/dequant boilerplate would be super convenient.

On Monday, August 8, 2016 at 5:50:29 PM UTC-7, Ning Xu wrote:

Ning Xu

no leída,

9 ago 2016, 9:22:039/8/16

a car...@allenai.org,gemmlowp

The m, n, k is 20, 500, 800 in the benchmark, and I didn't include the cost for doing quantization/de-quantization.

Benoit Jacob

no leída,

9 ago 2016, 10:48:389/8/16

a car...@allenai.org,gemmlowp,Ning Xu

On Tue, Aug 9, 2016 at 12:35 AM, <car...@allenai.org> wrote:

What are the sizes of M,N,K for your benchmark? And I'm assuming your including the cost for doing the quantization/dequantization?

Benoit: is there a reason why quantization/dequantization routines aren't built into gemmlowp? Is there an experimental build for quant/dequant built TensorFlow, perhaps through an experimental build? While it's relatively easy to write quant/dequant for a particular convnet, taking care of this quant/dequant boilerplate would be super convenient.

It is just a matter of scoping. We thought that quantization/dequantization was the responsibility of the user (such as a NN framework, such as TensorFlow), and limited the scope of gemmlowp to doing just integer arithmetic.

See this recent thread where a similar question was asked and Pete gave some links to relevant TensorFlow code:

https://groups.google.com/d/msg/gemmlowp/43LMfK9ZL40/LebFnv0VBwAJ

Perhaps one reason to keep this out of gemmlowp is that there are multiple inequivalent ways to do it; accordingly gemmlowp supports different output stages that correspond to different inequivalent quantization schemes:

this is the "standard one" that we've been using since the beginning of gemmlowp:

https://github.com/google/gemmlowp/blob/master/public/output_stages.h#L29

this is a newer one that I just added as I was looking into quantized inference more closely

https://github.com/google/gemmlowp/blob/master/public/output_stages.h#L68

I hope to share more details soon as this settles down, but the gist of it is that keeping quantization out of gemmlowp has limited the amount of gemmlowp code that needed to be turned around as we learned more about how to do quantized NN inference.

Benoit

To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/d4b345dd-a8c2-4694-ab31-d3168d5825a4%40googlegroups.com.

quantized_guy

no leída,

3 oct 2017, 6:55:233/10/17

a gemmlowp

Hi,

were you successful to run gemmlowp faster than cblasgemm on ARM-Android.?

Responder a todos

Responder al autor

Reenviar

computation time of gemmlowp vs accelerate framework veclib cblas_sgemm

Ning Xu

Benoit Jacob

Ning Xu

Benoit Jacob

Ning Xu

Ning Xu

Benoit Jacob

car...@allenai.org

Ning Xu

Benoit Jacob

quantized_guy