Anybody lucky enough to run gemmlowp faster than cblas gemm on ARM-Android?

528 views
Skip to first unread message

quantized_guy

unread,
Oct 3, 2017, 6:53:48 AM10/3/17
to gemmlowp
Hi @Group,
   Did anyone get lucky enough to run gemmlowp faster than cblasgemm on ARM-Android.


-
Regars

Benoit Jacob

unread,
Oct 3, 2017, 12:35:25 PM10/3/17
to quantized_guy, gemmlowp
Hello,

Please refer to the new benchmark results that I just shared,

These are only gemmlowp results. I do not have any Apple vecLib results.

Benoit

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/6abfb6b6-5827-4e8b-b900-c74dcfb07056%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

quantized_guy

unread,
Oct 4, 2017, 9:57:47 AM10/4/17
to gemmlowp

Hello Benoit,
 Thanks for the attachments, I have a 835 chipset board here,
 
 
 wrt https://docs.google.com/spreadsheets/d/1E3FRXsQlEPzSv9NK81iTdj2w2FS45KV1bwVrj2ppyqM/edit?usp=sharing .
 1000 x 1000 x 1000 => ~89 GFLOPS(Multithread)/ ~31GFLOPS(singlethread).
 
 as we verified on Snapdragon 835 with https://github.com/google/gemmlowp/blob/master/test/benchmark.cc (with L8R8LHSNonZeroBitdepthparams)
 
==================================
 Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 23 iterations):
  Best:             0.0245683s
  Worst:            0.0246404s
  Mean:             0.0245996s
  25% trimmed mean: 0.0245996s
  Mean of 10% best: 0.024569s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 70 iterations):
  Best:             0.158489s
  Worst:            0.16043s
  Mean:             0.159729s
  25% trimmed mean: 0.159806s
  Mean of 10% best: 0.158836s
Benchmarking multi-threaded mode...
10x10x10 : 1.021 GFlops/s
20x20x20 : 2.556 GFlops/s
30x30x30 : 4.091 GFlops/s
40x40x40 : 5.492 GFlops/s
50x50x50 : 6.201 GFlops/s
60x60x60 : 19.68 GFlops/s
64x256x147 : 46.68 GFlops/s
100x100x1 : 1.68 GFlops/s
100x100x100 : 25.09 GFlops/s
100x1000x100 : 42.9 GFlops/s
1000x1000x1 : 7.203 GFlops/s
1000x1000x10 : 41.87 GFlops/s
1000x1000x100 : 69.82 GFlops/s
1000x1000x1000 : 73.77 GFlops/s

Benchmarking single-threaded mode...
10x10x10 : 1.024 GFlops/s
20x20x20 : 2.559 GFlops/s
30x30x30 : 4.077 GFlops/s
40x40x40 : 5.482 GFlops/s
50x50x50 : 6.204 GFlops/s
60x60x60 : 7.906 GFlops/s
64x256x147 : 18.74 GFlops/s
100x100x1 : 1.682 GFlops/s
100x100x100 : 10.83 GFlops/s
100x1000x100 : 19.3 GFlops/s
1000x1000x1 : 1.994 GFlops/s
1000x1000x10 : 11.84 GFlops/s
1000x1000x100 : 21.14 GFlops/s
1000x1000x1000 : 21.11 GFlops/s
===================================


Have some queries
1) Is 150ms the best we can achieve with gemmlowp(is there room for improving it) on Snapdragon 835.

2) what was the config used to check benchmark on a53/ a73(Snapdragon 835) independently(and fyi: ran the https://github.com/google/gemmlowp/blob/master/test/benchmark_all_sizes.cc and were hitting ~83 GOPs),

---
Regards

Benoit Jacob

unread,
Oct 4, 2017, 10:15:32 AM10/4/17
to quantized_guy, gemmlowp
On Wed, Oct 4, 2017 at 9:57 AM, quantized_guy <zaheers...@gmail.com> wrote:

Hello Benoit,
 Thanks for the attachments, I have a 835 chipset board here,
 
 
 wrt https://docs.google.com/spreadsheets/d/1E3FRXsQlEPzSv9NK81iTdj2w2FS45KV1bwVrj2ppyqM/edit?usp=sharing .

This spreadsheet is the GEMM kernels micro-benchmark, not a whole-GEMM benchmark. In other words: it measures only the performance of the arithmetic inner loop, with only L1 memory accesses, and none of the pack/unpack work that the whole GEMM in gemmlowp does. That is why it returns higher values than the whole-GEMM benchmarks. Its main use, in relation to a whole GEMM benchmark, is to add some perspective: it shows how fast the CPU can perform the arithmetic kernel, so it's an upper limit on how fast an actual GEMM might go, but that upper limit is never quite attained. 

 1000 x 1000 x 1000 => ~89 GFLOPS(Multithread)/ ~31GFLOPS(singlethread).

Where are these numbers coming from? The above spreadsheet (GEMM kernel micro-benchmark) does not refer to specific matrix sizes (like "1000 x 1000 x 1000") and 31 Gop/s single-threaded is plainly out of reach of the Qualcomm Snapdragon 835 SoC, at least I have never observed such high performance. The above GEMM kernel micro-benchmark spreadsheet reports results for this Soc in columns E (big cores) and F(little cores). In cell E17 you can see 26.71 Gop/s, the highest arithmetic throughput that I know how to get on this SoC with a single thread, and that's on the kernel alone.
Right, these seem fairly in line with what I'm reporting in the whole-GEMM benchmark spreadsheet for this SoC.
You report somewhat lower perf for multi-threaded (73 vs 89 Gop/s), might be related to the fact that I use a cooling plate (as explained in the spreadsheet), or to the fact that I use 'taskset' to pin cores (see below).

 

Have some queries
1) Is 150ms the best we can achieve with gemmlowp(is there room for improving it) on Snapdragon 835.

I haven't looked at GoogLeNet latencies for a long time, so I can't answer that. I don't believe that Inception/GoogLeNet networks are very useful benchmarks anymore, given how fast mobile NNs have evolved over the past 3 years (MobileNet, ShuffleNet). benchmark.cc should be updated at some point...
 

2) what was the config used to check benchmark on a53/ a73(Snapdragon 835) independently(and fyi: ran the https://github.com/google/gemmlowp/blob/master/test/benchmark_all_sizes.cc and were hitting ~83 GOPs),

Use 'taskset'. For a Snapdragon 835 SoC with 4 big and 4 little cores:

adb shell taskset f0 /path/to/binary   # run on big cores
adb shell taskset 0f /path/to/binary   # run on little cores

Generally taskset takes a hexadecimal mask, where the low bits correspond to little cores, then the high bits correspond to big cores.
 

---
Regards


On Tuesday, October 3, 2017 at 10:05:25 PM UTC+5:30, Benoit Jacob wrote:
Hello,

Please refer to the new benchmark results that I just shared,

These are only gemmlowp results. I do not have any Apple vecLib results.

Benoit

On Tue, Oct 3, 2017 at 6:53 AM, quantized_guy <zaheers...@gmail.com> wrote:
Hi @Group,
   Did anyone get lucky enough to run gemmlowp faster than cblasgemm on ARM-Android.


-
Regars

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/6abfb6b6-5827-4e8b-b900-c74dcfb07056%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.

quantized_guy

unread,
Oct 4, 2017, 11:43:21 AM10/4/17
to gemmlowp
Hello Benoit,
          
>>               1000 x 1000 x 1000 => ~89 GFLOPS(Multithread)/ ~31GFLOPS(singlethread).
>> Where are these numbers coming from? The above spreadsheet (GEMM kernel micro-benchmark) does not refer to specific matrix sizes (like >> "1000 x 1000 x 1000") and 31 Gop/s single-threaded is plainly out of reach of the Qualcomm Snapdragon 835 SoC, at least I have never
>> observed such high performance.

I assumed Matrix:-1024 in that excel sheet as relative 1000 x 1000 x 1000 and there was a mistake with single thread instead of ~21 Gflops.

>> Have some queries
>> 1) Is 150ms the best we can achieve with gemmlowp(is there room for improving it) on Snapdragon 835.

>> I haven't looked at GoogLeNet latencies for a long time, so I can't answer that. I don't believe that Inception/GoogLeNet networks are very useful >> benchmarks anymore, given how fast mobile NNs have evolved over the past 3 years (MobileNet, ShuffleNet). benchmark.cc should be updated >> at some point...

Replacing "test/benchmark.cc" +285 with mobileNet dimensions will be useful? for benchmarking or any extra changes needed.



--
Thanks

Benoit Jacob

unread,
Oct 4, 2017, 11:50:21 AM10/4/17
to quantized_guy, gemmlowp
On Wed, Oct 4, 2017 at 11:43 AM, quantized_guy <zaheers...@gmail.com> wrote:
Hello Benoit,
          
>>               1000 x 1000 x 1000 => ~89 GFLOPS(Multithread)/ ~31GFLOPS(singlethread).
>> Where are these numbers coming from? The above spreadsheet (GEMM kernel micro-benchmark) does not refer to specific matrix sizes (like >> "1000 x 1000 x 1000") and 31 Gop/s single-threaded is plainly out of reach of the Qualcomm Snapdragon 835 SoC, at least I have never
>> observed such high performance.

I assumed Matrix:-1024 in that excel sheet as relative 1000 x 1000 x 1000 and there was a mistake with single thread instead of ~21 Gflops.

>> Have some queries
>> 1) Is 150ms the best we can achieve with gemmlowp(is there room for improving it) on Snapdragon 835.

>> I haven't looked at GoogLeNet latencies for a long time, so I can't answer that. I don't believe that Inception/GoogLeNet networks are very useful >> benchmarks anymore, given how fast mobile NNs have evolved over the past 3 years (MobileNet, ShuffleNet). benchmark.cc should be updated >> at some point...

Replacing "test/benchmark.cc" +285 with mobileNet dimensions will be useful? for benchmarking or any extra changes needed.

Ultimately it would be best to run an actual end-to-end, efficient NN inference code for a whole NN, rather than extracting just a collection of GEMMs. I'm hoping to be able to share something along these lines at some pont.

Failing that, replacing sizes with MobileNet sizes would help, but MobileNet is a whole family of models not a single model, straddling two orders of magnitude (from perhaps 10 million to 500 million multiply-adds). So one would have to be careful in picking a set of "representative" MobileNets.
 

Dmitry Belenko

unread,
Oct 4, 2017, 3:36:30 PM10/4/17
to gemmlowp
>> Did anyone get lucky enough

We (xnor.ai) tried pretty hard, even had gemmlowp as a part of our benchmark suite for a few months. We were not able to make it outperform full precision BLAS.

Intuitively it seems it should be faster because memory bandwidth is so scarce on lower end ARM and caches are so small, but in practice, testing with the actual matrix sizes that our networks use, OpenBLAS or even Eigen seems to run circles around it. Naturally when this happens there's always the inclination to think that we're "holding it wrong", so it's refreshing to see this question posed in this forum.

And we know there's still headroom in OpenBLAS and Eigen with the more recent ARM CPUs. And that's also before you consider that 1x1 stride convs can also be sped up with e.g. Winograd by a factor of at least 50%, or even more on low end CPUs, and that we have a binary convolution implementation that runs circles around even Winograd and supports arbitrary strides and dilation, which Winograd does not support. It's not a direct competitor to full precision (or even quantized) gemm, but we've found that it works really well if we train it right, especially in deeper models.

I would like to know how to make gemmlowp go fast on realistic matrix sizes, too. We would very much like to get rid of full precision where it's not needed, provided that there's not just memory usage improvement, but also a speedup. Realistic (for us) are filter/input tensor sizes/strides/dilations (that is, lowered gemm matrix sizes that correspond to those) found in e.g. ResNet34/50, YOLO Tiny, and some other modern classifiers and object detectors. This list is not going to fit everyone's needs, but as Benoit mentioned, the recent trend is to have fully convolutional models and fairly small filter sizes, so it should cover a lot of recent ground. 

For us in particular, the most critical part are the first couple of layers (very large input face, shallow input tensor, not a lot of filters). I imagine the same holds for just about everybody else who has to deal with visual data.

I think to be worthwhile (for us) gemmlowp has to be low double digit percentage points faster than OpenBLAS in this very asymmetric case. That is currently not so.

Benoit Jacob

unread,
Oct 4, 2017, 3:48:06 PM10/4/17
to Dmitry Belenko, gemmlowp
On Wed, Oct 4, 2017 at 3:36 PM, Dmitry Belenko <dmi...@xnor.ai> wrote:
>> Did anyone get lucky enough

We (xnor.ai) tried pretty hard, even had gemmlowp as a part of our benchmark suite for a few months. We were not able to make it outperform full precision BLAS.

Intuitively it seems it should be faster because memory bandwidth is so scarce on lower end ARM and caches are so small, but in practice, testing with the actual matrix sizes that our networks use, OpenBLAS or even Eigen seems to run circles around it. Naturally when this happens there's always the inclination to think that we're "holding it wrong", so it's refreshing to see this question posed in this forum.

I'm quite familiar with Eigen, especially on ARM as I did the work of making its float GEMM run fast on ARM. Also, Eigen is what we use for float inference over here. gemmlowp (8bit) is currently consistently faster than Eigen (float32) on about any matrix size/shape, but do make sure that you use L8R8WithLhsNonzeroBitDepthParams  as I explained  in https://groups.google.com/forum/#!msg/gemmlowp/pJTwxijEGHU/b_szEXIwAgAJ
In any case, Eigen does not "run circles around" gemmlowp on ARM/ARM64 architectures, rather the other way around.

Also, as I tried to explain in https://groups.google.com/forum/#!msg/gemmlowp/pJTwxijEGHU/b_szEXIwAgAJ and the links there, there actually isn't much room to further improve on 8bit performance over gemmlowp L8R8WithLhsNonzeroBitDepthParams on ARM, and that already performs faster than the theoretical maximum that any float32 GEMM might achieve. Refer to the GEMM kernels micro-benchmark spreadsheet there.


And we know there's still headroom in OpenBLAS and Eigen with the more recent ARM CPUs.

Right, there is a little bit of room in Eigen/float32/ARM. Mostly due to Eigen using a non-optimal float32 GEMM kernel, because Eigen's SIMD abstraction doesn't map well to the NEON instruction set. Secondarily, because Eigen GEMM kernels are written in C++/intrinsics, and that penalizes it somewhat on many (though perhaps not all) compilers compared to assembly kernels as used in gemmlowp.
 
And that's also before you consider that 1x1 stride convs can also be sped up with e.g. Winograd by a factor of at least 50%,

The current trend in mobile CNNs (e.g. MobileNet) is to have Conv nodes that aren't convolutional at all (kernel size: 1x1), the only convolutional aspect remaming in so-called "depth-wise separable conv" nodes, which don't account for a large part of the profile. Accordingly, nontrivial convolution algorithms seem to be declining on relevance, on current mobile CNNs.
 
or even more on low end CPUs, and that we have a binary convolution implementation that runs circles around even Winograd and supports arbitrary strides and dilation, which Winograd does not support. It's not a direct competitor to full precision (or even quantized) gemm, but we've found that it works really well if we train it right, especially in deeper models.

I would like to know how to make gemmlowp go fast on realistic matrix sizes, too. We would very much like to get rid of full precision where it's not needed, provided that there's not just memory usage improvement, but also a speedup. Realistic (for us) are filter/input tensor sizes/strides/dilations (that is, lowered gemm matrix sizes that correspond to those) found in e.g. ResNet34/50, YOLO Tiny, and some other modern classifiers and object detectors. This list is not going to fit everyone's needs, but as Benoit mentioned, the recent trend is to have fully convolutional models and fairly small filter sizes, so it should cover a lot of recent ground. 

For us in particular, the most critical part are the first couple of layers (very large input face, shallow input tensor, not a lot of filters). I imagine the same holds for just about everybody else who has to deal with visual data.

I think to be worthwhile (for us) gemmlowp has to be low double digit percentage points faster than OpenBLAS in this very asymmetric case. That is currently not so.

On Tuesday, October 3, 2017 at 3:53:48 AM UTC-7, quantized_guy wrote:
Hi @Group,
   Did anyone get lucky enough to run gemmlowp faster than cblasgemm on ARM-Android.


-
Regars

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.

Dmitry Belenko

unread,
Oct 4, 2017, 4:35:56 PM10/4/17
to gemmlowp
Yes, we are aware of MobileNet. We just don't think it's a be-all-end-all given all the other options available for getting rid of the flops entirely. With shrinkage in model sizes and at least 1GB in each device, mobile and embedded platforms are no longer all that memory constrained. They are energy (and therefore flops) constrained, which is why binarized and quantized approximations are receiving so much attention.

Thanks for the link, and for the explanation, maybe we were "holding it wrong" after all. Worth another try to be sure, real valued stuff is a persistent thorn in our side, and we already have a benchmark in our Git history.

On Wednesday, October 4, 2017 at 12:48:06 PM UTC-7, Benoit Jacob wrote:
Reply all
Reply to author
Forward
0 new messages