--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/6abfb6b6-5827-4e8b-b900-c74dcfb07056%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Hello Benoit,
Thanks for the attachments, I have a 835 chipset board here,
wrt https://docs.google.com/spreadsheets/d/1E3FRXsQlEPzSv9NK81iTdj2w2FS45KV1bwVrj2ppyqM/edit?usp=sharing .
1000 x 1000 x 1000 => ~89 GFLOPS(Multithread)/ ~31GFLOPS(singlethread).
as we verified on Snapdragon 835 with https://github.com/google/gemmlowp/blob/master/test/benchmark.cc (with L8R8LHSNonZeroBitdepthparams)
==================================
Benchmarking small model GEMMs...
running for 10 seconds...
Graph latency (over 23 iterations):
Best: 0.0245683s
Worst: 0.0246404s
Mean: 0.0245996s
25% trimmed mean: 0.0245996s
Mean of 10% best: 0.024569s
Benchmarking typical GoogLeNet GEMMs...
running for 20 seconds...
Graph latency (over 70 iterations):
Best: 0.158489s
Worst: 0.16043s
Mean: 0.159729s
25% trimmed mean: 0.159806s
Mean of 10% best: 0.158836s
Benchmarking multi-threaded mode...
10x10x10 : 1.021 GFlops/s
20x20x20 : 2.556 GFlops/s
30x30x30 : 4.091 GFlops/s
40x40x40 : 5.492 GFlops/s
50x50x50 : 6.201 GFlops/s
60x60x60 : 19.68 GFlops/s
64x256x147 : 46.68 GFlops/s
100x100x1 : 1.68 GFlops/s
100x100x100 : 25.09 GFlops/s
100x1000x100 : 42.9 GFlops/s
1000x1000x1 : 7.203 GFlops/s
1000x1000x10 : 41.87 GFlops/s
1000x1000x100 : 69.82 GFlops/s
1000x1000x1000 : 73.77 GFlops/s
Benchmarking single-threaded mode...
10x10x10 : 1.024 GFlops/s
20x20x20 : 2.559 GFlops/s
30x30x30 : 4.077 GFlops/s
40x40x40 : 5.482 GFlops/s
50x50x50 : 6.204 GFlops/s
60x60x60 : 7.906 GFlops/s
64x256x147 : 18.74 GFlops/s
100x100x1 : 1.682 GFlops/s
100x100x100 : 10.83 GFlops/s
100x1000x100 : 19.3 GFlops/s
1000x1000x1 : 1.994 GFlops/s
1000x1000x10 : 11.84 GFlops/s
1000x1000x100 : 21.14 GFlops/s
1000x1000x1000 : 21.11 GFlops/s
===================================
Hello Benoit,
Thanks for the attachments, I have a 835 chipset board here,
wrt https://docs.google.com/spreadsheets/d/1E3FRXsQlEPzSv9NK81iTdj2w2FS45KV1bwVrj2ppyqM/edit?usp=sharing .
1000 x 1000 x 1000 => ~89 GFLOPS(Multithread)/ ~31GFLOPS(singlethread).
Have some queries1) Is 150ms the best we can achieve with gemmlowp(is there room for improving it) on Snapdragon 835.
2) what was the config used to check benchmark on a53/ a73(Snapdragon 835) independently(and fyi: ran the https://github.com/google/gemmlowp/blob/master/test/benchmark_all_sizes.cc and were hitting ~83 GOPs),
---Regards
On Tuesday, October 3, 2017 at 10:05:25 PM UTC+5:30, Benoit Jacob wrote:Hello,Please refer to the new benchmark results that I just shared,These are only gemmlowp results. I do not have any Apple vecLib results.BenoitOn Tue, Oct 3, 2017 at 6:53 AM, quantized_guy <zaheers...@gmail.com> wrote:--Hi @Group,Did anyone get lucky enough to run gemmlowp faster than cblasgemm on ARM-Android.-Regars
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/6abfb6b6-5827-4e8b-b900-c74dcfb07056%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/0b27e54e-75fa-46ac-a2c8-fa82e995ed3a%40googlegroups.com.
Hello Benoit,
>> 1000 x 1000 x 1000 => ~89 GFLOPS(Multithread)/ ~31GFLOPS(singlethread).>> Where are these numbers coming from? The above spreadsheet (GEMM kernel micro-benchmark) does not refer to specific matrix sizes (like >> "1000 x 1000 x 1000") and 31 Gop/s single-threaded is plainly out of reach of the Qualcomm Snapdragon 835 SoC, at least I have never>> observed such high performance.
I assumed Matrix:-1024 in that excel sheet as relative 1000 x 1000 x 1000 and there was a mistake with single thread instead of ~21 Gflops.>> Have some queries>> 1) Is 150ms the best we can achieve with gemmlowp(is there room for improving it) on Snapdragon 835.>> I haven't looked at GoogLeNet latencies for a long time, so I can't answer that. I don't believe that Inception/GoogLeNet networks are very useful >> benchmarks anymore, given how fast mobile NNs have evolved over the past 3 years (MobileNet, ShuffleNet). benchmark.cc should be updated >> at some point...Replacing "test/benchmark.cc" +285 with mobileNet dimensions will be useful? for benchmarking or any extra changes needed.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/b8d68a20-239a-4ac0-9124-ef8bd02afdd9%40googlegroups.com.
>> Did anyone get lucky enoughWe (xnor.ai) tried pretty hard, even had gemmlowp as a part of our benchmark suite for a few months. We were not able to make it outperform full precision BLAS.Intuitively it seems it should be faster because memory bandwidth is so scarce on lower end ARM and caches are so small, but in practice, testing with the actual matrix sizes that our networks use, OpenBLAS or even Eigen seems to run circles around it. Naturally when this happens there's always the inclination to think that we're "holding it wrong", so it's refreshing to see this question posed in this forum.
And we know there's still headroom in OpenBLAS and Eigen with the more recent ARM CPUs.
And that's also before you consider that 1x1 stride convs can also be sped up with e.g. Winograd by a factor of at least 50%,
or even more on low end CPUs, and that we have a binary convolution implementation that runs circles around even Winograd and supports arbitrary strides and dilation, which Winograd does not support. It's not a direct competitor to full precision (or even quantized) gemm, but we've found that it works really well if we train it right, especially in deeper models.I would like to know how to make gemmlowp go fast on realistic matrix sizes, too. We would very much like to get rid of full precision where it's not needed, provided that there's not just memory usage improvement, but also a speedup. Realistic (for us) are filter/input tensor sizes/strides/dilations (that is, lowered gemm matrix sizes that correspond to those) found in e.g. ResNet34/50, YOLO Tiny, and some other modern classifiers and object detectors. This list is not going to fit everyone's needs, but as Benoit mentioned, the recent trend is to have fully convolutional models and fairly small filter sizes, so it should cover a lot of recent ground.For us in particular, the most critical part are the first couple of layers (very large input face, shallow input tensor, not a lot of filters). I imagine the same holds for just about everybody else who has to deal with visual data.I think to be worthwhile (for us) gemmlowp has to be low double digit percentage points faster than OpenBLAS in this very asymmetric case. That is currently not so.
On Tuesday, October 3, 2017 at 3:53:48 AM UTC-7, quantized_guy wrote:Hi @Group,Did anyone get lucky enough to run gemmlowp faster than cblasgemm on ARM-Android.-Regars
--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/a9660e3b-bc15-4a15-86dd-61078a9e5b58%40googlegroups.com.