Published GEMM benchmark with results on Android and iOS phones

570 views
Skip to first unread message

Benoit Jacob

unread,
Oct 3, 2017, 12:31:17 PM10/3/17
to gemmlowp
Dear gemmlowp community:

This was overdue, but today I finally published an up-to-date GEMM benchmark, here:
in particular allowing to benchmark the newer L8R8WithLhsNonzeroBitDepthParams path, which is what performs best and what we tend to use in production over here.

Here is a spreadsheet collecting benchmark results from various Android and iOS phones:
Notes:

1. To help putting these results in context, here are our results on the GEMM kernel micro-benchmark,
https://docs.google.com/spreadsheets/d/1E3FRXsQlEPzSv9NK81iTdj2w2FS45KV1bwVrj2ppyqM/edit?usp=sharing .  The new GEMM L8R8WithLhsNonzeroBitDepthParams path corresponds to row 31, NEON_64bit_GEMM_Int8Operands_AccumTwoWithin16Bits, in that spreadsheet.
In particular, one may consider the micro-benchmark result as a theoretical/asymptotic upper bound on the performance of actual whole GEMMs. In particular, seeing 8bit whole GEMMs achieve more Gop/s than the best float32 kernels in the micro-benchmark, is proof that gemmlowp/8bit achieves higher performance than is feasible in float32 on the same CPU, unless of course a faster kernel exists (but that's unlikely given how much we've researched that).

2. As a whole-GEMM benchmark (not a kernel microbenchmark), the new results are very dependent on matrix size. In practical mobile neural network applications, the sizes that matter most are typically not-so-large, so it is often more important to look at rows in the middle of the table (e.g. size=128) rather than at the "best" rows at the bottom, reporting on the more favorable, larger sizes. This is especially important as current trends in mobile neural networks (e.g. MobileNets) tend to result in ever smaller (cheaper) neural networks, involving smaller matrix sizes. This aspect is especially important to keep in mind when comparing against GPU implementations, as on GPUs the size-dependence of performance is even much greater, and unfortunately many GPU benchmark results that have been published are focusing on very large neural networks by 2017 standards (what was 'normal' in 2014, e.g. Inception/GoogLeNet, is very large by current standards). Accordingly, many GPU benchmark results that have been published are currently unreasonably favorable to GPUs.

3. Note the 3 sheets reporting respectively on 1-thread, 2-thread, 4-thread performance. In practice, the overwhelming majority of applications that I know, use only 1 thread, for a variety of reasons, from the fact that small matrix sizes are less able to take advantage of multiple threads (as you can see from the tables), to thermal issues in many Android devices.

Cheers,
Benoit

Benoit Jacob

unread,
Oct 3, 2017, 12:32:39 PM10/3/17
to gemmlowp
On Tue, Oct 3, 2017 at 12:30 PM, Benoit Jacob <benoi...@google.com> wrote:
Dear gemmlowp community:

This was overdue, but today I finally published an up-to-date GEMM benchmark, here:
in particular allowing to benchmark the newer L8R8WithLhsNonzeroBitDepthParams path, which is what performs best and what we tend to use in production over here.

Here is a spreadsheet collecting benchmark results from various Android and iOS phones:
Notes:

1. To help putting these results in context, here are our results on the GEMM kernel micro-benchmark,
https://docs.google.com/spreadsheets/d/1E3FRXsQlEPzSv9NK81iTdj2w2FS45KV1bwVrj2ppyqM/edit?usp=sharing .  The new GEMM L8R8WithLhsNonzeroBitDepthParams path corresponds to row 31

sorry, I meant: row 17

beeblebrox

unread,
Oct 26, 2017, 7:11:44 AM10/26/17
to gemmlowp
Hi Benoit,

I'm currently running gemmlowp on an Android device and noticing large gaps between gemm latency seen in benchmark and real use-case.

Specifically, benchmark_all_sizes.cc calls an untimed warm-up GEMM followed by several timed GEMMs for (roughly) kBenchmarkSecs, and reports their average latency. In a real NN use-case, however, we do not need such repetition and expect successive gemms to operate only once and on different data (corresponding to successive layers of the network).

Thus I ran benchmark_all_sizes.cc on a list of shapes from some real NN models; first with the above repetition, and then with only one timed call (per shape) to GemmWithOutputPipeline after an untimed warm-up run. While the former case is significantly faster than full-precision gemm, we're finding the latter case to be slower by up to 5x in some cases (with models like alexnet/googlenet). 

Would appreciate if you could hint at some internal factors that could be related to this behaviour.


Thanks

Pete Warden

unread,
Oct 26, 2017, 12:56:29 PM10/26/17
to beeblebrox, gemmlowp
Thanks for digging into this, very interesting results! I can't speak for Benoit, but I'd love to see a runnable version of your altered benchmark so we could reproduce the results.

--
You received this message because you are subscribed to the Google Groups "gemmlowp" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gemmlowp+unsubscribe@googlegroups.com.
To post to this group, send email to gemm...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/gemmlowp/76e188cb-2957-47e1-8fcf-89118e63ca9d%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

beeblebrox

unread,
Oct 26, 2017, 1:37:55 PM10/26/17
to gemmlowp
Hi Pete,
Sure, I'll share it as soon as I can (although it's essentially a minor edit to run 2nd gemm here only once).
Meanwhile I'll be glad if you can share whether you folks have experienced such behavior before; especially with the quantized operations in TensorFlow. Does quantized matmul give a reasonable speedup over full-precision GEMM during inference?

Benoit Jacob

unread,
Oct 27, 2017, 12:49:11 PM10/27/17
to beeblebrox, gemmlowp
Hello,

Benchmarking on mobile devices is hard, as many factors affect the measured latencies.

One factor is the CPU scheduler. Coming from a desktop perpective, one would expect the CPU scheduler to be designed so as to minimize latencies. On mobile, that is not the case, instead the CPU scheduler makes complex compromises to limit power usage. This includes not immediately ramping up CPU clock speed, not immediately bringing up more CPU cores that were sleeping i.e. initially queuing on one core, etc. As a result, the first few milliseconds or dozen milliseconds of a benchmark on Android are "meaningless": they measure the behavior of a CPU scheduler that does not even try to minimize latency.

Another factor is device thermals. On most Android devices, thermal throttling is not an exceptional circumstance, but rather, is the standard behavior under intensive computational workload. This has many implications, including that a longer-running benchmark is penalized vs. a shorter one (because it will throttle more), and also, the fact that benchmark results will depend on room temperature, on whether the device is held in the palm or laid on a thermally conductive surface...

Another factor is competing tasks happening in the device concurrently. For example, many Android NN applications with vision models tend to run concurrently with the Camera stack (e.g. when showing a Camera view while processing frames in a NN). The Camera stack is very computationally intensive and has higher priority than user NN code, so the impact of that tends to be major.

These factors are made even more complex on current Android devices that have multiple cores and that have "big.LITTLE" cores. Basically, it is very difficult to predict how many cores a given application will be able to use to do NN inference, as that depends on what it is doing (e.g. using the Camera stack will take cores away from what is usable for NNs), and it is difficult to even know whether a task is running on big or little cores. To address that specific concern, I recommend running all gemmlowp benchmarks with 'taskset' to pin them to one type of core, so that at least you know what you ran them on. Example: adb shell taskset f0 /.../benchmark.

For all these reasons, ultimately the only thing that matters is real application benchmarks: e.g. build a FPS counter or log actual NN latencies directly from your app or a realistic prototype of your app.

However, it is not practical to only have such application benchmarks. In a project like gemmlowp, we need to have performance metrics to track, that are less application-dependent and less fluctuating, even if that means less relevant in actual applications. This is a compromise that we have to make.

That is why gemmlowp provides two such levels of "micro-benchmarking". The more "micro", the more application-independent and universal, the less directly relevant to applications.

The most micro level of gemmlowp benchmarking is the standalone kernels benchmark. It does not even depend on actual matrix sizes. It gives a measure of how fast GEMMs could go in the case of asymptotically large sizes and efficient memory accesses. The benchmark code is
and the results are

The other level of gemmlowp benchmarking is a bit less "micro", measures whole GEMMs of various sizes. That is what I announced in this thread:
results at

Even though real application benchmarks are very different from these, it is our experience that gemmlowp's benchmarks are a reasonably accurate predictor of actually observed performance in real application benchmarks, once the application is properly set up.

In particular, yes, it is our experience that gemmlowp allows for large latency gains over any float GEMM implementation, i.e. the gains seen at the level of micro benchmarks do translate into application-level gains.

Sorry not to be able to share more at the moment, as our gemmlowp-based efficient inference code is not currently open-source. However if you say the TensorFlow Lite announcement at I/O this year, this is something that you can watch out for eventually.

Benoit

--
Reply all
Reply to author
Forward
0 new messages