Hello,
Benchmarking on mobile devices is hard, as many factors affect the measured latencies.
One factor is the CPU scheduler. Coming from a desktop perpective, one would expect the CPU scheduler to be designed so as to minimize latencies. On mobile, that is not the case, instead the CPU scheduler makes complex compromises to limit power usage. This includes not immediately ramping up CPU clock speed, not immediately bringing up more CPU cores that were sleeping i.e. initially queuing on one core, etc. As a result, the first few milliseconds or dozen milliseconds of a benchmark on Android are "meaningless": they measure the behavior of a CPU scheduler that does not even try to minimize latency.
Another factor is device thermals. On most Android devices, thermal throttling is not an exceptional circumstance, but rather, is the standard behavior under intensive computational workload. This has many implications, including that a longer-running benchmark is penalized vs. a shorter one (because it will throttle more), and also, the fact that benchmark results will depend on room temperature, on whether the device is held in the palm or laid on a thermally conductive surface...
Another factor is competing tasks happening in the device concurrently. For example, many Android NN applications with vision models tend to run concurrently with the Camera stack (e.g. when showing a Camera view while processing frames in a NN). The Camera stack is very computationally intensive and has higher priority than user NN code, so the impact of that tends to be major.
These factors are made even more complex on current Android devices that have multiple cores and that have "big.LITTLE" cores. Basically, it is very difficult to predict how many cores a given application will be able to use to do NN inference, as that depends on what it is doing (e.g. using the Camera stack will take cores away from what is usable for NNs), and it is difficult to even know whether a task is running on big or little cores. To address that specific concern, I recommend running all gemmlowp benchmarks with 'taskset' to pin them to one type of core, so that at least you know what you ran them on. Example: adb shell taskset f0 /.../benchmark.
For all these reasons, ultimately the only thing that matters is real application benchmarks: e.g. build a FPS counter or log actual NN latencies directly from your app or a realistic prototype of your app.
However, it is not practical to only have such application benchmarks. In a project like gemmlowp, we need to have performance metrics to track, that are less application-dependent and less fluctuating, even if that means less relevant in actual applications. This is a compromise that we have to make.
That is why gemmlowp provides two such levels of "micro-benchmarking". The more "micro", the more application-independent and universal, the less directly relevant to applications.
The most micro level of gemmlowp benchmarking is the standalone kernels benchmark. It does not even depend on actual matrix sizes. It gives a measure of how fast GEMMs could go in the case of asymptotically large sizes and efficient memory accesses. The benchmark code is
and the results are
The other level of gemmlowp benchmarking is a bit less "micro", measures whole GEMMs of various sizes. That is what I announced in this thread:
results at
Even though real application benchmarks are very different from these, it is our experience that gemmlowp's benchmarks are a reasonably accurate predictor of actually observed performance in real application benchmarks, once the application is properly set up.
In particular, yes, it is our experience that gemmlowp allows for large latency gains over any float GEMM implementation, i.e. the gains seen at the level of micro benchmarks do translate into application-level gains.
Sorry not to be able to share more at the moment, as our gemmlowp-based efficient inference code is not currently open-source. However if you say the TensorFlow Lite announcement at I/O this year, this is something that you can watch out for eventually.
Benoit