Hello,
This is to let this group know that most gemmlowp users should be using ruy instead:
TensorFlow Lite switched from gemmlowp to ruy on ARM around 1.5 years ago.
There isn't a lot of documentation, but there are example programs
here.
Ruy's strengths over gemmlowp include:
- higher performance.
- support for ARMv8.2+ optional dot-product instructions.
- ruy handles runtime CPU feature detection and dispatch.
- both float and quantized supported in a single package.
- very general support for quantized data types: LHS and RHS can freely mix and match int8 and uint8, and destination can be int8, uint8, int16, int32.
- storage order is now a runtime (not template) parameter.
- output features such as bias-addition and clamp ("ReLU" etc) are also runtime controlled now.
- despite all the above, code side is very small (about 50k of code for TFLite's full set of instantiations).
- quantization scheme: including completely general support for per-channel quantization (per-row or per-column), and more recent
refinements.
- CMake and Bazel build systems are fully supported.
Things that gemmlowp does that ruy does not:
- MIPS MSA support.
- the gemmlowp/fixedpoint library still does not have a replacement.
While ruy is the most direct successor to gemmlowp as a standalone matrix multiplication library, for neural network inference purposes, consider using
XNNPACK instead. The main reason to use ruy over XNNPACK is if you need the greater generality of matrix multiplications that it supports, including multiplying two runtime-variable matrices (XNNPACK is centered more around the typical NN case where one of the matrices is constant weights).
Cheers,
Benoit