I see that your benchmark results in the preprint are on actual GEMMs, not just GEMM kernel micro-benchmarks like my spreadsheet.
Please still confirm if you've benchmarked gemmlowp with L8R8WithLhsNonzeroBitDepthParams.
Benchmarking actual GEMMs as you know, introduces 3 new parameters (GEMM shape, which you call M, K, N in Table 2). This table refers to a very large neural network, "AlexNet", where these GEMM shapes are much larger than we typically see in the kind of mobile NNs that gemmlowp is being optimized for. I would find it very interesting if you could benchmark some actual production mobile NNs, such as MobileNets (that still leaves quite a range of sizes, from ~ 0.01 to ~ 1 billion ops per inference. Maybe you could benchmark a mid-size MobileNet at ~ 0.1 billion ops).
A typical matrix size there is between 50 and 500... it would be very interesting already if you could benchmark just (M,K,N)==(100,100,100).
Benoit