Hi,
First, given the specific shape of your GEMM here, with a very narrow right-hand side, gemmlowp is far from optimal. If you care about getting optimal performance for such a shape, you will need custom code. Since there are only two right-hand-side columns, this is roughly a GEMV, you will want to perform it in a single pass (no packing pass) on the large LHS matrix, and typically, if this large LHS matrix is a constant, you will want to consider pre-shuffling it for optimal traversal. Here is an example of that in TensorFlow Lite code, however there I had only coded kernels for the 1-column (GEMV) and 4-column cases, not 2-columns as you, though you would trivially derive that from the 4-columns case:
Going back to your more general benchmarking question.
Since the latency you're measuring here is so small (< 1 ms) there will inevitably be fluctuation.
If you're OK to only measuring warm-cache performance, a typical approach to stabilizing this result would be to run this many times in a loop, then averaging. You may also want to ignore the first iteration, as special one-off costs occur there only.
If on the other hand you want to measure cold-cache performance, then this is very hard to do anyway, and probably only doable meaningfully as integrated into your application, not as a micro-bencmark.