TL;DR: This can't be fixed in the existing API after all, so I am keeping the existing API and code unchanged, and adding the new behavior as a separate non-default code path ("output stage").
Here is how this resolved in the end:
It turned out that the differences in rounding between the existing and proposed schemes (integer vs. fixed-point multiplier), although small, did result in off-by-1 errors every roughly 2^12 to 2^14 matrix entries, and that was enough to cause some trouble in the tests. I could have adapted the test, but I thought that similar trouble would be caused as well to existing gemmlowp users, and I wanted to avoid that.
So I instead kept the old behavior unchanged, and added the new behavior as a new separate non-default output stage alongside it:
here is the new output stage:
compare to the old output stage:
Moreover, another change appeared to be needed, and it would also not exactly reproduce the old behavior, causing more subtle off-by-1 errors: as explained here,
the new output stage is doing the offset addition at the end, after the multiplier and shift.
I hope to share more details about what I just learned about doing quantized NN inference, which led to this new output stage, soon. (I will be away for the next 2 weeks, so I can't do it now). Meanwhile, rest assured that the existing gemmlowp behavior stays unchanged.
Cheers,
Benoit