Heads-up: fixing int32 multiplication overflow issue, should preserve compatibility

128 views

Skip to first unread message

Benoit Jacob

unread,

Jul 19, 2016, 2:30:33 PM7/19/16

to gemmlowp

Hi,

This is about an integer overflow issue in the contract that gemmlowp implements, as described here,

https://github.com/google/gemmlowp/blob/master/doc/low-precision.txt#L22

and as seen in reference code such as this,

https://github.com/google/gemmlowp/blob/master/test/test.cc#L36

The issue is that when multiplying the int32 accumulators with int32 output multiplier, overflow may occur. Specifically in the reference code mentioned above, in

int32_t output =(((total + c_offset) * c_mult_int) + kRoundingTerm) >> c_shift;

the multiplication by c_mult_int may overflow.

This has been worked around in practice, so far, by the combination of:

1. making assumptions about the actual range of int32 accumulators (these assumptions hold on real-world use cases, but would not hold on a matrix filled with the value 255).

2. using not-too-large values for int32 output_mult, typically less than 2^12.

The downsides of this situation include:

1. this requires care when choosing quantization parameters (c_mult_int and c_shift).

2. the resulting multiplier is moderately accurate (a multiplier less than 2^12 means less than 12 bits of accuracy). This may matter even when the result is 8bit because this is a form of (multiplicative) bias.

Fortunately, there is an easy way out: we can update the 'contract' to specify that the results are to be as if the internal computation had been done on actual real numbers, then casted to integers at the end. Implementation-wise, this can be achieved by suitably shifting, then using a int32 multiply-high instruction (such as ARM VQRDMULH). Such multiply-high instructions are already in use in gemmlowp (and in any fixed-point arithmetic code), so this doesn't even widen gemmlowp's surface of contact with hardware:

generic implementation:

https://github.com/google/gemmlowp/blob/master/internal/fixedpoint.h#L157

ARM implementation:

https://github.com/google/gemmlowp/blob/master/internal/fixedpoint_neon.h#L138

Cheers,

Benoit

Benoit Jacob

unread,

Aug 9, 2016, 11:08:03 AM8/9/16

to gemmlowp

TL;DR: This can't be fixed in the existing API after all, so I am keeping the existing API and code unchanged, and adding the new behavior as a separate non-default code path ("output stage").

Here is how this resolved in the end:

It turned out that the differences in rounding between the existing and proposed schemes (integer vs. fixed-point multiplier), although small, did result in off-by-1 errors every roughly 2^12 to 2^14 matrix entries, and that was enough to cause some trouble in the tests. I could have adapted the test, but I thought that similar trouble would be caused as well to existing gemmlowp users, and I wanted to avoid that.

So I instead kept the old behavior unchanged, and added the new behavior as a new separate non-default output stage alongside it:

here is the new output stage:

https://github.com/google/gemmlowp/blob/master/public/output_stages.h#L68

compare to the old output stage:

https://github.com/google/gemmlowp/blob/master/public/output_stages.h#L29

Moreover, another change appeared to be needed, and it would also not exactly reproduce the old behavior, causing more subtle off-by-1 errors: as explained here,

https://github.com/google/gemmlowp/blob/master/public/output_stages.h#L89

the new output stage is doing the offset addition at the end, after the multiplier and shift.

I hope to share more details about what I just learned about doing quantized NN inference, which led to this new output stage, soon. (I will be away for the next 2 weeks, so I can't do it now). Meanwhile, rest assured that the existing gemmlowp behavior stays unchanged.

Cheers,

Benoit

Reply all

Reply to author

Forward

0 new messages