Hi Ning,
here's a brief description of how we use gemmlowp in TensorFlow to do arbitrary multiplications:
- We assume that A and B are linearly quantized between Amin and Amax and Bmin and Bmax real numbers, where code 0 represents the float value A/Bmin, and 255 represents the float value A/Bmax.
- To get the offset values, we figure out what the value 0.0 would be as a quantized code for each input range, so it can be subtracted from the actual codes before multiplication. There's an example of that here:
- We then run gemmlowp. The raw results from the sum of 8x8 multiply-adds is a 32-bit number. There are options in gemmlowp to output this directly, or you can specify offset, shift, and multiply conversion factors to truncate them into eight bit results.
- The 32-bit results will represent values in a known float range. This is calculated in the QuantizationRangeForMultiplication() function, which works by figuring out what the difference between one code level and the next is in each input range, and multiplying them together to work out what the delta between code levels in the output is.
To be honest, I still find it all a bit mind-bending myself and usually end up sitting down with pen and paper to work through the math when I have to come back to it. It's a very general way of representing float values with a limited set of codes, but I find fixed-point is a lot easier to reason about.
Does that help?
Pete