--
You received this message because you are subscribed to the Google Groups "TensorFlow Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tflite+un...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tflite/caf8d102-d034-4a90-b86b-a0f597ca8a71n%40tensorflow.org.
hi Tei
Let me explain a little bit more.
I'm looking at the necessary hardware required to implement an 8bit output version of the HSWISH activation which can support what the TFLite quantizer is expecting.
Now, HSWISH is specifically defined as x*RELU6(X+3)/6. The meaning of the value 6 is, of course, related to scale factor which is applied to the 8bit matrix multiplication step. ie, it is 6.0 in floating point, but may take any value in 8bit representation.
Hardware has to implement the capability of dequantizing (essentially multiplying by a floating point factor per-channel, typically) of the matrix multiplication result before activation is applied.
Dequantization factor which is per-channel is typically expected to be able to get a good quantization.
So far so good.
Now, if you had a reLU, all results are positive, so you can encode in the dequantization factor also the scale which you want to have in the linear part of the reLU before it saturates (since 8bit representation is not infinite like a floating point ReLU, so at some point we will saturate). As such, if you pick zero point to be -128, then you have 255 values of linear space before you saturate (in 8bit signed). That very same per channel dequantization factor can also then control what that output range represents.
However, H-SWISH is not only comprised of positive values, it also has a negative component.
That negative componenet may either take a large value if the quantizer needs the activations to fit in a range where the most important positive values are not very high. The most negative number that H-SWISH can produce (in floating point is -0.375). So, if, for example, the quantizer believes that the most useful range of activations to quantize lies between -0.375 to 0.5, then it would quantize such that the range (ie 0.875) would be described in 8bits, this would also then determine the zero point (which will be offset so that you can fully describe the useful range in 8bits). Now, in the case of reLU, each channel could use a different range to be described, because the zero point offset can remain common, ie you are only scaling the positive range and thus every channel can have its own range. In the case of HSWISH, that is not possible to do because the zero point would then need to be on a per-channel basis (since each range would then be unique to the channel).
I would therefore infer, that if you are only supporting a per-tensor zeropoint offset, it must mean that the quantizer would then necessarily need to use a per-tensor scaling factor for the activation (ie to describe the HSWISH output in 8bits, after the activation has taken place).
This would mean that from a hardware point of view, you can use a single output scale factor to translate the HSWISH value into an 8bit representation.
This is what my query is trying to confirm. This is of course, cheaper to implement in hardware than an additional per-channel activation scale factor. So, you'd still have a per-tensor dequantiztion factor to translate your matrix multiplication into the range you quantized BEFORE the activation such that you get a correct HSWISH behaviour, but the mapping of that output would then need to be on a per-tensor basis.
Please let me know if this clarifies my question. I just want to understand if my assumption about what your quantizer is doing is correct.
Thanks,
Adam