H-SWISH 8bit quantization in TensorFlowLite

Adam Fuks

未讀,

2021年9月30日上午11:55:262021/9/30

收件者：TensorFlow Lite

Hi,

I'd like to understand the schema used in TensorFlowLite Quantizer to quantize H-SWISH based NN layers into 8bit.

Specifically, the typical approach of TFLite is to have a per-tensor zero-point offset. Now, for other activation functions (such as ReLU), you have a clear 0 point, regardless of the scale of the activation value. (ie 0 * anything is still 0). However, in the case of H-SWISH, having a common zeropoint offset, should also mean that the scaling of the HSWISH should also be per-Tensor, in order to keep same zero-point offset.

From a flow point of view, the H-SWISH functionality, could look as follows :

<8bit matrix mult> -> <dequant scaler per-channel> -> <H-SWISH activation> -> <per-tensor scale to fit into 8bit>

Is this how TensorFlowLite treats HSWISH in 8bit? or does it require a separate per-channel scaler after the activation?

Thanks,

Adam

Tei (Taehee) Jeong

未讀,

2021年9月30日晚上7:37:562021/9/30

收件者：Adam Fuks、TensorFlow Lite

Hi Adam,

Activations are always per-tensor asymmetric quantized. I'm not sure if I understood you correctly, the zero points are always nudged to make sure that zero point is always exactly zero.

Can you elaborate your question if this does not answer your question?

Regards,

Tei

2021년 10월 1일 (금) 오전 12:55, Adam Fuks <adam...@gmail.com>님이 작성:

--
You received this message because you are subscribed to the Google Groups "TensorFlow Lite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tflite+un...@tensorflow.org.
To view this discussion on the web visit https://groups.google.com/a/tensorflow.org/d/msgid/tflite/caf8d102-d034-4a90-b86b-a0f597ca8a71n%40tensorflow.org.

Adam Fuks

未讀,

2021年10月1日下午5:10:332021/10/1

收件者：TensorFlow Lite、tae...@google.com、TensorFlow Lite、Adam Fuks

hi Tei

Let me explain a little bit more.

I'm looking at the necessary hardware required to implement an 8bit output version of the HSWISH activation which can support what the TFLite quantizer is expecting.

Now, HSWISH is specifically defined as x*RELU6(X+3)/6. The meaning of the value 6 is, of course, related to scale factor which is applied to the 8bit matrix multiplication step. ie, it is 6.0 in floating point, but may take any value in 8bit representation.

Hardware has to implement the capability of dequantizing (essentially multiplying by a floating point factor per-channel, typically) of the matrix multiplication result before activation is applied.

Dequantization factor which is per-channel is typically expected to be able to get a good quantization.

So far so good.

Now, if you had a reLU, all results are positive, so you can encode in the dequantization factor also the scale which you want to have in the linear part of the reLU before it saturates (since 8bit representation is not infinite like a floating point ReLU, so at some point we will saturate). As such, if you pick zero point to be -128, then you have 255 values of linear space before you saturate (in 8bit signed). That very same per channel dequantization factor can also then control what that output range represents.

However, H-SWISH is not only comprised of positive values, it also has a negative component.

That negative componenet may either take a large value if the quantizer needs the activations to fit in a range where the most important positive values are not very high. The most negative number that H-SWISH can produce (in floating point is -0.375). So, if, for example, the quantizer believes that the most useful range of activations to quantize lies between -0.375 to 0.5, then it would quantize such that the range (ie 0.875) would be described in 8bits, this would also then determine the zero point (which will be offset so that you can fully describe the useful range in 8bits). Now, in the case of reLU, each channel could use a different range to be described, because the zero point offset can remain common, ie you are only scaling the positive range and thus every channel can have its own range. In the case of HSWISH, that is not possible to do because the zero point would then need to be on a per-channel basis (since each range would then be unique to the channel).

I would therefore infer, that if you are only supporting a per-tensor zeropoint offset, it must mean that the quantizer would then necessarily need to use a per-tensor scaling factor for the activation (ie to describe the HSWISH output in 8bits, after the activation has taken place).

This would mean that from a hardware point of view, you can use a single output scale factor to translate the HSWISH value into an 8bit representation.

This is what my query is trying to confirm. This is of course, cheaper to implement in hardware than an additional per-channel activation scale factor. So, you'd still have a per-tensor dequantiztion factor to translate your matrix multiplication into the range you quantized BEFORE the activation such that you get a correct HSWISH behaviour, but the mapping of that output would then need to be on a per-tensor basis.

Please let me know if this clarifies my question. I just want to understand if my assumption about what your quantizer is doing is correct.

Thanks,
Adam

回覆所有人

回覆作者

轉寄