I try to understand how post-training quantization works.
I have read the Quantization paper
that the tensorflow quantization scheme is based on.
The tensorflow documentation states that:
For full integer quantization, you need to calibrate or estimate the range, i.e, (min, max) of all floating-point tensors in the model. Unlike constant tensors such as weights and biases, variable tensors such as model input, activations (outputs of intermediate layers) and model output cannot be calibrated unless we run a few inference cycles. As a result, the converter requires a representative dataset to calibrate them.
Does this mean that post-training quantization run inference with the representative dataset and store the values for the different floating-point tensors. And then uses the min and max values observed for each of the floating-point tensors?
Also, what determines the zero-point?
Feedback would be greatly appreciated.