--
You received this message because you are subscribed to the Google Groups "OpenXLA Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openxla-discu...@openxla.org.
To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/dd58ba93-4cdd-43bc-8e18-ceb816ae17fbn%40openxla.org.
For more options, visit https://groups.google.com/a/openxla.org/d/optout.
Hello Team,As part of working on Consider introducing QuantizedTensorType based on integer multiplier/shift #1404, I was trying to define a conversion between floating point scale and (multiplier + shift). While doing so I was curious to know how the multiplier and shift parameters are derived in the TOSA rescale operations. For example, in the section TOSA quantized implementation of the RFC: Align StableHLO and TOSA arithmetic, we haveA framework operator like```
%framework_sum = framework.add %a %b : (tensor<2x2x!quant.uniform<i8:f32, 0.025:-1>>, tensor<2x2x!quant.uniform(i8:f32, .075:-1)>>) -> tensor<2x2x!quant.uniform(i8:f32, 0.15:-1)>
```can be realized using TOSA rescales operations, as shown below```%scaled_a = tosa.rescale %a {input_zp = -1 : i32, multiplier = [1431655765 : i32], shift = [13 : i32]}} : (tensor<2x2xi8>) -> tensor<2x2xi32>%scaled_b = tosa.rescale %b {input_zp = -1 : i32, multiplier = [1073741824 : i32], shift = [11 : i32]}} : (tensor<2x2xi8>) -> tensor<2x2xi32>%sum = tosa.add %scaled_a %scaled_b : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32>%scaled_sum = tosa.rescale %sum {{input_zp = -1 : i32, multiplier = [[1073741824 : i32], shift = [50 : i32]}} :} : (tensor<2x2xi32>) -> tensor<2x2xi8>>```I am not sure how the multiplier and shifts, as mentioned for `%scaled_a` and `%scaled_b`are derived. Any help on how to derive that would be pretty useful.On my side, I was following the quantization_util.cc and math_utils.h, but calculations do not add up.For example, following math_utils.h, the scale for `%a`, which is 0.025, should be represented as the multiplier, round and shift values given bymultiplier: 0.8*2^31 = 1717986918round = 2^(35) = 34359738368shift = 36such that,std::floor(0.025 + 0.5) = (static_cast<int64_t>(multiplier) + round) >> shift
--
Hi all, sorry for the slow response.
Focusing on your original question:
The integer rescaling requires that you need to know about the operator the rescale is being used for. With the example of the add operation (c = a + b), you need to derive the rescale parameters from both inputs. It will look something like this:
where scale_add (the quantized scale at which the add is performed) is chosen so that the add cannot overflow the accumulator (in this case int32) and has good precision.
For example, for 8-bit data with zero point we make scale_add = max(scale_a, scale_b)/(1<<n) with n=20 since 20+9+1<=31.
You can see the scale calculation in the TensorFlow Lite to TOSA legalization code here: https://github.com/tensorflow/tensorflow/blob/60f7a770d64cb6cb3a93f84c291272ec51304d31/tensorflow/compiler/mlir/tosa/transforms/legalize_tfl.cc#L701
There are multiple possible scale_add values. When we are trying to match TensorFlow Lite results we make the same choice that the TFL implementation chooses for scale_add.
For other operators (like multiply), you would choose different rescales based on the needs of the operator.
These rescale calculations can be done if your quantization parameters are floating-point or expressed as multiplier and shift. The multiplier/shift option has the advantage of being bit exact.
Thanks,
Eric
On Tue, May 2, 2023 at 4:50 PM 'Sandeep Dasgupta' via OpenXLA Discuss <openxla...@openxla.org> wrote:
I am not sure how the multiplier and shifts, as mentioned for `%scaled_a` and `%scaled_b`are derived. Any help on how to derive that would be pretty useful.
…
Hi all, sorry for the slow response.
Focusing on your original question:
The integer rescaling requires that you need to know about the operator the rescale is being used for. With the example of the add operation (c = a + b), you need to derive the rescale parameters from both inputs. It will look something like this:
- rescale_a = scale_a / scale_add
- rescale_b = scale_b / scale_add
- rescale_sum = scale_add / scale_c
where scale_add (the quantized scale at which the add is performed) is chosen so that the add cannot overflow the accumulator (in this case int32) and has good precision.
For example, for 8-bit data with zero point we make scale_add = max(scale_a, scale_b)/(1<<n) with n=20 since 20+9+1<=31.
> What does the 20+9+1 correspond to?
When you subtract the 8-bit zero point from the 8-bit value you get a range of 9 bits.
The 1 bit is because you can add at most 1 bit when adding two numbers together.
The 20 is somewhat arbitrary but moves the “interesting” bits higher in the 32-bit value to maximize the precision of the result.
Earlier in the code we point at the TFL add kernel code as reference, as that is the source of the 20. An add implemented from a different framework could choose a number other than 20 depending on how close we want to get to the other framework.
Thanks,
Eric
Thanks Sandeep. Pulling your questions from below:
Eric
To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/3973c8e7-f0da-480c-b27c-203a531db5f0n%40openxla.org.
For more options, visit
https://groups.google.com/a/openxla.org/d/optout.
Hi Sandeep,
Responses inline below.
Thanks,
Eric
From:
Sandeep Dasgupta <sda...@google.com>
Date: Tuesday, May 9, 2023 at 2:46 PM
To: OpenXLA Discuss <openxla...@openxla.org>
Cc: Sandeep Dasgupta <sda...@google.com>, OpenXLA Discuss <openxla...@openxla.org>, Eugene Burmako <bur...@google.com>, Mehdi AMINI <joke...@gmail.com>, Eric Kunze <Eric....@arm.com>
Subject: Re: Speccing StableHLO quantization
Hello Eric
I have a few follow up questions related to
computeMultiplierAndShiftTosaScale32 and
computeMultiplierAndShiftTosaScale16
Q1: The above two variants are called on the basis of a `scaleWidth`, which is defined, here
isScale32, as:
```
bool isScale32(mlir::quant::UniformQuantizedType output_element_type) {
return (output_element_type.getStorageTypeIntegralWidth() == 8);
}
```
It is based on the output storage-type bitwidth. So for 8-bit quantized output type, we use 32-bit scaling computeMultiplierAndShiftTosaScale32 and for others we use computeMultiplierAndShiftTosaScale16. That is, even if the output bit width is > 8, we still
use the 32-bit scaling.
I am not sure why that is. Or there is an implicit assumption the output quantized type to be 8-bits or below?
Eric: If output bit width is > 8, we should use the 16-bit scaling, which should happen with the above conditional. The primary use case we have for the 16-bit scaling is scaling down TOSA’s CONV2D, as the output for 16-bit int is defined to be 48-bits. Then the 48-bit accumulator * 16-bit scale multiplier won’t overflow 64-bits in RESCALE.
Q2: What if the scale value used as input in
computeMultiplierAndShiftTosaScale32 or
computeMultiplierAndShiftTosaScale16 is a very large value? That could make the right shift value
here negative. Are there any assumptions in this code about the range of input scale values, other than being finite and positive.
Eric: In the TOSA specification, we have REQUIRE statements, which are like the StableHLO constraints. If a REQUIRE is violated, then the output of the graph is marked as ‘unpredictable’. For RESCALE, we require the shift to be 2 <=
shift <= 62 or the result is unpredictable. It’s possible we don’t have enough checks for extreme scale values in the TensorFlow side, and should error out on the legalization to TOSA at that point.
Regards,
Sandeep
The PR augments the a few ops, whose specs is already published, with constraints in the following way:
Constraints added w.r.t the per-axis qunatization scheme,
Constraints added to comply with other similar specs
The PR proposes the specification for quantized element wise operations (total 47).
DetailsDequan-float-compute-quant (30): The semantics of all these ops follow dequantize -> float compulation -> quantize. All the ops in this category follow per-tensor quantization granularity. In order to simplify the specification, we introduced three meta functions, for the first three category of ops.
Computation in quantized integer domain (13): All the ops in this category follow per-tensor quantization granularity.
Currently not supporting quantized types (4):
For the ops under "Currently not supporting quantized types", I am happy to discuss about any use case for supporting quantized types for these op.
Looking forward to you feedback.
Hello Folks,
Here is the announcement of a PR which proposes the specification of "remaining" data movement ops. Specifically, the PR covers concatenate, sort, broadcast_in_dim, gather, scatter, dynamic_slice, and dynamic_update_slice. The rest are covered in a separate PR.
I note that
From the
It seems like TFLite fully_connected op does not support per-axis.
The per-axis scheme is added to StableHLO dot_general op mainly because of the fact that dot_general is used in the specification of convolution op. IMO, that should not be the only reason to include per-axis scheme to dot_general.
I opened a ticket is to re-visiting the spec of dot_generaland discuss potential use-case for it to support per-axis quantization scheme. Please let me know your input.
Thanks
Sandeep
Hello Folks,
Here is the announcement of a PR which proposes the quantization specification of all the remaining StableHLO ops (other than the ones already proposed). Please refer to the description of the PR to get a consolidated summary of the changes proposed by the PR.
Looking forward to your feedback.
Thanks
Sandeep