Speccing StableHLO quantization

644 views
Skip to first unread message

Eugene Burmako

unread,
Mar 23, 2023, 10:51:34 PM3/23/23
to openxla...@openxla.org
Hi everyone,

Following up on the RFC for aligning StableHLO and TOSA arithmetic, I would like to propose for us to collaborate on pull requests to StableHLO specification. 

@sdasgup3 has created an initial PR that drafts a specification for QuantizedType and for semantics of quantized add, modelled after TFLite quantization semantics. Let's get together as a community and discuss the details, with the plan to progress to more involved ops like convolution in the future pull requests.

Cheers,
Eugene

Sandeep Dasgupta

unread,
Apr 17, 2023, 1:47:41 PM4/17/23
to OpenXLA Discuss, Eugene Burmako
Hello Everyone,
Last week, we published our next PR https://github.com/openxla/stablehlo/pull/1413 for review. 

Summary
The PR proposes the specification for quantized dot-general op, along with the specifications for a few other ops on which the dot-general depends on, for example, slice, transpose, and reshape.

A few details
Given fp = tensor with floating-point type and q = tensor with uniformed quantized type, the PR covers the semantics of
(1) Static range quantized dot_general op "dot_general(q, q)", and
(2) Hybrid quantized dot_general op "dot_general(fp, q):: Currently, this version of the op only supports dynamic range quantization, where the on-the-fly quantization of lhs is fused in the op-semantics. 

Please refer to the PR description for details.

What comes next
Our plan, in very near future, is to propose a PR for convolution op. As the spec for convolution depends on dot-general, hence we are publishing the PRs is phases to smoothen the review process.


Looking forward to your valuable feedback!



Sandeep Dasgupta

unread,
Apr 28, 2023, 7:36:10 PM4/28/23
to OpenXLA Discuss, Sandeep Dasgupta, Eugene Burmako
Hello Everyone,
I am super excited to share my next PR #1446 for specification of quantized add op. Given, the size and nature of the PR , I am hopeful that it will not create any congestion on other ongoing reviews.


Summary

The PR proposes the specification for quantized add op.

Some background details

At some point we decided to drop the introduction of the specification of this op mainly because we were unsure about the fate of issue #1406. Please have a look at my revised proposal on issue #1406 and let me know if I am missing something. Otherwise, let us review this op  and let me know your feedback.  
Side note: For those who are already aware of the context of prior introduction of this op, please note that the current proposal is almost same as before except that it does not have any additional constraint imposed by the op's semantics on `storage_min` or `storage_max`.

Looking forward to your valuable feedback!

Sandeep Dasgupta

unread,
May 2, 2023, 7:50:20 PM5/2/23
to OpenXLA Discuss, Sandeep Dasgupta, Eugene Burmako, eric....@arm.com
Hello Team,

As part of working on Consider introducing QuantizedTensorType based on integer multiplier/shift #1404, I was trying to define a conversion between floating point scale and (multiplier + shift). While doing so I was curious to know how the multiplier and shift parameters are derived in the TOSA rescale operations. For example, in the section TOSA quantized implementation of the RFC: Align StableHLO and TOSA arithmetic, we have 

A framework operator like
```
%framework_sum = framework.add %a %b : (tensor<2x2x!quant.uniform<i8:f32, 0.025:-1>>, tensor<2x2x!quant.uniform(i8:f32, .075:-1)>>) -> tensor<2x2x!quant.uniform(i8:f32, 0.15:-1)>
```

can be realized using TOSA rescales operations, as shown below

```
%scaled_a = tosa.rescale %a {input_zp = -1 : i32, multiplier = [1431655765 : i32], shift = [13 : i32]}} : (tensor<2x2xi8>) -> tensor<2x2xi32> 

%scaled_b = tosa.rescale %b {input_zp = -1 : i32, multiplier = [1073741824 : i32], shift = [11 : i32]}} : (tensor<2x2xi8>) -> tensor<2x2xi32> 

%sum = tosa.add %scaled_a %scaled_b : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> 

%scaled_sum = tosa.rescale %sum {{input_zp = -1 : i32, multiplier = [[1073741824 : i32], shift = [50 : i32]}} :} : (tensor<2x2xi32>) -> tensor<2x2xi8>>
```

I am not sure how the multiplier and shifts, as mentioned for `%scaled_a` and `%scaled_b`are derived. Any help on how to derive that would be pretty useful. 

On my side, I was following the quantization_util.cc and math_utils.h, but calculations do not add up. 

For example, following math_utils.h,  the scale for `%a`, which is 0.025, should be represented as the multiplier, round and shift values given by
multiplier: 0.8*2^31 = 1717986918
round = 2^(35) = 34359738368
shift = 36

such that, 
std::floor(0.025 + 0.5)  = (static_cast<int64_t>(multiplier) + round) >> shift

I assume that the multiplier and shift values used in TOSA rescale operation could be a function of scales of %a, %b, and/or result scale. 
Also, a broader question I was trying to answer:  Is a defined conversion between the float scale and integer multiplier and shift sufficient to derive the parameters of the rescale operators? Or we need something more like the information on which operator the rescale is used for. Please let me know what you think?

Regards,
Sandeep


Stella Laurenzo

unread,
May 2, 2023, 11:01:16 PM5/2/23
to Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, eric....@arm.com
It's been a long time since I've been at that level of detail, but I recall the tosa utilities being where it is derived: https://github.com/tensorflow/tensorflow/blob/60f7a770d64cb6cb3a93f84c291272ec51304d31/tensorflow/compiler/mlir/tosa/transforms/legalize_utils.h#L57

--
You received this message because you are subscribed to the Google Groups "OpenXLA Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openxla-discu...@openxla.org.
To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/dd58ba93-4cdd-43bc-8e18-ceb816ae17fbn%40openxla.org.
For more options, visit https://groups.google.com/a/openxla.org/d/optout.

Mehdi AMINI

unread,
May 3, 2023, 12:30:30 AM5/3/23
to Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, eric....@arm.com
On Tue, May 2, 2023 at 4:50 PM 'Sandeep Dasgupta' via OpenXLA Discuss <openxla...@openxla.org> wrote:
Hello Team,

As part of working on Consider introducing QuantizedTensorType based on integer multiplier/shift #1404, I was trying to define a conversion between floating point scale and (multiplier + shift). While doing so I was curious to know how the multiplier and shift parameters are derived in the TOSA rescale operations. For example, in the section TOSA quantized implementation of the RFC: Align StableHLO and TOSA arithmetic, we have 

A framework operator like
```
%framework_sum = framework.add %a %b : (tensor<2x2x!quant.uniform<i8:f32, 0.025:-1>>, tensor<2x2x!quant.uniform(i8:f32, .075:-1)>>) -> tensor<2x2x!quant.uniform(i8:f32, 0.15:-1)>
```

can be realized using TOSA rescales operations, as shown below

```
%scaled_a = tosa.rescale %a {input_zp = -1 : i32, multiplier = [1431655765 : i32], shift = [13 : i32]}} : (tensor<2x2xi8>) -> tensor<2x2xi32> 

%scaled_b = tosa.rescale %b {input_zp = -1 : i32, multiplier = [1073741824 : i32], shift = [11 : i32]}} : (tensor<2x2xi8>) -> tensor<2x2xi32> 

%sum = tosa.add %scaled_a %scaled_b : (tensor<2x2xi32>, tensor<2x2xi32>) -> tensor<2x2xi32> 

%scaled_sum = tosa.rescale %sum {{input_zp = -1 : i32, multiplier = [[1073741824 : i32], shift = [50 : i32]}} :} : (tensor<2x2xi32>) -> tensor<2x2xi8>>
```

I am not sure how the multiplier and shifts, as mentioned for `%scaled_a` and `%scaled_b`are derived. Any help on how to derive that would be pretty useful. 

On my side, I was following the quantization_util.cc and math_utils.h, but calculations do not add up. 

For example, following math_utils.h,  the scale for `%a`, which is 0.025, should be represented as the multiplier, round and shift values given by
multiplier: 0.8*2^31 = 1717986918
round = 2^(35) = 34359738368
shift = 36

such that, 
std::floor(0.025 + 0.5)  = (static_cast<int64_t>(multiplier) + round) >> shift

I'm a bit surprised by the formula here, the file you're linking indeed indicates:

// quantized_val = floor(input_val * inv_scale + 0.5f) + zero_point

but I don't know where this 0.5f is coming from? It's not part of the MLIR uniform quant types as far as I can tell: https://mlir.llvm.org/docs/Quantization/#affine-to-real
 
Also I don't understand how the multipliers are computed, but note that 1073741824 is (2^32-1)*1/3 and 1073741824 is (2^31/2).

And in the quant.uniform types: 0.15 / 0.075 = 2 ; 0.075 / 0.025 = 3

-- 
Mehdi


--

Eric Kunze

unread,
May 3, 2023, 2:03:29 PM5/3/23
to Mehdi AMINI, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako

Hi all, sorry for the slow response.

 

Focusing on your original question:

 

The integer rescaling requires that you need to know about the operator the rescale is being used for. With the example of the add operation (c = a + b), you need to derive the rescale parameters from both inputs. It will look something like this:

 

  • rescale_a = scale_a / scale_add
  • rescale_b = scale_b / scale_add
  • rescale_sum = scale_add / scale_c

 

where scale_add (the quantized scale at which the add is performed) is chosen so that the add cannot overflow the accumulator (in this case int32) and has good precision.

For example, for 8-bit data with zero point we make scale_add = max(scale_a, scale_b)/(1<<n) with n=20 since 20+9+1<=31.

You can see the scale calculation in the TensorFlow Lite to TOSA legalization code here: https://github.com/tensorflow/tensorflow/blob/60f7a770d64cb6cb3a93f84c291272ec51304d31/tensorflow/compiler/mlir/tosa/transforms/legalize_tfl.cc#L701

 

There are multiple possible scale_add values. When we are trying to match TensorFlow Lite results we make the same choice that the TFL implementation chooses for scale_add.

 

For other operators (like multiply), you would choose different rescales based on the needs of the operator.

 

These rescale calculations can be done if your quantization parameters are floating-point or expressed as multiplier and shift. The multiplier/shift option has the advantage of being bit exact.

 

Thanks,
Eric

 

On Tue, May 2, 2023 at 4:50 PM 'Sandeep Dasgupta' via OpenXLA Discuss <openxla...@openxla.org> wrote:

I am not sure how the multiplier and shifts, as mentioned for `%scaled_a` and `%scaled_b`are derived. Any help on how to derive that would be pretty useful. 

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

Mehdi AMINI

unread,
May 3, 2023, 3:49:14 PM5/3/23
to Eric Kunze, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako
On Wed, May 3, 2023 at 11:03 AM Eric Kunze <Eric....@arm.com> wrote:

Hi all, sorry for the slow response.

 

Focusing on your original question:

 

The integer rescaling requires that you need to know about the operator the rescale is being used for. With the example of the add operation (c = a + b), you need to derive the rescale parameters from both inputs. It will look something like this:

 

  • rescale_a = scale_a / scale_add
  • rescale_b = scale_b / scale_add
  • rescale_sum = scale_add / scale_c

 

where scale_add (the quantized scale at which the add is performed) is chosen so that the add cannot overflow the accumulator (in this case int32) and has good precision.

For example, for 8-bit data with zero point we make scale_add = max(scale_a, scale_b)/(1<<n) with n=20 since 20+9+1<=31.


What does the 20+9+1 correspond to?

I think that the coefficients in Sandeep example match this formula except with n=31 for some reasons.

Thanks

-- 
Mehdi

Eric Kunze

unread,
May 3, 2023, 7:55:58 PM5/3/23
to Mehdi AMINI, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako

> What does the 20+9+1 correspond to?

 

When you subtract the 8-bit zero point from the 8-bit value you get a range of 9 bits.

The 1 bit is because you can add at most 1 bit when adding two numbers together.

The 20 is somewhat arbitrary but moves the “interesting” bits higher in the 32-bit value to maximize the precision of the result.

 

Earlier in the code we point at the TFL add kernel code as reference, as that is the source of the 20. An add implemented from a different framework could choose a number other than 20 depending on how close we want to get to the other framework.

https://github.com/tensorflow/tensorflow/blob/60f7a770d64cb6cb3a93f84c291272ec51304d31/tensorflow/compiler/mlir/tosa/transforms/legalize_tfl.cc#L695-L700

 

Thanks,

Eric

Sandeep Dasgupta

unread,
May 3, 2023, 8:58:57 PM5/3/23
to OpenXLA Discuss, Eric Kunze, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI
Thanks a lot (@Stella, @Mehdi, @Eric) for all the valuable pointers and help.
 
@Mehdi >> It's not part of the MLIR uniform quant types as far as I can tell ...
You are right and this also stuck  me at some point. The only information I recall from my conversation with Songyi Han, is that this approach of using the rounding mode as `round half up",  is most commonly used  for efficiency fpr quantization.

@Eric >> These rescale calculations can be done if your quantization parameters are floating-point or expressed as multiplier and shift. The multiplier/shift option has the advantage of being bit exact.

A few things that is evident to me from the conversation,
1. Once the unifiormQuantized scales are provided, the rescale ops can be generated, with proper parameters, for each different op, as mentioned in the link.
2. Having the scales of uniform quantized type expressed in multiplier and shift has the advantage of being bit exact and ensuring portability across implementations.

With that being said, all we need, from the POV of addressing Consider introducing QuantizedTensorType based on integer multiplier/shift #1404, is to provide a well defined conversion (to-fro) between fp scale and integer multiplier and shift.
Towards that, the conversions defined in QuantUtils.cpp could be a reasonable choice to begin with. In fact, that is what is used to compute the rescale parameters as shown above. (@Eric correct me if I am off here).
Using the conversion, the relation between a double scale (say double_scale) and the corresponding integer parameters (quantized_multiplier, shift) can be established as follows:

Converting double_scale to quantized_multiplier and shift

Let's say we use, for 32 bit scale width, computeMultiplierAndShiftTosaScale32 to get the quantized_multiplier and shift corresponding to a double_scale. That is:
computeMultiplierAndShiftTosaScale32(double_scale, &quantized_multiplier, &shift);

Converting quantized_multiplier and shift back to a quantized scale

const int64_t round = static_cast<int64_t>(1) << (shift - 1);
int64_t result1 = static_cast<int64_t>(quantized_multiplier) + round;
int64_t result2 = result1 >> shift;
int64_t result3 = std::clamp(result2, 
      static_cast<int64_t>(std::numeric_limits<int32_t>::min()),
      static_cast<int64_t>(std::numeric_limits<int32_t>::max()));
quantized_scale =  static_cast<int32_t>(result3);

Relation between double_scale and quantized_scale
assert( static_cast<int32_t>(std::floor(double_scale + 0.5) == quantized_scale);  ref

I note here that the conversion here introduced double rounding:first when converting  float32 scale to (multiply, shift) pair, and second when converting (multiplier, shift) pair back to float32.

A few questions to the team here:
  • Do we agree the above to be a defined conversion? Or are there any other considerations that need to be taken care of.  
  • Can this conversion be used to generate rescale operators with desired bit-exactness?  IMO, this question is relevant because of the loss of precision of the original floating point scales. In other words, are the rescale parameter derived from the  (multiplier, shift), as opposed to derived from floating point scales, is "acceptable". 

Looking forwards to your feedback. 

Eric Kunze

unread,
May 4, 2023, 4:09:29 PM5/4/23
to Sandeep Dasgupta, OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI

Thanks Sandeep. Pulling your questions from below:

 

    • Do we agree the above to be a defined conversion? Or are there any other considerations that need to be taken care of.  
      • (Eric) – Yes, we can agree on this conversion, I think it is sufficient.
    • Can this conversion be used to generate rescale operators with desired bit-exactness?  IMO, this question is relevant because of the loss of precision of the original floating-point scales. In other words, are the rescale parameter derived from the  (multiplier, shift), as opposed to derived from floating point scales, is "acceptable". 
      • (Eric) – Yes, this can generate rescale operators with bit exactness. Once we have the multiplier and shift in integer form, we can define conversions in a bit-exact way since they are integers.
      • (Eric) – The result might differ from TFLite where the scale calculations (such as scale_a / scaled_add) are done in floating point compared to the integer-based versions, but the fully integer version would be defined in the bit-exact way. On the TOSA side, we would be able to implement this with the RESCALEs as you quoted earlier.

    Eric

    Sandeep Dasgupta

    unread,
    May 5, 2023, 8:46:58 PM5/5/23
    to Eric Kunze, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI
    Thanks Eric, for clarifying my questions. 
    I am looking forward to having an outcome soon from my side. I will keep you updated on my progress.

    Sandeep Dasgupta

    unread,
    May 9, 2023, 5:45:59 PM5/9/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hello Eric
    I have a few follow up questions related to computeMultiplierAndShiftTosaScale32 and computeMultiplierAndShiftTosaScale16

    Q1: The above two variants are called on the basis of a `scaleWidth`, which is defined, here isScale32, as:


    ```
    bool isScale32(mlir::quant::UniformQuantizedType output_element_type) {
    return (output_element_type.getStorageTypeIntegralWidth() == 8);
    }

    ```

    It is based on the output storage-type bitwidth. So for 8-bit quantized output type, we use 32-bit scaling computeMultiplierAndShiftTosaScale32 and for others we use computeMultiplierAndShiftTosaScale16. That is, even if the output bit width is > 8, we still use the 32-bit scaling.
    I am not sure why that is. Or there is an implicit assumption the output quantized type to be 8-bits or below?

    Q2: What if the scale value used as input in computeMultiplierAndShiftTosaScale32 or computeMultiplierAndShiftTosaScale16 is a very large value? That could make the right shift value here negative. Are there any assumptions in this code about the range of input scale values, other than being finite and positive.



    Regards,
    Sandeep

    Eric Kunze

    unread,
    May 9, 2023, 6:37:03 PM5/9/23
    to Sandeep Dasgupta, OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI

    Hi Sandeep,

     

    Responses inline below.

     

    Thanks,

    Eric

     

    From: Sandeep Dasgupta <sda...@google.com>
    Date: Tuesday, May 9, 2023 at 2:46 PM
    To: OpenXLA Discuss <openxla...@openxla.org>
    Cc: Sandeep Dasgupta <sda...@google.com>, OpenXLA Discuss <openxla...@openxla.org>, Eugene Burmako <bur...@google.com>, Mehdi AMINI <joke...@gmail.com>, Eric Kunze <Eric....@arm.com>
    Subject: Re: Speccing StableHLO quantization

    Hello Eric
    I have a few follow up questions related to computeMultiplierAndShiftTosaScale32 and computeMultiplierAndShiftTosaScale16

    Q1: The above two variants are called on the basis of a `scaleWidth`, which is defined, here isScale32, as:


    ```
    bool isScale32(mlir::quant::UniformQuantizedType output_element_type) {
    return (output_element_type.getStorageTypeIntegralWidth() == 8);
    }

    ```

    It is based on the output storage-type bitwidth. So for 8-bit quantized output type, we use 32-bit scaling computeMultiplierAndShiftTosaScale32 and for others we use computeMultiplierAndShiftTosaScale16. That is, even if the output bit width is > 8, we still use the 32-bit scaling.
    I am not sure why that is. Or there is an implicit assumption the output quantized type to be 8-bits or below?

     

    Eric: If output bit width is > 8, we should use the 16-bit scaling, which should happen with the above conditional. The primary use case we have for the 16-bit scaling is scaling down TOSA’s CONV2D, as the output for 16-bit int is defined to be 48-bits. Then the 48-bit accumulator * 16-bit scale multiplier won’t overflow 64-bits in RESCALE.



    Q2: What if the scale value used as input in computeMultiplierAndShiftTosaScale32 or computeMultiplierAndShiftTosaScale16 is a very large value? That could make the right shift value here negative. Are there any assumptions in this code about the range of input scale values, other than being finite and positive.

    Eric: In the TOSA specification, we have REQUIRE statements, which are like the StableHLO constraints. If a REQUIRE is violated, then the output of the graph is marked as ‘unpredictable’. For RESCALE, we require the shift to be 2 <= shift <= 62 or the result is unpredictable. It’s possible we don’t have enough checks for extreme scale values in the TensorFlow side, and should error out on the legalization to TOSA at that point.

    Regards,
    Sandeep

    Sandeep Dasgupta

    unread,
    May 9, 2023, 8:10:28 PM5/9/23
    to Eric Kunze, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI
    Thanks Eric, for the reply and all the information.  

    I am very excited to share that we currently have a PR Introduce the specification of quantized type with multiplier and shift #1475 to extend this discussion to some concrete artifacts. Also, as part of the review, I am hoping to  address some of the open questions, as mentioned in the PR description, using your feedback comments. 
     
    Looking forward to your review!

    Regards,
    Sandeep 

    Sandeep Dasgupta

    unread,
    May 10, 2023, 5:53:47 PM5/10/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hello Team,

    Please find the PR related to the quantized Convolution op  along with the specifications for a few other ops on which the convolution specification depends on, for example, pad, reverse.

    A few details about the PR
    1. Given q = tensor with uniformed quantized type, the PR covers the semantics of static range quantized convolution(q, q). The specification is very similar to the earlier proposed quantized dot_general op.
    2. Given, that many ops support both per-tensor and per-axis quantization schemes, it might be redundant to specify for each op the semantics of the computation involved in both the scheme. To avoid that we decided that the op-level semantics will only specify the per-tensor semantics, whereas, semantics of per-axis computations will be described generally in one place.

    Please let me know your review feedback.

    Thanks
    Sandeep

    Sandeep Dasgupta

    unread,
    May 15, 2023, 3:06:02 PM5/15/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hello Folks,

    Let's discuss the long-term plan for speccing StableHLO quantization. To date, we have specified the semantics of quantized AddOp, DotGeneralOp, SliceOp, TransposeOp, and ReshapeOp. Moreover, we have the draft specification for ConvolutionOp, PadOp, and ReverseOp (relevant PR) available for review.
    To further our efforts, I was pondering what other operations would be of interest for this effort, and how we should prioritize those operations.
    To start the discussion, let me begin with a few operation categories and propose some candidates in each one, following the lexicographical order of the operator names.  In Q2, I was thinking of prioritizing the following ops which are aligned with some of the use cases that we have in mind at Google.

    Element-wise operations (12)
    abs
    clamp
    compare
    divide
    logistic
    maximum
    minimum
    multiply
    sqrt
    rsqrt
    subtract
    tanh

    Operations for generating/converting quantized tensors (4)
    constant
    dequantize
    iota
    quantize

    Data movement operations (4)
    broadcast_in_dim
    concatenate
    gather
    scatter

    Reduction operations (2)
    reduce
    reduce_window

    Control-flow operations (3)
    case
    if
    while

    If you can share, how does this compare with the list of ops that you find important for your use cases? If there are some high-priority ops that were not included in the list above, let's discuss and potentially collaborate on speccing them.




    With Thanks,
    Sandeep

    Logan (Tzu-hsiang) Chien

    unread,
    May 17, 2023, 1:27:19 PM5/17/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hi all,

    This is Logan from the Google EdgeTPU team.  We are currently reviewing these ops and the usages in the real static range quantization models.

    In terms of priority, I think the following ops should have higher priority to get an MVP:

    (in the order of importance)
    reduce
    reduce_window
    maximum
    minimum
    multiply
    divide

    reduce and reduce_window will be more complicated.  IMO, we should start from SumPooling/AvgPooling/MaxPooling/MinPooling first.

    We will share more details soon.

    Regards,
    Logan

    Sandeep Dasgupta

    unread,
    May 17, 2023, 8:53:43 PM5/17/23
    to OpenXLA Discuss, Logan (Tzu-hsiang) Chien, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze

    Hello Folks,

    As the pull request for convolution is nearing approval, I believe it is an opportune time to introduce the PR for the specification of uniform.quantize and uniform.dequantize op.
    Please let me know your feedback. 

    Thanks
    Sandeep

    Sandeep Dasgupta

    unread,
    May 24, 2023, 4:06:13 PM5/24/23
    to OpenXLA Discuss, Sandeep Dasgupta, Logan (Tzu-hsiang) Chien, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hello All,
    I have proposed spec updates for a few ops which are already landed in the StableHLO specification doc. The update is mostly based on capturing  missing constraints for the per-axis quantization scheme. 

    Summary

    The PR augments the a few ops, whose specs is already published, with constraints in the following way:

    Constraints added w.r.t the per-axis qunatization scheme,

    • transpose: Transpose the quantization_dimension of output.
    • reshape: Avoid merging or splitting of elements along quantization dimension of operand.
    • slice: Slice the zero_points(operand) and scales(operand).
    • reverse: reverse the scales and zero_points of the result if the quantization_dimension(operand) is in dimensions to reverse.
    • pad: The op is simplified to use only per-tensor scheme. The rationale for this is as follows:
      • If the padding_value is added in axis other than quantization_dimension(operand), then the zero_point(padding_value) should match the zero_points(operand)[i] for all i. That means all the values in zero_points(operand) are the same, which is basically the per-tensor case.
      • Only meaningful case is if the padding_value is added only along the quantization_dimension(operand). In this case, the zero_points(result) and scales(result) need to be updated accordingly. I am not if there is a use case for this particular case and hence kept things simple by allowing on per-tensor scheme for this op.

    Constraints added to comply with other similar specs

    • add: For QuantizedTensorType, we introduced a new non-terminal QuantizedElementType which is different from the non-terminal ElementType used for TensorType. The proposed change accommodates that.

    Please let me know your review comments. 

    Sandeep Dasgupta

    unread,
    May 25, 2023, 1:29:22 PM5/25/23
    to OpenXLA Discuss, Sandeep Dasgupta, Logan (Tzu-hsiang) Chien, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hello All,
    Please find the PR for the specifications of quantized ReduceOp, ReduceWindowOp, and SelectAndScatterOp #1538. AMong them ReduceOp and ReduceWindowOp are highlighted as priority in our previous discussion.
    The specification of these ops are proposed with similar consideration for input and output type as other reduction ops like dot_general.  Please find the description at https://github.com/openxla/stablehlo/pull/1538#issue-1726209092 for more details. 


    Please let me know the review comments. 
    Thanks
    Sandeep

    Sandeep Dasgupta

    unread,
    May 26, 2023, 1:38:02 PM5/26/23
    to OpenXLA Discuss, Sandeep Dasgupta, Logan (Tzu-hsiang) Chien, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hello All

    After further consideration of quantized ops,  it seems to me that  not all ops have a use-case for per-axis quantization schema specifications. It is evident in https://github.com/openxla/stablehlo/pull/1535 that the specification of ops with per-axis quantization demands considerable adjustments to operation constraints.. Moreover,  all these constraints are supposed to be static-checked by the StableHLO verifier and/or type inference, and without any use-case all these code will be dead.

    In light of the above, I propose that we consider per-axis quantization during specification of an op  only if there is a use case for it. Based on my understanding (ref), convolution and dot_general seem like valid choices for allowing the per-axis quantization scheme. 
    I would like to discuss these ops further to see if there are any other ops that we should consider for per-axis quantization.

    Please let me know your input.

    Thanks
    Sandeep

    Sandeep Dasgupta

    unread,
    May 31, 2023, 11:30:44 PM5/31/23
    to OpenXLA Discuss, Sandeep Dasgupta, Logan (Tzu-hsiang) Chien, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze
    Hello Folks,

    Please find the PR for the specification of quantized element wise ops #1566, which addresses a total of 47 ops.  I have categorized the ops in the following manner to assist the review process. 

    Summary

    The PR proposes the specification for quantized element wise operations (total 47).

    Details
    • Dequan-float-compute-quant (30): The semantics of all these ops follow dequantize -> float compulation -> quantize. All the ops in this category follow per-tensor quantization granularity. In order to simplify the specification, we introduced three meta functions, for the first three category of ops.

      • Binary (8): add, atan2, maximum, minimum, multiply, subtract, power, remainder, divide.
      • Unary (20): cbrt, ceil, cosine,, floor, sine, tanh, exponential, exponential_minus_one, log, log_plus_one, rsqrt, sqrt, logistic, round_nearest_afz, round_nearest_even, is_finite, abs, sign, negate.
      • Ternary (1): clamp.
      • Others (1): reduce_precision.
    • Computation in quantized integer domain (13): All the ops in this category follow per-tensor quantization granularity. 

      • compare:
        • Operandslhs and rhs have the same quantized type. Output is a tensor of boolean type.
      • select, count_leading_zero, popcnt, shift_left, shift_right_arithmetic, shift_right_logical, and, not, or, xor:
        • Operands and results to have the same quantized type.
      • bitcast_convertoperand and result have the same quantized_element_type except for the storage_type(operand) and storage_type(result) which may differ.
      • map: All the inputs have the same quantized_element_type which may differ from the quantized_element_type(result).
    • Currently not supporting quantized types (4):

      • imag, real, complex. These seems very specific to complex type.
      • convert: A convert from a quantized type to any type can be realized using stablehlo.uniform_dequantize followed by stabhle.convert to convert the dequantized floating-point type to type of choice. Similarly, a convert from any type to quantized type can be realized using stablehlo.convert to floating-point type followed by stablehlo.uniform_quantize.

    For the ops under "Currently not supporting quantized types", I am happy to discuss about any use case for supporting quantized types for these op.

    Looking forward to you feedback.

    Sandeep Dasgupta

    unread,
    Jun 1, 2023, 3:09:09 PM6/1/23
    to OpenXLA Discuss, Sandeep Dasgupta, Logan (Tzu-hsiang) Chien, OpenXLA Discuss, Eugene Burmako, Mehdi AMINI, Eric Kunze

    Hello Folks,

    Here is the announcement of a PR  which proposes the specification of "remaining" data movement ops. Specifically, the PR covers concatenatesortbroadcast_in_dimgatherscatterdynamic_slice, and dynamic_update_slice. The rest are covered in a separate PR.

    I note that

    • The specifications are very similar to each other and hence i thought about putting it in a single PR.
    • Following the discussion item in link, I proposed the specification using pex-axis quantization granularity. Feel free to provide your input if any of the op has a use-case to support per-axis scheme.

    Looking forward to your feedback.

    Thanks
    Sandeep

    Sandeep Dasgupta

    unread,
    Jun 9, 2023, 6:35:07 PM6/9/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako

    Hello All
    As brought up in the issue, StableHLO specification in the current form does not consider hydrid convolution or dot_general ops, but it seems that hybrid op, in general, can have some practical use-cases. To set up the stage for engaging further discussion on the topic I have drafted a post, where I have expressed some details around the topic and some pros and cons in expressing hybrid ops in StableHLO. 
    As part of the discussion, I would like to understand some practical use-cases for expressing hybrid ops in StableHLO. For example, it would be great if we have some concrete example(s) of StableHLO op(s), which when expression in hybrid form, would benefit the compilation flow in certain way. 

    Please feel free to provide your input.

    Regards,
    Sandeep

    Sandeep Dasgupta

    unread,
    Jun 14, 2023, 11:41:45 AM6/14/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako
    Hello Folks

    From the

    It seems like TFLite fully_connected op does not support per-axis.

    The per-axis scheme is added to StableHLO dot_general op mainly because of the fact that dot_general is used in the specification of convolution op. IMO, that should not be the only reason to include per-axis scheme to dot_general.

    I opened a ticket is to re-visiting the spec of dot_generaland discuss potential  use-case for it to support per-axis quantization scheme. Please let me know your input. 


    Thanks

    Sandeep

    Sandeep Dasgupta

    unread,
    Jun 25, 2023, 7:26:35 PM6/25/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako

    Hello Folks,

    Here is the announcement of a PR which proposes the quantization specification of all the remaining StableHLO ops (other than the ones already proposed). Please refer to the description of the PR to get a consolidated summary of the changes proposed by the PR.

    Looking forward to your feedback.


    Thanks

    Sandeep 


    Sandeep Dasgupta

    unread,
    Jun 30, 2023, 7:14:33 PM6/30/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako

    Hello Folks
    Please find the RFC for supporting quantized specification for reduce op. Please let me know your feedback. 

    Thanks
    Sandeep

    Sandeep Dasgupta

    unread,
    Jul 26, 2023, 1:50:02 PM7/26/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako
    Hello Folks,
    I have a PR (https://github.com/openxla/stablehlo/pull/1700) to propose a formalization of  the quantization semantics for dot_general and convolution along similar lines on how the same for most of the  element wise operations are specified, which is, using dequantize-op-quantize strategy. This has benefits as mentioned here in the PR description. 
    Looking forward to your feedback on this.   


    Regards,
    Sandeep

    Sandeep Dasgupta

    unread,
    Jul 26, 2023, 5:13:24 PM7/26/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako
    Hello Everybody
    Added a PR to define the semantics of per-axis quantization scheme. Feel free to take a look and let me know your feedback.

    Thanks,
    Sandeep

    Sandeep Dasgupta

    unread,
    Aug 22, 2023, 2:09:07 PM8/22/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako
    Hello Folks,

    Please find the per-axis quantized specification PR for reshape, transpose and broadcast_in_dim ops. 
    Looking forward to your feedback. 

    Thanks

    Sandeep Dasgupta

    unread,
    Oct 9, 2023, 11:49:24 AM10/9/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako
    Hello Folks,
    Re: RFC for supporting quantized specification for reduce op
    The RFC is approved now. Thanks to everybody for the feedback!
    There will be follow up PRs to implement the proposals made in the RFC.

    Thanks
    Sandeep

    Sandeep Dasgupta

    unread,
    Oct 9, 2023, 5:30:45 PM10/9/23
    to OpenXLA Discuss, Sandeep Dasgupta, OpenXLA Discuss, Eugene Burmako
    Hello Folks,
    Please find the PR to define the semantics of reduce, reduce_window and select_and_scatter ops following the approved RFC.
    Next steps: Once the PR is approved the plan is to propose the corresponding changes in the verifier/shape functions for these ops. 

    Please let me know your feedback!

    Regards,
    Sandeep

    Reply all
    Reply to author
    Forward
    0 new messages