[RFC] Increase StableHLO Compatibility Guarantees.

Pulkit Bhuwalka

unread,

Jun 23, 2023, 7:20:45 PM6/23/23

to openxla...@openxla.org

Hi everyone,

The original StableHLO Compatibility RFC contained backward and forwards compatibility of 5 years within a major release, and backward compatibility for serialized artifacts across 1 major release. This guarantee was then reduced in a follow up RFC.

I propose we bring back Stability guarantees to the same level, since this is critical for mobile ML deployments where the execution environment is not tightly controlled by the model author. Sharing details below-

ML Models deployed on-device (eg., Android) need strict backward and forward compatibility guarantees.

A deployed ML Model should never break due to a software update. This could be an update to the ML runtime, Mobile OS, or the App itself. OEMs regularly update phones, which can break functionality if the Opset changes.
ML models are often long-lived. Even when the application is updated, the model it uses may be older or the application team may not have access to the source model it uses. Said differently, a mobile ML runtime needs to support older versions of StableHLO Ops.
There are a significant number of users who use old Mobile/Android phones, often 5+ years. App developers should be able to target older phones, should they choose to for deploying their ML features.

Due to the above it's essential that Opset definitions are maintained long-term. It's reasonable for us as a community to iterate on the Opset and utilize the VHLO mechanism to version them. But once we have ML models deployed in the market, especially on Mobile phones it's not feasible to update the execution environment.

Best,

Pulkit

Anush Elangovan

unread,

Jul 2, 2023, 6:39:19 PM7/2/23

to Pulkit Bhuwalka, openxla...@openxla.org

Hi Pulkit,

Should we wait until we get broad(er) adoption before extending the compatibility window so we don't trade off development velocity ?

Thanks

Anush

Best,
Pulkit

--
You received this message because you are subscribed to the Google Groups "OpenXLA Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to openxla-discu...@openxla.org.
To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/CAKsyoMCYKXZHTgtojNesc996NgcaN%2Bb84XSHfEPCrkf7zN621w%40mail.gmail.com.
For more options, visit https://groups.google.com/a/openxla.org/d/optout.

Jacques Pienaar

unread,

Jul 6, 2023, 3:53:34 PM7/6/23

to Anush Elangovan, Pulkit Bhuwalka, openxla...@openxla.org

Hey Anush,

This is actually a good question for Eugene:

What is the cost to development velocity and who pays for it?

To Pulkit:

By when are you proposing there needs to be this guarantee?

Best,

Jacques

To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/CANq_MgV2520pQJbgcE5O_d7OuL2x%3DwyTspdmjWJmhTfHoQ0j%2BA%40mail.gmail.com.

Pulkit Bhuwalka

unread,

Aug 17, 2023, 3:50:36 AM8/17/23

to jpie...@openxla.org, Zichuan Wei, Cormac Brick, Jacques Pienaar, Anush Elangovan, openxla...@openxla.org

Hi folks,

Sorry I missed the replies to my RFC.

@Anush Elangovan - That's a great point. I would break it up into 2 pieces- (1) given the requirements for ODML deployment I mentioned above, do we think the compatibility requirements make sense, and (2) when do we introduce them.

Realistically, given the complexities of on-device ML deployment we can't feasibly ship StableHLO Ops until we have some compatibility guarantees. TFLite ships on billions of devices, and tens of thousands of Apps use it. We can't control their update cycles and can't afford models to break in the field. We really want to ship and support StableHLO, hence the push.

@Jacques Pienaar - To some extent, we as a community need to figure out when we're most okay with adding these guarantees. However, like I said unless such a guarantee exists any MLIR code which reads these Ops on device will fail backward compatibility.

Within the TFLite team, we are working on a separate document which goes into a lot more detail on the sorts of use cases in the field for on-device ML deployments and the requirements from the Opset. We'll share that soon, hopefully it'll help foster discussion and explain the rationale. We can feasibly handle a bunch of this complexity in the runtime ourselves.

Best,

Pulkit

Jacques Pienaar

unread,

Aug 23, 2023, 6:43:26 PM8/23/23

to Pulkit Bhuwalka, Zichuan Wei, Cormac Brick, Jacques Pienaar, Anush Elangovan, OpenXLA Discuss

Hey,

Currently StableHLO has some period of forwards and backwards compatibility, so question of how long is sufficient and by when are very good questions. And I think independent of MLIR. E.g., in MLIR ecosystem we have some other widely used deployments that doesn't need this level and they employ different mechanisms. Which is quite interesting to see that there are 3 solutions already in this space, and would be a good retrospective in ~2 years.

I mean more stability is probably good, but I'm mostly curious about where the cost is as that affects how deeply this needs to be discussed. For example, it was unclear if this affects anything beyond VHLO layer.

When StableHLO was forked off MHLO, there were a couple of ops already marked deprecated. Personally I'd like to not promise 5 year guarantees on ops whose foot have been in the grave for more than a year already. I know Eugene has some cleanups he wants to do too. All of this just affect timelines not whether it should be done though.

Best,

Jacques

Zichuan Wei

unread,

Sep 18, 2023, 4:35:12 PM9/18/23

to OpenXLA Discuss, Jacques Pienaar, Zichuan Wei, Cormac Brick, Jacques Pienaar, Anush Elangovan, OpenXLA Discuss, Pulkit Bhuwalka

Hi everyone, we have spent more time exploring TFLite use cases and compatibility requirements. As mentioned in TensorFlow Lite update on StableHLO use, TFLite would like to consume StableHLO as an input source and leverage VHLO versioning facilities. In order to achieve this goal, we propose the following changes to the StableHLO compatibility - please let us know of any feedback you have!

Proposed Compatibility Window

On average, Android phones receive 3 years of OS update guarantee(e.g. Pixel update Policy, update policy from other companies), and such OS updates must not break existing models on-device. Given that, we would like to propose a backward compatibility window of 3 years for StableHLO to support this use case. Additionally, new models that don’t use new features must be runnable on these supported OS, so we would also like to propose a 3 year forward compatibility window. And we would like to work with you to determine when should the compatibility window be extended.

Proposed VHLO Compatibility Guarantees

Up until this point the StableHLO project has focused on compatibility guarantees from the opset perspective, without providing specific guarantees for implementation details like the VHLO dialect. However, on the TFLite side, we found VHLO to be really useful, and we would like to propose to formalize some of its properties - for the most part these properties are already maintained in practice, and this RFC proposes to formally document and maintain them:

VHLO op version number must only change by increment, if and only if there is a change to Operator behavior. (i.e. add_v1 → add_v2).
VHLO ops must not be deleted within the compatibility window.
VHLO ops must always be convertible to StableHLO ops within the compatibility window using machinery maintained in the openxla/stablehlo repository (i.e. not an external tool).
VHLO programs must be roundtrippable with StableHLO ops (an equivalent of today's --vhlo-to-version='target=current' --vhlo-legalize-to-stablehlo --stablehlo-legalize-to-vhlo --vhlo-to-version='target=...'), meaning a VHLO program from an older version must be able to be converted to the StableHLO dialect and returned back to the original version number. This allows running StableHLO passes on an older VHLO program and re-serializing for the original version of that program.

Proposed Documentation Enhancement

For developers that could be interacting directly with this serialized VHLO, we propose a documentation enhancement. Namely, there must be an easy way to access documentation detailing the changes between different versions of the same op.. The exact mechanism can be determined in a follow-up RFC.

Best regards,
Zichuan Wei

Stella Laurenzo

unread,

Sep 18, 2023, 4:54:54 PM9/18/23

to Zichuan Wei, OpenXLA Discuss, Jacques Pienaar, Cormac Brick, Jacques Pienaar, Anush Elangovan, Pulkit Bhuwalka

On Mon, Sep 18, 2023 at 1:35 PM 'Zichuan Wei' via OpenXLA Discuss <openxla...@openxla.org> wrote:

Hi everyone, we have spent more time exploring TFLite use cases and compatibility requirements. As mentioned in TensorFlow Lite update on StableHLO use, TFLite would like to consume StableHLO as an input source and leverage VHLO versioning facilities. In order to achieve this goal, we propose the following changes to the StableHLO compatibility - please let us know of any feedback you have!

Proposed Compatibility Window

On average, Android phones receive 3 years of OS update guarantee(e.g. Pixel update Policy, update policy from other companies), and such OS updates must not break existing models on-device. Given that, we would like to propose a backward compatibility window of 3 years for StableHLO to support this use case. Additionally, new models that don’t use new features must be runnable on these supported OS, so we would also like to propose a 3 year forward compatibility window. And we would like to work with you to determine when should the compatibility window be extended.

Proposed VHLO Compatibility Guarantees

Up until this point the StableHLO project has focused on compatibility guarantees from the opset perspective, without providing specific guarantees for implementation details like the VHLO dialect. However, on the TFLite side, we found VHLO to be really useful, and we would like to propose to formalize some of its properties - for the most part these properties are already maintained in practice, and this RFC proposes to formally document and maintain them:

VHLO op version number must only change by increment, if and only if there is a change to Operator behavior. (i.e. add_v1 → add_v2).
VHLO ops must not be deleted within the compatibility window.
VHLO ops must always be convertible to StableHLO ops within the compatibility window using machinery maintained in the openxla/stablehlo repository (i.e. not an external tool).
VHLO programs must be roundtrippable with StableHLO ops (an equivalent of today's --vhlo-to-version='target=current' --vhlo-legalize-to-stablehlo --stablehlo-legalize-to-vhlo --vhlo-to-version='target=...'), meaning a VHLO program from an older version must be able to be converted to the StableHLO dialect and returned back to the original version number. This allows running StableHLO passes on an older VHLO program and re-serializing for the original version of that program.

#4 seems like an infectious constraint, no? In the limit, it would mean that semantics of StableHLO could never be changed (i.e. to break ops apart or combine) without non-trivial logic to somehow roundtrip the exact versions. I might be missing something but aside from convenience, I don't understand why you need round-tripping. However, I've never really understood why people are using these opsets in this way (versus lowering to something that is more amenable to re-targeting and having one-way conversions).

I think that the pathways here need to be one way, not bidirectional.

To view this discussion on the web visit https://groups.google.com/a/openxla.org/d/msgid/openxla-discuss/9f500293-2898-460b-928c-b3acf23330f8n%40openxla.org.

Zichuan Wei

unread,

Sep 25, 2023, 7:33:18 PM9/25/23

to OpenXLA Discuss, Stella Laurenzo, OpenXLA Discuss, Jacques Pienaar, Cormac Brick, Jacques Pienaar, Anush Elangovan, Pulkit Bhuwalka, Zichuan Wei

Thanks Stella for your comment!

For on-device use cases, the server producing the IR tends to be more frequently updated than the on-device consumer. In order for the newer version of the on-server compiler to generate portable artifacts that can be parsed correctly by the on-device consumer, we will need the downgrade capability.

In addition, there is also a difference between when an artifact is created and when additional optimization is performed, e.g. a float model created a year ago needs to be quantized today, and the quantization passes are only written at the head. Then we first need to upgrade the model to the latest head so passes can be executed correctly.

I want to point out that these use cases are not unique to on-device, as the current stablehlo compatibility has already been providing such guarantees and we’re simply proposing to extend the guarantee to 3 years:

“Portable artifacts serialized by a new version of libStablehlo have the same semantics when deserialized by an old version of libStablehlo if these versions are built from openxla/stablehlo commits which are less than 1 month apart, unless the program is using new features introduced since the old version.”

As long as no new features are introduced during the IR upgrade, I think it’s reasonable for the user to be able to downgrade the model back to the original version. This allows for authoring passes on the latest opset, instead of on VHLO directly, and passes which don’t introduce new features can still leverage StableHLO compatibility guarantees.

Best,

Zichuan Wei

unread,

Feb 21, 2024, 3:56:05 PMFeb 21

to OpenXLA Discuss, Zichuan Wei, Stella Laurenzo, OpenXLA Discuss, Jacques Pienaar, Cormac Brick, Jacques Pienaar, Anush Elangovan, Pulkit Bhuwalka

Hi all, Today TFLite offers open ended compatibility, and we are aware of some android apps that are shipping model assets created more than 4 years ago. (e.g. mobilenetV3 remains to be very popular) We propose to extend the backward compatibility window of 5 years. Additionally, the TFlite team will follow up by soliciting feedback from their developer community, as well as exploring other mechanisms to meet their developers platform stability requirements and assess if there is any scope to further refine the compatibility window in a future RFC.

Reflecting on the community feedback on the round-trip ability, this is not a requirement for TFLite. We do propose to extend the forward compatibility requirement to 2 years, in order to support the TFLite and other community members on an annual release cycle.

All other proposed features remain the same:

VHLO op version number must only change by increment, if and only if there is a change to Operator behavior. (i.e. add_v1 → add_v2).
VHLO ops must not be deleted within the compatibility window.
VHLO ops must always be convertible to StableHLO ops within the compatibility window using machinery maintained in the openxla/stablehlo repository (i.e. not an external tool).

Best,

Zichuan Wei

Kevin Gleason

unread,

Apr 11, 2024, 4:02:49 PMApr 11

to OpenXLA Discuss, Zichuan Wei, Stella Laurenzo, OpenXLA Discuss, Jacques Pienaar, Cormac Brick, Jacques Pienaar, Anush Elangovan, Pulkit Bhuwalka

Thanks for the RFC Zichuan / Pulkit!

Overall I’m supportive of this RFC: The compatibility window is based on existing known use cases for compatibility, it enables important use cases within the OpenXLA ecosystem (mobile deployment), there are a other community members on similar annual update cycles who can leverage these guarantees, and we now have over a year of experience with evolving the opset with forward/backward compatibility guarantees with maintenance costs proving to be fairly low. I’ve spent a bit of time thinking / iterating within our team on maintenance costs, evolution implications, and potential alternatives. A brief summary:

The maintenance cost boils down to maintaining an ever-growing VHLO opset, along with the MLIR passes and tests, including:

Maintain the VHLO opset, which grows for all StableHLO opset changes.
Maintain IR upgrade / downgrade patterns, which grow at about the same pace as VHLO.
Applying upstream MLIR changes to the VHLO opset (ex: properties).
Maintaining compatibility tests for all versions within the compatibility window.

In practice, this has not amounted to much effort at all -- a few hours here and there for upstream MLIR integration, and a little extra per-op effort which is unavoidable regardless of the duration of compatibility.

The evolution cost is another potential concern. I don't foresee many changes to opset evolution with extended compatibility, nor changes to the review process. We aim to provide means of experimenting / escape hatches, and an RFC review process for standardizing useful features. Although we currently guarantee 1mo forward and 6mo backward compatibility, in practice we have >1yr forward/backward compatibility today, which I don’t believe has greatly hindered evolution. Back when authoring the initial StableHLO Compatibility RFC, I went through the MHLO dialect history and at that time, in the ~3yrs I’d looked at, there were no opset changes that would have required breaking compatibility. If push came to shove we could add something like V2 ops/attrs/types to StableHLO to keep evolving, but we should do our best to avoid that. Overall we may incur some tech debt, but given out current experience, it will likely be very manageable.

We also explored several alternatives, almost all of which push the burden of maintenance to on-device users, as deserialization is where the compatibility issues are likely to occur. It seems likely that on-device compilers will want to use StableHLO with similar compatibility guarantees, and pushing the maintenance elsewhere will have additional costs on the entire on-device ecosystem, likely amounting to a similar amount of maintenance on StableHLO maintainers, on-device deployment teams, and on-device compilers. Given that, I'm on board with proceeding with this RFC, as all of this amounts to a very reasonable cost to enable on-device StableHLO.

Interested in any feedback! Next step otherwise is to discuss when extended compatibility should kick in.