PSA: Risky changes should be guarded by a Runtime Enabled Feature

138 views
Skip to first unread message

Rick Byers

unread,
Mar 29, 2023, 2:53:26 PM3/29/23
to blink-dev

Hey blink-dev,

Within Google we’ve had a lot of discussion over the past year about how we should make increasing use of flags and kill switches to reduce the risk of breaking changes, but I realized there hasn’t been discussion here on blink-dev.


Once a change hits stable it’s quite expensive and time-consuming to do an emergency patch, but much faster and cheaper to update a server config. For Google Chrome this means “Finch”, but other big chromium embedders like Edge also have their own system for setting flags dynamically. So increasingly we want to ensure that changes which might reasonably cause a regression are guarded by a flag wherever practical.


For Blink this simply means using a Runtime Enabled Feature (or other base::Feature). I think we’re all largely in the habit of using this for new APIs, but we should also be using one for all intentional breaking changes and this is something the API owners will be asking about.


See Flag Guarding Guidelines for more details and feel free to reach out to me and/or chrome-at...@google.com with questions or concerns.


Thanks,

   Rick


Charles Harrison

unread,
Mar 29, 2023, 3:50:03 PM3/29/23
to Rick Byers, blink-dev
Thanks for posting this Rick. One call-out that I think is worth mentioning is the distinction between enabled-by-default flags and disabled-by-default flags. The general guidelines for choosing between these is in the links you posted, but given that it can be sometimes a subjective call it might be worth clarifying in e.g. API owners feedback on a particular I2S. I often see requests to flag guard a feature but sometimes it's ambiguous in which direction.

--
You received this message because you are subscribed to the Google Groups "blink-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to blink-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/blink-dev/CAFUtAY9iR-n102RPZWfqpUGjYm9W2LpFFb5%2BodimGFUC8Dv8ZA%40mail.gmail.com.

Rick Byers

unread,
Apr 4, 2023, 2:16:03 PM4/4/23
to Charles Harrison, blink-dev
Thanks Charlie. Yes I should have been specific, I'm really talking about enabled-by-default flags here (eg. Runtime Enabled Feature with status=stable).

Note that our process for flipping a kill switch via finch includes first landing a CL that disables the flag (confirming it doesn't break any tests etc.). We're still learning how best to communicate these rare situations, but that'll be an ongoing process. In general I've certainly found it helpful when folks update their I2S threads with rollout status whenever it deviates from the original plan.

Rick

Charles Harrison

unread,
Apr 4, 2023, 2:24:33 PM4/4/23
to Rick Byers, blink-dev
Thanks Rick this is really helpful. I have been involved in a few changes where API OWNERS said "flag-gated" and the implementer chose to use a disabled-by-default flag which I didn't push back on. I think just I needed this recalibration :)

Rick Byers

unread,
Apr 4, 2023, 4:52:30 PM4/4/23
to Charles Harrison, blink-dev
Oh that's interesting. It's possible there are cases where disabled-by-default is the better choice. What do you see as the main trade offs? Obviously it's important that we're testing the default configuration - so either way, we shouldn't be relying on finch to change behavior except in the emergency situation (or intentional A/B tests of course). 

I was mainly thinking that for Runtime Enabled Features, we have the status mechanism that biases towards status=stable. I suppose breaking changes are more ambiguous. Eg. if we're removing an API that still has a Runtime Enabled Feature, then setting it back to no status to unship it could be a reasonable option. But I'd worry about the risk of confusion there. Eg. if someone sets it to status=test or status=experimental after it's gone to Chrome Stable with status=stable, there's risk of breaking changes showing up in official builds but not in a lot of chromium developer configurations that have --enable-experimental-web-platform-features set. I don't think that usage is a good fit for how Runtime Enabled Features are designed and communicated.

To me behavior changes of any type go through a flow of in-development -> under test -> experimental -> stable (with killswitch) -> mature (no flags). Once we hit the 'stable' feature in stable release for some period of time, I don't think there's really any going back - only the possibility of a new breaking change following this path. I guess this mirrors our launch process as well. You don't need an I2D&R to flip the killswitch on a new API in the first week of being on stable. But before long (maybe a couple weeks?) you cross some threshold where any change or disabling is to a launched feature and needs an I2D&R, and so logically a new Runtime Enabled Feature working up to being on-by-default. Perhaps we should be more thoughtful and explicit about this flow?

Rick

Charles Harrison

unread,
Apr 4, 2023, 8:29:02 PM4/4/23
to Rick Byers, blink-dev
For me, I really just needed the link to this guidance which you posted:

This positions the "default" rollout to be enabled-by-default with a killswitch, rather than a finch-enabled rollout. This matches what you have above, where we want deprecations to go through that flow of in-development -> under test -> experimental -> stable (with killswitch) -> mature (no flags). I also think this is the right guidance, as asking risky features to go through a finch-enabled rollout is a large overhead (and requires Chrome internal changes).

For removals, I agree that this implies that we should remove their Runtime Enabled Feature and replace it with a Runtime Enabled Feature that removes the feature, which we can then killswitch with status=stable.

Rick Byers

unread,
Apr 5, 2023, 10:48:29 AM4/5/23
to Charles Harrison, blink-dev
Right. I've heard some debate and confusion on this. While performance work and major new user-facing browser features tend to go through finch-based rollout and A/B testing to validate quality, platform API work tends instead to rely on developers being in control of their own A/B testing as they see fit (often with origin trials). It's confusing for developers to have APIs available for a random subset of their users, and even more confusing when they need to repro / debug an issue. But there's obviously a lot of grey area and judgement needed here, especially for features which are both platform APIs and browser UI features. I'm happy to help think through the tradeoffs in particular situations if desired.

Rick
Reply all
Reply to author
Forward
0 new messages