Changing the "fuse" threshold to allow short-term experimentation even by large sites?

40 views
Skip to first unread message

Alex Komoroske

unread,
Feb 10, 2017, 1:17:51 PM2/10/17
to experimentation-dev
Hey folks,

At BlinkOn7 there was a break-out discussion about Origin Trials and the criteria for the global "fuse" blowing. There are have also been a number of recent discussions about performance-related features that are primarily useful to big, mature, savvy sites, but whose use could easily blow the fuse accidentally. We're stuck in a position where we don't know if the design of the API will actually have the right performance characteristics in the wild, but the only way to get that feedback (experimentation in the wild) is closed off.

In some ways, the criteria are working as intended (preventing a large customer from accidentally "burning in" the API). But in other ways, it seems like it's being too restrictive (not allowing any big customers, even when they want to be responsible and do a short, time-limited trial).

Is there a way to modify the "fuse" criteria, so that even large customers can do experiments-- as long as they're short-term time limited?  Today the criteria is effectively "the fuse blows if at any moment there is more than the deprecation threshold of uses." But what if it was something like "the fuse blows if the median daily usage across the past 14 days is greater than the deprecation threshold?" That way, large customers could do experiments for whatever number of users they needed, but if they ran the experiment for more than 7 days, the fuse would blow. 

Thoughts?

--Alex

Jason Chase

unread,
Feb 10, 2017, 3:53:17 PM2/10/17
to experimentation-dev
To me, the goal is to prevent "burn-in", and the "fuse" criteria are simply a tool to achieve the goal. We defined the fuse criteria at the outset, before we actually ran any trials. So, I think it makes sense to iterate on the criteria, as we identify different use cases (like large customers doing performance-related experimentation) and become more comfortable about the risk of burn-in.

As for the actual criteria, the origin trials team is currently looking at the 7 day aggregation of usage data. Already, it's not based on a single peak usage value. The intent is to cover a typical usage period (i.e. weekdays and weekends), and look for sustained overuse. I think it makes sense to be somewhat forgiving to accidental/temporary overuse. For example, imagine a large customer that inadvertently exposed a trial to larger than intended subset of their user population. We could tweak the criteria further as suggested.

Another point is that we've often considered the usage limits as both a safeguard and communication tool. We have contact info for the registered origins, so we can reach out before usage exceeds the criteria.

Just a few thoughts from someone on the origin trials team (so I might be a little biased).

Thanks,
Jason

Alex Komoroske

unread,
Feb 10, 2017, 6:39:18 PM2/10/17
to Jason Chase, experimentation-dev
I definitely support us revisiting the "fuse" criteria to make sure they fit the goals we want while avoiding the cases we want to avoid.

One thing I do think is a hard constraint, however, is to have a very specific, automatic threshold. 

Ultimately this whole thing is a bit of game theory, similar to the game of chicken. To win a game of chicken you want conspicuous, irrevocable strategic commitment (a summary of various approaches). Ultimately we of course could override whatever automatic criteria we enacted, but the trolley problem shows us that there is a distinctly different moral situation when you must proactively intervene to cause the desired outcome, versus when the desired outcome is automatic.

So I really want us to set a specific, concrete threshold, and then do everything we can to make that fully automatic.

The 7 day average today seems a bit too short, because most large companies will likely want to do an experiment that is longer than that, and "average" captures the wrong value. That's why I proposed the 14-day median in my strawman.

Very interested to hear everyone's thoughts!

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/experimentation-dev/7feff4a4-5043-4471-9da1-194369301222%40chromium.org.

Rick

unread,
Feb 10, 2017, 11:25:51 PM2/10/17
to experimentation-dev, cha...@chromium.org
I agree that the fuse criteria is unnecessarily strict.  From a predictability standpoint I'd be comfortable with relaxing it substantially without much fear that we'd risk lock-in (the time limit is our main protection against lock-in IMHO).

There's another reason for the fuse limit in addition to just lock-in prevention: ensuring the openness of the platform.  Often features aren't spec'd out in sufficient detail at origin trial time to reasonably enable alternate implementations.  If it became common for a large number of users to experience the benefits of origin trials, other browsers could feel pressure to try to ship these experimental features in order to get the same user benefit.  But as long as it's a small fraction of the user base who ever has an origin trial active, the risk here should be minimal.

I don't have a strong opinion on exactly how we define the fuse limit, but I agree we should try to get automated enforcement setup ASAP.  We really don't want to risk being in a position where there's judgement required to decide if/when to disable an experiment which has gotten suddenly popular (and might even break an improperly-coded popular site if disabled).

Owen

unread,
Feb 12, 2017, 7:55:39 PM2/12/17
to experimentation-dev, cha...@chromium.org
I agree that it makes a lot of sense to revise this criteria as we have now learned more, and definitely agree that more important than the specific constants is a clearly expressed policy and a system that does not apply any judgement or discretion.

Alex's suggestion of a 14 day threshold makes sense to me. As Rick says, the automatic disabling is the most powerful lever we have to prevent burn in anyway, and I don't see almost any increased risk of a few additional days over the target threshold.

Chris Harrelson has also confirmed in blink-dev that the 0.03% deprecation threshold is explicitly out of date. This further supports updating it.

Specifically, as a starting point I'd suggest capping 14-day average usage at 0.5% of page loads. 

Half a percent feels about right to me as preventing the feature from having too significant positive impact on Chrome users such that other browsers feel forced to fast follow on an unproven design, but also large enough to allow developers to experiment and collect statistically significant data which is often vital, such as in the case of Navigation Preload.

Alex Komoroske

unread,
Feb 14, 2017, 9:29:18 AM2/14/17
to Owen, experimentation-dev, Jason Chase
On Sun, Feb 12, 2017 at 4:55 PM, Owen <owe...@chromium.org> wrote:
I agree that it makes a lot of sense to revise this criteria as we have now learned more, and definitely agree that more important than the specific constants is a clearly expressed policy and a system that does not apply any judgement or discretion.

Alex's suggestion of a 14 day threshold makes sense to me. As Rick says, the automatic disabling is the most powerful lever we have to prevent burn in anyway, and I don't see almost any increased risk of a few additional days over the target threshold.

Chris Harrelson has also confirmed in blink-dev that the 0.03% deprecation threshold is explicitly out of date. This further supports updating it.

Specifically, as a starting point I'd suggest capping 14-day average usage at 0.5% of page loads. 

Is the "average" vs "median" intentional? I imagine it might be easier to calculate? 

Half a percent feels about right to me as preventing the feature from having too significant positive impact on Chrome users such that other browsers feel forced to fast follow on an unproven design, but also large enough to allow developers to experiment and collect statistically significant data which is often vital, such as in the case of Navigation Preload.

This generally seems good to me. What are the next steps? 
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsubscribe...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

Ben Maurer

unread,
Feb 14, 2017, 12:47:57 PM2/14/17
to experimentation-dev
Something I'd point out here is that the "burn in" of an API is not necessarily only related to its global usage.

Let's say Facebook is responsible for X% of global page views (let's assume X is higher than any threshold we're talking about today). If Facebook launches an origin trial to 90% of its users, but nobody else uses the feature I'd argue that the burn in is minimal. First, since 10% of fb users currently manage without the feature, it can't become something so critical that we depend on it. Second, we would launch such an experiment with active involvement with chrome -- we're likely to be able to work through any issues with the experiment amicably. On Rick's openness point, we would never want to push browsers to standardize a substandard API just to get a few extra months of user benefit from it. We've been bitten way too many times by bad APIs to go down this route :-)

On the other hand, let's say that 90 sites that are 1/100th the size of Facebook launch the feature to 100% of their users. I'd argue that this feature is at risk of being burned in. Now there are 90 sites that aren't testing their code without the existence of this origin trial.

IMHO the key to avoiding burn in is forcing a percentage of users to not be in the experiment. I'm not sure of the best technical way to do that. One way would be to ignore the opt-in on N% of page views, but this gets tricky because it can really mess with A/B tests (most of our A/B tests are based on a hash of the user ID, so it's hard to A/B test something which has its own opt-out mechanism).

Perhaps a simple solution here is to mandate that any site which wishes to do an origin trial that would pass 0.1% of pageviews engage in a deeper relationship where they would detail their plan to avoid burn-in and commit to maintaining the openness of the platform.

-b

Alex Komoroske

unread,
Feb 14, 2017, 3:31:42 PM2/14/17
to Ben Maurer, experimentation-dev
On Tue, Feb 14, 2017 at 9:47 AM, Ben Maurer <ben.m...@gmail.com> wrote:
Something I'd point out here is that the "burn in" of an API is not necessarily only related to its global usage.

Let's say Facebook is responsible for X% of global page views (let's assume X is higher than any threshold we're talking about today). If Facebook launches an origin trial to 90% of its users, but nobody else uses the feature I'd argue that the burn in is minimal. First, since 10% of fb users currently manage without the feature, it can't become something so critical that we depend on it. Second, we would launch such an experiment with active involvement with chrome -- we're likely to be able to work through any issues with the experiment amicably. On Rick's openness point, we would never want to push browsers to standardize a substandard API just to get a few extra months of user benefit from it. We've been bitten way too many times by bad APIs to go down this route :-)

On the other hand, let's say that 90 sites that are 1/100th the size of Facebook launch the feature to 100% of their users. I'd argue that this feature is at risk of being burned in. Now there are 90 sites that aren't testing their code without the existence of this origin trial.

IMHO the key to avoiding burn in is forcing a percentage of users to not be in the experiment. I'm not sure of the best technical way to do that. One way would be to ignore the opt-in on N% of page views, but this gets tricky because it can really mess with A/B tests (most of our A/B tests are based on a hash of the user ID, so it's hard to A/B test something which has its own opt-out mechanism).

Perhaps a simple solution here is to mandate that any site which wishes to do an origin trial that would pass 0.1% of pageviews engage in a deeper relationship where they would detail their plan to avoid burn-in and commit to maintaining the openness of the platform.

Very insightful, thanks Ben!

In practice the risk is roughly proportional to how widely used a service is and how badly users depend on it--that's the strongest predictor of how loud the blow-back will be if a site breaks: how many angry bugs will be filled on crbug, how much loud complaining there will be, etc.

We've also found that the "commitments" from large companies often don't work in practice, because the people who make the committment get overruled by others in the organization who primarily just care about getting useful features to their end users. There are a lot of examples of this happening historically, unfortunately. It's rare for big sites to have such a productive relationship with us as Facebook does. :-)


-b


On Friday, February 10, 2017 at 10:17:51 AM UTC-8, Alex Komoroske wrote:
Hey folks,

At BlinkOn7 there was a break-out discussion about Origin Trials and the criteria for the global "fuse" blowing. There are have also been a number of recent discussions about performance-related features that are primarily useful to big, mature, savvy sites, but whose use could easily blow the fuse accidentally. We're stuck in a position where we don't know if the design of the API will actually have the right performance characteristics in the wild, but the only way to get that feedback (experimentation in the wild) is closed off.

In some ways, the criteria are working as intended (preventing a large customer from accidentally "burning in" the API). But in other ways, it seems like it's being too restrictive (not allowing any big customers, even when they want to be responsible and do a short, time-limited trial).

Is there a way to modify the "fuse" criteria, so that even large customers can do experiments-- as long as they're short-term time limited?  Today the criteria is effectively "the fuse blows if at any moment there is more than the deprecation threshold of uses." But what if it was something like "the fuse blows if the median daily usage across the past 14 days is greater than the deprecation threshold?" That way, large customers could do experiments for whatever number of users they needed, but if they ran the experiment for more than 7 days, the fuse would blow. 

Thoughts?

--Alex

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

Ben Maurer

unread,
Feb 14, 2017, 3:42:54 PM2/14/17
to Alex Komoroske, experimentation-dev
On Tue, Feb 14, 2017 at 12:31 PM, Alex Komoroske <komo...@chromium.org> wrote:

In practice the risk is roughly proportional to how widely used a service is and how badly users depend on it--that's the strongest predictor of how loud the blow-back will be if a site breaks: how many angry bugs will be filled on crbug, how much loud complaining there will be, etc.

My point here is that a breakage takes two steps:

1) A site must take on a dependency to the feature in question
2) The feature must be removed

"don't let origin trials go beyond 0.5% of page views" is targeted at step 2. It says "reduce the damage caused by removing the feature".

I'm suggesting we target step 1 for the larger sites. If a large site is only ever able to roll out an origin trial to 90% of chrome users they'll have to have a plan in place for the remaining 10%. No site is going to deploy a feature that will cause 10% of their users to file crbugs the day they launch it. Regardless of how big that 90% of users who have the feature is you know that the site has a vested interest in keeping their site usable if the trial is turned off. 

-b

Alex Komoroske

unread,
Feb 14, 2017, 5:08:51 PM2/14/17
to Ben Maurer, experimentation-dev
Ah, that's a great point.

We've designed OT to date specifically for the general case that we can enforce easily, but I wonder if there are other approaches we can take to enforce a 90/10 for the use cases of large sites.

-b

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

Ben Maurer

unread,
Feb 14, 2017, 5:20:19 PM2/14/17
to Alex Komoroske, experimentation-dev
Do you guys collect per-site usage data for origin trials? If so, maybe you could have an "placebo" origin trial token for each origin trial that would still be keyed per site but would not enable the trial. You could collect usage status for each site and blow the fuse on a site if it's global page views were > X and it didn't have a normal:placebo ratio of at least X. Sites could still game this (eg by putting the placebo impression on trivial pages that don't use the feature) but it would take a fair bit of effort to do.

Alternatively, you could have X% of chrome users simply disable all origin trials. You'd need to communicate this to sites via a header so they could run valid A/B tests. It'd also be helpful to allow enterprise settings to override this for a given origin (eg so that we can make all our corp chrome machines enable origin trials on facebook.com for all our internal users)

On Tue, Feb 14, 2017 at 2:08 PM, Alex Komoroske <komo...@chromium.org> wrote:
On Tue, Feb 14, 2017 at 12:42 PM, Ben Maurer <ben.m...@gmail.com> wrote:


On Tue, Feb 14, 2017 at 12:31 PM, Alex Komoroske <komo...@chromium.org> wrote:

In practice the risk is roughly proportional to how widely used a service is and how badly users depend on it--that's the strongest predictor of how loud the blow-back will be if a site breaks: how many angry bugs will be filled on crbug, how much loud complaining there will be, etc.

My point here is that a breakage takes two steps:

1) A site must take on a dependency to the feature in question
2) The feature must be removed

"don't let origin trials go beyond 0.5% of page views" is targeted at step 2. It says "reduce the damage caused by removing the feature".

I'm suggesting we target step 1 for the larger sites. If a large site is only ever able to roll out an origin trial to 90% of chrome users they'll have to have a plan in place for the remaining 10%. No site is going to deploy a feature that will cause 10% of their users to file crbugs the day they launch it. Regardless of how big that 90% of users who have the feature is you know that the site has a vested interest in keeping their site usable if the trial is turned off. 

Ah, that's a great point.

We've designed OT to date specifically for the general case that we can enforce easily, but I wonder if there are other approaches we can take to enforce a 90/10 for the use cases of large sites.

-b

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsubscribe...@chromium.org.

To post to this group, send email to experimentation-dev@chromium.org.

Owen

unread,
Feb 23, 2017, 1:30:10 AM2/23/17
to experimentation-dev, komo...@chromium.org
We are working on collecting some per-site metrics but it is challenging due to privacy constraints so unlikely to work as well as we'd like, at least any time soon.

I think these are all really great points and I think we should explicitly explore options that would allow large sites to experiment so long as there is a very meaningful holdback group with feature detection etc.

I suggest we keep discussing options, but first finalize the initial topic since simply increasing the current fuse threshold and moving it to an average over a few days seems to have general consensus on this thread.

Specifically, unless there are any further points or counter proposals, I suggest we send a concrete proposal to blink-dev to update the fuse such that an origin trial is disabled if the 14-day average usage exceeds 0.5% of page loads. 

I'll confirm with the origin trials team directly that they're all on board. In the meanwhile, I suggest we go with this on Friday if nobody has a counter proposal. 

Once that goes out I'd love to loop back around and discuss Ben's suggestion.

-b
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.

To post to this group, send email to experimentation-dev@chromium.org.

Ian Clelland

unread,
Feb 23, 2017, 3:36:41 PM2/23/17
to Owen, experimentation-dev, komo...@chromium.org, Ben Maurer
I'd like to defer to API owners where the threshold should be - - 0.5% seems very high to me; if it weren't explicitly in an origin trial, we would have a really hard time justifying breaking 1/200 page loads. Maybe just the fact that a feature *is* in a trial makes enough of a difference that we can do that.

As a strawman counter-proposal, another thing we could do is drop the threat of a global disable, except in extreme cases -- we could say that if global usage rises above a certain level, then we will automatically disable the tokens with the highest usage, and will continue to do that, in order, until usage drops below threshold.

To respond to some of Ben's comments, we had considered early on dong randomized disabling of trials for individual users, or individual page loads (or even something arbitrary and deterministic, iike "Origin-trial-free-Thursdays") to force people to code for the chance that the trial simply wasn't available.

In the end, we decided not to do that, since there's a very important category of sites which will exist *just* to demonstrate a new feature and to let users try it out -- developer's blog posts, DevRel demo sites, and the like.

It would be a terrible experience for them and their users if a significant number of them couldn't just take the steps of

 -- Update Chrome.
 -- Visit site.

and have the feature work. For those sites, though their traffic is extremely small compared to FB, we would be making things much more difficult by not allowing 100% usage.


-b
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsubscribe...@chromium.org.

To post to this group, send email to experimentation-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

Rick Byers

unread,
Feb 23, 2017, 4:00:21 PM2/23/17
to Ian Clelland, Owen, experimentation-dev, Alex Komoroske, Ben Maurer
I like Ben's point about the risk being correlated with the number of sites depending on it.  Certainly we've had other examples where we've said we're not going to allow any single site to keep us from fixing an important interop bug. But it's true that we can't necessarily guarantee we will always win such a game of chicken (eg. imagine a feature that turns out to result in significant revenue somehow).

I also like Ian's idea of disabling the most-used token first when the fuse is blown. 0.5% is definitely a lot of usage and perhaps almost at the point where it's not much different in practice from 100% (there are probably very few sites who could hit 0.5% on their own).

I don't know that we're going to be able to pick a perfect policy up-front, but I think it's OK to iterate and learn.  We've now got multiple examples of our policy being too strict, and AFAIK not a single example of harm caused by the policy not being strict enough.  I'd say we shouldn't be too afraid to relax it significantly as a result.  Worst case and we hit a single painful case someday, it's probably not the end of the world and it'll lead to us learning something that helps us find the right tradeoff.

IMHO if we're very clear when signing up for a trial that the time limits is immovable and short, and that we expect sites to have at least a 10% user holdback (despite having no way to enforce that), then I think it's unlikely that we'll find ourselves in a situation where we feel strong pressure to break our promise.  So subject to those constraints, I'm OK trying out whatever specific policy you folks feel is reasonable.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

Owen

unread,
Feb 28, 2017, 1:00:10 AM2/28/17
to experimentation-dev, icle...@chromium.org, owe...@chromium.org, komo...@chromium.org, ben.m...@gmail.com
I think we should definitely explore all of these options, but in the short term I don't see any harm in making the usage change while allowing us room to iterate so I've started the thread with blink-dev.

In terms of disabling tokens in terms of most-used, that sounds like a really awesomely awesome idea Ian!

I thought about it today and realized that we don't yet have any mechanism to determine how much each token is being used (as we don't report the token as part of UMA), but once we have RAPPOR-based tracking in place I imagine that's totally something we could do!

I do agree with Rick that the time limits of the trials are very effective, and that having at least a 10% holdback for the largest sites is very important (I agree with Ian that there are legitimate reasons to need the feature to be enabled for 100% of very small sites).

Just one other thought about how we could achieve that (assuming it's very hard to do otherwise due to privacy and challenges of per-origin tracking), I wonder therefore that if the number of sites that fall in the "very big" category is sufficiently small enough whether we could create some fully transparent legal contract-based solution where companies could sign a contract for a special token that exempts them from the regular kill switch, but agrees contractually to having a holdback group of at least 10%. There could be hurdles unknown to me about making such contract text and who has signed them fully transparent, but if we could do it with full transparency and allow anyone that wants to go through the fuss to be able to, it could be another way of effectively solving the issue.

I'll try to grab one of the lawyers and see if they think something could be workable in that space.

(also since I'm suggesting something legal-ish I'd be remiss to not remind people to generally exercise extreme caution when speculating about legal matters!)

Thanks

-b
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.

Rick Byers

unread,
Feb 28, 2017, 9:54:15 AM2/28/17
to Owen, experimentation-dev, Ian Clelland, Alex Komoroske, Ben Maurer
The contract idea is interesting.  Personally I don't think we need something legally binding, just a transparent public commitment.  As long as we have something public we can point to saying "you explicitly agreed that this would be auto-disabled after 6 weeks" that should be plenty strong IMHO.


-b
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsubscribe...@chromium.org.

To post to this group, send email to experimentation-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsubscribe...@chromium.org.

To post to this group, send email to experimentation-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsubscribe...@chromium.org.

To post to this group, send email to experimentation-dev@chromium.org.

--
You received this message because you are subscribed to the Google Groups "experimentation-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to experimentation-dev+unsub...@chromium.org.
To post to this group, send email to experimentation-dev@chromium.org.
Reply all
Reply to author
Forward
0 new messages