disabling field trials for tests

2,262 views
Skip to first unread message

Paweł Hajdan, Jr.

unread,
Sep 27, 2011, 7:06:21 PM9/27/11
to chromium-dev
I remember a case where a field trial was making tests fail randomly. It was quite difficult and annoying to diagnose.

I'd like to disable field trials for tests. Does that sound like a good idea? If so, what'd be the best to do that?

Responding to a possible counter-point that our tests should reflect the product as closely as possible: well, we already disable "web resources", preconnect, first run, default browser check, safebrowsing auto-update, force native password store on Linux and override GPU settings. Field trials by definition add randomness to the tests, and in my opinion having more stable tests hugely outweighs any possible losses of coverage because of the disabling. If something is really important, a test should be added has that field trial always enabled.

Dominic Hamon

unread,
Sep 27, 2011, 7:48:50 PM9/27/11
to phajd...@chromium.org, chromium-dev
Apologies if anyone receives this multiple times. Between the wrong mail account and missing Reply All I sent it three times to various addresses.


On Tue, Sep 27, 2011 at 4:06 PM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I remember a case where a field trial was making tests fail randomly. It was quite difficult and annoying to diagnose.

I'd like to disable field trials for tests. Does that sound like a good idea? If so, what'd be the best to do that?

Disable them or have them always return the default group? Some field trials enable a feature by default and disable it in the field trial, some the opposite. I think there's a tacit assumption that the default group is what is expected to be stable and the alternatives might become the default in time.

Returning the default group for all field trials is trivial if you can determine that you're running in a test environment. That would have to be a compile-time flag I think. It would be necessary to allow some tests to override this, however, so that they can test all code-paths.

Bottom-line is that the alternative code paths from a field test should not cause regressions. If you agree, then we should keep them enabled but make it easy to either force a test to be run under a particular field trial group or enforce checking of field trial group before running a particular test.
 

Responding to a possible counter-point that our tests should reflect the product as closely as possible: well, we already disable "web resources", preconnect, first run, default browser check, safebrowsing auto-update, force native password store on Linux and override GPU settings. Field trials by definition add randomness to the tests, and in my opinion having more stable tests hugely outweighs any possible losses of coverage because of the disabling. If something is really important, a test should be added has that field trial always enabled.

--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Paweł Hajdan, Jr.

unread,
Sep 28, 2011, 4:17:15 PM9/28/11
to Dominic Hamon, chromium-dev
On Tue, Sep 27, 2011 at 16:45, Dominic Hamon <domi...@google.com> wrote:
On Tue, Sep 27, 2011 at 4:06 PM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I remember a case where a field trial was making tests fail randomly. It was quite difficult and annoying to diagnose.

I'd like to disable field trials for tests. Does that sound like a good idea? If so, what'd be the best to do that?

Disable them or have them always return the default group? Some field trials enable a feature by default and disable it in the field trial, some the opposite. I think there's a tacit assumption that the default group is what is expected to be stable and the alternatives might become the default in time.

Yeah, to be precise I think I mean using always the default group.
 
Returning the default group for all field trials is trivial if you can determine that you're running in a test environment. That would have to be a compile-time flag I think. It would be necessary to allow some tests to override this, however, so that they can test all code-paths.

Why a compile-time flag? Can't we make that a command-line flag?
 
Bottom-line is that the alternative code paths from a field test should not cause regressions. If you agree, then we should keep them enabled but make it easy to either force a test to be run under a particular field trial group or enforce checking of field trial group before running a particular test.

I anticipated that point. See my original e-mail:

If something is really important, a test should be added [that] has that field trial always enabled.

Dominic Hamon

unread,
Sep 28, 2011, 4:20:54 PM9/28/11
to Paweł Hajdan, Jr., chromium-dev
On Wed, Sep 28, 2011 at 1:17 PM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
On Tue, Sep 27, 2011 at 16:45, Dominic Hamon <domi...@google.com> wrote:
On Tue, Sep 27, 2011 at 4:06 PM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I remember a case where a field trial was making tests fail randomly. It was quite difficult and annoying to diagnose.

I'd like to disable field trials for tests. Does that sound like a good idea? If so, what'd be the best to do that?

Disable them or have them always return the default group? Some field trials enable a feature by default and disable it in the field trial, some the opposite. I think there's a tacit assumption that the default group is what is expected to be stable and the alternatives might become the default in time.

Yeah, to be precise I think I mean using always the default group.
 
Returning the default group for all field trials is trivial if you can determine that you're running in a test environment. That would have to be a compile-time flag I think. It would be necessary to allow some tests to override this, however, so that they can test all code-paths.

Why a compile-time flag? Can't we make that a command-line flag?

I meant that there already is a compile-time flag. We could add a command-line flag that is passed in to all test runs, I suppose.
 
 
Bottom-line is that the alternative code paths from a field test should not cause regressions. If you agree, then we should keep them enabled but make it easy to either force a test to be run under a particular field trial group or enforce checking of field trial group before running a particular test.

I anticipated that point. See my original e-mail:

If something is really important, a test should be added [that] has that field trial always enabled.


My point is a little stronger than that. I think we should allow field trials in tests but enforce that tests that are affected by field trials should still be green. If this means running different tests for field trial versions, then so be it.

Paweł Hajdan, Jr.

unread,
Sep 29, 2011, 1:44:58 PM9/29/11
to Dominic Hamon, chromium-dev
On Wed, Sep 28, 2011 at 13:19, Dominic Hamon <domi...@google.com> wrote:
I meant that there already is a compile-time flag.

What's the name of that flag?

And then I'd really like to have something runtime-based. What do you think about adding a command-line flag with the same effect? 

If something is really important, a test should be added [that] has that field trial always enabled.


My point is a little stronger than that. I think we should allow field trials in tests but enforce that tests that are affected by field trials should still be green. If this means running different tests for field trial versions, then so be it.

I believe this is unrealistic. We have a lot of problems with flakiness. This is a tradeoff: yeah, it would be great to have 100% coverage and no flaky tests, but we're not there. If tests are flaky because of field trials activating randomly (and this has happened at least with prerendering), you are effectively losing coverage because people don't trust the tests and ignore them. And you'll always be behind if you try to fix tests in a reactive way, i.e. detect and patch. I'm looking at a more general solution now, and this effort is a part of it.

By the way, if you want the field trials, feel free to add separate tests with those trials enabled, separate bot (not on the main waterfall) that runs all tests with those trials enabled, or even a separate waterfall.

Just keep randomness off the main waterfall please!

Dominic Hamon

unread,
Sep 29, 2011, 2:05:00 PM9/29/11
to Paweł Hajdan, Jr., chromium-dev
On Thu, Sep 29, 2011 at 10:44 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
On Wed, Sep 28, 2011 at 13:19, Dominic Hamon <domi...@google.com> wrote:
I meant that there already is a compile-time flag.

What's the name of that flag?

UNIT_TEST is certainly defined for some subset of tests.
 

And then I'd really like to have something runtime-based. What do you think about adding a command-line flag with the same effect? 

I think this opens the door to (more) code that behaves differently during testing than in released builds. I'm already suspicious of things like TestingProfile that are worthwhile to keep test footprints small, but do change the order of construction/destruction and may hide race conditions, or even cause them. We should be working to minimize the differences between test code and release code.
 

If something is really important, a test should be added [that] has that field trial always enabled.


My point is a little stronger than that. I think we should allow field trials in tests but enforce that tests that are affected by field trials should still be green. If this means running different tests for field trial versions, then so be it.

I believe this is unrealistic. We have a lot of problems with flakiness. This is a tradeoff: yeah, it would be great to have 100% coverage and no flaky tests, but we're not there. If tests are flaky because of field trials activating randomly (and this has happened at least with prerendering), you are effectively losing coverage because people don't trust the tests and ignore them. And you'll always be behind if you try to fix tests in a reactive way, i.e. detect and patch. I'm looking at a more general solution now, and this effort is a part of it.

I agree that it's challenging, but you're proposing a technical solution to a procedural problem. If people ignore the tests then what can we do to make it impossible to ignore them?

Under your proposal field trials would still introduce test regressions but they wouldn't be caught until the field trial is made the default, and that is far later than I'm comfortable with. I'd rather that engineers are forced to consider the impact of introducing an alternative code path at the time they introduce it. We shouldn't accept flakiness, and if a field trial introduces it then it should also introduce a fix for that flakiness.

As I'm typing this, I realize that I'm guilty of this with a recent field trial that I've enabled, though it doesn't seem to have caused any flakiness on the waterfalls which seems strange. I think I should change the test to check which version of the code they're running against and change the expectations accordingly.

Paweł Hajdan, Jr.

unread,
Sep 29, 2011, 2:21:51 PM9/29/11
to Dominic Hamon, chromium-dev
On Thu, Sep 29, 2011 at 11:05, Dominic Hamon <domi...@chromium.org> wrote:
On Thu, Sep 29, 2011 at 10:44 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
On Wed, Sep 28, 2011 at 13:19, Dominic Hamon <domi...@google.com> wrote:
I meant that there already is a compile-time flag.

What's the name of that flag?

UNIT_TEST is certainly defined for some subset of tests.

Right, I was thinking you meant something field trial specific. But we can't use UNIT_TEST because the field trial files are compiled into base, which is never compiled with UNIT_TEST. That symbol can only be used in headers or test-only files.
 
And then I'd really like to have something runtime-based. What do you think about adding a command-line flag with the same effect? 

I think this opens the door to (more) code that behaves differently during testing than in released builds.

As said before, we already change the behavior quite a lot, i.e. disable safe browsing auto-update, first run, etc. This is not opening any more doors that are already open. In fact, we have places in the testing infrastructure specifically dedicated to add flags for different runtime behavior (test_launcher_utils).

And our GPU behavior is just completely different in tests than in the real world. In my opinion this point just doesn't apply here.
 
I'm already suspicious of things like TestingProfile that are worthwhile to keep test footprints small, but do change the order of construction/destruction and may hide race conditions, or even cause them. We should be working to minimize the differences between test code and release code.

TestingProfile is only used in unit tests, which should mock things out. In browser and ui tests we don't use it (if people do, it's bad). Browser and ui tests start the full browser. They are extremely susceptible to browser-global issues like crashes and so on. That includes field trials.

We can't control crash rate directly, but it should be possible to disable field trials very easily. This is a different thing from a mock (which is unit test thing). This is keeping large tests stable by eliminating as much randomness from them as possible.
 
I agree that it's challenging, but you're proposing a technical solution to a procedural problem. If people ignore the tests then what can we do to make it impossible to ignore them?

Not impossible, but if we make tests more reliable, ignoring them is much more painful. Actually I don't think the problem is procedural but in fact a technical one. The tests are not stable enough. You can't force everyone who gets unrelated test failures to chase random flakes. The test author/owner needs to fix those problems, otherwise he should be aware that there is effectively no coverage from a flaky test. This is a fact.
 
Under your proposal field trials would still introduce test regressions but they wouldn't be caught until the field trial is made the default, and that is far later than I'm comfortable with.

Are we catching regressions earlier because tests are more flaky? No! People just file bugs which sit there for months. Seriously, the case that made me write this is precisely something against your point. There was no bug in the product, just tests got more unstable. We're not going to magically fix more problems by running more unstable tests and then ignoring them.
 
I'd rather that engineers are forced to consider the impact of introducing an alternative code path at the time they introduce it.

Agreed, this is sort of obvious.
 
We shouldn't accept flakiness, and if a field trial introduces it then it should also introduce a fix for that flakiness.

Yeah, again this is unrealistic. We can't detect flakiness on the first check-in that introduces it. By the time people detect it, the reputation of the test is already damaged. And many people ignore flakiness bugs, so the fix isn't always going to be quick.

Furthermore, those issues frequently affect multiple tests, which increases the scale of the problem. Unless you have an equally magical fix for increasing our flakiness removal rate, I think this is not going to work.
 
I think I should change the test to check which version of the code they're running against and change the expectations accordingly.

Are you volunteering to do that for thousands of tests we have and every field trial anyone introduces? Sorry, it just sounds unrealistic to me.

Keep randomness off the main waterfall! Feel free to have your own waterfall with more flaky tests. ;-)

Dominic Hamon

unread,
Sep 29, 2011, 2:36:22 PM9/29/11
to Paweł Hajdan, Jr., chromium-dev
It's clear that we have different philosophies here and I'm not sure we'll agree on the right approach, so someone else should probably weight in. This'll be my last mail as I'm about to start repeating myself. 

On Thu, Sep 29, 2011 at 11:21 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
On Thu, Sep 29, 2011 at 11:05, Dominic Hamon <domi...@chromium.org> wrote:
On Thu, Sep 29, 2011 at 10:44 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
On Wed, Sep 28, 2011 at 13:19, Dominic Hamon <domi...@google.com> wrote:
I meant that there already is a compile-time flag.

What's the name of that flag?

UNIT_TEST is certainly defined for some subset of tests.

Right, I was thinking you meant something field trial specific. But we can't use UNIT_TEST because the field trial files are compiled into base, which is never compiled with UNIT_TEST. That symbol can only be used in headers or test-only files.
 
And then I'd really like to have something runtime-based. What do you think about adding a command-line flag with the same effect? 

I think this opens the door to (more) code that behaves differently during testing than in released builds.

As said before, we already change the behavior quite a lot, i.e. disable safe browsing auto-update, first run, etc. This is not opening any more doors that are already open. In fact, we have places in the testing infrastructure specifically dedicated to add flags for different runtime behavior (test_launcher_utils).

It doesn't close any either. I understand that you're coming at this from the point of view of reducing flakiness, and that is something that I think everyone agrees is important. I'd also like to make sure that we don't diverge our testing code from our release code. That we already are is not a reason to do it more, it's a reason to stop doing that in every place that we can.
 

And our GPU behavior is just completely different in tests than in the real world. In my opinion this point just doesn't apply here.
 
I'm already suspicious of things like TestingProfile that are worthwhile to keep test footprints small, but do change the order of construction/destruction and may hide race conditions, or even cause them. We should be working to minimize the differences between test code and release code.

TestingProfile is only used in unit tests, which should mock things out. In browser and ui tests we don't use it (if people do, it's bad). Browser and ui tests start the full browser. They are extremely susceptible to browser-global issues like crashes and so on. That includes field trials.

We can't control crash rate directly, but it should be possible to disable field trials very easily. This is a different thing from a mock (which is unit test thing). This is keeping large tests stable by eliminating as much randomness from them as possible.
 
I agree that it's challenging, but you're proposing a technical solution to a procedural problem. If people ignore the tests then what can we do to make it impossible to ignore them?

Not impossible, but if we make tests more reliable, ignoring them is much more painful. Actually I don't think the problem is procedural but in fact a technical one. The tests are not stable enough. You can't force everyone who gets unrelated test failures to chase random flakes. The test author/owner needs to fix those problems, otherwise he should be aware that there is effectively no coverage from a flaky test. This is a fact. 
 
Under your proposal field trials would still introduce test regressions but they wouldn't be caught until the field trial is made the default, and that is far later than I'm comfortable with.

Are we catching regressions earlier because tests are more flaky? No! People just file bugs which sit there for months. Seriously, the case that made me write this is precisely something against your point. There was no bug in the product, just tests got more unstable. We're not going to magically fix more problems by running more unstable tests and then ignoring them. 
 
I'd rather that engineers are forced to consider the impact of introducing an alternative code path at the time they introduce it.

Agreed, this is sort of obvious.
 
We shouldn't accept flakiness, and if a field trial introduces it then it should also introduce a fix for that flakiness.

Yeah, again this is unrealistic. We can't detect flakiness on the first check-in that introduces it. By the time people detect it, the reputation of the test is already damaged. And many people ignore flakiness bugs, so the fix isn't always going to be quick.

We absolutely can detect a change in code that changes the expectations of a test. The author and code reviewers should be working to make sure existing tests don't break under the field trials when they're introduced. I think this is realistic moving forward. I'm not suggesting running more unstable tests, I'm suggesting we don't put the unstable tests in in the first place.

Detecting flakiness is a problem and people not trusting tests is a problem. I think your proposal will encourage more stability in tests but I don't think it's the best way to do that as the potential regression is still there, just delayed.
 

Furthermore, those issues frequently affect multiple tests, which increases the scale of the problem. Unless you have an equally magical fix for increasing our flakiness removal rate, I think this is not going to work.

Perhaps organising a flaky test fixit would help clear the backlog.
 
 
I think I should change the test to check which version of the code they're running against and change the expectations accordingly.

Are you volunteering to do that for thousands of tests we have and every field trial anyone introduces? Sorry, it just sounds unrealistic to me.

I'm volunteering to do it for the tests affected by my field trial, and I'm proposing that every field trial author should do the same when introducing a trial. I don't think it's unrealistic for future field trials. Clearing the backlog is a larger issue that I don't have a good solution too.

Jói Sigurðsson

unread,
Sep 29, 2011, 4:33:11 PM9/29/11
to domi...@chromium.org, Paweł Hajdan, Jr., chromium-dev
> I'm volunteering to do it for the tests affected by my field trial, and I'm proposing
> that every field trial author should do the same when introducing a trial. I don't
> think it's unrealistic for future field trials. Clearing the backlog is a larger issue
> that I don't have a good solution too.
I think this points to a reasonable compromise: To reduce flakiness,
use the default group for all field trials when running tests. The
exception to this would be if the field trial author somehow enables
testing of the trial's different trial groups. Personally, I think
this should most often be at the level of unit tests that explicitly
set the trial group they want to test the behavior of, but I can see
why you might also want to running subsets of our existing tests with
a field trial set to some non-default group. Is the first (explicit
unit tests) enough for now?

Cheers,
Jói

Paweł Hajdan, Jr.

unread,
Sep 30, 2011, 8:25:49 PM9/30/11
to Jói Sigurðsson, domi...@chromium.org, chromium-dev
I talked to Jim Roskind today and he suggested a great solution: we can just print the information about field trials in tests, making it easier to correlate test failures with field trials. I'm going to implement that. This solution should address all concerns posted here.

William Chan (陈智昌)

unread,
Sep 30, 2011, 8:44:04 PM9/30/11
to phajd...@chromium.org, Jói Sigurðsson, domi...@chromium.org, chromium-dev
I'm with joi@ here. Use targeted unit tests, stop depending on randomized behavior.
Reply all
Reply to author
Forward
0 new messages