Flakes & retries (was Re: [chromium-dev] Handling flaky tests)

120 views
Skip to first unread message

Wez

unread,
Nov 14, 2017, 5:15:07 PM11/14/17
to s...@chromium.org, Dirk Pranke, chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao
(forking to avoid detracting from Dirk's policy point. :)

While porting Chromium to the new Fuchsia operating system, we've encountered a number of flakes that turned out to be cross-platform issues that just repro'd more often for us due to scheduler & emulation introducing timing differences - many such issues also repro on other platforms, but sufficiently rarely that they're effectively masked by the retry logic in both the test-launcher and the bot infrastructure.

We also have tests which appear to pass, but are actually taking longer than our allotted per-test-timeout, and get away with that because they happen to be run in a batch with a load of fast-running tests - these slow tests "pass" only because they're gobbling up other tests' timeout allocation, and if the sharding & batching of tests changes then we'll see them start to mysteriously fail.

There are two bugs with proposals for reducing flake:

1. Add bots which run "10x" testing (crbug.com/782656 was filed for this for layout tests), to increase likelihood of repro'ing flake.

2. Remove retries from most tests (crbug.com/697215), so that flakey tests can't sneak past the bots so easily.

Do these proposals appeal to anyone?  Or are there better ideas to spot & address flake? :)

Cheers!

Wez



On 14 November 2017 at 12:50, Scott Violet <s...@chromium.org> wrote:
+1 to this policy.

Flaky tests have direct impact on productivity in so far as they impact the time it takes to land patches (flaky tests cause retries) as well as making it harder to sheriff. I'm all for disabling flaky tests. That said, sheriffs need to make an effort before blindly disabling all tests. A test that has run reliably for a long time and suddenly starts flaking is typically an indication that something recent is to blame and should be reverted.

  -Scott

On Tue, Nov 14, 2017 at 12:25 PM, Dirk Pranke <dpr...@chromium.org> wrote:
Hi all,

I'm sheriff today, and in the course of my sheriff-y duties, I've realized that our policy for how we're supposed to handle flaky tests is perhaps unclear or not in broad agreement.

Specifically, it is my belief that we are supposed to disable any flaky tests we come across [1]. However, the documentation I can point at [2] suggests that you're supposed to let them run. I think the documentation needs to be changed.

However, before I do that, I wanted to see if there was something I'm missing.

From the discussions we've had in the past, the objections to disabling tests generally tend to be twofold:

1) If you disable a flaky test, you lose whatever potential coverage that test might be giving us.

2) If you disable a flaky test, you can't easily tell if the test stops being flaky and starts passing (or failing consistently).

These are both valid objections.

The counterarguments I've heard (and made) to the first is that you can't actually trust the coverage the test might be giving you since it's too hard to tell different failure modes apart, and so the value you get from continuing to run things is outweighed by the confusion of looking at ongoing failures and trying to figure out if they're new or not.

The counterarguments to the second are that the cost of continuing to run the flaky test (both in itself, and in potential knock-on costs due to the way we run tests and one test might interfere with another) outweighs the value you get by doing so (i.e., things don't fix themselves often enough, and rapidly enough).

These objections were discussed in the thread [3] linked from the current documentation. That discussion happened in 2012, so, quite some time ago. I don't remember offhand discussions that have happened since then, but I do know that we got rid of the ability -- in gtests -- to mark a test as FLAKY, so there currently isn't a way to keep running a test but ignore failures. We could arguably make this possible, but for the moment I'm going to declare that a separate discussion.

There are related objections and topics that are relevant. First, there's the question of how we would track disabled tests, and lindsayw@ has been working on processes that we'll start following soon for this. Second, there's the question of measuring test (and code) coverage, so we can start to learn whether this matters or not, and baxley@ and liaoyuke@ and others are working on a plan for that as well.

So, to repeat: I think we should be disabling flaky tests, and I think the docs need to be updated to make that extra clear. 

Anyone strongly object to this?

-- Dirk

[1] I double-checked with a few other folks just to make sure I wasn't misrembering

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAEoffTC%2BfeGZ3CUkAvN4E9PHfBYkkr3CK92Ta%2Bgef0-1%2BkCyYg%40mail.gmail.com.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAKARY_m43UO9gJkPw%2BfjCS2RV-e%2B1FvMoNP2H0h6-gAGqe6O%3DA%40mail.gmail.com.

Dirk Pranke

unread,
Nov 14, 2017, 5:27:47 PM11/14/17
to Wez, Scott Violet, chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao
Thanks, Wez.

I support the removal of "retry-by-default" for test suites that are believed to be non-flaky (and should stay that way). I don't think we currently have a great sense of *which* test suites meet that bar, though I can name a few that I'm sure don't (*cough* layout tests *cough* ;). But, I think this is something we can work on pretty easily.

I am less convinced of the "10x" testing effort, if for no other reason than I don't know how we'd actually deploy that and be effective across all of the different build configurations that we have. *However*, I am very supportive of *investigating* that effort, seeing if we can come up with a plan, trying it, and seeing how it goes.

The Ops folks as a whole are spending a decent amount of time right now trying to wrangle our different flakiness efforts together (there's a bunch) and we will hopefully have a better story about what we're doing soon.

One of the challenges we have to face, particularly with the first point, is that we put the retries in in order to make things more resilient to failure; if we make things *less* resilient to failure in order to better identify the failures, we have to make sure we get better at fixing the failures faster (repairing). With progress on the tools like FindIt and FlakeAnalyzer auto-reverting things, we're getting closer here, but there's still a lot more we can do.

-- Dirk

dan...@chromium.org

unread,
Nov 14, 2017, 5:54:12 PM11/14/17
to Wez, Scott Violet, Dirk Pranke, chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao
On Tue, Nov 14, 2017 at 5:13 PM, Wez <w...@chromium.org> wrote:
(forking to avoid detracting from Dirk's policy point. :)

While porting Chromium to the new Fuchsia operating system, we've encountered a number of flakes that turned out to be cross-platform issues that just repro'd more often for us due to scheduler & emulation introducing timing differences - many such issues also repro on other platforms, but sufficiently rarely that they're effectively masked by the retry logic in both the test-launcher and the bot infrastructure.

We also have tests which appear to pass, but are actually taking longer than our allotted per-test-timeout, and get away with that because they happen to be run in a batch with a load of fast-running tests - these slow tests "pass" only because they're gobbling up other tests' timeout allocation, and if the sharding & batching of tests changes then we'll see them start to mysteriously fail.

There are two bugs with proposals for reducing flake:

1. Add bots which run "10x" testing (crbug.com/782656 was filed for this for layout tests), to increase likelihood of repro'ing flake.

2. Remove retries from most tests (crbug.com/697215), so that flakey tests can't sneak past the bots so easily.

I feel that retries have made the ability to land code easier in the presence of flaky tests, at the expensive of our ability to reliably detect bugs we introduce and ship a stable product. I have to give a +1 to not retrying tests.
 

Scott Violet

unread,
Nov 14, 2017, 7:00:18 PM11/14/17
to Dana Jansens, Wez, Dirk Pranke, chromium-dev, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
+1 to what Dana says.

  -Scott

Dirk Pranke

unread,
Nov 14, 2017, 7:13:37 PM11/14/17
to Scott Violet, Dana Jansens, Wez, chromium-dev, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
I don't think there's any doubt that Dana was right, at least about the first part: we intentionally chose to enable retries, knowing that it would allow more flakiness to get through.

I suspect it has indeed also made the product less stable. Of course, it has also allowed us to leave the tree open longer, and make the CQ run more effectively, so it's a mixed bag.

That said, I am in favor of changing what we can change to prevent more flakiness without hurting us too much in the other direction as well. We're not, for example, at the point where I would turn off retries everywhere yet.

-- Dirk

Ken Russell

unread,
Nov 14, 2017, 7:17:31 PM11/14/17
to Scott Violet, Dana Jansens, Wez, Dirk Pranke, chromium-dev, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
I agree with Dana's point of removing retries from most if not all test suites. However, it needs to be recognized that sometimes it's core code changes which cause existing tests to become flaky, rather than flaky tests allowing buggy or unreliable code to be landed. It's important that the cause of flakiness be understood before taking the next step, whether it be reverting a patch, disabling a test, or something else.

-Ken



Scott Violet

unread,
Nov 14, 2017, 11:52:15 PM11/14/17
to Ken Russell, Dana Jansens, Wez, Dirk Pranke, chromium-dev, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
Agreed. I'm going to quote myself in the thread this was split from: "That said, sheriffs need to make an effort before blindly disabling all tests. A test that has run reliably for a long time and suddenly starts flaking is typically an indication that something recent is to blame and should be reverted."

  -Scott

Colin Blundell

unread,
Nov 15, 2017, 4:48:33 AM11/15/17
to s...@chromium.org, Ken Russell, Dana Jansens, Wez, Dirk Pranke, chromium-dev, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
While I agree with everything that everyone's saying here, the cost to morale of repeatedly needing to retry CQ runs due to test flake (and needing to investigate whether a CQ failure was really due to your change or not) was extremely high. In my experience, it was high enough that it actually served as a major pain point for some engineers in being a Chrome developer. If we go back to that world, I think that it needs to be coupled with an organizational commitment to valuing effort spent in de-flaking the tree. Test flake is the definition of a tragedy of the commons problem.

I would also note that basically by definition, adding test retry presumably helped by far the most in exactly the test suites that exhibit the most flake :P.

Best,

Colin

Anthony Berent

unread,
Nov 15, 2017, 6:17:06 AM11/15/17
to blun...@chromium.org, s...@chromium.org, Ken Russell, Dana Jansens, Wez, Dirk Pranke, chromium-dev, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
Retries are already being removed on Java instrumentation tests (see https://docs.google.com/document/d/1u3PIlYZ8Op4k9jsVG-WNZ5UvBpbEjNEqDCoWG257Krs/edit). As part of this, tests that are known to be flaky, and are not easy to fix, are being annotated with @RetryOnFailure and will, for now, continue to be retried. Could something similar be done with other tests?

kyle...@chromium.org

unread,
Nov 15, 2017, 10:19:41 AM11/15/17
to Chromium-dev, blun...@chromium.org, s...@chromium.org, k...@chromium.org, dan...@chromium.org, w...@chromium.org, dpr...@chromium.org, lind...@chromium.org, bax...@chromium.org, liao...@chromium.org
I've noticed that test retries have allowed tests that flake depending on the test execution order to land. Test T1 runs and modifies some global state, then test T2 runs and fails because of that global state. When T2 is retried it passes because the retry happens in a new process where that global state hasn't been changed. I would imagine running most of the big test targets with no retries and permuting the order (eg. "--gtest_shuffle --test-launcher-retry-limit=0 --gtest_repeat=10") produces interesting flakes. For example, here is what I got running ash_unittests 10 times.

Summary of all test iterations:
2 tests failed:
    AutoclickTest.MouseMovement (../../ash/autoclick/autoclick_unittest.cc:140)
    WindowManagerTest.UpdateCursorVisibilityOnKeyEvent (../../ash/wm/window_manager_unittest.cc:776)
4 tests crashed:
    HighlighterControllerTest.DetachedClient (../../ash/highlighter/highlighter_controller_unittest.cc:431)
    MultiMirroringTest.SwitchToAndFromSoftwareMirrorMode (../../ash/display/display_manager_unittest.cc:3940)
    PaletteTrayTestWithVoiceInteraction.MetalayerToolActivatesHighlighter (../../ash/system/palette/palette_tray_unittest.cc:332)
    ToplevelWindowEventHandlerTest.GestureDragCaptureLoss (../../ash/wm/toplevel_window_event_handler_unittest.cc:974)
7 tests had unknown result:
    HighlighterControllerTest.HighlighterGestures (../../ash/highlighter/highlighter_controller_unittest.cc:173)
    HighlighterControllerTest.HighlighterPrediction (../../ash/highlighter/highlighter_controller_unittest.cc:148)
    MultiMirroringTest.HardwareMirrorMode (../../ash/display/display_manager_unittest.cc:3828)
    ToplevelWindowEventHandlerTest.EasyResizerUsedForTransient (../../ash/wm/toplevel_window_event_handler_unittest.cc:742)
    ToplevelWindowEventHandlerTest.EscapeReverts (../../ash/wm/toplevel_window_event_handler_unittest.cc:849)
    ToplevelWindowEventHandlerTest.GestureDragForUnresizableWindow (../../ash/wm/toplevel_window_event_handler_unittest.cc:773)
    ToplevelWindowEventHandlerTest.MinimizeMaximizeCompletes (../../ash/wm/toplevel_window_event_handler_unittest.cc:866)
End of the summary.

James Cook

unread,
Nov 15, 2017, 10:44:34 AM11/15/17
to kyle...@chromium.org, Chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, w...@chromium.org, Dirk Pranke, lind...@chromium.org, bax...@chromium.org, liao...@chromium.org, Zelidrag Hornung, Darin Fisher
Can we have some managers weigh in on whether they're willing to commit people to fixing test flake instead of their other team priorities? Flaky tests are expensive to debug and fix, especially if the failures have been around for a while.

I'm happy to fix flake in areas of the code I'm familiar with, but I want to know that other teams are committed to doing the same in the other areas. In particular, I don't want someone to turn off retries for a test suite (browser_tests, I'm looking at you) before a group of people commits to fixing the flake this will reveal.

Just turning off retries without committing people to fixing the flake seems like it will make patches slow to land again and demoralize our teams, as blundell@ mentioned above.

dan...@chromium.org

unread,
Nov 15, 2017, 10:49:13 AM11/15/17
to James Cook, kyle...@chromium.org, Chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Wez, Dirk Pranke, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, Zelidrag Hornung, Darin Fisher
I think it's also worth considering tooling that automatically disables tests that are flaky. That would greatly reduce the number of flaky tests we see blocking the CQ, without retrying them and hiding the failures from their authors. It seems to me that in the presence of the many ways to write tests, including TEST_P, macros that generate multiple test instances, and so forth, that we'd need tooling external to the C++ code in order to accomplish this reliably, but I think it would be a huge win for our ability to reduce the impact of flake without hiding it.

Dirk Pranke

unread,
Nov 15, 2017, 12:57:45 PM11/15/17
to Dana Jansens, kyle...@chromium.org, James Cook, Colin Blundell, Anthony Berent, Chromium-dev, Scott Violet, Kenneth Russell, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, Zelidrag Hornung, Darin Fisher
Responding to a few different things at once:

To Colin's (and, presumably, others') concern: we will not change policies if it makes the CQ even flakier. The flake rate is something that we actively monitor, and it's already too high. This is not something I consider up for debate :).

And, given what I wrote in the other thread that started this, I expect us to take a stronger stand on disabling flaky tests. That will mean that people will "lose coverage", and hopefully be incented to work on fixing that. We are currently tracking the number of disable tests we have, and we are also working on actually measuring coverage; we plan to be regularly reporting these things to mgmt so that we can prioritize properly. I and others can talk more about what we're up to here if there's interest.

To Dana's point: we in ops are already working on tooling to make it easier to identify flakes and disable tests. I'm happy to talk about what we're doing here in a separate thread as well.

To Kyle's point: yes, absolutely some tests have ordering dependencies. We do some things to shuffle test execution to help identify this, and we can certainly do more.

And, finally, to Anthony's comment about instrumentation tests that are being annotated with @RetryOnFailure, that's exactly the sort of thing I think we should *not* be doing. We should disable such tests, not retry them. But, let's fork yet another thread to talk about that specific example.

-- Dirk


Colin Blundell

unread,
Nov 15, 2017, 1:01:52 PM11/15/17
to Dirk Pranke, Dana Jansens, kyle...@chromium.org, James Cook, Colin Blundell, Anthony Berent, Chromium-dev, Scott Violet, Kenneth Russell, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, Zelidrag Hornung, Darin Fisher
On Wed, Nov 15, 2017 at 6:56 PM Dirk Pranke <dpr...@chromium.org> wrote:
Responding to a few different things at once:

To Colin's (and, presumably, others') concern: we will not change policies if it makes the CQ even flakier. The flake rate is something that we actively monitor, and it's already too high. This is not something I consider up for debate :).

This statement sounds good to me! But how would removing retry *not* make the CQ flakier as an immediate consequence?

Dirk Pranke

unread,
Nov 15, 2017, 1:30:50 PM11/15/17
to Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Kenneth Russell, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, Zelidrag Hornung, Darin Fisher
On Wed, Nov 15, 2017 at 9:59 AM, Colin Blundell <blun...@chromium.org> wrote:


On Wed, Nov 15, 2017 at 6:56 PM Dirk Pranke <dpr...@chromium.org> wrote:
Responding to a few different things at once:

To Colin's (and, presumably, others') concern: we will not change policies if it makes the CQ even flakier. The flake rate is something that we actively monitor, and it's already too high. This is not something I consider up for debate :).

This statement sounds good to me! But how would removing retry *not* make the CQ flakier as an immediate consequence?

Well, yeah, okay, any removal of retries should actually make things flakier, by definition :).

But, the lions' share of CQ failures comes from things that are known to be flaky (like telemetry_unittests). If we stop retrying failures for something that basically never fails, like webkit_python_unittests, that'll have no net effect on the overall flakiness, but it might help catch new flakes a bit faster (by not retrying them).

We added retries (partially) because we weren't doing enough to stop things from being flaky in the first place. I think the intent is to do more to make things less flaky, but having retries can actively hinder such efforts.

-- Dirk

Scott Graham

unread,
Nov 15, 2017, 1:34:52 PM11/15/17
to kyle...@chromium.org, chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, Wez, Dirk Pranke, lind...@chromium.org, bax...@chromium.org, liao...@chromium.org
On Wed, Nov 15, 2017 at 7:21 AM <kyle...@chromium.org> wrote:
I've noticed that test retries have allowed tests that flake depending on the test execution order to land. Test T1 runs and modifies some global state, then test T2 runs and fails because of that global state. When T2 is retried it passes because the retry happens in a new process where that global state hasn't been changed. I would imagine running most of the big test targets with no retries and permuting the order (eg. "--gtest_shuffle --test-launcher-retry-limit=0 --gtest_repeat=10") produces interesting flakes. For example, here is what I got running ash_unittests 10 times.

Summary of all test iterations:
2 tests failed:
    AutoclickTest.MouseMovement (../../ash/autoclick/autoclick_unittest.cc:140)
    WindowManagerTest.UpdateCursorVisibilityOnKeyEvent (../../ash/wm/window_manager_unittest.cc:776)
4 tests crashed:
    HighlighterControllerTest.DetachedClient (../../ash/highlighter/highlighter_controller_unittest.cc:431)
    MultiMirroringTest.SwitchToAndFromSoftwareMirrorMode (../../ash/display/display_manager_unittest.cc:3940)
    PaletteTrayTestWithVoiceInteraction.MetalayerToolActivatesHighlighter (../../ash/system/palette/palette_tray_unittest.cc:332)
    ToplevelWindowEventHandlerTest.GestureDragCaptureLoss (../../ash/wm/toplevel_window_event_handler_unittest.cc:974)
7 tests had unknown result:
    HighlighterControllerTest.HighlighterGestures (../../ash/highlighter/highlighter_controller_unittest.cc:173)
    HighlighterControllerTest.HighlighterPrediction (../../ash/highlighter/highlighter_controller_unittest.cc:148)
    MultiMirroringTest.HardwareMirrorMode (../../ash/display/display_manager_unittest.cc:3828)
    ToplevelWindowEventHandlerTest.EasyResizerUsedForTransient (../../ash/wm/toplevel_window_event_handler_unittest.cc:742)
    ToplevelWindowEventHandlerTest.EscapeReverts (../../ash/wm/toplevel_window_event_handler_unittest.cc:849)
    ToplevelWindowEventHandlerTest.GestureDragForUnresizableWindow (../../ash/wm/toplevel_window_event_handler_unittest.cc:773)
    ToplevelWindowEventHandlerTest.MinimizeMaximizeCompletes (../../ash/wm/toplevel_window_event_handler_unittest.cc:866)
End of the summary.

Even on this seemingly simple point, the problem is non-trivial. I wrote a little script to do exhaustive pairwise checking, and then ran it on base_unittests. At the time I ran it, 14331 of the 8151025 pairs failed, here's the list*. This is pretty "easy" flake, but sometimes even these can take some work to fix, and that's only one of the smaller binaries on one platform (Linux ASAN base_unittests in this case). 

I feel our tests are in a deep hole. They require substantial investment. We don't need more tools/dashboards/bots/policies/instrumentation, it's easy to find flaky tests. We "just" need many engineering hours of work, and a will to spend them on this.



* I can share the whole log with failure output if anyone wants it, but it's 1.3G.

 

Lei Zhang

unread,
Nov 15, 2017, 2:33:27 PM11/15/17
to Scott Graham, kyle...@chromium.org, chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, Wez, Dirk Pranke, lind...@chromium.org, bax...@chromium.org, liao...@chromium.org
Unit tests are especially fragile, as test cases share the same
process address space. It it easy for one test to set a global and
then forget to restore it, thus potentially ruining the rest of the
test run in subtle ways. Test sharding and retries mask some of these
problems and that concerns me. Do we have stats on which tests need
retrying the most? That's a strong signal these tests are problematic
in some way.
> https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CANHK6RZUUzxVtm8uJcFJYJSMb_%3D%2BtT4EZ6PNgmau0tq2JPsqCg%40mail.gmail.com.

Ken Russell

unread,
Nov 15, 2017, 2:43:05 PM11/15/17
to Dana Jansens, James Cook, kyle...@chromium.org, Chromium-dev, Colin Blundell, Scott Violet, Wez, Dirk Pranke, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, Zelidrag Hornung, Darin Fisher
There are some test suites which are rock solid and in which the vast majority of newly flaky failures are caused by bugs introduced in the browser itself. I'm not in favor of automatically disabling flaky tests because a bug which introduces bad crashes in the browser (for example) will cause many random tests to fail and then to be automatically disabled. This has happened in our test suite as recently as the past week. Analysis needs to be done on every instance of test failures to understand what the true root cause was.


Ken Russell

unread,
Nov 15, 2017, 2:51:53 PM11/15/17
to Dirk Pranke, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, Zelidrag Hornung, Darin Fisher
On Wed, Nov 15, 2017 at 10:28 AM, Dirk Pranke <dpr...@chromium.org> wrote:
On Wed, Nov 15, 2017 at 9:59 AM, Colin Blundell <blun...@chromium.org> wrote:


On Wed, Nov 15, 2017 at 6:56 PM Dirk Pranke <dpr...@chromium.org> wrote:
Responding to a few different things at once:

To Colin's (and, presumably, others') concern: we will not change policies if it makes the CQ even flakier. The flake rate is something that we actively monitor, and it's already too high. This is not something I consider up for debate :).

This statement sounds good to me! But how would removing retry *not* make the CQ flakier as an immediate consequence?

Well, yeah, okay, any removal of retries should actually make things flakier, by definition :).

But, the lions' share of CQ failures comes from things that are known to be flaky (like telemetry_unittests). If we stop retrying failures for something that basically never fails, like webkit_python_unittests, that'll have no net effect on the overall flakiness, but it might help catch new flakes a bit faster (by not retrying them).

I disagree with this assessment and think it unfairly singles out a test suite which inherently acts as a full-browser integration test, uncovering random crashes throughout the code base.

Over the past couple of weeks the flakiness situation has gotten really bad with the win7_chromium_rel_ng, android_n5x_swarming_rel and linux_android_rel_ng tryservers sometimes flaking on as many as 50% of their try runs, and in multiple test suites, none of which were telemetry_unittests. I've personally filed multiple bugs when seeing failures on CLs, and so have others, and most the resolutions were real bugs in the product, not just bugs in the tests.

We need to commit as a team to prioritize diagnosing and fixing these flaky failures. Also, if the chromium-try-flakes tool isn't working (there seems to be at least one issue with it -- Issue 784739), we should prioritize bringing it back on line.

-Ken

Ken Rockot

unread,
Nov 15, 2017, 3:15:58 PM11/15/17
to Kenneth Russell, Dirk Pranke, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Webster, Mike Baxley, Yuke Liao, Zelidrag Hornung, Darin Fisher
+1 to much of this thread.

I'd like to see no more retries - or even better, retries (2x or 3x may be more reasonable than 10x?) with the requirement that all runs must pass - but I think it first requires a dedication of resources toward de-flaking existing tests.

Once that's done, we'd ideally then shift those same resources toward re-enabling all the forgotten disabled tests across the tree.

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAMYvS2fjbOV1kmiCjhJZkhhWyubao1rDq9QZ3dFrMQzFvh67cQ%40mail.gmail.com.

Yuri Wiitala

unread,
Nov 15, 2017, 3:28:26 PM11/15/17
to Ken Rockot, Kenneth Russell, Dirk Pranke, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Webster, Mike Baxley, Yuke Liao, Zelidrag Hornung, Darin Fisher
I'm going to take a very different stance here, and disagree with much of this thread. I think the focus has been on the tests themselves, and only a single choice: whether to disable the tests. This is like throwing the baby out with the bath water.

A test was originally written because someone thought it was worth their time to sanity-check some product feature, heuristic, etc. So, by disabling the test, you (as an agent that has no context whatsoever) are overriding that decision. Thus, original authors and/or OWNERS really do need to be a part of the process here. For example, in the past I've had people wanting to disable a unit test related to base::Time that would have left open the possibility for security vulnerabilities (e.g., certificate validation requires a reliable clock).

I would propose that tests are allowed to be flaky for a 180 days. When they are discovered as flaky, create a crbug and start the clock: During the 180 days, `git blame` history and OWNERS could be used to generate weekly nag e-mails to incentivize action. After 180 days, with no action, then the test can be automatically disabled and crbug left open.


Yuri Wiitala

unread,
Nov 15, 2017, 3:32:29 PM11/15/17
to Ken Rockot, Kenneth Russell, Dirk Pranke, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Webster, Mike Baxley, Yuke Liao, Zelidrag Hornung, Darin Fisher
There is also another problem: In the Chromium project we completely lack the incentive structure for maintaining our tests and tree. People are free to commit code changes, even substantial major features, with ZERO responsibility for the outcome. Sometimes this is due to laziness, but more often this is more about teams never allocating devs to take the time to do the right thing (i.e., attend to tests that have broken in a timely manner).

I'm not sure what to do about this. Perhaps we should ban new code commits until tests are fixed? But, that'd probably cause folks to just delete tests; and that's not necessarily the best solution, right? Any ideas for fixing our incentive structure?


Yuri Wiitala

unread,
Nov 15, 2017, 3:35:07 PM11/15/17
to Ken Rockot, Kenneth Russell, Dirk Pranke, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Webster, Mike Baxley, Yuke Liao, Zelidrag Hornung, Darin Fisher
...and I should add, sometimes people can't attend to tests because, logistically, nobody is active anymore on a particular project.

Fady Samuel

unread,
Nov 15, 2017, 3:36:10 PM11/15/17
to m...@chromium.org, Ken Rockot, Kenneth Russell, Dirk Pranke, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Webster, Mike Baxley, Yuke Liao, Zelidrag Hornung, Darin Fisher
Could a test flake fixit be useful here to encourage folks to step up?

Fady

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CA%2BapAgE57%2BNbBSZuvbvZz3_OvKnPhKWVKAze%3DmkornO-U2RmJw%40mail.gmail.com.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CA%2BN%2BEKaHsws_D9Skc9vOb0nZNvL6kYUDTHXweb8ZE9n2XrHv3A%40mail.gmail.com.

Dirk Pranke

unread,
Nov 15, 2017, 4:36:26 PM11/15/17
to Ken Russell, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, Zelidrag Hornung, Darin Fisher
On Wed, Nov 15, 2017 at 11:50 AM, Ken Russell <k...@chromium.org> wrote:
On Wed, Nov 15, 2017 at 10:28 AM, Dirk Pranke <dpr...@chromium.org> wrote:
On Wed, Nov 15, 2017 at 9:59 AM, Colin Blundell <blun...@chromium.org> wrote:


On Wed, Nov 15, 2017 at 6:56 PM Dirk Pranke <dpr...@chromium.org> wrote:
Responding to a few different things at once:

To Colin's (and, presumably, others') concern: we will not change policies if it makes the CQ even flakier. The flake rate is something that we actively monitor, and it's already too high. This is not something I consider up for debate :).

This statement sounds good to me! But how would removing retry *not* make the CQ flakier as an immediate consequence?

Well, yeah, okay, any removal of retries should actually make things flakier, by definition :).

But, the lions' share of CQ failures comes from things that are known to be flaky (like telemetry_unittests). If we stop retrying failures for something that basically never fails, like webkit_python_unittests, that'll have no net effect on the overall flakiness, but it might help catch new flakes a bit faster (by not retrying them).

I disagree with this assessment and think it unfairly singles out a test suite which inherently acts as a full-browser integration test, uncovering random crashes throughout the code base.

You're right, of course; there are multiple highly flaky test suites. I meant no ill will to the telemetry_unittests' maintainers, who I know are at least working very hard to try and deflake their suite, which is more than I can say for some of the others.

-- Dirk

John Budorick

unread,
Nov 15, 2017, 5:58:08 PM11/15/17
to Dirk Pranke, Kenneth Russell, blun...@chromium.org, Dana Jansens, kyle...@chromium.org, jame...@chromium.org, Anthony Berent, Chromium-dev, Scott Violet, w...@chromium.org, lind...@chromium.org, bax...@chromium.org, liao...@chromium.org, zeli...@chromium.org, da...@chromium.org
I'm supportive of working toward eliminating default retries, but I'd caution that we're a long way from being able to do so.

Anthony brings up a relevant and, in my opinion, cautionary tale here. We started trying to do this last year in Android instrumentation tests. I've come to believe that our effort, while well-intentioned, was flawed. We started with some of the largest tests in Chrome, and we made the default assumption that tests were not flaky until we saw evidence otherwise. Given that we've built our tests and our code on this unstable foundation of retries, I don't think either was correct. (I believe this also ties into Ken's comment -- the lack of default retries is part of the reason why linux_android_rel_ng and android_n5x_swarming_rel flake at an elevated rate.)

Lei, you asked about information as to when tests are retried. To my knowledge, that's only readily available on the flakiness dashboard. For example, here's browser_tests on linux_chromium_rel_ng. Light grey is a failure followed by a pass. Black is at least two failures. Were we to disable retries on this suite, it would basically never pass.

If we want to eliminate the default retries, I think we'll need to:
 - Improve (or resurrect) tooling to track flaky tests and inform owners when their tests are flaky, and enable it for tests that pass on retry. Dirk mentioned the Ops team's current efforts at consolidating and improving our flakiness tooling; this is falls into that bucket. (Scott, while I'd agree that we have more flaky tests than we can shake a stick at, I don't think we can rely on folks to find an fix flaky tests of their own accord -- that's basically our current state.)
 - Gradually disable retries on a per-suite basis, starting from the unit test suites, once we have enough information to believe that they won't flake without the crutch of retries.

John Abd-El-Malek

unread,
Nov 15, 2017, 6:06:08 PM11/15/17
to John Budorick, Dirk Pranke, Kenneth Russell, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, lind...@chromium.org, bax...@chromium.org, liao...@chromium.org, Zelidrag Hornung, Darin Fisher
One thing to add: test retries were originally added when we started sharding tests (i.e. multiple tests running at the same time on a single machine). This improved machine utilization a lot, but it did lead to more flakiness because some tests were starved for resources. This doesn't mean that the tests are flaky; it's just a tradeoff. Of course, this doesn't mean we need to retry the tests 3x in serial, maybe we should aim to make that only 1x.

Dirk Pranke

unread,
Nov 15, 2017, 7:51:16 PM11/15/17
to Yuri Wiitala, Ken Rockot, Kenneth Russell, Colin Blundell, Dana Jansens, kyle...@chromium.org, James Cook, Anthony Berent, Chromium-dev, Scott Violet, Wez, Lindsay Webster, Mike Baxley, Yuke Liao, Zelidrag Hornung, Darin Fisher
Replying to these three messages, in reverse order.

First, many people have said on the bug that the sheriff (or whomever) should do diligence to try and figure out why a test is failing before disabling it. Usually this should mean that there's a CL that landed that should be reverted instead. However, failing tests, although they no longer close the tree, do impose a cost on the project both in terms of ignoring "irrelevant" failures and in terms of increased retries of jobs in the CQ (the with/without-the-patch dance).

Second, once a test starts failing flakely, it contributes almost no signal. Depending on the failure, it might make it impossible to tell if a failure is something new or "the flake". More importantly, most of the time it actually does more harm than good, because of the extra time you have to figure out what's going on.

Though, I do agree that there is an important caveat that sometimes the flakes are the fault of bugs in the product or in the test harnesses or the way that we're running tests, and so you probably shouldn't just disable tests the first time something fails. And, that's why it is important that we get to baseline stable state, which we currently don't have in many places.

Third, regarding the incentives for fixing things: generally speaking, in Google at least we mostly set our own goals, and I've almost never heard mgmt say that we should prioritize shipping features over fixing tests and ensuring product quality. So, if you think you're being asked to do that, you should speak up. If you need more voices to your cause, come talk to me and I'll see if I can help :).

Fourth, I do understand that people leave and move around, and I understand that we have a very large codebase. However, we (Ops) at least expect all of the code to be owned and for engineering to be on the hook to fix things in the codebase, because we're shipping this code. If you think there are places where things are currently un-owned and you're having trouble getting mgmt to step up for this, again, come talk to me and I'll see if I can help.

I hope this helps,

-- Dirk

Dirk Pranke

unread,
Nov 15, 2017, 7:55:14 PM11/15/17
to Scott Graham, kyle...@chromium.org, chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
On Wed, Nov 15, 2017 at 10:33 AM, Scott Graham <sco...@chromium.org> wrote:



On Wed, Nov 15, 2017 at 7:21 AM <kyle...@chromium.org> wrote:
I've noticed that test retries have allowed tests that flake depending on the test execution order to land. Test T1 runs and modifies some global state, then test T2 runs and fails because of that global state. When T2 is retried it passes because the retry happens in a new process where that global state hasn't been changed. I would imagine running most of the big test targets with no retries and permuting the order (eg. "--gtest_shuffle --test-launcher-retry-limit=0 --gtest_repeat=10") produces interesting flakes. For example, here is what I got running ash_unittests 10 times.

Summary of all test iterations:
2 tests failed:
    AutoclickTest.MouseMovement (../../ash/autoclick/autoclick_unittest.cc:140)
    WindowManagerTest.UpdateCursorVisibilityOnKeyEvent (../../ash/wm/window_manager_unittest.cc:776)
4 tests crashed:
    HighlighterControllerTest.DetachedClient (../../ash/highlighter/highlighter_controller_unittest.cc:431)
    MultiMirroringTest.SwitchToAndFromSoftwareMirrorMode (../../ash/display/display_manager_unittest.cc:3940)
    PaletteTrayTestWithVoiceInteraction.MetalayerToolActivatesHighlighter (../../ash/system/palette/palette_tray_unittest.cc:332)
    ToplevelWindowEventHandlerTest.GestureDragCaptureLoss (../../ash/wm/toplevel_window_event_handler_unittest.cc:974)
7 tests had unknown result:
    HighlighterControllerTest.HighlighterGestures (../../ash/highlighter/highlighter_controller_unittest.cc:173)
    HighlighterControllerTest.HighlighterPrediction (../../ash/highlighter/highlighter_controller_unittest.cc:148)
    MultiMirroringTest.HardwareMirrorMode (../../ash/display/display_manager_unittest.cc:3828)
    ToplevelWindowEventHandlerTest.EasyResizerUsedForTransient (../../ash/wm/toplevel_window_event_handler_unittest.cc:742)
    ToplevelWindowEventHandlerTest.EscapeReverts (../../ash/wm/toplevel_window_event_handler_unittest.cc:849)
    ToplevelWindowEventHandlerTest.GestureDragForUnresizableWindow (../../ash/wm/toplevel_window_event_handler_unittest.cc:773)
    ToplevelWindowEventHandlerTest.MinimizeMaximizeCompletes (../../ash/wm/toplevel_window_event_handler_unittest.cc:866)
End of the summary.

Even on this seemingly simple point, the problem is non-trivial. I wrote a little script to do exhaustive pairwise checking, and then ran it on base_unittests. At the time I ran it, 14331 of the 8151025 pairs failed, here's the list*. This is pretty "easy" flake, but sometimes even these can take some work to fix, and that's only one of the smaller binaries on one platform (Linux ASAN base_unittests in this case). 

I feel our tests are in a deep hole. They require substantial investment. We don't need more tools/dashboards/bots/policies/instrumentation, it's easy to find flaky tests. We "just" need many engineering hours of work, and a will to spend them on this.

I think we need to be clear about our existing policies (which is what started this :). I tend to agree that we don't need more tools or dashboards or instrumentation for finding flaky tests. I also agree that we need the will of engineering to work on this (Ops can't and won't try to fix the flakiness problem for you).

I do actually think we need a bit more dashboards and reports to figure out exactly what flakiness is costing us, which is what I've asked my team to work on getting. It may be that we're willing to accept a flakiness rate of X%, because getting better than that is simply too expensive in terms of hardware or dev time. I think if we could surface and agree to that, then at least we'd avoid some of these sorts of discussions, too.

-- Dirk

Mike Lawther

unread,
Nov 15, 2017, 8:22:37 PM11/15/17
to dpr...@chromium.org, Scott Graham, kyle...@chromium.org, chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao
I also came here to say what Dirk beat me to about a flaky test having a loss of signal. It may have once tested something, but if it not longer does that reliably, then it's not doing the job it's supposed to.

We need to have confidence in our tests. Too many times I've heard from a new developer 'uh oh, the trybots failed on my change!', and the advice is 'ah, just try ti again, it's probably flake somewhere'. 

I think this can be sliced into two problems:
 1) harden our existing corpus of tests
 2) do not introduce any more flaky tests

There's been a lot of discussion about 1), and it's a hard problem. I don't think it's a prerequisite for 2) though.

For 2), we could aim to test the heck of of *new* tests. Strawman - any new tests added with a CL get the 10x/random ordering etc treatment to try our hardest to determine if they are flaky. If they pass, we have a higher confidence that we're not making the 1) problem any worse.

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAEoffTArMjC0UqtKzYAx9dwSgFE360OmeSmqrxCr99HXzXEidQ%40mail.gmail.com.

erik...@chromium.org

unread,
Nov 16, 2017, 4:15:50 PM11/16/17
to Chromium-dev, dpr...@chromium.org, sco...@chromium.org, kyle...@chromium.org, blun...@chromium.org, s...@chromium.org, k...@chromium.org, dan...@chromium.org, w...@chromium.org, lind...@chromium.org, bax...@chromium.org, liao...@chromium.org
There's been some discussion on this thread about disabling retries built into tests or test suites. The very existence of these retries is indicative of an underlying problem [poorly written test, unstable harness, etc.] but getting rid of the retries amplifies the problem.

AFAICT, the CQ will automatically retry entire builds on certain types of failure [starting from compile, and running all test suites]. And as a developer, if I see a failure that looks like a flake, I'll just mash the CQ button until the CL actually goes in.

I think that there's agreement that:
  • Flaky tests are really bad.
  • We'd like to move towards a world with no retries.
  • Fixing flaky tests is hard. 
    • There's disagreement about what we should do if a test flakes. Should we disable it, or try to fix it, or contact an original owner? I think the answer is context dependent.
Proposal:
All tests should be divided into two buckets:
  • Bucket 1: Finely tuned tests that are NOT flakey. These tests are robust to being sharded, run in different orders, and different types of thread/task scheduling. If the test begins to fail or flake, there is something REALLY BAD happening. Fixing the test should be at least Priority-1. If the test needs to be disabled, a ReleaseBlock bug should be filed. Ideally, these tests will have long-lasting teams as OWNERs. If they do not, their responsibility is the chromium sheriffs. 
  • Bucket 2: All other tests.
By default, all tests start in Bucket 2. Moving tests into bucket-1 requires specifying explicit owners. After some period of time of non-flakiness [6 months?], owners can be removed if no long-lasting team is available. Bucket-2 tests can be disabled on flake [with a non-RB bug filed]. We'd like all tests to eventually be in bucket 1. We incentivize teams/projects to move tests over with a couple of fix-its and a graph showing bucket 1 vs bucket 2 stats. New projects *must* land tests in bucket 1. Yes, this requires being around [e.g. for at least 6 months] to fix flaky tests potentially after the feature has shipped and the team has disbanded. 

In order to root out flakiness earlier in Bucket 1, tests are always run with random sharding parameters, with random orders, potentially with randomness introduced into the Chrome task schedulers.

Thoughts?

Dirk Pranke

unread,
Nov 16, 2017, 11:18:12 PM11/16/17
to Alexei Svitkine, Mike Lawther, Scott Graham, kyle...@chromium.org, chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, Wez, Lindsay Pasricha, Mike Baxley, Yuke Liao
This could work, but I'd like to gather a bit more data before we start trying to do things like this. My team is working on this right now, and so hopefully we will have something in the very near future that we can share.

-- Dirk

On Thu, Nov 16, 2017 at 7:49 AM, Alexei Svitkine <asvi...@google.com> wrote:
(from the right email this time)

Scott's list of failing base unit test pairs seems like a good actionable list to deflake that test suite. Perhaps we can start by finding owners for all of those failures and getting to a point that all those tests are either fixed or disabled (based on owners discretion) and then turn off retries for that test suite and see how that looks? Then we can use that to inform us if we want to do the same for other suites or whether we need a different approach.

I'm certainly happy to look at any flakes in base metrics tests for my share. Hopefully we can find owners for the other tests too.

To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.

Shuotao Gao

unread,
Nov 20, 2017, 3:10:39 PM11/20/17
to Mike Lawther, Dirk Pranke, Scott Graham, kyle...@chromium.org, chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, Wez, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, John Abd-El-Malek
Actually Findit team is looking into 2), crbug.com/698385.

For all *gtest* steps on chromium.linux/Linux Tests, we found that:
  * there are ~10k *new* gtests added in last 2 months
  * the total of gtests almost grows linearly
as shown below and here if you wanted to interact with the data or look into a specific gtest step.

Based on the observation in Flake Analyzer, a good number of new tests are already flaky when they are added.
To get conclusive numbers, our next step is to figure out:
    * the percentage of these *new* tests are flaky
    * the number of rerun to surface the flakiness of new tests
    * the potential overhead in CQ time and VM resource
All feedback are welcome!



To unsubscribe from this group and stop receiving emails from it, send an email to chromium-dev+unsubscribe@chromium.org.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.

Wez

unread,
Jan 8, 2018, 12:45:59 PM1/8/18
to Shuotao Gao, Mike Lawther, Dirk Pranke, Scott Graham, kyle...@chromium.org, chromium-dev, Colin Blundell, Scott Violet, Kenneth Russell, Dana Jansens, Lindsay Pasricha, bax...@chromium.org, Yuke Liao, John Abd-El-Malek
Interesting to observe that we have a substantial number of new tests flaky when landed.  Can we tell whether these tests landed successfully because they have a flow flake rate, versus because they passed second/third time due to the allowed retries?
Reply all
Reply to author
Forward
0 new messages