Handling flaky tests

Dirk Pranke

unread,

Nov 14, 2017, 3:27:44 PM11/14/17

to chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao

Hi all,

I'm sheriff today, and in the course of my sheriff-y duties, I've realized that our policy for how we're supposed to handle flaky tests is perhaps unclear or not in broad agreement.

Specifically, it is my belief that we are supposed to disable any flaky tests we come across [1]. However, the documentation I can point at [2] suggests that you're supposed to let them run. I think the documentation needs to be changed.

However, before I do that, I wanted to see if there was something I'm missing.

From the discussions we've had in the past, the objections to disabling tests generally tend to be twofold:

1) If you disable a flaky test, you lose whatever potential coverage that test might be giving us.

2) If you disable a flaky test, you can't easily tell if the test stops being flaky and starts passing (or failing consistently).

These are both valid objections.

The counterarguments I've heard (and made) to the first is that you can't actually trust the coverage the test might be giving you since it's too hard to tell different failure modes apart, and so the value you get from continuing to run things is outweighed by the confusion of looking at ongoing failures and trying to figure out if they're new or not.

The counterarguments to the second are that the cost of continuing to run the flaky test (both in itself, and in potential knock-on costs due to the way we run tests and one test might interfere with another) outweighs the value you get by doing so (i.e., things don't fix themselves often enough, and rapidly enough).

These objections were discussed in the thread [3] linked from the current documentation. That discussion happened in 2012, so, quite some time ago. I don't remember offhand discussions that have happened since then, but I do know that we got rid of the ability -- in gtests -- to mark a test as FLAKY, so there currently isn't a way to keep running a test but ignore failures. We could arguably make this possible, but for the moment I'm going to declare that a separate discussion.

There are related objections and topics that are relevant. First, there's the question of how we would track disabled tests, and lindsayw@ has been working on processes that we'll start following soon for this. Second, there's the question of measuring test (and code) coverage, so we can start to learn whether this matters or not, and baxley@ and liaoyuke@ and others are working on a plan for that as well.

So, to repeat: I think we should be disabling flaky tests, and I think the docs need to be updated to make that extra clear.

Anyone strongly object to this?

-- Dirk

[1] I double-checked with a few other folks just to make sure I wasn't misrembering

[2] https://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium/handling-a-failing-test

[3] https://groups.google.com/a/chromium.org/group/chromium-dev/browse_thread/thread/fcec09fc659f39a6/ef71cbf47185c095#

Scott Violet

unread,

Nov 14, 2017, 3:51:47 PM11/14/17

to Dirk Pranke, chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao

+1 to this policy.

Flaky tests have direct impact on productivity in so far as they impact the time it takes to land patches (flaky tests cause retries) as well as making it harder to sheriff. I'm all for disabling flaky tests. That said, sheriffs need to make an effort before blindly disabling all tests. A test that has run reliably for a long time and suddenly starts flaking is typically an indication that something recent is to blame and should be reverted.

-Scott

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAEoffTC%2BfeGZ3CUkAvN4E9PHfBYkkr3CK92Ta%2Bgef0-1%2BkCyYg%40mail.gmail.com.

Dirk Pranke

unread,

Nov 14, 2017, 5:18:25 PM11/14/17

to Scott Violet, chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao

Agreed. Reverting should be the first tool a sheriff reaches for, and sheriffs should make a serious effort to do so before doing anything else.

-- Dirk

Xianzhu Wang

unread,

Nov 15, 2017, 12:27:31 PM11/15/17

to Dirk Pranke, Scott Violet, chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao

Why can't we extend the policy to all failing tests? (Is a constantly failing test more valuable than a flaky test? Does it cost less?)

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CAEoffTD-9VnhevOELoS-K%3DuRSumo688vGH75Djkxw8MwRUu%2BrQ%40mail.gmail.com.

Mike Baxley

unread,

Nov 15, 2017, 12:39:39 PM11/15/17

to Xianzhu Wang, Dirk Pranke, Scott Violet, chromium-dev, lind...@chromium.org, bax...@chromium.org, Yuke Liao

+1

Flaky tests should be made unflaky or removed. Best option is to revert what broke it (which is not always trivial). If that is not feasible, it should be marked flaky/disabled and effectively removed from our CQ/waterfall. A bug also needs to be filed for the disabled flaky test.

On Wed, Nov 15, 2017 at 9:25 AM Xianzhu Wang <wangx...@chromium.org> wrote:

Why can't we extend the policy to all failing tests? (Is a constantly failing test more valuable than a flaky test? Does it cost less?)

I *think* that policy already applies to disabled tests. Dirk, correct me if I'm wrong.

Dirk Pranke

unread,

Nov 15, 2017, 12:41:41 PM11/15/17

to Xianzhu Wang, Scott Violet, chromium-dev, Lindsay Pasricha, bax...@chromium.org, Yuke Liao

Good question. For gtest-based tests, there isn't actually a distinction, since you can't mark a test as expected to fail (unless I'm misremembering something).

For layout tests, a test that fails consistently is (IMO) slightly better than a flaky test, but only slightly. The policy is that you check in a revised baseline, so that it starts to "pass" [1], but I don't know

how much we do this in practice. That said, let's not mix the two issues together too much, and start a different thread for deciding if we want to handle layout test failures differently.

-- Dirk

[1] https://chromium.googlesource.com/chromium/src/+/master/docs/testing/layout_test_expectations.md#Failing-tests . In this sense, the layout tests are change-detection tests more than they are correctness tests.

Shuotao Gao

unread,

Nov 15, 2017, 2:49:49 PM11/15/17

to bax...@chromium.org, Xianzhu Wang, Dirk Pranke, Scott Violet, chromium-dev, lind...@chromium.org, Yuke Liao

+1 for the policy to disable flaky tests to reduce the retries and false rejections in CQ, but better with a bug filed. However, it is better to attempt to revert culprits first, and Findit's Flake Analyzer could help here.

--

--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev
---
You received this message because you are subscribed to the Google Groups "Chromium-dev" group.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-dev/CABs39yA%3DPqq8hRTsTuH9xCdRnrKQ_ZXpY%2Bb%2Bo80y6tOR5Ro%2BbA%40mail.gmail.com.

Reply all

Reply to author

Forward