disabling flaky tests

771 views
Skip to first unread message

Evan Martin

unread,
Feb 13, 2012, 1:18:31 PM2/13/12
to chromium-dev
Summary: This week I’m going to mass-mark every test we’ve currently
marked as “flaky” as instead “disabled”.
This will cause such tests to stop running on the build bots. If
you’re currently tracking down why a test is flaky, or intend to do so
in the future, read on for how and why.


Currently tests that are marked with a “FLAKY_” prefix still run on
the bots, but whether they pass or fail is ignored by the bots. For a
recent run on the bots 1% of all tests we ran, more than a thousand,
were marked FLAKY.

We currently have a web of interdependent problems around flaky tests
and it’s difficult to find which thread to first pull. When a test
begins to unreliably fail it’s difficult to determine the cause,
because the bots take a long time to cycle (providing fewer runs worth
of data and larger blamelists per change), but the bots take a long
time to cycle in part due to flaky tests taking time.

I intend to start pulling. Concretely, I plan to:
1) Use a sed script to rename every FLAKY_foo test to DISABLED_foo.
2) Institute a zero-tolerance policy for flaky tests. If a test ever
fails for reasons other than a bad change causing it to regress we
disable it.

Q1. Won't we lose test coverage?
A. We're already ignoring the success/fail of flaky tests. When they
fail, the bot stays green.

Q2. Shouldn't we instead fix those tests?
A. Yes, I agree 100%. But realistically that hasn't happened. We can
talk a lot about blame or what ought to be done but nobody has stepped
up. I am stepping up.

Q3. Shouldn’t we instead do [insert complicated engineering solution here]?
A. I have lots of ideas like this too, but unless you’re volunteering
to do such work then your idea isn’t very useful.

Q4. How will people diagnose a test that they think is flaky?
A. Once I’ve done this sweeping change, you can mark a test as FLAKY_
again and collect data for it. I just want to reset all tests that
aren't currently being worked on. (When I looked into which flaky
tests were slowest I found the slowest one has had a bug open on it
for nearly a year, no action taken.) I would like to follow up this
change with some policy (either verbal or in code) like "if a test is
marked flaky for more than a few days it goes back to disabled unless
someone says they're working on it".

Q5. Why not leave the tests marked flaky, and just only run the tests
marked flaky on a different bot?
A. When a test is flaky it's often flaky on just a specific
configuration or specific machine. To get full configuration coverage
for a separate flaky set of tests we'd need a second copy of every bot
in existence, from Mac 10.5 debug to Linux ChromeOS Aura Clang. It's
better to allow people who want to analyze a test to have them run on
exactly the same bots that normal tests would run on, via the above
FLAKY_ mechanism.

Q6. Often a test is marked flaky but it’s due to some unrelated
problem, like the browser crashing on startup. It’s not fair to
disable the test!
A. Regardless of who is at fault, an unreliable test has a cost across
the team that outweighs its utility. (In fact, people who didn’t
author the test are perhaps even more impacted by it, as they
redundantly attempt to track down whether they’re at fault for
regressing it.)

Q7. How can I prevent my test from getting disabled?
A. The more isolated a test is from the rest of Chrome, the more
reliable it becomes and the faster it runs. (For example, each
individual browser_test takes around ten seconds to run on XP debug.)
Converting your tests to unit tests (as distinct from the integration
tests we usually do, see
http://en.wikipedia.org/wiki/Software_testing#Unit_testing ) will
help.

Q8. What is the performance impact of this?
A. I grabbed the output for a single cycle across all the bots earlier
this week.
Not running all flaky tests would reduce the total wall time spent in
tests by 9%, saving just under an hour. But because we run bots in
parallel the actual cycle time impact varies per bot. The maximally
impacted test suite by percentage impact shaves off 96% of its
runtime, and the maximally impacted bot by time shaves off 6 minutes
(10%).

James Hawkins

unread,
Feb 13, 2012, 4:02:40 PM2/13/12
to ev...@chromium.org, chromium-dev
You didn't ask, but I'm fully supportive.

James


--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
   http://groups.google.com/a/chromium.org/group/chromium-dev

Stuart Morgan

unread,
Feb 13, 2012, 4:54:47 PM2/13/12
to ev...@chromium.org, chromium-dev
So now every time a random test fails once because of, say, a rare
memory smasher in another test in the same binary, or some sort of
random bot issue, instead of ignoring that one-off failure and moving
on the process will be:
1) Someone disables it
2) They file a bug
3) Someone (hopefully) takes the bug
4) That person re-enables it FLAKY to get data
5) They monitor it for several days to see that it's always green
6) They remove the FLAKY marker

This seems like huge amount of overhead, especially for something that
is supposed to be reducing the drain on the team. Can you explain what
the compelling advantage of "zero tolerance" over "much lower than
current, but non-zero tolerance" is that is worth this cost? We
already have the dashboards that let us distinguish between one-off
failures and tests that are actually flaky.

-Stuart

Evan Martin

unread,
Feb 13, 2012, 4:56:17 PM2/13/12
to Stuart Morgan, chromium-dev

Could you contrast the above procedure against the alternative you're
proposing? I'm not sure I follow.

Stuart Morgan

unread,
Feb 13, 2012, 5:06:06 PM2/13/12
to Evan Martin, chromium-dev
Le 13 février 2012 22:56, Evan Martin <ev...@chromium.org> a écrit :
> Could you contrast the above procedure against the alternative you're
> proposing?  I'm not sure I follow.

The alternative procedure is to check the flakiness dashboard when a
test flakes, and if there's no indication that the test has any
history of flake, skip all the other steps and save a lot of
unnecessary work for one-off failures. If it really is flaky, it'll
fail a second time and get disabled then.

The cost of waiting for a second failure seems a lot lower than the
cost of dealing with a lot of false-positive disabling.

-Stuart

Evan Martin

unread,
Feb 13, 2012, 5:50:24 PM2/13/12
to Stuart Morgan, chromium-dev

I guess you're talking about which action should be taken when a test
fails. I think I was premature in bringing that up, as it's not a
direct conclusion of what I'm doing right now but rather where I
intend to head.

There are different options with different tradeoffs, but no option
seems very satisfactory to me. Yours seems to optimize forging ahead
at the cost of making every person who makes a change a Chrome needing
to be able to reason through whether any given test fails. Mine has
collateral test suite damage but simpler reasoning.

Let me walk you through an example of where my reasoning comes from.
I don't know about you, but my experience with the commit queue is
that the majority of my changes take multiple (automated) tries to get
through. E.g. looking through my recent history my second-most-recent
was https://chromium-status.appspot.com/cq/evan%40chromium.org/9316110/4002
. It took three tries on Mac, the second of which involved a
45-second timeout.
Prompted by this email, I looked into the problem. Its adjacent test
that uses the same signalling mechanism is already marked flaky, and
this test also frequently times out on the commit queue:
http://code.google.com/p/chromium/issues/detail?id=95058 ). My point
is not to pick on this test or Macs in particular, but rather to note
that I don't have the time to track down why my Linux-only (literally:
this change adjusted Linux compiler flags) failed on a Mac bot. And
apparently nor does anyone else; from that bug's log, it appears this
test (not the one already marked flaky, but the one that delayed my
commit by nearly thirty minutes) has timed out many times, causing
many different people's jobs to be delayed and retried.

We'd all have been better off with this test turned off a long time
ago. That isn't to say that the test isn't useful, or that its
failure isn't a real problem, but rather that given that it's not
fixing itself, we're better off without it than with it. I don't see
a better way to amortize the cost of fixing this kind of problem other
than by making it everyone's problem. I would have personally
disabled the test if I had felt I had the authority to.


PS: At some level if you have a random memory smasher in your test
suite, no amount of disabling will save you (unless you manage to
accidentally disable the memory smasher). But at the same time, a
test suite with a random memory smasher is not very useful.

James Cook

unread,
Feb 13, 2012, 6:51:21 PM2/13/12
to ev...@chromium.org, Stuart Morgan, chromium-dev
As a developer of purely ChromeOS-specific Chrome code, I very much
appreciate changes that reduce the time I spend looking at test
failures in code I didn't affect.

I've been hacking on Chrome for about a year now and my first few
months were filled with terror whenever I got email from the build
system -- I had no idea what the failing tests were supposed to do,
and I was 95% sure I didn't affect them, but with hundreds of people
tracking ToT was I *sure* I didn't break something?

Disabling all our current flaky tests has my support. It's like the
step in "Inbox Zero" that says "delete your inbox." It's painful, and
might lose something, but it can help you get to a better place.

James

Dirk Pranke

unread,
Feb 13, 2012, 7:27:42 PM2/13/12
to ev...@chromium.org, Stuart Morgan, chromium-dev
On Mon, Feb 13, 2012 at 2:50 PM, Evan Martin <ev...@chromium.org> wrote:
> On Mon, Feb 13, 2012 at 2:06 PM, Stuart Morgan
> <stuart...@chromium.org> wrote:
>> Le 13 février 2012 22:56, Evan Martin <ev...@chromium.org> a écrit :
>>> Could you contrast the above procedure against the alternative you're
>>> proposing?  I'm not sure I follow.
>>
>> The alternative procedure is to check the flakiness dashboard when a
>> test flakes, and if there's no indication that the test has any
>> history of flake, skip all the other steps and save a lot of
>> unnecessary work for one-off failures. If it really is flaky, it'll
>> fail a second time and get disabled then.
>>
>> The cost of waiting for a second failure seems a lot lower than the
>> cost of dealing with a lot of false-positive disabling.
>
> I guess you're talking about which action should be taken when a test
> fails.  I think I was premature in bringing that up, as it's not a
> direct conclusion of what I'm doing right now but rather where I
> intend to head.
>
> There are different options with different tradeoffs, but no option
> seems very satisfactory to me.  Yours seems to optimize forging ahead
> at the cost of making every person who makes a change a Chrome needing
> to be able to reason through whether any given test fails.  Mine has
> collateral test suite damage but simpler reasoning.
>

I think perhaps the question is how long a grace period a flaky test
should have. Clearly, we shouldn't disable a test immediately after it
fails once - you wouldn't be able to distinguish a flake from a
regression. Arguably a test that was passing fine now failing every
other time probably indicates a regression (and not just a flaky test)
as well. However, if a test fails 1 out 10 times, or 1/100, or 1/1000,
and you don't have a good past history of passing to draw on ... I'm
not sure where the best place to draw the line is

(We have this problem in spades in the layout tests right now ...
fortunately marking tests as flaky or not just involves tweaking
test_expectations.txt, not changing code, making this more amenable to
automation).

-- Dirk

Shawn Singh

unread,
Feb 13, 2012, 9:34:41 PM2/13/12
to dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
Apologies if this question is a repeat or just a dumb question...

I thought the idea was to skip FLAKY tests, not re-categorize them to DISABLED.  I'm not against skipping FLAKY tests, and I did briefly see the previous discussion about this.  But my immediate reaction to this announcement was that DISABLED means we loose the FLAKY label on the test, which is worse than just skipping it.

I'm sure i'm just missing something obvious, can you please fill it in?

Thanks,
~Shawn

Peter Kasting

unread,
Feb 13, 2012, 9:41:09 PM2/13/12
to shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
On Mon, Feb 13, 2012 at 6:34 PM, Shawn Singh <shawn...@chromium.org> wrote:
I thought the idea was to skip FLAKY tests, not re-categorize them to DISABLED.

How were you proposing to skip these other than by marking them DISABLED?  That's normally how we skip tests.

PK

Fred Akalin

unread,
Feb 13, 2012, 9:49:15 PM2/13/12
to pkas...@google.com, shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
Perhaps adding "-*FLAKY*" to --gtest_filters for the command lines of our build/trybots?

--

Mihai Parparita

unread,
Feb 13, 2012, 9:55:22 PM2/13/12
to aka...@chromium.org, pkas...@google.com, shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
On Mon, Feb 13, 2012 at 6:49 PM, Fred Akalin <aka...@chromium.org> wrote:
Perhaps adding "-*FLAKY*" to --gtest_filters for the command lines of our build/trybots?

But then if you wanted to temporarily re-enable the test (to see if it's still flaky and/or gather more data), how would you do that but still signal to the tree/sheriffs that they can ignore the failures?

Mihai

Peter Kasting

unread,
Feb 13, 2012, 9:55:39 PM2/13/12
to Fred Akalin, shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
On Mon, Feb 13, 2012 at 6:49 PM, Fred Akalin <aka...@chromium.org> wrote:
Perhaps adding "-*FLAKY*" to --gtest_filters for the command lines of our build/trybots?

That won't also skip them for developers, which seems like a suboptimal inconsistency.

Mihai's objection is better though.

PK

Shawn Singh

unread,
Feb 13, 2012, 9:58:45 PM2/13/12
to Peter Kasting, Fred Akalin, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
OK, thanks for the replies.   As for suggestions of my own, I don't have any.  I'm not familiar with the infrastructure at all.  I just felt it was better to ask, just in case, since its a pretty big change. =)

But clearly you guys all have already understood and considered that point...  that's good enough for me.

Thanks!
~Shawn

Fred Akalin

unread,
Feb 13, 2012, 10:29:22 PM2/13/12
to Peter Kasting, shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
Ah, that makes sense.

As for losing the flaky bit when mass-renaming to disabled, I think the fact that that info is still in svn/git is good enough.

On Mon, Feb 13, 2012 at 6:55 PM, Peter Kasting <pkas...@google.com> wrote:

Mark Larson (Google)

unread,
Feb 13, 2012, 11:05:51 PM2/13/12
to ev...@chromium.org, chromium-dev
Go, Evan.

I think this is going to be a net win for developer productivity across Chromium, and I know there's going to be some collateral damage. The problem is that FLAKY_ tests are growing (overall... I know a lot of them get fixed), and it's costing every developer 2^10 paper cuts at this point for no real gain in quality signal. If you think disabling a particular flaky test is doing harm to the project, you can either fix the test or add new, non-flaky tests for the feature.

On Mon, Feb 13, 2012 at 10:18, Evan Martin <ev...@chromium.org> wrote:
Summary: This week I’m going to mass-mark every test we’ve currently
marked as “flaky” as instead “disabled”.
This will cause such tests to stop running on the build bots. If
you’re currently tracking down why a test is flaky, or intend to do so
in the future, read on for how and why.


Currently tests that are marked with a “FLAKY_” prefix still run on
the bots, but whether they pass or fail is ignored by the bots. For a
recent run on the bots 1% of all tests we ran, more than a thousand,
were marked FLAKY.

We currently have a web of interdependent problems around flaky tests
and it’s difficult to find which thread to first pull. When a test
begins to unreliably fail it’s difficult to determine the cause,
because the bots take a long time to cycle (providing fewer runs worth
of data and larger blamelists per change), but the bots take a long
time to cycle in part due to flaky tests taking time.

I intend to start pulling. Concretely, I plan to:
1) Use a sed script to rename every FLAKY_foo test to DISABLED_foo.
2) Institute a zero-tolerance policy for flaky tests. If a test ever
fails for reasons other than a bad change causing it to regress we
disable it.

This might be a bit trigger-happy, but I could be wrong. I'd hope that we're not marking tests as FLAKY_ based on one unpredicted failure. However, I agree with the philosophy of disabling tests instead of marking them flaky... the point is that, in aggregate, we've become dependent on the flaky crutch and used it to sweep tests under the rug too often. 
 

Q1. Won't we lose test coverage?
A. We're already ignoring the success/fail of flaky tests. When they
fail, the bot stays green.

Q2. Shouldn't we instead fix those tests?
A. Yes, I agree 100%. But realistically that hasn't happened. We can
talk a lot about blame or what ought to be done but nobody has stepped
up. I am stepping up.

I am glad to see Evan stepping up. You can step up, too, if you have ideas about how to de-flake tests.
 

Q3. Shouldn’t we instead do [insert complicated engineering solution here]?
A. I have lots of ideas like this too, but unless you’re volunteering
to do such work then your idea isn’t very useful.

Q4. How will people diagnose a test that they think is flaky?
A. Once I’ve done this sweeping change, you can mark a test as FLAKY_
again and collect data for it. I just want to reset all tests that
aren't currently being worked on. (When I looked into which flaky
tests were slowest I found the slowest one has had a bug open on it
for nearly a year, no action taken.)  I would like to follow up this
change with some policy (either verbal or in code) like "if a test is
marked flaky for more than a few days it goes back to disabled unless
someone says they're working on it".

Can we make a policy that any test that's been disabled/flaky/fails for 10K or 20K revisions just gets deleted? If we're shipping a browser through dev, beta, and stable without needing this test's input, is it worth it to have the test code in the tree, to be compiling that code over and over, and to be bloating the size of the test binaries that we pass around from bot to bot? 

I guess there are tests that are disabled for specific platforms only, but for any disable that's not platform-specific, I think it'd be good to delete the test code after some period. 

This sort of Logan's Run algorithm for test code would need someone to step up and implement... if we think rebuilding dead test code all the time is really a problem.
 

Sreeram Ramachandran

unread,
Feb 13, 2012, 11:08:53 PM2/13/12
to aka...@chromium.org, Peter Kasting, shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
On Mon, Feb 13, 2012 at 19:29, Fred Akalin <aka...@chromium.org> wrote:
> As for losing the flaky bit when mass-renaming to disabled, I think the fact
> that that info is still in svn/git is good enough.

Better still might be to rename the test to DISABLED_FLAKY_Test, so
that it's clear when looking at the code what happened to the test (or
if that's too confusing, perhaps DISABLED_WasFlaky_Test or such).

Rachel Blum

unread,
Feb 13, 2012, 11:19:20 PM2/13/12
to sre...@chromium.org, aka...@chromium.org, Peter Kasting, shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
 
Better still might be to rename the test to DISABLED_FLAKY_Test, so
that it's clear when looking at the code what happened to the test (or
if that's too confusing, perhaps DISABLED_WasFlaky_Test or such).

There's really no point in that. Unless people are frequently changing DISABLED tests (why?), it'll be immediately obvious from the change history anyways. Let's keep history in revision control, not in the source.

Rachel

Szymon Jakubczak

unread,
Feb 14, 2012, 10:11:32 AM2/14/12
to aka...@chromium.org, Peter Kasting, shawn...@chromium.org, dpr...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
If recategorizing FLAKY to DISABLED is a loss, does this mean the
DISABLED tests are a graveyard (once a test enters, it never comes
back to life)? If so, shouldn't we actually cut them out of the source
to speed up build times?

James Cook

unread,
Feb 14, 2012, 11:09:45 AM2/14/12
to ev...@chromium.org, Stuart Morgan, chromium-dev
> I've been hacking on Chrome for about a year now and my first few
> months were filled with terror whenever I got email from the build
> system -- I had no idea what the failing tests were supposed to do,
> and I was 95% sure I didn't affect them, but with hundreds of people
> tracking ToT was I *sure* I didn't break something?

I realized last night that my comment has nothing to do with Evan's
work. His plan cuts cycle times. It neither increases nor decreases
the chance a particular build run will hit a not-known-to-be-flaky
test.

However, shorter cycle times mean narrower lists of revision ranges to
examine when investigating a failure, and less anxious waiting time
for test results. Full steam ahead!

James

Scott Hess

unread,
Feb 14, 2012, 11:16:36 AM2/14/12
to jame...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
On Tue, Feb 14, 2012 at 8:09 AM, James Cook <jame...@chromium.org> wrote:
>> I've been hacking on Chrome for about a year now and my first few
>> months were filled with terror whenever I got email from the build
>> system -- I had no idea what the failing tests were supposed to do,
>> and I was 95% sure I didn't affect them, but with hundreds of people
>> tracking ToT was I *sure* I didn't break something?
>
> I realized last night that my comment has nothing to do with Evan's
> work.  His plan cuts cycle times.  It neither increases nor decreases
> the chance a particular build run will hit a not-known-to-be-flaky
> test.

It decreases the chances, because what you know FOR CERTAIN about
flaky tests is that something is wrong. In some cases, the test is
flaky because it tickles something latent, this won't improve those.
In other cases the test really is messing things up, and skipping
those will improve the environment for later tests.

-scott

Paweł Hajdan, Jr.

unread,
Feb 14, 2012, 11:57:56 AM2/14/12
to sh...@google.com, jame...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
On Tue, Feb 14, 2012 at 16:11, Szymon Jakubczak <sz...@google.com> wrote:
If recategorizing FLAKY to DISABLED is a loss, does this mean the
DISABLED tests are a graveyard (once a test enters, it never comes
back to life)? If so, shouldn't we actually cut them out of the source
to speed up build times?

I wouldn't call it a graveyard, but there is a correlation between a test being disabled for a long time and being useless.

No mater what happens, it's still going to stay in the version control history in case we really need to restore it (and if it's so important it should be fixed in the first place).

On Tue, Feb 14, 2012 at 17:16, Scott Hess <sh...@chromium.org> wrote:
In other cases the test really is messing things up, and skipping
those will improve the environment for later tests.

Yup, I think it's an important point, especially for test binaries where all tests run in the same address space (most of them, except e.g. browser_tests) or where they can otherwise mess up things easily (e.g. interactive_ui_tests, focus related breakages).

Scott Byer

unread,
Feb 14, 2012, 2:13:59 PM2/14/12
to phajd...@chromium.org, sh...@google.com, jame...@google.com, ev...@chromium.org, Stuart Morgan, chromium-dev
Whether a test is marked FLAKY or DISABLED, it should have a bug filed on it. I find searching the issues list even faster than digging up source history. I had originally though that keeping FLAKY either as a comment or additional part of the identifier would be a good thing, but really, it's just visual noise.

As for DISABLED being a graveyard, it's really no more so than FLAKY is. Both are a shame, and I've recently been trying to fix a few flaky tests, and it's been hard, but instructive. I think I'm much less likely at this point to create a flaky test than before this exercise, and now I can recognize some of the anti-patterns easily.

Having been sheriff enough to know the pain of flaky tests, I say Go Evan!

Ricardo Vargas

unread,
Feb 14, 2012, 2:43:17 PM2/14/12
to m...@chromium.org, ev...@chromium.org, chromium-dev
Isn't that targeting the symptoms but not the problem?. We should be requiring any new feature and significant behavior changing CL to have relevant tests. Deleting tests without removing the code that the test is supposed to verify is basically the same as allowing new code that is not tested. I don't think that the fact that the code "works" in the stable channel is an indicator of the usefulness of unit tests, and the number of bug reports is not exactly going down.

Albert Bodenhamer

unread,
Feb 14, 2012, 3:01:27 PM2/14/12
to rva...@chromium.org, m...@chromium.org, ev...@chromium.org, chromium-dev
Flaky tests just provide the illusion of test coverage without actually providing any valuable data.  

Perhaps a good second step to this would be to automatically generate a high priority bug on any code without good test coverage?  Perhaps some consequences for people who leave untested code laying around?

That way we can call out uncovered code as a risk and provide solid incentives to fix it.



--
Albert Bodenhamer | Software Engineer | abodenha@chromium.org 

Evan Martin

unread,
Feb 14, 2012, 3:05:52 PM2/14/12
to Albert Bodenhamer, rva...@chromium.org, m...@chromium.org, chromium-dev
On Tue, Feb 14, 2012 at 12:01 PM, Albert Bodenhamer
<abod...@chromium.org> wrote:
> Perhaps a good second step to this would be to automatically generate a high
> priority bug on any code without good test coverage?  Perhaps some
> consequences for people who leave untested code laying around?
>
> That way we can call out uncovered code as a risk and provide solid
> incentives to fix it.

See question #3 in the initial email. ;)

In seriousness, test coverage is a really weak point of our process.
Even being able to identify which code is covered by tests is
something that nobody is currently working on, as far as I know.

Albert Bodenhamer

unread,
Feb 14, 2012, 3:12:07 PM2/14/12
to Evan Martin, rva...@chromium.org, m...@chromium.org, chromium-dev
On Tue, Feb 14, 2012 at 12:05 PM, Evan Martin <ev...@chromium.org> wrote:
On Tue, Feb 14, 2012 at 12:01 PM, Albert Bodenhamer
<abod...@chromium.org> wrote:
> Perhaps a good second step to this would be to automatically generate a high
> priority bug on any code without good test coverage?  Perhaps some
> consequences for people who leave untested code laying around?
>
> That way we can call out uncovered code as a risk and provide solid
> incentives to fix it.

See question #3 in the initial email.  ;)

:-) Fair enough.  I was thinking more of an "AND" rather than "INSTEAD OF", but I'll put some thought into it.  Flaky tests need to die.
 

In seriousness, test coverage is a really weak point of our process.
Even being able to identify which code is covered by tests is
something that nobody is currently working on, as far as I know.

Yep.  I'm seeing in a lot of areas where there are non-broken tests that they often really aren't testing things well.  Good tests are hard.

Chris Bentzel

unread,
Feb 14, 2012, 3:21:52 PM2/14/12
to ev...@chromium.org, Albert Bodenhamer, rva...@chromium.org, m...@chromium.org, chromium-dev, Randy Smith

I know he hopes to find time to document his process so others can repeat this.

Randy Smith

unread,
Feb 14, 2012, 3:23:31 PM2/14/12
to ev...@chromium.org, Albert Bodenhamer, rva...@chromium.org, m...@chromium.org, chromium-dev
On Tue, Feb 14, 2012 at 3:05 PM, Evan Martin <ev...@chromium.org> wrote:
> On Tue, Feb 14, 2012 at 12:01 PM, Albert Bodenhamer
> <abod...@chromium.org> wrote:
>> Perhaps a good second step to this would be to automatically generate a high
>> priority bug on any code without good test coverage?  Perhaps some
>> consequences for people who leave untested code laying around?
>>
>> That way we can call out uncovered code as a risk and provide solid
>> incentives to fix it.

It's a bit of a thread hijack, but I'd like to +1 this. We're
spending a lot of time talking about flaky and failing tests (and for
good reasons, as Evan points out--what he's doing is just step one,
but I'm excited about it). But when we talk about feedback loops
(Developers shouldn't have flaky tests! They should be forced to fix
them!) I think we miss the point. Code shouldn't be in the tree
without good tests. Someone who checks in code with minimal testing
is in the same position as someone who checks in code with flaky tests
that provided good coverage before they were disabled. The feedback
loops we should put in place for getting people to write/fix tests
should have to do with code coverage (*), not with flaky or disabled
tests.

-- Randy

P.S. (*) I know that code coverage is an imperfect metric for quality
of testing. I just think it's better one than a lack of test flake
:-J.

>
> See question #3 in the initial email.  ;)
>
> In seriousness, test coverage is a really weak point of our process.
> Even being able to identify which code is covered by tests is
> something that nobody is currently working on, as far as I know.

I've done some work to make it easier to get coverage run for a test
subset which I need to write up and put o nthe wiki.

Scott Hess

unread,
Feb 14, 2012, 3:23:50 PM2/14/12
to rva...@chromium.org, m...@chromium.org, ev...@chromium.org, chromium-dev
On Tue, Feb 14, 2012 at 11:43 AM, Ricardo Vargas <rva...@chromium.org> wrote:
> Isn't that targeting the symptoms but not the problem?. We should be
> requiring any new feature and significant behavior changing CL to have
> relevant tests. Deleting tests without removing the code that the test is
> supposed to verify is basically the same as allowing new code that is not
> tested. I don't think that the fact that the code "works" in the stable
> channel is an indicator of the usefulness of unit tests, and the number of
> bug reports is not exactly going down.

Very old flaky (and disabled) tests are like three-digit bug numbers -
the easy ones don't get to that point. When you go to look, you find
out that it's going to be a 12-week yak-shave to fix it.

In many cases, the problem may not be that the test needs to be fixed
so much as the code needs to be refactored to be more testable, or, in
many cases, more portable (there are tons of "MAYBE" cases in that
bin). I am quite sure that we have fair amount of code which only
exists because a disabled test somewhere refers to it.

-scott

Evan Martin

unread,
Feb 16, 2012, 1:26:37 PM2/16/12
to chromium-dev
On Mon, Feb 13, 2012 at 10:18 AM, Evan Martin <ev...@chromium.org> wrote:
> Summary: This week I’m going to mass-mark every test we’ve currently
> marked as “flaky” as instead “disabled”.
> This will cause such tests to stop running on the build bots. If
> you’re currently tracking down why a test is flaky, or intend to do so
> in the future, read on for how and why.

This is now more or less complete. The attached image shows the sort
of impact it had on a random bot. It appears in many cases the bot
cycle time variance has gone down a lot. (Caveat: I am not certain
this was all due to my change. The buildbot UI makes it difficult to
be certain.)

Here are some random ideas I've had for places to take this in the future:
- remove all FAILS_ tests that always time out
- eliminate the XP debug bot, instead using Release+DCHECKs
- construct a new test framework for testing browser UI that doesn't
involve rebooting Chrome for each test
- construct a new test framework for testing browser UI that builds
Chrome without webkit (less code = faster boots / faster links)
- more tooling to monitor and auto-report on problematic tests

However, I found taking even this small step incredibly frustrating
and not worth the pain, so it implementing these sorts of ideas will
unfortunately have to fall on someone else.

cycle.png

Ilya Sherman

unread,
Feb 16, 2012, 5:48:21 PM2/16/12
to ev...@chromium.org, chromium-dev
On Thu, Feb 16, 2012 at 10:26 AM, Evan Martin <ev...@chromium.org> wrote:
On Mon, Feb 13, 2012 at 10:18 AM, Evan Martin <ev...@chromium.org> wrote:
> Summary: This week I’m going to mass-mark every test we’ve currently
> marked as “flaky” as instead “disabled”.
> This will cause such tests to stop running on the build bots. If
> you’re currently tracking down why a test is flaky, or intend to do so
> in the future, read on for how and why.

This is now more or less complete.  The attached image shows the sort
of impact it had on a random bot.  It appears in many cases the bot
cycle time variance has gone down a lot.  (Caveat: I am not certain
this was all due to my change.  The buildbot UI makes it difficult to
be certain.)

Here are some random ideas I've had for places to take this in the future:
- remove all FAILS_ tests that always time out

Tests that time out are already supposed to be marked DISABLED -- historically, this has been one of the most common reasons for disabling a test.
 
- eliminate the XP debug bot, instead using Release+DCHECKs
- construct a new test framework for testing browser UI that doesn't
involve rebooting Chrome for each test
- construct a new test framework for testing browser UI that builds
Chrome without webkit (less code = faster boots / faster links)
- more tooling to monitor and auto-report on problematic tests

However, I found taking even this small step incredibly frustrating
and not worth the pain, so it implementing these sorts of ideas will
unfortunately have to fall on someone else.

Reid Kleckner

unread,
Feb 17, 2012, 11:28:05 AM2/17/12
to ev...@chromium.org, chromium-dev
On Thu, Feb 16, 2012 at 1:26 PM, Evan Martin <ev...@chromium.org> wrote:
- construct a new test framework for testing browser UI that doesn't
involve rebooting Chrome for each test

Obviously there's isolation issues here and this would be a lot of work.

However, it would also really benefit the Valgrind and DrMemory bots.  With all these short-lived processes, these tools have to re-do translation for all the code on Chrome's startup path.  It's pretty much the worst possible case for us.  Having longer-lived processes would allow them to reuse more of the code cache.

Reid

Paweł Hajdan, Jr.

unread,
Feb 17, 2012, 12:38:00 PM2/17/12
to r...@google.com, ev...@chromium.org, chromium-dev
Actually adapting ui_tests to do this would be fairly easy (not so much with browser_tests though).

Of course that could create steps to repro like "any of those 100 tests could leak / corrupt memory".

Vincent Scheib

unread,
May 31, 2012, 5:30:06 PM5/31/12
to chromium-dev, phajd...@chromium.org, r...@google.com, ev...@chromium.org
I have updated chromium.org documentation to reflect this policy: https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriff-details-chromium/handling-a-failing-test

Corrections welcome if I got it wrong.

Paweł Hajdan, Jr.

unread,
Jun 1, 2012, 3:35:00 AM6/1/12
to Vincent Scheib, chromium-dev, r...@google.com, ev...@chromium.org
Looks good, thank you!
Reply all
Reply to author
Forward
0 new messages