Hi all,
I'm sheriff today, and in the course of my sheriff-y duties, I've realized that our policy for how we're supposed to handle flaky tests is perhaps unclear or not in broad agreement.
Specifically, it is my belief that we are supposed to disable any flaky tests we come across [1]. However, the documentation I can point at [2] suggests that you're supposed to let them run. I think the documentation needs to be changed.
However, before I do that, I wanted to see if there was something I'm missing.
From the discussions we've had in the past, the objections to disabling tests generally tend to be twofold:
1) If you disable a flaky test, you lose whatever potential coverage that test might be giving us.
2) If you disable a flaky test, you can't easily tell if the test stops being flaky and starts passing (or failing consistently).
These are both valid objections.
The counterarguments I've heard (and made) to the first is that you can't actually trust the coverage the test might be giving you since it's too hard to tell different failure modes apart, and so the value you get from continuing to run things is outweighed by the confusion of looking at ongoing failures and trying to figure out if they're new or not.
The counterarguments to the second are that the cost of continuing to run the flaky test (both in itself, and in potential knock-on costs due to the way we run tests and one test might interfere with another) outweighs the value you get by doing so (i.e., things don't fix themselves often enough, and rapidly enough).
These objections were discussed in the thread [3] linked from the current documentation. That discussion happened in 2012, so, quite some time ago. I don't remember offhand discussions that have happened since then, but I do know that we got rid of the ability -- in gtests -- to mark a test as FLAKY, so there currently isn't a way to keep running a test but ignore failures. We could arguably make this possible, but for the moment I'm going to declare that a separate discussion.
There are related objections and topics that are relevant. First, there's the question of how we would track disabled tests, and lindsayw@ has been working on processes that we'll start following soon for this. Second, there's the question of measuring test (and code) coverage, so we can start to learn whether this matters or not, and baxley@ and liaoyuke@ and others are working on a plan for that as well.
So, to repeat: I think we should be disabling flaky tests, and I think the docs need to be updated to make that extra clear.
Anyone strongly object to this?
-- Dirk
[1] I double-checked with a few other folks just to make sure I wasn't misrembering