The all-green tree would be read-only, receiving new changesets only
from Mozilla Central as a mechanical response to tinderbox results. It
would thus be single-headed, as Mozilla Central is now.
Naturally, developers trying to land a patch would still need to work
from the latest M-C head and take the usual care shepherding their work
into an all-green changeset that has been fully merged with others'
work. So the all-green repo wouldn't really change the critical path
for landing patches, only in keeping one's tree fresh.
This would relax, but not eliminate, the role of sheriffs, because while
burnage wouldn't stop others from getting a reasonably fresh known-good
state, there would still be a need for someone to sort out which of
several landed changes are causing problems, and revert or fix up as
needed to produce a new all-green head.
If we have good experiences with an all-green tree, we might eventually
want to change our rules so that patches are not considered landed until
they appear in the all-green tree. That is, after all, our actual goal
condition: a tree that passes all tests with the change in it. This
would further change the incentives around sheriffing: committers would
have a direct interest in getting the tree to green, since they'd be
unable to close their bugs until it was.
Some details:
- As I understand it, the slaves currently pull whenever they have
finished a task and there is a new changeset available. This means that
not every changeset necessarily gets tested: a slave can grab one, and
then several commits may be pushed before the slave gets a chance to
start a new cycle. But for the concept of a "green changeset" to really
be useful, we need to get the exact same changeset through all the
builds and tests. If slaves skip changesets often enough, all-green
changesets may be rare simply because they don't usually get enough
attention, not because they are in fact flawed. But this can be solved
by closing the tree to let it "cycle green", which we have to do now
under some circumstances, or by adding more slaves.
- If we decide that all-green changesets are valuable enough, we could
direct slaves to ignore changesets that have been rejected by other
tests, and concentrate their effort on the changesets that might be
all-green. I'm sure there are other scheduling tricks we can use to
improve performance further: for example, preferring the changeset that
has the most greens so far from other tests, but has not yet gotten our
attention, would lead to the slaves "dogpiling" on promising changesets,
and then abandoning them as soon as a failure is found.
Also, I think you mean a green push, not changeset. I know I, and
others push a series of changesets together because they could not live
on their own. Testing each changeset would certainly cause failed tests
at a minimum in those cases.
Cheers,
Shawn
Okay. This may be a non-issue, then. That would be nice.
> Also, I think you mean a green push, not changeset. I know I, and others
> push a series of changesets together because they could not live on
> their own. Testing each changeset would certainly cause failed tests at
> a minimum in those cases.
Well, a push can carry many changesets, but the thing being tested, that
can be said to be all-green or not, is a changeset. But you have a
point in that, although the *head* of the all-green repo will always be
green, its parents might not be: if something was pushed to M-C and then
fixed, the intermediate, non-green changesets will be in the all-green
repo, too. So the subject I chose for the thread is not quite right:
rather, it's an all-green-head repository.
(I guess one *could* have a repo that contained only the green
changesets, or their contents anyway; the non-green intermediate
changesets would be folded out. But since a changeset's parents figure
in its ID, all the IDs would be different; Mercurial would treat such a
repo as being an essentially unrelated line to M-C. Limited usefulness.)
I consider it wasted work to invest in infrastructure based on our flaky
results.
If you as a developer want to get a a local tree to work on, just go to
whatever crufty display we have right now and look up a revision and
just update to that locally. I don't see why we need another hg clone
for that.
Or, to put my general grmpf into a question, are you sure you're not
just trying to game a different problem that you figure you can't fix
anyway? If so, it'd be much more useful to call that one out instead of
proposing workarounds.
Axel
My thoughts are the same as Axel. How is this even remotely practical
given how much random orange there currently is?
I want to strongly support the general idea proposed by Jim Blandy. I've
wasted many hours trying to build firefox, much of it trying to decode
obscure instructions and crufty displays. I think I've figured it out
but the procedure is a stupid waste of time.
jjb
If the results are that bad, then something that consumes them as input
isn't going to work very well. But don't we ourselves consume them as input?
It seems to me that one of our institutional problems is that we permit
ourselves to work around test failures, because we don't really believe
the tests. But if we were really sure that the tests were what's flaky,
we would delete those tests. Our reluctance to do so indicates that we
don't *know* whether these random oranges are really the tests' fault,
or whether there's a real intermittent bug in the product.
I want to encourage us to treat the tests the way we treat our code, or
our build system: something which is expected to work, and which gets
fixed when it doesn't. If a popular all-green repository makes
dismissing failing tests no longer practical (because your work never
gets out to anyone, and your bug is never closed), then that creates an
incentive to fix the tests, or the bugs they detect.
> Or, to put my general grmpf into a question, are you sure you're not
> just trying to game a different problem that you figure you can't fix
> anyway? If so, it'd be much more useful to call that one out instead of
> proposing workarounds.
Actually, I don't have regular problems finding trees that are good
enough to use as the basis for my work. So I'm not personally trying to
get around flaky tests. If I'm trying to game something, it's the social
mechanisms we use to work around the technical failure.
I think there are three categories of problems:
* real intermittent bugs in the product
* tests that are flaky
* timeouts that start becoming flaky when the tests are run on a
system with high load
It's not always trivial to distinguish the first two cases. In
fact, once we have, it's usually relatively easy to fix, and we
generally do.
I don't think we want to just start disabling tests and covering up
real intermittent bugs in the product; we've found quite a few such
bugs by debugging intermittently-failing tests.
-David
--
L. David Baron http://dbaron.org/
Mozilla Corporation http://www.mozilla.com/
> Or, to put my general grmpf into a question, are you sure you're not
> just trying to game a different problem that you figure you can't fix
> anyway? If so, it'd be much more useful to call that one out instead of
> proposing workarounds.
One of the core issues that I see is: Where is the try to push that
Jesse asked for in February?
http://www.squarefree.com/2009/02/19/continuous-integration-at-mozilla/
A push to mozilla-central is such a waste of time compared to push to
try-server, where you are not nailed to the computer and just check the
next morning how the patch went. Thats 90 sec. compared to 2-3 hours
till everything did cycle on mozilla-central. And you need to have at
mozilla-central another 3-4 hours spare in case something did go wrong.
Bernd
It introduces an extra step for developers wanting a green tree, but it
solves the original issue without needing a big investment in
infrastructure.
- Blair
I agree with all this. If, in fact, we do have a good record of fixing
product code and tests to the point that the results are generally
dependable, then it seems to me that an all-green repo would be useful.
You sound more sanguine about the quality of our test results than Axel
does, his argument being that they are so unreliable that it's pointless
to build automated tools driven by them.
In the bigger picture, if we could figure out appropriate rules for
automatic testing and merging in a multi-headed incoming repository,
then one changeset's regressions wouldn't even need to prevent others
from "playing through". There are subtleties there which I don't know
how to deal with yet, but with some care we could eventually eliminate
the sheriff role altogether. A nest failure would block only the person
whose change introduced it, and they would be responsible for
apportioning blame to tests, code, framework, etc., or at least driving
that process with help from others. (Certainly we'd still want to have
test/build specialists on call.)
Our own Graydon Hoare once designed (and implemented) a version control
system called Monotone which had mechanisms by which a test system could
give a changeset its stamp of approval; this made it possible to
mechanically select changesets that had passed a given set of tests, and
create policies like the one I'm suggesting we move towards.
That is essentially what I'm suggesting. This is a very minor thing
work-wise. One more public clone is not a big investment, unless I'm
missing something.
However, I think it will have a big impact on the way people think about
landing their patches, and about the test results. And it's a step
towards the non-failure-blocked system I sketched in my reply to dbaron.
Bots can be great with handling data, but they generally suck at making
decisions.
What bothers me personally is that our bots suck at handling data, and
the noise in the data is just what makes piling stuff on top even more
troublesome.
Axel
You're missing a ton. There's not single place that would know when all
tests pass. I doubt that the talos graph analysis is strong enough yet,
either.
Plus this is a repo per ... not sure. What about comm-central failures?
Etc.
Axel
> In bug 528293 I filed a suggestion that we create a Mercurial repository
> which automatically pulls only those changesets from Mozilla Central for
> which all builds and tests have passed. Developers simply wanting to
> keep ongoing, uncommitted work based on a recent-known-good state could
> pull from this all-green tree.
Jim, is this something you were proposing to implement yourself? It doesn't
sound like something that needs a lot of help from RelEng to implement as an
experiment, to see whether our random-failure rate prevents it from working
correctly or whether people actually would use it.
--BDS
Axel is telling me that the build and test results are not nearly as
accessible as I imagine; I may need some help with that. But that aside,
yes, it's something that would be very easy to set up as an experiment;
it doesn't require any real buy-in to demo.
I wanted to see what more experienced people thought.
In reality, I strongly agree that our random orange situation isn't yet
good enough to make this worth the investment and as someone else
pointed out, the automated Talos analysis isn't yet good enough to be
relied upon.
> Axel is telling me that the build and test results are not nearly as
> accessible as I imagine; I may need some help with that.
> But that aside,
> yes, it's something that would be very easy to set up as an experiment;
As an experiment, it might be easy to set-up but as a full blown,
production system it would be a *lot* more work, and touch many existing
systems. That's not to say we shouldn't necessarily do it, but just that
it's far from a trivial project.
re: getting test results, the only way to do it right now is scraping
Tinderbox. The TBPL code that does that lives here:
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/87aaa378c9a4/TinderboxJSONUser.js
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/87aaa378c9a4/TinderboxHTMLParser.js
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/87aaa378c9a4/TinderboxData.js
HTH
- Ben
I heartily agree with this. But maybe we can get to this goal from a
different direction: before focusing on the subset of changesets that
are always on green, we should figure out how to make green and orange
mean what we want them to mean. At that point, keeping a green tree on
the side becomes both easier and more useful.
As a first cut how to do that, how about if once a test fails
intermittently, we:
1. File a bug on that test. The target of the bug is that either the
test is fixed, or the code is fixed, whichever one is busted.
2. Mark the test as failing intermittently. Tests so marked will not
make the tree turn orange when they fail intermittently. But, if the
test starts failing consistently, that will be counted as a failure. If
the test starts succeeding consistently, that is also a kind of bug,
because the test needs to have the intermittent failure mark removed.
Another idea is that if we start getting a new intermittent orange, run
more tests until we find out what changeset created the orange, and then
back it out.
--
Dave
People do that, AFAICT.
> 2. Mark the test as failing intermittently. Tests so marked will not
> make the tree turn orange when they fail intermittently. But, if the
> test starts failing consistently, that will be counted as a failure. If
> the test starts succeeding consistently, that is also a kind of bug,
> because the test needs to have the intermittent failure mark removed.
I don't see how one is supposed to do that. What's the criteria between
consistent and intermittent, and how would you report on that?
> Another idea is that if we start getting a new intermittent orange, run
> more tests until we find out what changeset created the orange, and then
> back it out.
Same thing, basically. What's the criteria on "intermittent".
For reference, roc reports on
http://weblogs.mozillazine.org/roc/archives/2009/11/today_for_the_f.html
of over a 100 test runs before being able to reproduce a random failure.
That's just not feasible without the infra that he has put in in that
particular case.
Axel
I predict that intermittent failures will continue to predominate. The
try-server makes it relatively easy to catch non-intermittent failures
before you check in. Therefore most of the test failures that reach
mozilla-central will be intermittent ones. Nondeterminism in our code
will increase over time as we exploit more CPU parallelism.
Our only hope is to get better at analyzing and fixing intermittent
failures. Over the last few days I've found and fixed three intermittent
test failures using VMWare's record-and-replay debugging. It's fun.
(Well, I think I've fixed them --- verifying fixes for non-reproducible
bugs is still a hard problem.) We'll still learning how to use these
tools most effectively, but we'll blog some more information soon.
I think the best we can realistically hope for in the long term is that
our Tinderbox test VMs record all the time. (This should not be a
short-term goal, the logistics would be formidable.) Then there will
still be intermittent failures, but you'll always be able to figure out
what went wrong. Hopefully then we can hold intermittent failures to an
acceptable level. In this world, we'd actually encourage variation in
the Tinderbox test environment, since it amounts to increased test coverage.
Rob
That may be true, but the system doesn't need to be perfect to be
valuable. Some tests may pass when they shouldn't, but a changeset with
no known failures is still, on average, better than a changeset known to
have failures.
In other words, a bug that crops up rarely and is detected by any of a
number of otherwise innocent tests --- is that a fair characterization
of the sort of thing you've been fixing? --- is something that is simply
not addressed effectively in our setup. The existence of such bugs
doesn't mean that it's not useful to automatically screen for the
classes of bugs our system can catch reliably. An all-green repo brings
us forward as much as it would if the heisenbugs didn't exist.
> I predict that intermittent failures will continue to predominate. The
> try-server makes it relatively easy to catch non-intermittent failures
> before you check in.
I'm proposing, in effect, to automate this part of the process.
I agree with roc. If the all-green repo isn't reliably all-green,
what's the point of using it?
Nick
But I think part of the problem is that we're lumping together two kinds
of testing. (I'm not sure of my terminology here --- Wikipedia wasn't
much help --- so corrections are welcome.)
- Regression testing (in this categorization) checks for reliably
reproducible failure conditions. If a test does not fail reliably
pre-fix, and pass reliably post-fix, it is not a regression test. Once a
regression test suite has been performed on a given revision of the
product, there is no point in running it again.
- Continuous testing checks for intermittent bugs by simply exercising
the product in some fashion over and over again. The tests need not even
be specific to a particular bug, as long as they provide a broad survey
of the conditions of live use. Fuzzing is continuous testing. Running
the test suite over and over is continuous testing. Continuous testing
should simply run all the time, on as many machines as possible, ideally
with execution recorded for later replay.
Part of what's confusing is that regression tests often make a good test
load for continuous testing, because they probably have the broadest
coverage of any readily available synthetic load.
But these two classes of testing need different kinds of support, and
yield different kinds of results. Roc and Axel are pointing out that it
doesn't make sense to use continuous testing results to drive an
all-green repo, because the failures don't necessarily reveal a
difference between one changeset and the next --- the failures are so
rare that we probably won't catch them the moment they're introduced.
So I understand roc's argument to be that the try server has effectively
solved the problem as far as the regression tests are concerned, and
what we struggle with in M-C are the bugs found by our approximation to
continuous testing.
In this sense, the fact that tinderbox stops testing once it has
processed all the landed changes is a misfeature. Each slave should run
over and over again on whatever the latest changeset is, to make it more
likely to catch intermittent failures, and to make it clearer when a
failure is intermittent (we can see that it comes and goes on the same
changeset).
Going further, it would be helpful to do continuous testing on as many
machines as we can afford, even if they are configured identically,
because this makes it more likely that we will catch intermittent
failures as soon as the bug is introduced. If we had an infinite number
of machines, we'd be guaranteed to see all the intermittent failures
immediately. Covering all our intended platforms is a minimum, not a
stopping point.
It would be known to lack build problems and reliably reproducible bugs.
s/bugs/test failures/
It's not clear to me whether the concern each of you is expressing
relates to false negatives (orange when it's really ok, except for
some existing random orange you happened to hit) or false positives
(green when it actually just introduced a new random orange).
I *think* roc was talking about false negatives, though, and I think
that's a serious concern. More concretely: if we define the
all-green-changeset repository for mozilla-central as containing up
to the most-recent changeset in mozilla-central that had all tests
green (i.e., if a particular test wasn't run on that changeset, then
count the next changeset on which it was run, but no further), then
its current tip would be
http://hg.mozilla.org/mozilla-central/rev/095b7beae53b , which will,
in less than 12 hours, have been pushed a week ago. (This may be
unusually bad, see next paragraph, but I don't think it would be
abnormal for this definition of an all-green-changeset repository to
go without update for a few days on end.)
However, an alternative (and, I suspect, more useful) definition of
the all-green-changeset repository would be a repository containing
the most recent changeset in mozilla-central that had all tests
green on that changeset or any later changeset, it would generally
be reasonably current (although not right now, since Windows Ed has
been orange since midday Friday, see bug 528699).
That said, it has been an unusually bad few weeks, and we really
need to fix it.
I strongly agree with almost everything you say in this message, but
I just want to make two comments.
> In this sense, the fact that tinderbox stops testing once it has
> processed all the landed changes is a misfeature. Each slave should run
> over and over again on whatever the latest changeset is, to make it more
> likely to catch intermittent failures, and to make it clearer when a
> failure is intermittent (we can see that it comes and goes on the same
> changeset).
We actually used to do this. Tinderbox used to test continuously,
and buildbot used to trigger builds every 2 hours. This is good
both for random failures and (especially) for performance testing,
which has inherent randomness. But we apparently don't have enough
hardware to maintain that level of coverage on the number of tests
and branches we now have to cover.
> Going further, it would be helpful to do continuous testing on as many
> machines as we can afford, even if they are configured identically,
> because this makes it more likely that we will catch intermittent
> failures as soon as the bug is introduced. If we had an infinite number
> of machines, we'd be guaranteed to see all the intermittent failures
> immediately. Covering all our intended platforms is a minimum, not a
> stopping point.
I disagree slightly here about the "configured identically" bit: I
think machines configured differently are better overall: you'll
see bugs that happen only in specific configurations *and* you'll
see the random bugs quickly, but it sometimes might take a little
additional time to distinguish which case you hit.
Can you use probability?
I notice from the incident you blogged that you ran the VM only until
the test failed. While you are working with that recording, could you
start it again and log the number of failures over a given time? Then,
after you deploy the fix, you could do the same thing and a (I'm sure
fairly simple) statistical calculation will give you a probability that
you've fixed the problem. We could set a threshold of 95%, or something.
E.g. if a bug occurs 4 times in 100 runs before the fix, and 0 times in
100 runs after the fix, and we assume the probability of the bug
occurring on a given run is constant, the percentage likelihood that the
bug is no longer present is 92% (0.975^100). This may, of course, be
bogus maths, but I'm sure this sort of calculation is possible.
Gerv
I was trying to think of some sort of relaxed condition like this, but
what you've written would include changesets with known reproducible
failures. Did you mean: "All tests that were applied to that changeset
are green, and all other tests are green on the first successor
changeset they were applied to"?
I think that sounds like my first (less useful) definition.
However, this definition would not include any changesets that cause
a particular test to be orange every time. Once the orange is fixed
by a later changeset, that later changeset (along with its
perma-orange ancestors) would get pulled in.
Oh --- yes, it does. Sorry about that.
> However, this definition would not include any changesets that cause
> a particular test to be orange every time. Once the orange is fixed
> by a later changeset, that later changeset (along with its
> perma-orange ancestors) would get pulled in.
By "this definition" are you talking about the "alternative (and, I
suspect, more useful) definition"? With a history like this:
changeset test 1 test 2
2 orange green
1 green orange
a repository that pulls "the most recent changeset in mozilla-central
that had all tests green on that changeset or any later changeset" would
pull changeset 1, even if the orange in test 2 is an always-reproducible
failure fixed by changeset 2. So it would have a tip changeset "that
causes a particular test to be orange every time".
I agree. In an ideal world, no slave would ever be idle for long.
With that said, there's barriers to this:
* We're underpowered still. For example, last night between m-c and
1.9.2 over jobs waited over 15 minutes before they even started. On the
try server this number was much higher. This is a problem we're
addressing, but we're not there yet.
* The need for smarter scheduling, queuing, and displays. With the
technology and setup we have now we _could_ ensure that no slave was
ever idle by retesting things over and over but that would hurt us in
other ways, such as incoming builds taking even _longer_ to start,
because they would wait on retests completing first. (We can't kill
those off in favour of incoming builds due to the burning it would
cause.) We also need better displays before this becomes reasonably useful.
> Going further, it would be helpful to do continuous testing on as many
> machines as we can afford, even if they are configured identically,
> because this makes it more likely that we will catch intermittent
> failures as soon as the bug is introduced. If we had an infinite number
> of machines, we'd be guaranteed to see all the intermittent failures
> immediately. Covering all our intended platforms is a minimum, not a
> stopping point.
I understand your point here, but I think it's a bit ignorant. We're not
shooting for the minimum, but we're still trying to catch up with the
constantly demand, which continues to grow. It's not simply a matter of
spending money on hardware, it's a matter of ensuring we can keep those
machines consistent and functioning. This is not as easy as it may
sound. We are actively working on this.
Oh, right. Then it doesn't work.
I'm just talking theory. An infinite number of machines wouldn't leave
room for my house and some good restaurants. I just wanted to point out
that that these two kinds of testing have different characteristics, and
work through some consequences.
Since you brought up the issue of practicality, let me counter with a
completely impractical idea. :)
An array of machines doing continuous testing can be arranged to cover
lots of platforms for whatever changeset is current. Or, they can be
arranged to cover lots of changesets for one selected platform. (Suppose
each machine has several VMs installed on it for the various platforms,
and we shut down all but the VM running the platform we care about.)
This would be an automated way to track down intermittent oranges: let N
machines all churn away on (say) every N changeset (thus covering the
last N*N changesets), and see where the oranges begin. Then rearrange
the machines to cover the tighter range and find the exact changeset.
Can you rephrase that? I don't get it.
Axel
Yes, I expect you're right. Someone who remembers more stats than me
(Zack?) can probably work it out under some assumptions.
But suppose we had recording enabled on a set of test machines and over
a long period of time we see a particular failure exactly once. After
the fix, you might need to keep running those machines for an extremely
long period of time to get any confidence that the failure is fixed. The
problem is a lot worse considering that in real life those machines
would be testing a constantly moving trunk, or at least unable to spend
more than a limited amount of time testing a particular changsets.
Anyway, let's not worry about this yet.
Rob
I don't think that's completely accurate. We have regression tests
where, if a regression occurs, we expect that the test will fail
intermittently, because we don't know how to write a more reliable test.
> - Continuous testing checks for intermittent bugs by simply exercising
> the product in some fashion over and over again. The tests need not even
> be specific to a particular bug, as long as they provide a broad survey
> of the conditions of live use. Fuzzing is continuous testing. Running
> the test suite over and over is continuous testing. Continuous testing
> should simply run all the time, on as many machines as possible, ideally
> with execution recorded for later replay.
I think this is a good distinction, modulo the limitation noted above.
Unfortunately you simply can't do ongoing regression testing without
also effectively doing continuous testing.
> Going further, it would be helpful to do continuous testing on as many
> machines as we can afford, even if they are configured identically,
> because this makes it more likely that we will catch intermittent
> failures as soon as the bug is introduced. If we had an infinite number
> of machines, we'd be guaranteed to see all the intermittent failures
> immediately. Covering all our intended platforms is a minimum, not a
> stopping point.
Of course there are diminishing returns, so it wouldn't actually make
sense to max out on "as many machines as we can afford". The returns are
actually quite meagre since an intermittent test failure is pretty
low-value for us right now, since it's so hard to work from most test
logs to a fix. Maybe that will change.
Rob
Chris Pearce put some information up here:
http://pearce.org.nz/2009/11/replay-debugging-mochitest-failures.html
https://developer.mozilla.org/En/Debugging/Record_and_Replay_Debugging_Firefox
Rob
I am going to assert without evidence that we can make improvements
sufficient to make the tree green enough by removing the more frequently
occurring sources of random orange.
:-)
IOW, yes, let's cross that bridge when we come to it. If the code does,
in fact, contain hundreds of independent extremely hard-to-trigger bugs
which together make the product noticeably unstable, then... er...
Gerv
The statistical question here is formally equivalent to testing for a
fair coin. Before the fix, we had a Bernoulli process with 4
"tails" (failures) out of 100 trials. Now we have a Bernoulli process
with 0 tails out of 100 trials. The null hypothesis is that these are
samples from the same distribution. Since the probability of failure
is so low, we need the (exact) binomial test rather than the
(approximate but much easier to compute) chi-square test. We set it up
with the null probability equal to 4/100, and our sample is 0/100. A
one-sided test is appropriate, because we wouldn't be asking the
statistical question if we had seen *any* failures after the fix, let
alone *more* failures.
R's 'binom.test' gives me 98.3% chance that the bug in this example was
fixed.
> But suppose we had recording enabled on a set of test machines and
> over a long period of time we see a particular failure exactly once.
> After the fix, you might need to keep running those machines for an
> extremely long period of time to get any confidence that the failure
> is fixed. The problem is a lot worse considering that in real life
> those machines would be testing a constantly moving trunk, or at
> least unable to spend more than a limited amount of time testing a
> particular changsets.
I ran up a graph (attached) of the probability that a bug is fixed as a
function of the number of successful test runs since the fix, for
several different rates of incidence before the fix. It looks to me
like we only have to worry about that for failure rates much lower than
the ones we're dealing with right now. For a test that fails one time
in ten, we need about 30 runs for 95% confidence that it's been fixed;
a test that fails one time in 100 needs more like 300 runs for
that level of confidence, but how often do we have one of those?
zw
zw
We only have a handful of random orange bugs that fail more often
than 1 time in 100; the bulk of them fail significantly less than
that. (I'd think 100 would probably be a reasonable
order-of-magnitude guess at the number of unit test runs per day on
mozilla-central, so the threshhold for that rate is probably
reasonably close to the 1 failure per day threshhold.)
I put them here:
http://people.mozilla.org/~zweinberg/intermittent-failures.png
http://people.mozilla.org/~zweinberg/intermittent-failures.R
zw
Alas. I guess the near-constant "random" orange in the past few weeks
made me overconfident, in a funny sort of way.
Seems to me that what we need to deal with intermittent failures at
those frequencies is an easy way for the developers working on the bugs
to run the appropriate test in an endless loop, locally. Unfortunately
I don't know if it's even *possible* to run Talos locally, at present...
zw
> Alas. I guess the near-constant "random" orange in the past few weeks
> made me overconfident, in a funny sort of way.
>
> Seems to me that what we need to deal with intermittent failures at
> those frequencies is an easy way for the developers working on the bugs
> to run the appropriate test in an endless loop, locally. Unfortunately
> I don't know if it's even *possible* to run Talos locally, at present...
>
>
The vast majority of random orange is unit test failures, and it's quite
easy to run Mochitest/Reftest/xpcshell tests at home. The problem is that
most of these tests only exhibit failure when run in a VM, not on a fast
developer machine. Of course, it's not that much harder to run tests in a
VM.
-Ted
I'd dispute that insofar as there have always been a handful of mochi
and reftests that fail spuriously on my development machines. And it
always seems to be one of the talos things that goes rando-orange when I
push stuff.
If releng published VM images that precisely matched what runs on the
build farm, with instructions for mounting a build tree from the host
OS inside and running all the tests, that would be excellent.
(Does MoCo have a site license for VMWare that can be extended to
remote employees?)
zw
> Ted Mielczarek <ted.mie...@gmail.com> wrote:
> > The vast majority of random orange is unit test failures, and it's
> > quite easy to run Mochitest/Reftest/xpcshell tests at home. The
> > problem is that most of these tests only exhibit failure when run in
> > a VM, not on a fast developer machine. Of course, it's not that much
> > harder to run tests in a VM.
>
> I'd dispute that insofar as there have always been a handful of mochi
> and reftests that fail spuriously on my development machines. And it
> always seems to be one of the talos things that goes rando-orange when I
> push stuff.
>
You don't have to take my word for it:
https://bugzilla.mozilla.org/buglist.cgi?quicksearch=sw%3A[orange]
> If releng published VM images that precisely matched what runs on the
> build farm, with instructions for mounting a build tree from the host
> OS inside and running all the tests, that would be excellent.
>
> (Does MoCo have a site license for VMWare that can be extended to
> remote employees?)
>
>
The Linux refplatform (not 100% up-to-date, but pretty close) is available:
https://wiki.mozilla.org/ReferencePlatforms/Linux-Public
Windows obviously presents licensing problems, and our OS X build machines
are physical hardware. I don't really think that most of the failures
require the exact reference platform, however, usually just a VM is good
enough. roc and cpearce had success debugging random test failures in a
record-and-replay VM, it just took a lot of cycles to catch the failure.
-Ted
Fabulous, thanks!
> I ran up a graph (attached) of the probability that a bug is fixed as a
> function of the number of successful test runs since the fix, for
> several different rates of incidence before the fix. It looks to me
> like we only have to worry about that for failure rates much lower than
> the ones we're dealing with right now. For a test that fails one time
> in ten, we need about 30 runs for 95% confidence that it's been fixed;
> a test that fails one time in 100 needs more like 300 runs for
> that level of confidence, but how often do we have one of those?
One of the random oranges I fixed last week was on the order of 1-in-100
on my test machine.
http://weblogs.mozillazine.org/roc/archives/2009/11/today_for_the_f.html
I got 1000 green iterations with the fix, though, so I guess it probably
is fixed!
Rob
> Windows obviously presents licensing problems, and our OS X build machines
> are physical hardware. I don't really think that most of the failures
> require the exact reference platform, however, usually just a VM is good
> enough.
Hmmm. Has anyone compared the rate of OSX oranges to Windows/Linux
oranges? There was some talk the other day that overloaded VMs were
contributing to intermittent orange. OS X, being on real HW, would be a
control to that hypothesis.
[And then there's the question of how many of these oranges are caused
by timing-sensitive product code, vs. timing-sensitive tests. Time spent
fixing the product is well worth it, time spent fixing tests is not.]
Justin
> Hmmm. Has anyone compared the rate of OSX oranges to Windows/Linux oranges?
> There was some talk the other day that overloaded VMs were contributing to
> intermittent orange. OS X, being on real HW, would be a control to that
> hypothesis.
>
I personally haven't, although I know of one bug that only seems to manifest
on our 4-core Xserves (not on the 2-core minis):
https://bugzilla.mozilla.org/show_bug.cgi?id=524014.
>
> [And then there's the question of how many of these oranges are caused by
> timing-sensitive product code, vs. timing-sensitive tests. Time spent fixing
> the product is well worth it, time spent fixing tests is not.]
>
All of the bugs roc has fixed with record-and-replay debugging so far have
been product bugs, AFAIK.
-Ted
(stripped out of the news version :-( )
> of the probability that a bug is fixed as a
> function of the number of successful test runs since the fix, for
> several different rates of incidence before the fix. It looks to me
> like we only have to worry about that for failure rates much lower than
> the ones we're dealing with right now. For a test that fails one time
> in ten, we need about 30 runs for 95% confidence that it's been fixed;
> a test that fails one time in 100 needs more like 300 runs for
> that level of confidence, but how often do we have one of those?
Let's pick a confidence level we're happy with, call it 95%, and just
have a web page with a lookup table:
Bug Prevalence Runs Reqd for 95% Confidence
1% 300
2% 275
3% 230
...
Gerv
As Ted already mentioned, the Linux ref platform is readily available.
We can get you access to a copy of the Windows ref platform without much
trouble. Mac's can be loaned out in extreme cases, but we're very short
on them right now, so we can't give them out as readily.
Doing builds on build machines isn't really different than elsewhere as
long as you use the "official" mozconfigs.
> (Does MoCo have a site license for VMWare that can be extended to
> remote employees?)
>
File an IT bug, they'll get you a license.
Just to point out the obvious here, with 300M users, you get 300 runs a
million times every few days, so at 1% you get 10,000 failures every few
days.
So in addition to all of this, you need an estimate of how often the
tested code sequence can be visited by average users or you have to go
for a higher margin. Difficult business!
jjb
> On 24/11/09 20:45, Zack Weinberg wrote:
> > I ran up a graph (attached)
>
> (stripped out of the news version :-( )
I put it on people.m.o and posted a URL in a followup; here it is again:
http://people.mozilla.org/~zweinberg/intermittent-failures.png
http://people.mozilla.org/~zweinberg/intermittent-failures.R
> Let's pick a confidence level we're happy with, call it 95%, and just
> have a web page with a lookup table:
>
> Bug Prevalence Runs Reqd for 95% Confidence
> 1% 300
> 2% 275
> 3% 230
> ...
Here's the raw numbers:
confidence
level
0.95 0.99
failure 0.5 5 7
probability 0.4 6 10
0.3 9 13
0.2 14 21
0.1 29 44
0.09 32 49
0.08 36 56
0.07 42 64
0.06 49 75
0.05 59 90
0.04 74 113
0.03 99 152
0.02 149 228
0.01 299 459
0.009 332 510
0.008 373 574
0.007 427 656
0.006 498 766
0.005 598 919
0.004 748 1149
0.003 998 1533
0.002 1497 2301
0.001 2995 4603
zw
While that's true, we know there are outstanding orange bugs caused by
httpd.js issues (which are being worked on). And I have traced orange
bugs to bad tests in the past.
But since you can't usually know ahead of time whether the bug is in the
test, you might as well figure it out and then fix the test if that's
where the problem is.
Rob