What are some steps we can take to improve the situation?
Andreas
On Aug 10, 2010, at 10:45 AM, sayrer wrote:
> See https://wiki.mozilla.org/Platform/2010-08-10#Tree_Health.
>
> What are some steps we can take to improve the situation?
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
> I disagree with item 2). The try server is my compiler for platforms I am not running on my machine (windows + linux). Developer time is a lot more costly than CPU time. Making me install 2 VMs and wait for compiles costs a ton more than scaling up try server. If we compile on cheap beefy boxes and have a ton of those (hundreds, not dozens) instead of costly and slow VMs, we can scale try server and our build infrastructure arbitrarily. Or even better, use a cloud computing service.
I think you're making some assumptions, and know from talking to Justin and John that cloud computing services have been tested and shown to not be a viable route for scaling. As the wiki notes state, work is being done to increase capacity, as well.
If you personally require a bank of Minis to act as your compilers (so you don't have to do it across VMs) I'm sure that we can arrange for that hardware to be made available. I don't think that's a universal solution, but I think that you're a pretty special case :)
In the meantime, the issue is people who aren't even trying to compile & test even on one development platform, not across all three, breaking all builds and tests. I think that it's reasonable to ask that people compile and test locally before pushing to try to get the cross-platform coverage.
cheers,
mike
Mike
On Aug 10, 2010 2:10 PM, "Andreas Gal" <andre...@gmail.com> wrote:
I disagree with item 2). The try server is my compiler for platforms I am
not running on my machine (windows + linux). Developer time is a lot more
costly than CPU time. Making me install 2 VMs and wait for compiles costs a
ton more than scaling up try server. If we compile on cheap beefy boxes and
have a ton of those (hundreds, not dozens) instead of costly and slow VMs,
we can scale try server and our build infrastructure arbitrarily. Or even
better, use a cloud computing service.
Andreas
On Aug 10, 2010, at 10:45 AM, sayrer wrote:
I haven't seen the notes you refer to, but it must be possible to scale a compiler farm in some way. Justin and John are the experts here. If cloud computing isn't the right answer, I am sure they will find some other way. Building is the most parallelizable task in CS I can think of. This problem can be solved.
Andreas
On Aug 10, 2010, at 11:15 AM, Mike Beltzner wrote:
> On 2010-08-10, at 11:10 AM, Andreas Gal wrote:
>
>> I disagree with item 2). The try server is my compiler for platforms I am not running on my machine (windows + linux). Developer time is a lot more costly than CPU time. Making me install 2 VMs and wait for compiles costs a ton more than scaling up try server. If we compile on cheap beefy boxes and have a ton of those (hundreds, not dozens) instead of costly and slow VMs, we can scale try server and our build infrastructure arbitrarily. Or even better, use a cloud computing service.
>
- Kyle
I think it makes sense to say that your cset has build and start (and pass
the new tests) on your primary platform. But even running mochitests on a
platform takes a couple hours of local computer time, and even worse, you
can't really touch that computer while the tests are running because of
focus issues. So I think that perhaps we need a more nuanced version of the
rule!
--BDS
Yes, that is what I meant. Your build should compile on your platform,
and some relevant set of tests should be run.
We are having problems with uncompiled and/or untested patches being
pushed to shared infrastructure.
- Rob
Given that our tests now require focus to remain on the test window
while running, so effectively require a dedicated machine to run, or
hours of downtime on the part of the person running them, I'm not sure
it's reasonable to ask that people test locally...
-Boris
It is reasonable to ask that relevant tests are run.
- Rob
I think we need to
- eliminate the tests that are focus-sensitive from the default run
- make the test infrastructure run multiple suites in parallel, since
most developers have multiple cores
- figure out if we have too many tests, and whether they have value
commensurate in their cost
- figure out why tests take so long to run on your computer, because
on *minis* we see all of mochitest taking 1h to 1h20m depending on OS,
and I think that includes transferring the builds!
(http://bit.ly/9uJjaB)
And probably other things too. In the interim, people need to at
least not be *reckless* by using try just because (and I am not making
this up) they don't know how to run our test suites.
But most people don't get review turned around in a few hours either,
so I think putting the patch up for review and then doing the full
test run overnight or whatever is pretty OK. It might mean that we
find some bugs twice (once in review, once in test suite) but I don't
think that's a big deal.
Mike
Yes, agreed.
-Boris
I don't think that's practical for a large chunk of our tests.
> - make the test infrastructure run multiple suites in parallel, since
> most developers have multiple cores
That would be very helpful indeed, but it depends on the focus
requirement to go away.
> - figure out if we have too many tests, and whether they have value
> commensurate in their cost
I don't think that we have too many tests. Although the number is
very high, I have witnessed tons of bugs in my own patches being
uncovered by seemingly unrelated (and sometimes seemingly unneeded)
tests. Bugs that I would not be able to detect if it were not for
those tests.
> - figure out why tests take so long to run on your computer, because
> on *minis* we see all of mochitest taking 1h to 1h20m depending on OS,
> and I think that includes transferring the builds!
> (http://bit.ly/9uJjaB)
I've experienced very similar test run times on my own machine. (And
I've had to run the entire mochitest-plain suite quite a few times).
> But most people don't get review turned around in a few hours either,
> so I think putting the patch up for review and then doing the full
> test run overnight or whatever is pretty OK. It might mean that we
> find some bugs twice (once in review, once in test suite) but I don't
> think that's a big deal.
I think that's a very good suggestion. I've sometimes posted
follow-up patches for review once the try server runs finish and
uncover a problem. This will save the reviewer's time if (s)he has
looked at the patch before the try server finishes.
--
Ehsan
<http://ehsanakhgari.org/>
But continue running them on tbox, yes? All the focus tests, for
example, fall into this category.
> - figure out if we have too many tests, and whether they have value
> commensurate in their cost
Fwiw, we don't have enough tests... In my opinion.
> - figure out why tests take so long to run on your computer, because
> on *minis* we see all of mochitest taking 1h to 1h20m depending on OS,
> and I think that includes transferring the builds!
> (http://bit.ly/9uJjaB)
Picking numbers at random, that link shows "all of mochitest" (adding up
parts 1-5) taking about 2h50min on Mac OS X 10.5.5 debug. Where did
that 1h to 1h20min number come from?
(For reference, the comparable numbers on other OSes are: OSX 10.6 ==
1h50min, F12 == 2h21min, F12x64 == 2h1min, Win5.1 == 2h16min, win6.1 ==
2h26min, win6.1x64 == 2h26min (in fact looks like copy-paste from the
32-bit times).)
> But most people don't get review turned around in a few hours either,
> so I think putting the patch up for review and then doing the full
> test run overnight or whatever is pretty OK. It might mean that we
> find some bugs twice (once in review, once in test suite) but I don't
> think that's a big deal.
Is there a single command that will do the full test run (mochitest,
reftest, chrome test, browser test, etc)? If there is, I'd love to run
tests overnight using it...
-Boris
AFAIK no, but that would be fairly easy to add.
Cheers,
Shawn
Sure, or even running them nightly.
>> - figure out if we have too many tests, and whether they have value
>> commensurate in their cost
>
> Fwiw, we don't have enough tests... In my opinion.
I agree that we don't have enough test coverage, yeah; I was
advocating my own devilry, or similar. It may be that consolidating
tests to reduce the amount of time in setup/teardown or similar would
pay dividends, though.
>> - figure out why tests take so long to run on your computer, because
>> on *minis* we see all of mochitest taking 1h to 1h20m depending on OS,
>> and I think that includes transferring the builds!
>> (http://bit.ly/9uJjaB)
>
> Picking numbers at random, that link shows "all of mochitest" (adding up
> parts 1-5) taking about 2h50min on Mac OS X 10.5.5 debug. Where did that 1h
> to 1h20min number come from?
I was looking at opt builds (1-5 + other), which are the test results
that people mostly watch for on try server as well, I think.
> Is there a single command that will do the full test run (mochitest,
> reftest, chrome test, browser test, etc)? If there is, I'd love to run
> tests overnight using it...
I don't believe there is, though I thought there was until I went
looking for it a couple of weeks ago. :-/
Mike
Note, that doesn't do reftests (not sure how to do that offhand), but it
covers most of our tests.
Cheers,
Shawn
Then I wold think that
make mochitest check xpcshell-tests reftest crashtest
is the holy grail here...
--
Ehsan
<http://ehsanakhgari.org/>
- Kyle
On Tue, Aug 10, 2010 at 12:48 PM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
> On Tue, Aug 10, 2010 at 3:45 PM, Shawn Wilsher <sdw...@mozilla.com> wrote:
>> On 8/10/2010 12:32 PM, Boris Zbarsky wrote:
>>>
>>> Is there a single command that will do the full test run (mochitest,
>>> reftest, chrome test, browser test, etc)? If there is, I'd love to run
>>> tests overnight using it...
>>
>> In your object directory:
>> [py]make mochitest check xpcshell-tests
>
> Then I wold think that
>
> make mochitest check xpcshell-tests reftest crashtest
>
> is the holy grail here...
>
> --
> Ehsan
> <http://ehsanakhgari.org/>
I don't think that backing people out aggressively should be discussed
as long as #developers is basically 50% about how our infrastructure
just failed to report/build/star/whichever.
Axel
The ability to stop a try server run partway through would be great.
Eg. if I have a stupid windows compile error that'll usually show up
really quickly. I've heard that RelEng is working on this, it'll be
great.
I wonder if the ability to do a try server run on a subset of machines
is useful (eg. just windows machines). It could get complicated,
though.
N
There's no self serve way to do this yet, as you mention, but RelEng is
more than happy to do it for you. You can file a bug or ping the
buildduty person to get this done. Be sure to mention the changeset and
which jobs you want killed.
Platform / job selection is also being worked on.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)
iEYEARECAAYFAkxhw20ACgkQJE25Np0n+NtyewCdEWHvuK70rOXmWKd62GnrMDvp
MyoAniwn6tQPq349njLeaxxhx9g88ahH
=W+5X
-----END PGP SIGNATURE-----
In reverse polish notation:
My numbers aren't skewed, they're biased by a european timezone. Which
means that I get a bit of silence, and then, most of the time, a
struggle to get the tree back to life.
To your first question, backing people out because the logs on tbpl are
cached as empty and the star magic won't work sounds like a bad
solution. And I'm surprised to see real orange among all that random orange.
Practically, if I land in my daytime, folks that could help me to
diagnose the test results beyond what tbpl does wouldn't be awake before
someone else stepped up and quoted a rule to back me out. Yielding yet
another rule to make me not land code.
Axel
> Practically, if I land in my daytime, folks that could help me to
> diagnose the test results beyond what tbpl does wouldn't be awake before
> someone else stepped up and quoted a rule to back me out. Yielding yet
> another rule to make me not land code.
I was under the impression that we had build folks covering basically
every waking hour of the day. Is this no longer true?
Cheers,
Shawn
Perhaps "immediately" in my original post was the wrong word to use.
I'm not advocating setting up a bot that backs people out 30 seconds
after a build turns orange. Both Friday and Monday the tree was
closed for the better part of the day because once we discovered that
a patch was causing orange/red/whatever we tried to fix it on
Tinderbox rather than back it out. Once we know that a given *push*
(and I say push, not changeset, because we've wasted a lot of time in
the past trying to figure out which changeset in a six changeset push
causes an issue, and then we back out changeset N and find an hour
later that changeset N+2 depended on it to work properly) has caused
red or new orange that push should be backed out. We can't afford to
use mozilla-central to debug and fix patches.
And FWIW, I don't think potentially new randomorange is a big issue
here. At least over the past few days these patches that have failed
have failed cleanly and clearly, either as red or as orange on
multiple platforms (or in platform specific tests).
- Kyle
> And I'm surprised to see real orange among all that random orange.
They're not random. They are intermittent, and we need to understand what is causing them to be so, and fix that. Thinking of them as random means that you are permitting yourselves and others to ignore them. That's a problem.
Honestly, I'd rather we enforce "do not check in on orange" even if the cost is that we lose code commits. The alternative is ignoring tests, which is dangerous.
cheers,
mike
FWIW, running tests in a local VM are a good way to ameliorate those
sorts of problems.
Rob
This is possible already. e.g.
http://blog.mozilla.com/cjones/2010/08/10/filter-your-tryserver-builds-a-bit-more-easily/
Rob
That script is awesome, but it has a hardwired, probably already
out-of-date list of platforms in it. (It appears, for instance, that it
cannot disable the win7 tests, which are less-than-useful on the
tryserver at present due to permaorange.) I'm not sure if we even
*have* a canonical, guaranteed-up-to-date list of try server platforms
anywhere in scrapeable format, but wouldn't it be nice if the script
didn't have to be hand-edited?
(Putting my money where my mouth is: if releng commits to maintain a
URL with a machine-parseable list of identifiers that can follow
"mozconfig-extra-", I'll make the script use it.)
zw
On 10-08-10 7:14 PM, Shawn Wilsher wrote:
>> Practically, if I land in my daytime, folks that could help me to
>> diagnose the test results beyond what tbpl does wouldn't be awake before
>> someone else stepped up and quoted a rule to back me out. Yielding yet
>> another rule to make me not land code.
We can definitely help with infrastructure issues, but we're far from
experts when it comes to random orange and other such things.
> I was under the impression that we had build folks covering basically
> every waking hour of the day. Is this no longer true?
Somebody is probably around most of the time (we have people in -8, -5,
+3, and +12 of UTC), except for some parts west coast Fridays and much
of that weekend. We only have one person in +3 and one in +12, so if
either of them are gone for any reason we lose coverage. They've also
got other work to do too, and aren't at a buildduty level of
responsiveness all of the time.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (Darwin)
iEYEARECAAYFAkxh80UACgkQJE25Np0n+NsKdwCfWIcAiGUhDlu1nFk9oi5ufyjE
j+4AnRnPMyFsI3iHQZM/oRsThlhVq0E1
=5Dfa
-----END PGP SIGNATURE-----
On 10-08-10 7:14 PM, Shawn Wilsher wrote:
>> Practically, if I land in my daytime, folks that could help me to
>> diagnose the test results beyond what tbpl does wouldn't be awake before
>> someone else stepped up and quoted a rule to back me out. Yielding yet
>> another rule to make me not land code.
We can definitely help with infrastructure issues, but we're far from
experts when it comes to random orange and other such things.
> I was under the impression that we had build folks covering basically
> every waking hour of the day. Is this no longer true?
Somebody is probably around most of the time (we have people in -8, -5,
That'd be great, I'd love that. FTR the script was just a quick hack to
save me some time. Another option I thought of a while ago was an hg
extension in the same spirit as qimportbz, where one could just |hg pull
-u| its repo to update. Should be more flexible and would have access
to better mq state checking etc. (Maybe they should merge into an mozhg
extension?)
> (Putting my money where my mouth is: if releng commits to maintain a
> URL with a machine-parseable list of identifiers that can follow
> "mozconfig-extra-", I'll make the script use it.)
>
Sure, script or hg extension, this sounds really useful.
Cheers,
Chris
Note that the recommended way of "skipping" a build (adding 'exit' in
mozconfig-extra-$platform) will actually turn that build red.
So in cases where a developer is using TryServer to debug an issue with a
specific platform that they don't have local builds for, they're likely to
get near-full-red or full-red on their TryServer push, simply because
they've "turned off" the other platforms where they already know their
patch builds correctly.
(I'm not claiming that this is the most common cause of full-red TryServer
cycles -- I just wanted to point out that a full-red TryServer cycle
doesn't necessarily mean that the developer neglected to test locally --
in this case, it'd actually mean they're being a good citizen of the tree
by passing on TryServer cycles that they don't need.)
~Daniel
On 08/10/2010 10:45 AM, sayrer wrote:
> See https://wiki.mozilla.org/Platform/2010-08-10#Tree_Health.
>
> What are some steps we can take to improve the situation?
https://bugzilla.mozilla.org/show_bug.cgi?id=578895#c1
:-)
--
Ehsan
<http://ehsanakhgari.org/>
That's a good point. I updated
http://people.mozilla.com/~cjones/tryselect to print "CANCELLED-BUILD"
before exiting, so that when we have tbpl parsing logs client-side (bug
585187), it can filter these out based on "CANCELLED-BUILD".
Cheers,
Chris
>On Tue, Aug 10, 2010 at 3:15 PM, Mike Shaver <mike....@gmail.com> wrote:
>
>
>>I think we need to
>>
>>- eliminate the tests that are focus-sensitive from the default run
>>
>>
>I don't think that's practical for a large chunk of our tests.
>
Maybe we could split up the focus-sensitive tests, and run those
first... then the other tests could run, possibly in parallel.
--
Warning: May contain traces of nuts.
Please file a bug!
I have thought about that a few times idly as well, but I rarely am
working on anything where I need it.
People like you surely would profit from it though, and it should be
easy to add.
I'll make sure it's ported to comm-central as well.
Robert Kaiser
--
Note that any statements of mine - no matter how passionate - are never
meant to be offensive but very often as food for thought or possible
arguments that we as a community needs answers to. And most of the time,
I even appreciate irony and fun! :)
> On Tue, Aug 10, 2010 at 8:37 PM, Zack Weinberg
> <zwei...@mozilla.com> wrote:
> > Robert O'Callahan <rob...@ocallahan.org> wrote:
> >
> > That script is awesome, but it has a hardwired, probably already
> > out-of-date list of platforms in it. (It appears, for instance,
> > that it cannot disable the win7 tests, which are less-than-useful
> > on the tryserver at present due to permaorange.) I'm not sure if
> > we even *have* a canonical, guaranteed-up-to-date list of try
> > server platforms anywhere in scrapeable format, but wouldn't it be
> > nice if the script didn't have to be hand-edited?
> >
> > (Putting my money where my mouth is: if releng commits to maintain a
> > URL with a machine-parseable list of identifiers that can follow
> > "mozconfig-extra-", I'll make the script use it.)
>
> https://bugzilla.mozilla.org/show_bug.cgi?id=578895#c1
>
> :-)
That's ... less conveniently parseable than I'd like, but I'll see what
I can do. Tomorrow :)
zw
I think trying to optimize the tests is a better first approach.
Aren't a lot of these starting up and shutting down the engine? Running
several tests in the same JS engine instance, maybe in a separate
context, might speed up by factor of 10.
Also, at least in the Thunderbird tests, there's a lot of "wait 10
seconds", because there were concurrency issues that nobody bothered to
find out and just put that hack in. That costs a lot of time as well,
aggregated.
> be *reckless* by using try just because (and I am not making
> this up) they don't know how to run our test suites.
Yes, that's a bad reason. However, the testsuites should be easier to
run. For several suites, there's barely any documentation and I had to
ask people to know how to run them, and I was told I need a certain,
freaky setup.
See https://wiki.mozilla.org/Platform/2010-08-10#Tree_Health. What are some steps we can take to improve the situation?
Ignoring for the moment the fanciful builders that finish jobs in one
minute, an automated backout system is probably impossible because of
the intermittent failures.
- Kyle
Actually, AFAIK most of those have been taken out, except in the bloat
tests (which afaik most people don't actually run).
Standard8
Ah yes? I happen to have such a "fanciful" builder: "make -s -j4",
assuming the tree is already full build, i.e. ./configure && time make
-s -j4 && touch netwerk/protocol/http/nsHttpChannel.cpp && time make -s
-j4. First build takes ~35 minutes, the second takes ~1 minute. All make
should do is check which files changed, compile that, link it.
On 2-core Athlon which sells for 200-400 bucks for the whole machine.
To get to the good tree: You just use hg.
> an automated backout system is probably impossible because of
> the intermittent failures.
Intermittent red? I specifically wrote not to back out on orange. Have
you even read and mentally processed my "bunch of stuff" before tossing
it away?
And Beltzer wrote that intermittent orange is are bugs that *need* to be
fixed, and I totally agree. If you care about the tests, you better make
them reliable. It's been years now that this state exists.
Please, let's keep it civil and not accuse people of not reading posts.
Cheers,
Shawn
I'm sure Releng would be really interested (and I know I am!) in your
configuration because our builders have some fairly beefy hardware and
are nowhere near that fast.
>> an automated backout system is probably impossible because of
>> the intermittent failures.
>
> Intermittent red? I specifically wrote not to back out on orange. Have you
> even read and mentally processed my "bunch of stuff" before tossing it away?
We have a decent amount of intermittent red (as dholbert notes). You
said backing out on orange was optional, which certainly wouldn't be
possible. I did read your post, I just don't think that most of the
solutions are practical (mostly because of the speed issue, but I'm
more than happy to be proven wrong).
> And Beltzer wrote that intermittent orange is are bugs that *need* to be
> fixed, and I totally agree. If you care about the tests, you better make
> them reliable. It's been years now that this state exists.
I completely agree, but right now our choice is to spend resources
fixing orange or fixing blockers. That situation is not likely to
change until Fx 4 gets close to shipping.
- Kyle
I agree that this is a serious problem. And the amount of intermittent
orange is very substantial - I see a few known intermittent oranges on
try pretty much each time I push, no matter what I push. Time is
wasted figuring out what is a real orange and what isn't, both in
nervously looking at your own pushes, and also at others' (when
deciding if to push after them), and far worse than that, it prevents
the possibility of automated backing out of bad pushes.
This may be a naive thing to say, but is there a reason not to, at
some future point in time, focus almost entirely on fixing important
intermittent oranges, and after that, 'weaken' the remaining ones, so
that when they do fail, they show up as less than orange but more than
success (that is, a warning, not a failure)?
If we did that, then we could automatically back out any test that
causes an orange. Zero tolerance. The tree is always green (+ some
warnings, which are all on tests known to be unreliable, but of low
importance). No need for human intervention at all to fix oranged
trees. No need for people to carefully look at the tree before
pushing, in order to decide if it's "too orange" to push - just push
away (assuming what you are pushing passed try), as the automatic
system will back out any patch before yours that causes failures
anyhow. Push and forget, and get an email back later sometime whether
it stuck or not.
The preparation for this (fixing all important intermittent oranges)
would be hard, and could only be started sometime in the future,
obviously - but maybe it would be worth it?
- azakai
There's nothing special about it, completely standard opt build. Try the
commands above.
> We have a decent amount of intermittent red (as dholbert notes).
Ah, I didn't know that (might have seen a few of those, but considered
them a puzzling hickup). Since when are compiles non-deterministic? I'd
think there's something seriously wrong, then.
> I completely agree, but right now our choice is to spend resources
> fixing orange or fixing blockers. That situation is not likely to
> change until Fx 4 gets close to shipping.
You're also spending a huge amount of time on creating and reviewing
tests (on these very blockers), so obviously the tests are considered
important, so that's not an argument. A testsuite which is ignored
because it's crying wolf all the time is not terribly useful. As this
very argument (we can't do certain reactions to orange, because it may
be wrong) shows.
> On Wed, Aug 11, 2010 at 2:12 PM, Ben Bucksch
> <ben.buck...@beonex.com> wrote:
> > Ah yes? I happen to have such a "fanciful" builder: "make -s -j4",
> > assuming the tree is already full build, i.e. ./configure && time
> > make -s -j4 && touch netwerk/protocol/http/nsHttpChannel.cpp &&
> > time make -s -j4. First build takes ~35 minutes, the second takes
> > ~1 minute. All make should do is check which files changed, compile
> > that, link it.
> >
> > On 2-core Athlon which sells for 200-400 bucks for the whole
> > machine.
>
> I'm sure Releng would be really interested (and I know I am!) in your
> configuration because our builders have some fairly beefy hardware and
> are nowhere near that fast.
My development machine cost order of US$1500, has eight cores, I run
Linux and use ccache.
$ time make -j8 # no changes at all
...
real 0m40.191s
user 0m35.034s
sys 0m9.501s
The numbers are similar if there really is only one file to recompile -
nearly all of the additional 30s here is constructing libgklayout.a and
then libxul.so:
$ touch ../moz-central/layout/style/nsCSSParser.cpp
$ time make -j8
real 1m9.679s
user 0m51.615s
sys 0m17.389s
But let's take a look at a nontrivial rebuild, eh?
$ (cd ../moz-central && hg pull -u)
...
183 files updated, 0 files merged, 1 files removed, 0 files unresolved
...
$ time make -j8
...
real 8m32.606s
user 36m44.846s
sys 3m4.316s
So I don't think <1 minute turnaround for typical pushes is gonna
happen.
zw
There are a number of causes -- probably the most frequent is network
congestion (which can make hg clone or wget commands time out on build
boxes).
There's also occasional redness from filesystem corruption or other
random machine-specific blips.
One red that I saw just yesterday is as-yet-undiagnosed:
https://bugzilla.mozilla.org/show_bug.cgi?id=579790
These reds are all relatively infrequent, but the point is that they
*do* happen -- and our developer-caused redness is (mercifully)
infrequent-enough that these sporadic reds would probably represent a
significant proportion of any auto-backouts. And that'd be bad.
~Daniel
Yes, I should have been more polite. I felt offended by that "> A bunch
of stuff" etc..
It was just a proposal, an idea, I felt it would dramatically improve
the situation. I'm not a build person and can't help. Sorry if that idea
was old news.
Ben
echo "TinderboxPrint: CANCELLED-BUILD" and you can do this with
tinderbox & TBPL already.
Cool. Done.
Cheers,
Chris
> Also, at least in the Thunderbird tests, there's a lot of "wait 10
> seconds", because there were concurrency issues that nobody bothered to
> find out and just put that hack in. That costs a lot of time as well,
> aggregated.
We removed all the waits in Firebug's FBTest suite, it's not reliable or
it slows down testing. We watch mutation events until the UI has what
the test expects or we fail. Our tests are more reliable now, if more
painful to write.
jjb
Yes. That would mean spending a long period of time working on issues
which mostly do not affect our users.
We need to make it easier to fix intermittent oranges. We've taken some
steps here --- crash stacks, hang stacks, increasing use of VM record
and replay --- but more could be done.
Rob
It's not being ignored. We frequently back out patches because they
caused test failures. Even more so, people find and fix lots of bugs
before checking in by noticing test failures on try-server.
Intermittent orange sucks, and we need to get better at fixing it, but
we need to keep the problem in perspective. We run hundreds of thousands
of tests on every push and usually we get a handful of intermittent
failures.
Rob
I think one thing we need to get better at is backing out changes
that cause new high-frequency intermittent oranges.
Given a graph of when an orange occurred, it's often pretty easy to
tell approximately when it started, and what changesets might have
been likely to cause it. Better tools would make looking at this
sort of thing easier.
I've seen very high frequency intermittent oranges stay in the tree
for weeks, and when I looked, I figured out which change caused them
in only a few minutes. We need to get better about doing that
rather than just starring and moving on.
-David
--
L. David Baron http://dbaron.org/
Mozilla Corporation http://www.mozilla.com/
Is the list around 1079 equivalent? That is, grep for
"BRANCHES['tryserver']" and "TRY_SLAVES".
(I am not the build team, don't trust anything I say)
--
Mook
> On Aug 10, 5:07 pm, Mike Beltzner <beltz...@mozilla.com> wrote:
>> On 2010-08-10, at 4:03 PM, Axel Hecht wrote:
>>
>> > And I'm surprised to see real orange among all that random orange.
>>
>> They're not random. They are intermittent, and we need to understand
>> what is causing them to be so, and fix that.
>
> I agree that this is a serious problem. And the amount of intermittent
> orange is very substantial - I see a few known intermittent oranges on
> try pretty much each time I push, no matter what I push. Time is wasted
> figuring out what is a real orange and what isn't, both in nervously
> looking at your own pushes, and also at others' (when deciding if to
> push after them), and far worse than that, it prevents the possibility
> of automated backing out of bad pushes.
Can someone explain to me why we can't temporary remove whatever test
generate an intermittent orange after a bug has been opened on it?
I'm sure there are very good reasons, just I don't know them ;)
To me it seems like if a test is known to be an intermittent orange then
people wont tend to care about them anyway when submitting, and maybe
worse, they might not care about them after it has actually been fixed.
^__^
MikeK
Actually I interpreted "cause orange" as excluding known intermittent
orange.
>On 08/11/2010 03:22 PM, Ben Bucksch wrote:
>
>
>>>We have a decent amount of intermittent red (as dholbert notes).
>>>
>>>
>>Ah, I didn't know that (might have seen a few of those, but considered them a puzzling hickup). Since when are compiles non-deterministic? I'd think there's something seriously wrong, then.
>>
>>
>There are a number of causes -- probably the most frequent is network congestion (which can make hg clone or wget commands time out on build boxes).
>
I can't wait for purple to be implemented.
That's certainly one perspective which is quite true. However, if you
take a look at kaie's stats [1], the tree has only been green for about
1% of the time since the stats began.
I'm quite sure that this adds up to a significant amount of developer
time spent figuring out which random oranges these bugs are (and which
aren't), and starring etc.
Plus you may also start get the depression factor of I want to push, but
oh no the tree is orange again, do I really want to star all these
oranges? (which has been even worse in the past when people then checkin
on top of unstarred oranges whilst you're starring them).
Standard8
[1] http://www.kuix.de/mozilla/tinderboxstat/index.php?tree=Firefox
> Can someone explain to me why we can't temporary remove whatever test
> generate an intermittent orange after a bug has been opened on it?
This has been done quite a few times for tests that are falling over
very frequently.
It's a little less clear what to do when the test only fails
infrequently, but is doing worthwhile testing the rest of the time
(perhaps in an undertested area!)... Intermittent orange sucks, but so
do real regressions.
Justin
> Plus you may also start get the depression factor of I want to push, but
> oh no the tree is orange again, do I really want to star all these
> oranges? (which has been even worse in the past when people then checkin
> on top of unstarred oranges whilst you're starring them).
The sheriff should be starring builds. If sheriffs aren't doing their
job, this should be raised separately.
Cheers,
Shawn
When it's the test's fault, sure.
But it's also not uncommon for some low-level breakage to cause
intermittent orange across a decent number of tests. This happened
last week for a few of the style system mochitests as a result of
something in the tracemonkey merge (bug 583262, I think), and I know
this because people started commenting in the random orange bugs
(e.g., bug 527614) for those tests quite frequently.
When that sort of thing happens, we shouldn't start disabling tests.
We should back out the change that caused the intermittent orange.
This. Even tests that time out or otherwise fail intermittently can
(and do) catch actual regressions by failing differently.
- Kyle
Good point. But, why not do this - whenever a push has an orange, the
system will run that same failing test 10 more times. If all are ok,
the test is green + a comment about an intermittent failure. If one or
more has failed, the test is an actual failure and is automatically
backed out. Otherwise it sticks.
(Obviously there would still be some false positives, but 10 or some
other number can be chosen to reduce those down to almost 0.)
- azakai
The main obstacle to doing this would be that sometimes a test fails
and leaves the browser in a state that causes later failures (tabs
lying around, focus lost, etc). We'd probably have to change the test
harness to restart the browser on the test in question, but that
doesn't sound too difficult (I know, famous last words ...).
- Kyle
If re-running the test itself 10 times doesn't work (still see
failures), then I guess running the entire set of tests it is in for
10 times after that could be done. This would take a while, but only
on intermittent tests that *also* mess up the browser for further
tests (in that case anyhow you need to run all those tests too, since
they also were marked as failures). So probably a rare occurrence.
(This would add some overhead to *real* test failures. But not all the
10 additional runs would be done - since they all fail, you can stop
after the first few. So this seems worth the benefits of completely
automating pushing and backing out, and of removing random oranges
from the tree.)
- azakai
* Our intern Andrew has been looking into ways we can solve the
"focus" problem through changes to the frameworks and has made some
good progress on this.
* We are looking into what we need to do to improve "Topfails" which
should enhance our ability to detect, track and deal with test
failures.
* We are trying to create one or more metrics to measure where we are
currently at and be able to tell what kind of progress we are making
against the "Oranges", the "Orange Factor" if you will.
* We are working on developing a mechanism with which we can run/re-
run/skip individual or groups of tests.
* We are looking into how we can run tests in parallel and/or with
remote webserver (on mobile testing, we have seen a significant speed
up not to mention the reduction in required resources).
Any and all feedback is welcome.
Additional references for the interested.
Jesse's etherpad: http://etherpad.mozilla.com:9000/WarOnOrange
Clint's wiki page: https://wiki.mozilla.org/User:Ctalbert/WarOnOrange
Joel's blog post: http://elvis314.wordpress.com/2010/07/05/improving-personal-hygiene-by-adjusting-mochitests/
Is there a bug/user repo/person I should talk to to look at for people
who are interested in this?
- Kyle
I don't see what's doing the hg merging here?
PS: tinderbox doesn't operate anything, it's kinda sad that we hardly
visualize at all what is actually happening in the releng infra. Most of
that get's thrown away, and then thrown onto tinderbox, and then tbpl
comes in and makes up a completely new story on top of that incomplete
data. It may not be too far away from what happens, but it's not the
complete picture either.
Axel
Actually just re-running an entire test job (as in, Debug Linux
Mochitest #5, etc.) would be great. With that we could entirely
automate pushing and backing out (as discussed earlier.)
Being able to run individual tests or small groups of them would be a
significant optimization, but the move from manually looking for
oranges and manually backing out ==> an automatic system that handles
oranges&backouts, would be much more important.
- azakai
Clint or Joel would be the right folks to chat with.
Bob
The JS team has their own tracemonkey repo, and Sayre merges it to m-c
every few days. This seems to work well. From my point of view, I
find landing a patch on TM much less stressful than I would landing it
on mozilla-central, because if I stuff it up I'll be inconveniencing a
much smaller group of people.
(A similar thing happens with Nanojit -- it has its own repo because
it's shared with Adobe, and I'm responsible for merging changes to TM
every so often. It too works well.)
Would having more staging repos like this cause problems?
Nick
I have no idea, but Lukas might!
Ehsan
Hmm... Do we have any indication of typical frequency ranges that we see
our intermittent failures?
Now statistics was never my strong point, and assuming I understand the
above correctly, wouldn't it seem that it will not be too uncommon that
a push that doesn't lead to failures 100% of the times would have a
significant chance of getting its first test run to succeed, and hence
make following pushes more likely to be backed out in error.
I don't know if we should do the 10x run on the previous version to
detect if it was indeed the latest push that introduced the error.
I'm all in for automation of anything that can be automated, but I think
we should consider that if we automate the push process too much, there
is a risk that some people will think "Oh, I'll just try to push my
patch, it will be backed out automatically anyway if there is a problem
with it" (like what has happened on the try server and some people seem
to think is a misuse of shared resources).
^__^
MikeK
btw: Thank you, to all of you who helped me understand the reasons for
not automatically disabling tests that failed intermittent after opening
bugs on them, still digesting the explanations.
I think it is a good idea, but as the individual patches now land in
bigger chunks, I would guess that the risk of merging problems increase?
Not sure if the total merging effort increases or decreases thou...
^__^
MikeK
Not if the code that's being worked on is something that's mostly
self-contained, or not otherwise worked on in mozilla-central. If not,
then merging from mozilla-central into the staging repo regularly should
(usually) make any merge issues minor. Then once you merge back into
mozilla-central (which would happen less often), there shouldn't be any
merge issues. Of course, that strategy does require someone to actively
maintain the staging repo.
- Blair
I have working code in staging right now for bug 73184, so using your
commit message to specify builds should become available sometime during
the next week please see:
https://wiki.mozilla.org/User:Lukasblakk/TryServerSyntax#Usage_Examples
for details on how that will work. The bonus of the commit message is
that you will be able to not only ask for certain desktop/mobile
platform options but you will also be able to run a particular test or
talos suite.
Cheers,
Lukas
The assumption is that a test with intermittent failures will fail,
for no reason, at some frequency - but that if it *passes*, it must be
valid. So we can rule out those intermittent failures using
statistics.
If that is not the case - if bad patches can succeed through luck -
then we have far worse problems than this, it would mean we
*currently* are entering bad patches and have very little way to
prevent that. In other words not only an automatic system would fail
here, but the current one. But I don't think that is the case. (Please
correct me if I'm wrong though.)
>
> I don't know if we should do the 10x run on the previous version to
> detect if it was indeed the latest push that introduced the error.
>
> I'm all in for automation of anything that can be automated, but I think
> we should consider that if we automate the push process too much, there
> is a risk that some people will think "Oh, I'll just try to push my
> patch, it will be backed out automatically anyway if there is a problem
> with it" (like what has happened on the try server and some people seem
> to think is a misuse of shared resources).
This is a big risk, I fully agree. It can be dealt with in various
ways. But I don't think the solution is "keep things slow&unautomated
because if make things fast then people will abuse the speed" ;)
- azakai
> I have working code in staging right now for bug 73184,
Sorry - that should have been https://bugzilla.mozilla.org/show_bug.cgi?id=473184
Cheers,
Lukas
The syntax looks pretty good. The only comment I have is that instead of
'b' for both, maybe allow 'od'/'do'? That's probably slightly easier to
remember and more extensible if we ever add other build types.
Rob
As a general comment, it seems to me that we do things backwards: The
reverse of our "check in, then test" paradigm seems much more
sensible.
I realize this isn't a place we can get to tomorrow, but I think
there's a bigger picture here than just the pain points of our current
system (intermittent orange, having to watch the tree for hours,
difficulty of deciding what to back out and when).
> I realize this isn't a place we can get to tomorrow, but I think
> there's a bigger picture here than just the pain points of our current
> system (intermittent orange, having to watch the tree for hours,
> difficulty of deciding what to back out and when).
We can get pretty close today with the Tryserver, though. Two big projects (Tab Candy, Firefox Sync) recently landed on mozilla-central with very little surprise and pain; both were tested repeatedly and thoroughly on tryserver for test and performance regressions beforehand, and many changes were made based on those test cycles.
cheers,
mike
We have some tests that will almost always fail on bad patches but might
succeed through a freak coincidence every so often. Not many, but a few.
-Boris