http://tinderbox.mozilla.org/showbuilds.cgi?tree=Mozilla2
Currently warns:
The tree is OPEN (but beware of bug 442169).
This is a easy way for us to start ignoring test results all together. 
       I believe there are a number of rather straightforward things we 
can do right now to a) make our unit tests more deterministic b) make 
the analysis of test failures much easier.   I've documented those 
things here:
https://bugzilla.mozilla.org/showdependencytree.cgi?id=443323&hide_resolved=1
Some of these changes like:
https://bugzilla.mozilla.org/show_bug.cgi?id=443090
Are straightforward but will require coordinated changes to the unit 
test infrastructure and the tinderbox setup to not break everything. 
They will also require help from multiple people.  I'd like us to take 
the time right now to try and make progress on this stuff - even if we 
have to close the tree to land the fixes and divert folks temporarily 
from other tasks.
Please add your thoughts on specific solutions to the bugs (and/or file 
new ones) or add them here.
Best,
Schrep
1) Two additional unittest machines are now visible again on tinderbox 
for mozilla-central. qm-win2k3-moz2-01 and qm-centos5-03 now give us 
duplication on win32 and linu, which should help with debugging.
2) I've updated tinderbox to now point instead to bug#438871, where 
we've been tracking a bunch of intermittent unittest problems. 
Bug#442169 was just one of those problems.
tc
John.
=====
> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning
My pet bug: https://bugzilla.mozilla.org/show_bug.cgi?id=438954. If
that was set up, then whenever an orange happened on the test box we'd
have a fighting chance of debugging it.
Rob
1) Ted has been making great progress on getting our logging 
rationalized (https://bugzilla.mozilla.org/show_bug.cgi?id=443090)
2) Coop is making get progress getting better output for oranges in make 
check that currently report no obvious failures 
(https://bugzilla.mozilla.org/show_bug.cgi?id=438324)
3) Lukas is getting the basic log consolidation infrastructure up - once 
443090 completes this should move fast
4) Sayre has filed a number of real code bugs as a result of valgrind 
analysis (see dependency tree of 
https://bugzilla.mozilla.org/show_bug.cgi?id=438871)
5) Gavin has been doing some great debugging work on the mochitest 
failures (https://bugzilla.mozilla.org/show_bug.cgi?id=431745)
6) We are bringing up a few physical boxes (mini's) this week to augment 
the vms for side-to-side comparisons.
 From #4 it is clear that at least some of the failures are related to 
real code bugs that just get tickled by the different i/o, cpu, and 
timing behavior of the VMs.  As further proof of this the the physical 
box (qm-moz2mini01) running OSX fails about 1 time a day on mochitest 
with similar strange behavior 
(http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1215429798.1215436152.29149.gz&fulltext=1). 
  So there are *definitely* code or test issues causing problems.
So if you get a valgrind bug or other related issue filed I'd appreciate 
it if you could move that to the top of your priority queue so we can 
get these machines cycling more reliably.
Super-big thanks to everyone who has dove into this - esp Ted, Sayre, 
Gavin, Coop, and Lukas.  You folks rock!
Cheers,
Schrep
The bad news is that we have what appears to be a set of permanent  
orange boxes (Linux and Windows PGO dep unit test boxes) on the  
Firefox 3 tree (cvsroot) which are preventing work on 1.9.0.2. Can we  
get some of the same attention?
cheers,
mike
When did it start? Did we release Firefox 3 with them orange?
- Rob
I see two failures recently:
qm-pmac03 (talos) is consistently orange.  Other two members of the 
mac-triad are not.
qm-centos-01 is showing intermittent orange that was much worse before 
today.  Some of them look to be the check lack of error reporting:
https://bugzilla.mozilla.org/show_bug.cgi?id=438324
If we could get r+ on this that would help!
Test failures look like:
REFTEST UNEXPECTED FAIL (LOADING): 
file:///builds/slave/trunk_centos5/mozilla/layout/reftests/bugs/413292-1.html
-- and --
*** 12913 ERROR FAIL | Test timed out. |  | 
/tests/docshell/test/navigation/test_opener.html
*** 12916 ERROR FAIL | Unable to restore focus, expect failures and 
timeouts. |  | 
/tests/docshell/test/navigation/test_popup-navigates-children.html
*** 12923 ERROR FAIL | Unable to restore focus, expect failures and 
timeouts. |  | /tests/docshell/test/navigation/test_reserved.html
*** 12934 ERROR FAIL | Unable to restore focus, expect failures and 
timeouts. |  | 
/tests/docshell/test/navigation/test_sibling-matching-parent.html
*** 12941 ERROR FAIL | Unable to restore focus, expect failures and 
timeouts. |  | /tests/docshell/test/navigation/test_sibling-off-domain.html
*** 12948 ERROR FAIL | Unable to restore focus, expect failures and 
timeouts. |  | /tests/docshell/test/test_bug344861.html
*** 12953 ERROR FAIL | Unable to restore focus, expect failures and 
timeouts. |  | /tests/docshell/test/test_bug369814.html
Thanks to the special joy of tinderbox renaming, "when did it start?"
is pretty much unanswerable.
However, the "perma" orange beltzner was talking about was two things
- the Linux part was the bug 431745 test_sleep_wake.js thing, and the
Win/PGO one was the way that the combination of build-on-checkin on a
controlled and essentially closed branch and a very slow machine makes
it look like forever. Assuming I'm getting the timing of beltzner's
message right, the box hit bug 427142 just before midnight the night
before, took a couple of hours to fail, then took a five hour rest,
then took 3.5 hours to fail out of a download manager test where it
threw trying to get various directories including CurProcD, followed
by a test_sleep_wake failure that was always nice for a slow timeout,
so that from start of orange until it turned green with just two
failed builds was just shy of 14 hours.
Unfortunately, other than the fix for test_sleep_wake that's since
landed in 1.9.0.x, I'm not sure what's fixable there.
Having a way to look back at the history of just one box, rather than
the PITA scroll-over, scroll-down, scroll-back, load a huge table,
over-down-back, so beltzner could have seen it was just two builds for
the PGO box, and before that (back to the July 4th dawn of time) it
was (whatever it was, maybe mostly green, I haven't looked because
it's a PITA) would be nice, but actual changes to the way the
waterfall works like that are... rare. I've certainly never seen one.
Having continuous build is nice when you've got nothing to check in
and you want to see if it's going to green up (or just to increase
your chances of random green), but it's a pain when you have to check
in to an old stale branch full of mangy slow machines, and you miss
catching the start of a build by just a few minutes and have to be on
the hook for nearly two full cycles.
There's probably *something* there in that throwing getting CurProcD
thing, but, how on earth are you going to find it? Something happened
on an unwatched box that nobody has access to, and now it's gone. What
can we change, that will make it possible to debug that?
What can we change about the way we rename trees, to make it possible
to see whether or not it has happened before, when a box on the
Firefox4.0 tree does it a week after the current Firefox tree has been
renamed?
Actually, while I'm wishing for a pony, I'd like a herd, too: being
able to see both "A week of this box" and "A week of this class" (in
this case, all the Windows unit test boxes) would be lovely.
> There's probably *something* there in that throwing getting CurProcD
> thing, but, how on earth are you going to find it? Something happened
> on an unwatched box that nobody has access to, and now it's gone. What
> can we change, that will make it possible to debug that?
That throwing is perfectly normal.  Those area all caught, but our error 
reporting still likes to show them for whatever reason.
Cheers,
Shawn