This morning I came to land and discovered that crashtests were failing
on OSX 10.5 runs. There were a couple of other failures too but they
have all been resolved and all that remains are the crashtests.
On both Opt and Debug the crashtest hangs during shutdown, on Debug we
also see a timeout in 472237-1.html. Since they were slightly different
and showed up in different changesets I ended up backing out both
mstange and jkew. After the backouts though both failures were still
present.
To try to rule out machine issues bhearsum triggered another 3 rounds of
crashtests on jkew's, mstange's and another changeset from a little
earlier that had been green. All of them failed which suggests the
problem isn't caused by any particular failing code and may indeed be
machine related however we do not believe that there has been any change
to the slave config. There was a buildbot master config change at 2am
but that is early enough that I think we can rule it out as a suspect
for now.
It seems that any OSX 10.5 crashtests run after around 5am this morning
fails (on one changeset one crashtest ran before and passed and another
after and failed), this is true on mozilla-central as well as
TraceMonkey and TryServer (in this case on builds that were even based
on m-c from yesterday).
Bhearsum attempted to set the system clock of a slave back to yesterday
and ran the tests and also saw the hangs which suggests it isn't
necessarily time related.
Ehsan has been trying to run tests on a 10.5 machine but cannot
reproduce the hangs and is now trying to get access to a slave to do the
same tests there (bug 616042).
At this point the only course of action we can think of is to try to
reproduce the hangs on a slave and get a stack trace to see why it is
hanging.
All this is probably not going to be finished tonight, so I'm afraid
that mozilla-central will remain closed at least until tomorrow. I'll
do my best to resolve this situation as soon as possible, though, and
I'll appreciate any ideas that others might have as to the reason of
this bunch of failures.
Cheers,
Ehsan
> _______________________________________________
> dev-tree-management mailing list
> dev-tree-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-tree-management
That master change was actually at 2*pm* today, so I think it's even
more unlikely to be a suspect here.
Longer version: I started to bisect the crashtest suite (among other
things, such as hours of locally running a 32-bit build under gdb!) to
see whether this is caused by what a single test does. This process
brought me to
<http://mxr.mozilla.org/mozilla-central/source/layout/generic/crashtests/508908-1.html?force=1#23>.
This test tried to load a Flash file from zombo.com. For some reason,
that domain was experiencing DNS resolution problems on our internal
network, which meant that Firefox could not load it (well, to be
precise, Flash couldn't load it), and it caused Firefox to get stuck at
shutdown time for some reason.
I filed bug 616085 to disable the test so that we can reopen
mozilla-central, and filed bug 616088 to fix the underlying issue.
I also pushed a fix to bug 616085 disabling the test. I did run a local
crashtest on the slave with that test disabled, which finished
successfully, so the tree can be reopened once we get green runs on my
push <http://hg.mozilla.org/mozilla-central/rev/baa5ae44f0ba>.
Thanks to all of the people who helped me in this process, *especially*
the wonderful releng folks.
Cheers, and good night all!
Ehsan
I'd like to start planning on locking down the firewall the test
machines are behind so that they _can't_ access external resources.
This will most likely cause some tests to fail, so we need to find a way
to phase this in. Is there any reason we shouldn't lock down the test
machines' external access?
On 02/12/10 02:27 AM, Ehsan Akhgari wrote:
> OK, short version: I tracked down the problem, pushed a fix, and the
> tree can be reopened once we get green runs on that.
>
> Longer version: I started to bisect the crashtest suite (among other
> things, such as hours of locally running a 32-bit build under gdb!) to
> see whether this is caused by what a single test does. This process
> brought me to
> <http://mxr.mozilla.org/mozilla-central/source/layout/generic/crashtests/508908-1.html?force=1#23>.
> This test tried to load a Flash file from zombo.com. For some reason,
> that domain was experiencing DNS resolution problems on our internal
> network, which meant that Firefox could not load it (well, to be
> precise, Flash couldn't load it), and it caused Firefox to get stuck at
> shutdown time for some reason.
>
> I filed bug 616085 to disable the test so that we can reopen
> mozilla-central, and filed bug 616088 to fix the underlying issue.
>
> I also pushed a fix to bug 616085 disabling the test. I did run a local
> crashtest on the slave with that test disabled, which finished
> successfully, so the tree can be reopened once we get green runs on my
> push<http://hg.mozilla.org/mozilla-central/rev/baa5ae44f0ba>.
>
> Thanks to all of the people who helped me in this process, *especially*
> the wonderful releng folks.
>
> Cheers, and good night all!
> Ehsan
>
>
> On 10-12-01 8:28 PM, Dave Townsend wrote:
That sounds like a great idea.
-Boris
We don't want to lose another day because some other site has dns issues.
Cheers,
Shawn
That implies that the tests will fail cleanly, and the reason we lost a
whole day to zombo.com is precisely because that test *didn't* fail
cleanly. And the reason Mossop dropped his patch for bug 472007 was "a
bunch of tests were failing in ways I couldn't understand," which sounds
rather like the same thing. I would guess that someone's in for hours,
possibly tens or hundreds of hours, of bisecting test suites to find
which tests are actually causing failures which appear in other tests or
appear nonspecific.
How about planning to do this on a specific day over the holidays? I'm
going to be around for most of the rest of December and if I have a few
developer buddies that can commit to helping debug/disable/otherwise
deal with tests I'd be happy to coordinate this.
The tough thing is that we'd be doing this for _all_ trees, including try.
How about planning to do this on a specific day over the holidays? I'm
Another approach might be patching out network library to NS_ABORT on
external network accesses, and pushing the patch to the try server and
start catching offending tests.
But the important question is, who will own this side of things?
Ehsan
On 10-12-02 9:00 AM, Chris AtLee wrote:
> So after we realized what the culprit was, there seemed to be consensus
> that having external dependencies in our tests is a Bad Thing. We even
> have https://bugzilla.mozilla.org/show_bug.cgi?id=472007 on file for
> forcing tests to not rely on external dependencies.
>
> I'd like to start planning on locking down the firewall the test
> machines are behind so that they _can't_ access external resources. This
> will most likely cause some tests to fail, so we need to find a way to
> phase this in. Is there any reason we shouldn't lock down the test
> machines' external access?
>
> On 02/12/10 02:27 AM, Ehsan Akhgari wrote:
>> OK, short version: I tracked down the problem, pushed a fix, and the
>> tree can be reopened once we get green runs on that.
>>
>> Longer version: I started to bisect the crashtest suite (among other
>> things, such as hours of locally running a 32-bit build under gdb!) to
>> see whether this is caused by what a single test does. This process
>> brought me to
>> <http://mxr.mozilla.org/mozilla-central/source/layout/generic/crashtests/508908-1.html?force=1#23>.
>>
>> This test tried to load a Flash file from zombo.com. For some reason,
>> that domain was experiencing DNS resolution problems on our internal
>> network, which meant that Firefox could not load it (well, to be
>> precise, Flash couldn't load it), and it caused Firefox to get stuck at
>> shutdown time for some reason.
>>
>> I filed bug 616085 to disable the test so that we can reopen
>> mozilla-central, and filed bug 616088 to fix the underlying issue.
>>
>> I also pushed a fix to bug 616085 disabling the test. I did run a local
>> crashtest on the slave with that test disabled, which finished
>> successfully, so the tree can be reopened once we get green runs on my
>> push<http://hg.mozilla.org/mozilla-central/rev/baa5ae44f0ba>.
>>
>> Thanks to all of the people who helped me in this process, *especially*
>> the wonderful releng folks.
>>
>> Cheers, and good night all!
>> Ehsan
>>
>>
>> On 10-12-01 8:28 PM, Dave Townsend wrote:
> Another approach might be patching out network library to NS_ABORT on
> external network accesses, and pushing the patch to the try server and
> start catching offending tests.
>
> But the important question is, who will own this side of things?
Clint and the WOO team, or Jason Duell, seem like obvious choices.
cheers,
mike
Another approach might be to have (and enable for tests) a pref that
flips our DNS code into a mode where it returns 127.0.0.1 for
everything rather than hitting real DNS.
This would probably be a good thing anyway, since our mochitests
currently get really slow when DNS is flaky, since they wait for
real DNS lookups before proceeding to use the PAC redirection.
-David
--
L. David Baron http://dbaron.org/
Mozilla Corporation http://www.mozilla.com/
Good idea, I filed bug 616182 for that.
Ehsan
CCing Clint in case he doesn't read this group.
Ehsan
I'd rather go forward with Dbaron's proposal with the pref before
patching network code when enable-tests is defined. This is one of
those areas where it's easy to engineer a bunch of solutions but you
have to be careful that you are staying true to testing the real
product, and I'd worry that patching the network library is pushing that
boundary.
Also, having this be a pref will make it easy to debug intermittent
tests to see if this change is effecting them as make the switch (i.e.
you're flipping a pref rather than this workflow: patching code,
building, running test, unpatching code, building, running test).
My two cents.
Clint
I still think it's worth having the network itself restricted. Any
takers for Ben's suggestion of a December "debug that failing test" day?
Wouldn't this potentially be a much higher-impact change than just returning 127.0.0.1? It could affect services running on the machines, which could then affect the test runs. Also, the effects might not be seen right when we cut off access, as jobs could be scheduled to start at certain times that don't happen on debug test day.
Also, the pref could be used by people running the tests on non-Mozilla infrastructure (Songbird, InstantBird, locally, etc) where they may not have an active firewall block.
Christian
FWIW, I wasn't suggesting to NS_ABORT if enable-tests is defined! I was
suggesting to do that if a magic pref which we turn on in our test
suites is set. :-)
Ehsan
I think this is actually an advantage of doing this rather that a
deficiency. This will catch cases where the pref isn't getting set, or
isn't working, or the code being used doesn't respect the pref. It's an
extra level of verification.
> I'd like to start planning on locking down the firewall the test
> machines are behind so that they _can't_ access external resources. This
> will most likely cause some tests to fail, so we need to find a way to
> phase this in. Is there any reason we shouldn't lock down the test
> machines' external access?
+1 to locking them down. Clearly they're ticking orangebombs.
Can you enable logging on the firewall? Might help to see what they're
currently doing on the wire and work backwards from that. Alternatively,
if we could segment off just a handful of machines, that could also
help... Ideally we should try and fix things incrementally, and avoid
having massive orange-day where we scramble to fix everything at once.
Justin
Justin
This would probably be do-able, but I'm not sure how it would help. I
think it would actually be worse if we had a set of machines configured
differently for an extended period of time -- oranges could very easily
end up being called rando-orange instead of debugged.
> Ideally we should try and fix things incrementally, and avoid
> having massive orange-day where we scramble to fix everything at once.
I agree, for the most part. I think that having some initial period --
even if it's just 3 or 4 hours -- where we try to catch as much as
possible, would be helpful. We could easily get through at least 1000 or
2000 test runs in that time, which seems like it would be a good start.
Does that affect your opinion on it at all?
>> if we could segment off just a handful of machines, that could also
>> help...
>
> This would probably be do-able, but I'm not sure how it would help. I
> think it would actually be worse if we had a set of machines configured
> differently for an extended period of time -- oranges could very easily
> end up being called rando-orange instead of debugged.
Sorry, guess I only explained half my thought. Segment off a few
firewalled machines, not have them participate in the normal
mozilla-central pool, and use them to figure out what breaks. Once
they're running green, firewall the whole pool.
>> Ideally we should try and fix things incrementally, and avoid
>> having massive orange-day where we scramble to fix everything at once.
>
> I agree, for the most part. I think that having some initial period --
> even if it's just 3 or 4 hours -- where we try to catch as much as
> possible, would be helpful.
That's still basically a tree closure. I'm not fundamentally opposed to
it, but if we can avoid doing it that way without too much pain that
would be best.
Justin
Rob
--
"Now the Bereans were of more noble character than the Thessalonians, for
they received the message with great eagerness and examined the Scriptures
every day to see if what Paul said was true." [Acts 17:11]
While I agree that testing this is necessary before turning networking
off for the live test machines, this is far from the truth. We have a
known set of tests to look into, and we have a known set of criteria for
which tests need to be inspected (those which include "http://" in them
but not any one of the known URL schemes we use in our tests, those
which access things like XHR, websockets, etc)
This does *not* need to be a trial and error process.
On 10-12-07 9:58 PM, Robert O'Callahan wrote:
> Someone could just run all the tests locally while offline and investigate
> all the failures.
That wouldn't be a very good set of results, because the code paths that
we take in offline and online mode are going to be different (as I've
painfully discovered during the past week!).
Ehsan
> On 10-12-07 9:58 PM, Robert O'Callahan wrote:
> > Someone could just run all the tests locally while offline and
> investigate
> > all the failures.
>
>
> That wouldn't be a very good set of results, because the code paths that
> we take in offline and online mode are going to be different (as I've
> painfully discovered during the past week!).
>
OK, but it's easy to trick Gecko into thinking you're online when you don't
have a connection to the Internet.