Tree is CLOSED until bug 615858 is resolved

Dave Townsend

unread,

Dec 1, 2010, 8:28:58 PM12/1/10

to

This is the current state of affairs after around 7 hours of investigations.

This morning I came to land and discovered that crashtests were failing
on OSX 10.5 runs. There were a couple of other failures too but they
have all been resolved and all that remains are the crashtests.

On both Opt and Debug the crashtest hangs during shutdown, on Debug we
also see a timeout in 472237-1.html. Since they were slightly different
and showed up in different changesets I ended up backing out both
mstange and jkew. After the backouts though both failures were still
present.

To try to rule out machine issues bhearsum triggered another 3 rounds of
crashtests on jkew's, mstange's and another changeset from a little
earlier that had been green. All of them failed which suggests the
problem isn't caused by any particular failing code and may indeed be
machine related however we do not believe that there has been any change
to the slave config. There was a buildbot master config change at 2am
but that is early enough that I think we can rule it out as a suspect
for now.

It seems that any OSX 10.5 crashtests run after around 5am this morning
fails (on one changeset one crashtest ran before and passed and another
after and failed), this is true on mozilla-central as well as
TraceMonkey and TryServer (in this case on builds that were even based
on m-c from yesterday).

Bhearsum attempted to set the system clock of a slave back to yesterday
and ran the tests and also saw the hangs which suggests it isn't
necessarily time related.

Ehsan has been trying to run tests on a 10.5 machine but cannot
reproduce the hangs and is now trying to get access to a slave to do the
same tests there (bug 616042).

At this point the only course of action we can think of is to try to
reproduce the hangs on a slave and get a stack trace to see why it is
hanging.

Ehsan Akhgari

unread,

Dec 1, 2010, 9:10:31 PM12/1/10

to Dave Townsend, dev-tree-...@lists.mozilla.org

To add to what Mossop said: in order to test our own sanity, I filed bug
616055 to retrigger as many crashtests as possible to indeed make sure
that code changes are not the suspect. I will also start investigating
the issue more closely once I have access to a slave on which this
failure happens.

All this is probably not going to be finished tonight, so I'm afraid
that mozilla-central will remain closed at least until tomorrow. I'll
do my best to resolve this situation as soon as possible, though, and
I'll appreciate any ideas that others might have as to the reason of
this bunch of failures.

Cheers,
Ehsan

> _______________________________________________
> dev-tree-management mailing list
> dev-tree-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-tree-management

Ben Hearsum

unread,

Dec 1, 2010, 10:48:11 PM12/1/10

to Dave Townsend

On 12/01/10 08:28 PM, Dave Townsend wrote:
> To try to rule out machine issues bhearsum triggered another 3 rounds of
> crashtests on jkew's, mstange's and another changeset from a little
> earlier that had been green. All of them failed which suggests the
> problem isn't caused by any particular failing code and may indeed be
> machine related however we do not believe that there has been any change
> to the slave config. There was a buildbot master config change at 2am
> but that is early enough that I think we can rule it out as a suspect
> for now.

That master change was actually at 2*pm* today, so I think it's even
more unlikely to be a suspect here.

Ehsan Akhgari

unread,

Dec 2, 2010, 2:27:28 AM12/2/10

to Dave Townsend, dev-tree-...@lists.mozilla.org

OK, short version: I tracked down the problem, pushed a fix, and the
tree can be reopened once we get green runs on that.

Longer version: I started to bisect the crashtest suite (among other
things, such as hours of locally running a 32-bit build under gdb!) to
see whether this is caused by what a single test does. This process
brought me to
<http://mxr.mozilla.org/mozilla-central/source/layout/generic/crashtests/508908-1.html?force=1#23>.
This test tried to load a Flash file from zombo.com. For some reason,
that domain was experiencing DNS resolution problems on our internal
network, which meant that Firefox could not load it (well, to be
precise, Flash couldn't load it), and it caused Firefox to get stuck at
shutdown time for some reason.

I filed bug 616085 to disable the test so that we can reopen
mozilla-central, and filed bug 616088 to fix the underlying issue.

I also pushed a fix to bug 616085 disabling the test. I did run a local
crashtest on the slave with that test disabled, which finished
successfully, so the tree can be reopened once we get green runs on my
push <http://hg.mozilla.org/mozilla-central/rev/baa5ae44f0ba>.

Thanks to all of the people who helped me in this process, *especially*
the wonderful releng folks.

Cheers, and good night all!
Ehsan

Chris AtLee

unread,

Dec 2, 2010, 9:00:30 AM12/2/10

to dev-tree-...@lists.mozilla.org

So after we realized what the culprit was, there seemed to be consensus
that having external dependencies in our tests is a Bad Thing. We even
have https://bugzilla.mozilla.org/show_bug.cgi?id=472007 on file for
forcing tests to not rely on external dependencies.

I'd like to start planning on locking down the firewall the test
machines are behind so that they _can't_ access external resources.
This will most likely cause some tests to fail, so we need to find a way
to phase this in. Is there any reason we shouldn't lock down the test
machines' external access?

On 02/12/10 02:27 AM, Ehsan Akhgari wrote:
> OK, short version: I tracked down the problem, pushed a fix, and the
> tree can be reopened once we get green runs on that.
>
> Longer version: I started to bisect the crashtest suite (among other
> things, such as hours of locally running a 32-bit build under gdb!) to
> see whether this is caused by what a single test does. This process
> brought me to
> <http://mxr.mozilla.org/mozilla-central/source/layout/generic/crashtests/508908-1.html?force=1#23>.
> This test tried to load a Flash file from zombo.com. For some reason,
> that domain was experiencing DNS resolution problems on our internal
> network, which meant that Firefox could not load it (well, to be
> precise, Flash couldn't load it), and it caused Firefox to get stuck at
> shutdown time for some reason.
>
> I filed bug 616085 to disable the test so that we can reopen
> mozilla-central, and filed bug 616088 to fix the underlying issue.
>
> I also pushed a fix to bug 616085 disabling the test. I did run a local
> crashtest on the slave with that test disabled, which finished
> successfully, so the tree can be reopened once we get green runs on my
> push<http://hg.mozilla.org/mozilla-central/rev/baa5ae44f0ba>.
>
> Thanks to all of the people who helped me in this process, *especially*
> the wonderful releng folks.
>
> Cheers, and good night all!
> Ehsan
>
>
> On 10-12-01 8:28 PM, Dave Townsend wrote:

Boris Zbarsky

unread,

Dec 2, 2010, 9:57:00 AM12/2/10

to

On 12/2/10 9:00 AM, Chris AtLee wrote:
> I'd like to start planning on locking down the firewall the test
> machines are behind so that they _can't_ access external resources.

That sounds like a great idea.

-Boris

Shawn Wilsher

unread,

Dec 2, 2010, 10:50:02 AM12/2/10

to dev-tree-...@lists.mozilla.org

On 12/2/2010 6:00 AM, Chris AtLee wrote:
> I'd like to start planning on locking down the firewall the test

> machines are behind so that they _can't_ access external resources. This
> will most likely cause some tests to fail, so we need to find a way to
> phase this in. Is there any reason we shouldn't lock down the test
> machines' external access?

I don't think there is any good reason not to do this. How soon can we
do this? I personally feel that we should do the ASAP, disable any
failing tests, and then file blocking bugs to fix the tests before release.

We don't want to lose another day because some other site has dns issues.

Cheers,

Shawn

Phil Ringnalda

unread,

Dec 2, 2010, 11:02:15 AM12/2/10

to

On 12/2/10 7:50 AM, Shawn Wilsher wrote:
> I don't think there is any good reason not to do this. How soon can we
> do this? I personally feel that we should do the ASAP, disable any
> failing tests, and then file blocking bugs to fix the tests before release.

That implies that the tests will fail cleanly, and the reason we lost a
whole day to zombo.com is precisely because that test *didn't* fail
cleanly. And the reason Mossop dropped his patch for bug 472007 was "a
bunch of tests were failing in ways I couldn't understand," which sounds
rather like the same thing. I would guess that someone's in for hours,
possibly tens or hundreds of hours, of bisecting test suites to find
which tests are actually causing failures which appear in other tests or
appear nonspecific.

Ben Hearsum

unread,

Dec 2, 2010, 11:34:38 AM12/2/10

to Shawn Wilsher, dev-tree-...@lists.mozilla.org

How about planning to do this on a specific day over the holidays? I'm
going to be around for most of the rest of December and if I have a few
developer buddies that can commit to helping debug/disable/otherwise
deal with tests I'd be happy to coordinate this.

The tough thing is that we'd be doing this for _all_ trees, including try.

Ben Hearsum

unread,

Dec 2, 2010, 11:34:38 AM12/2/10

to Shawn Wilsher, dev-tree-...@lists.mozilla.org

On 12/02/10 10:50 AM, Shawn Wilsher wrote:

How about planning to do this on a specific day over the holidays? I'm

Ehsan Akhgari

unread,

Dec 2, 2010, 11:38:40 AM12/2/10

to Chris AtLee, dev-tree-...@lists.mozilla.org

This is a great idea. But we don't have to resort to that to determine
which tests are using external resources. We can start going through
<http://mxr.mozilla.org/mozilla-central/search?string=http://&find=test>
as a first step, filtering out the bogus entries, filing bugs on tests
which actually do use external resources, and get them fixed before
turning the firewalls on, so that we would deal with less headache...

Another approach might be patching out network library to NS_ABORT on
external network accesses, and pushing the patch to the try server and
start catching offending tests.

But the important question is, who will own this side of things?

Ehsan

On 10-12-02 9:00 AM, Chris AtLee wrote:
> So after we realized what the culprit was, there seemed to be consensus
> that having external dependencies in our tests is a Bad Thing. We even
> have https://bugzilla.mozilla.org/show_bug.cgi?id=472007 on file for
> forcing tests to not rely on external dependencies.
>

> I'd like to start planning on locking down the firewall the test
> machines are behind so that they _can't_ access external resources. This
> will most likely cause some tests to fail, so we need to find a way to
> phase this in. Is there any reason we shouldn't lock down the test
> machines' external access?
>

> On 02/12/10 02:27 AM, Ehsan Akhgari wrote:
>> OK, short version: I tracked down the problem, pushed a fix, and the
>> tree can be reopened once we get green runs on that.
>>
>> Longer version: I started to bisect the crashtest suite (among other
>> things, such as hours of locally running a 32-bit build under gdb!) to
>> see whether this is caused by what a single test does. This process
>> brought me to
>> <http://mxr.mozilla.org/mozilla-central/source/layout/generic/crashtests/508908-1.html?force=1#23>.
>>
>> This test tried to load a Flash file from zombo.com. For some reason,
>> that domain was experiencing DNS resolution problems on our internal
>> network, which meant that Firefox could not load it (well, to be
>> precise, Flash couldn't load it), and it caused Firefox to get stuck at
>> shutdown time for some reason.
>>
>> I filed bug 616085 to disable the test so that we can reopen
>> mozilla-central, and filed bug 616088 to fix the underlying issue.
>>
>> I also pushed a fix to bug 616085 disabling the test. I did run a local
>> crashtest on the slave with that test disabled, which finished
>> successfully, so the tree can be reopened once we get green runs on my
>> push<http://hg.mozilla.org/mozilla-central/rev/baa5ae44f0ba>.
>>
>> Thanks to all of the people who helped me in this process, *especially*
>> the wonderful releng folks.
>>
>> Cheers, and good night all!
>> Ehsan
>>
>>
>> On 10-12-01 8:28 PM, Dave Townsend wrote:

Mike Beltzner

unread,

Dec 2, 2010, 11:50:38 AM12/2/10

to Ehsan Akhgari, Chris AtLee, dev-tree-...@lists.mozilla.org

On 2010-12-02, at 11:38 AM, Ehsan Akhgari wrote:

> Another approach might be patching out network library to NS_ABORT on
> external network accesses, and pushing the patch to the try server and
> start catching offending tests.
>
> But the important question is, who will own this side of things?

Clint and the WOO team, or Jason Duell, seem like obvious choices.

cheers,
mike

L. David Baron

unread,

Dec 2, 2010, 12:23:11 PM12/2/10

to Ehsan Akhgari, Chris AtLee, dev-tree-...@lists.mozilla.org

On Thursday 2010-12-02 11:38 -0500, Ehsan Akhgari wrote:
> Another approach might be patching out network library to NS_ABORT on
> external network accesses, and pushing the patch to the try server and
> start catching offending tests.

Another approach might be to have (and enable for tests) a pref that
flips our DNS code into a mode where it returns 127.0.0.1 for
everything rather than hitting real DNS.

This would probably be a good thing anyway, since our mochitests
currently get really slow when DNS is flaky, since they wait for
real DNS lookups before proceeding to use the PAC redirection.

-David

--
L. David Baron http://dbaron.org/
Mozilla Corporation http://www.mozilla.com/

Ehsan Akhgari

unread,

Dec 2, 2010, 1:14:03 PM12/2/10

to L. David Baron, Chris AtLee, dev-tree-...@lists.mozilla.org

On 10-12-02 12:23 PM, L. David Baron wrote:
> On Thursday 2010-12-02 11:38 -0500, Ehsan Akhgari wrote:
>> Another approach might be patching out network library to NS_ABORT on
>> external network accesses, and pushing the patch to the try server and
>> start catching offending tests.
>
> Another approach might be to have (and enable for tests) a pref that
> flips our DNS code into a mode where it returns 127.0.0.1 for
> everything rather than hitting real DNS.
>
> This would probably be a good thing anyway, since our mochitests
> currently get really slow when DNS is flaky, since they wait for
> real DNS lookups before proceeding to use the PAC redirection.

Good idea, I filed bug 616182 for that.

Ehsan

Ehsan Akhgari

unread,

Dec 2, 2010, 1:14:44 PM12/2/10

to Mike Beltzner, Chris AtLee, Clint Talbert, dev-tree-...@lists.mozilla.org

On 10-12-02 11:50 AM, Mike Beltzner wrote:

> On 2010-12-02, at 11:38 AM, Ehsan Akhgari wrote:
>
>> Another approach might be patching out network library to NS_ABORT on
>> external network accesses, and pushing the patch to the try server and
>> start catching offending tests.
>>

>> But the important question is, who will own this side of things?
>
> Clint and the WOO team, or Jason Duell, seem like obvious choices.

CCing Clint in case he doesn't read this group.

Ehsan

Clint Talbert

unread,

Dec 2, 2010, 1:53:24 PM12/2/10

to

Thanks for the nudge, I read this group, but on an every-couple of days
schedule.

I'd rather go forward with Dbaron's proposal with the pref before
patching network code when enable-tests is defined. This is one of
those areas where it's easy to engineer a bunch of solutions but you
have to be careful that you are staying true to testing the real
product, and I'd worry that patching the network library is pushing that
boundary.

Also, having this be a pref will make it easy to debug intermittent
tests to see if this change is effecting them as make the switch (i.e.
you're flipping a pref rather than this workflow: patching code,
building, running test, unpatching code, building, running test).

My two cents.

Clint

Chris AtLee

unread,

Dec 2, 2010, 4:37:36 PM12/2/10

to Ehsan Akhgari, L. David Baron, dev-tree-...@lists.mozilla.org

On 02/12/10 01:14 PM, Ehsan Akhgari wrote:
> On 10-12-02 12:23 PM, L. David Baron wrote:

>> On Thursday 2010-12-02 11:38 -0500, Ehsan Akhgari wrote:
>>> Another approach might be patching out network library to NS_ABORT on
>>> external network accesses, and pushing the patch to the try server and
>>> start catching offending tests.
>>

>> Another approach might be to have (and enable for tests) a pref that
>> flips our DNS code into a mode where it returns 127.0.0.1 for
>> everything rather than hitting real DNS.
>>
>> This would probably be a good thing anyway, since our mochitests
>> currently get really slow when DNS is flaky, since they wait for
>> real DNS lookups before proceeding to use the PAC redirection.
>
> Good idea, I filed bug 616182 for that.
>
> Ehsan

I still think it's worth having the network itself restricted. Any
takers for Ben's suggestion of a December "debug that failing test" day?

Christian Legnitto

unread,

Dec 2, 2010, 4:44:34 PM12/2/10

to Chris AtLee, L. David Baron, Ehsan Akhgari, dev-tree-...@lists.mozilla.org

Wouldn't this potentially be a much higher-impact change than just returning 127.0.0.1? It could affect services running on the machines, which could then affect the test runs. Also, the effects might not be seen right when we cut off access, as jobs could be scheduled to start at certain times that don't happen on debug test day.

Also, the pref could be used by people running the tests on non-Mozilla infrastructure (Songbird, InstantBird, locally, etc) where they may not have an active firewall block.

Christian

Ehsan Akhgari

unread,

Dec 2, 2010, 6:21:25 PM12/2/10

to Clint Talbert, dev-tree-...@lists.mozilla.org

On 10-12-02 1:53 PM, Clint Talbert wrote:
> On 12/2/2010 10:14 AM, Ehsan Akhgari wrote:
>> On 10-12-02 11:50 AM, Mike Beltzner wrote:
>>> On 2010-12-02, at 11:38 AM, Ehsan Akhgari wrote:
>>>
>>>> Another approach might be patching out network library to NS_ABORT on
>>>> external network accesses, and pushing the patch to the try server and
>>>> start catching offending tests.
>>>>
>>>> But the important question is, who will own this side of things?
>>>
>>> Clint and the WOO team, or Jason Duell, seem like obvious choices.
>>
>> CCing Clint in case he doesn't read this group.
> Thanks for the nudge, I read this group, but on an every-couple of days
> schedule.
>
> I'd rather go forward with Dbaron's proposal with the pref before
> patching network code when enable-tests is defined. This is one of
> those areas where it's easy to engineer a bunch of solutions but you
> have to be careful that you are staying true to testing the real
> product, and I'd worry that patching the network library is pushing that
> boundary.

FWIW, I wasn't suggesting to NS_ABORT if enable-tests is defined! I was
suggesting to do that if a magic pref which we turn on in our test
suites is set. :-)

Ehsan

Chris AtLee

unread,

Dec 2, 2010, 7:32:39 PM12/2/10

to Christian Legnitto, L. David Baron, Ehsan Akhgari, dev-tree-...@lists.mozilla.org

On 02/12/10 04:44 PM, Christian Legnitto wrote:
> On Dec 2, 2010, at 1:37 PM, Chris AtLee wrote:
>
>> On 02/12/10 01:14 PM, Ehsan Akhgari wrote:
>>> On 10-12-02 12:23 PM, L. David Baron wrote:

>>>> On Thursday 2010-12-02 11:38 -0500, Ehsan Akhgari wrote:
>>>>> Another approach might be patching out network library to NS_ABORT on
>>>>> external network accesses, and pushing the patch to the try server and
>>>>> start catching offending tests.
>>>>

>>>> Another approach might be to have (and enable for tests) a pref that
>>>> flips our DNS code into a mode where it returns 127.0.0.1 for
>>>> everything rather than hitting real DNS.
>>>>
>>>> This would probably be a good thing anyway, since our mochitests
>>>> currently get really slow when DNS is flaky, since they wait for
>>>> real DNS lookups before proceeding to use the PAC redirection.
>>>
>>> Good idea, I filed bug 616182 for that.
>>>
>>> Ehsan
>>
>> I still think it's worth having the network itself restricted. Any takers for Ben's suggestion of a December "debug that failing test" day?
>
> Wouldn't this potentially be a much higher-impact change than just returning 127.0.0.1? It could affect services running on the machines, which could then affect the test runs. Also, the effects might not be seen right when we cut off access, as jobs could be scheduled to start at certain times that don't happen on debug test day.

I think this is actually an advantage of doing this rather that a
deficiency. This will catch cases where the pref isn't getting set, or
isn't working, or the code being used doesn't respect the pref. It's an
extra level of verification.

Justin Dolske

unread,

Dec 6, 2010, 7:19:32 PM12/6/10

to

On 12/2/10 6:00 AM, Chris AtLee wrote:

> I'd like to start planning on locking down the firewall the test
> machines are behind so that they _can't_ access external resources. This
> will most likely cause some tests to fail, so we need to find a way to
> phase this in. Is there any reason we shouldn't lock down the test
> machines' external access?

+1 to locking them down. Clearly they're ticking orangebombs.

Can you enable logging on the firewall? Might help to see what they're
currently doing on the wire and work backwards from that. Alternatively,
if we could segment off just a handful of machines, that could also
help... Ideally we should try and fix things incrementally, and avoid
having massive orange-day where we scramble to fix everything at once.

Justin

Chris AtLee

unread,

Dec 6, 2010, 8:01:49 PM12/6/10

to Justin Dolske, dev-tree-...@lists.mozilla.org

I'll check with IT about what we can do here. There may already be some logs we can scrape for this information (e.g. the squid proxy logs).

Justin

Ben Hearsum

unread,

Dec 7, 2010, 8:48:03 AM12/7/10

to Justin Dolske

On 12/06/10 07:19 PM, Justin Dolske wrote:
> Alternatively,
> if we could segment off just a handful of machines, that could also
> help...

This would probably be do-able, but I'm not sure how it would help. I
think it would actually be worse if we had a set of machines configured
differently for an extended period of time -- oranges could very easily
end up being called rando-orange instead of debugged.

> Ideally we should try and fix things incrementally, and avoid
> having massive orange-day where we scramble to fix everything at once.

I agree, for the most part. I think that having some initial period --
even if it's just 3 or 4 hours -- where we try to catch as much as
possible, would be helpful. We could easily get through at least 1000 or
2000 test runs in that time, which seems like it would be a good start.

Does that affect your opinion on it at all?

Justin Dolske

unread,

Dec 7, 2010, 5:11:42 PM12/7/10

to

On 12/7/10 5:48 AM, Ben Hearsum wrote:

>> if we could segment off just a handful of machines, that could also
>> help...
>
> This would probably be do-able, but I'm not sure how it would help. I
> think it would actually be worse if we had a set of machines configured
> differently for an extended period of time -- oranges could very easily
> end up being called rando-orange instead of debugged.

Sorry, guess I only explained half my thought. Segment off a few
firewalled machines, not have them participate in the normal
mozilla-central pool, and use them to figure out what breaks. Once
they're running green, firewall the whole pool.

>> Ideally we should try and fix things incrementally, and avoid
>> having massive orange-day where we scramble to fix everything at once.
>
> I agree, for the most part. I think that having some initial period --
> even if it's just 3 or 4 hours -- where we try to catch as much as
> possible, would be helpful.

That's still basically a tree closure. I'm not fundamentally opposed to
it, but if we can avoid doing it that way without too much pain that
would be best.

Justin

Robert O'Callahan

unread,

Dec 7, 2010, 9:58:24 PM12/7/10

to Justin Dolske, dev-tree-...@lists.mozilla.org

Someone could just run all the tests locally while offline and investigate
all the failures.

Rob
--
"Now the Bereans were of more noble character than the Thessalonians, for
they received the message with great eagerness and examined the Scriptures
every day to see if what Paul said was true." [Acts 17:11]

Ehsan Akhgari

unread,

Dec 7, 2010, 10:10:31 PM12/7/10

to rob...@ocallahan.org, dev-tree-...@lists.mozilla.org, Justin Dolske

There seems to be a consensus in this thread about the fact that the
only way to determine whether tests are dependent on network resources
is to run them with networking disabled somehow.

While I agree that testing this is necessary before turning networking
off for the live test machines, this is far from the truth. We have a
known set of tests to look into, and we have a known set of criteria for
which tests need to be inspected (those which include "http://" in them
but not any one of the known URL schemes we use in our tests, those
which access things like XHR, websockets, etc)

This does *not* need to be a trial and error process.

On 10-12-07 9:58 PM, Robert O'Callahan wrote:
> Someone could just run all the tests locally while offline and investigate
> all the failures.

That wouldn't be a very good set of results, because the code paths that
we take in offline and online mode are going to be different (as I've
painfully discovered during the past week!).

Ehsan

Robert O'Callahan

unread,

Dec 7, 2010, 10:16:30 PM12/7/10

to Ehsan Akhgari, dev-tree-...@lists.mozilla.org, Justin Dolske

On Wed, Dec 8, 2010 at 4:10 PM, Ehsan Akhgari <ehsan....@gmail.com>wrote:

> On 10-12-07 9:58 PM, Robert O'Callahan wrote:
> > Someone could just run all the tests locally while offline and
> investigate
> > all the failures.
>
>
> That wouldn't be a very good set of results, because the code paths that
> we take in offline and online mode are going to be different (as I've
> painfully discovered during the past week!).
>

OK, but it's easy to trick Gecko into thinking you're online when you don't
have a connection to the Internet.