Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The state of the Aurora branch

119 views
Skip to first unread message

Ehsan Akhgari

unread,
Jan 17, 2013, 5:58:20 PM1/17/13
to dev-pl...@lists.mozilla.org
The Aurora tree was closed yesterday by Ed because of the perma-orange
failure filed in bug 823989, which went unnoticed for quite some time
before Ed closed the tree. This morning, I tried to reproduce the bug
locally using the information posted on there and I saw that it was easily
reproducible, so I spent some time to debug it and figure out what's
happening, and landed a work-around.

Then I noticed that there seems to be a bunch of other perma-oranges as
well (leaks on Mac and Linux mochitest-bc caused by devtools tests and test
failures on mochitest-bc). I did work on bug 823989 in the hopes that I
can reopen the tree, and I'm quite disappointed that these failures have
been ignored on Aurora for such a long time. As such, I'm keeping the
Aurora tree closed. I think keeping the trees green is a responsibility
shared by everybody, and I would appreciate if some heros would step up and
work on fixing the rest of the perma-oranges.

Until that happens, there is no ETA on when Aurora will be reopened.
Needless to say, this is a risk for Firefox 20, so I urge everybody to
please help fix the test failures.

Cheers,
--
Ehsan
<http://ehsanakhgari.org/>

Ed Morley

unread,
Jan 17, 2013, 6:21:28 PM1/17/13
to Ehsan Akhgari, dev-pl...@lists.mozilla.org
On 17 January 2013 22:58:20, Ehsan Akhgari wrote:
> The Aurora tree was closed yesterday by Ed because of the perma-orange
> failure filed in bug 823989, which went unnoticed for quite some time

Both the failure fixed by Ehsan & the remaining ones on aurora are
Nightly-only.

Unfortunately tests run on nightly builds have 'pgo' in their
buildername (they are pgo after all) rather than 'nightly', meaning they
are indistinguishable from pgo build test results. This means a later
green pgo result can imply that the earlier (nightly) orange was only an
intermittent - hence how the permaoranges have gone unnoticed until now.

I've filed bug 832050 for making the buildbot buildernames for tests run
on nightlies use 'nightly' instead of 'pgo', so we can put these tests
on their own row in TBPL.

Best wishes,

Ed

Boris Zbarsky

unread,
Jan 17, 2013, 10:19:22 PM1/17/13
to
On 1/17/13 6:21 PM, Ed Morley wrote:
> On 17 January 2013 22:58:20, Ehsan Akhgari wrote:
>> The Aurora tree was closed yesterday by Ed because of the perma-orange
>> failure filed in bug 823989, which went unnoticed for quite some time
>
> Both the failure fixed by Ehsan & the remaining ones on aurora are
> Nightly-only.

Hmm. How do the nightly builds differ from the 'pgo' builds? It sounds
like it might be relevant for the problems that need fixing?

-Boris

Ehsan Akhgari

unread,
Jan 17, 2013, 10:58:44 PM1/17/13
to Boris Zbarsky, dev-pl...@lists.mozilla.org
They define MOZ_UPDATE_CHANNEL=aurora which causes the testpilot extension
to be built among other things.

Cheers,
Ehsan
> ______________________________**_________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/**listinfo/dev-platform<https://lists.mozilla.org/listinfo/dev-platform>
>

Ben Hearsum

unread,
Jan 17, 2013, 11:03:30 PM1/17/13
to
Seems like we should make test pilot being built or not an explicit
decision rather than one dependent on channel name...or make sure it's
built for all aurora builds rather than just nightlies in some other way.

Ehsan Akhgari

unread,
Jan 17, 2013, 11:08:31 PM1/17/13
to Ben Hearsum, dev-pl...@lists.mozilla.org
See https://bugzilla.mozilla.org/show_bug.cgi?id=831868.

Cheers,
Ehsan
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>

L. David Baron

unread,
Jan 18, 2013, 5:39:06 AM1/18/13
to Ehsan Akhgari, dev-pl...@lists.mozilla.org
On Thursday 2013-01-17 17:58 -0500, Ehsan Akhgari wrote:
> The Aurora tree was closed yesterday by Ed because of the perma-orange
> failure filed in bug 823989, which went unnoticed for quite some time
> before Ed closed the tree. This morning, I tried to reproduce the bug
> locally using the information posted on there and I saw that it was easily
> reproducible, so I spent some time to debug it and figure out what's
> happening, and landed a work-around.
>
> Then I noticed that there seems to be a bunch of other perma-oranges as
> well (leaks on Mac and Linux mochitest-bc caused by devtools tests and test
> failures on mochitest-bc). I did work on bug 823989 in the hopes that I
> can reopen the tree, and I'm quite disappointed that these failures have
> been ignored on Aurora for such a long time. As such, I'm keeping the
> Aurora tree closed. I think keeping the trees green is a responsibility
> shared by everybody, and I would appreciate if some heros would step up and
> work on fixing the rest of the perma-oranges.
>
> Until that happens, there is no ETA on when Aurora will be reopened.
> Needless to say, this is a risk for Firefox 20, so I urge everybody to
> please help fix the test failures.

So given that this is a regression in Firefox 19 (which is now on
beta), and the only reason we're not seeing this permaorange on beta
is because we don't generate non-debug nightly builds on beta (and I
don't think we run tests on any of our debug nightlies), it seems
odd to close only Aurora for this. It seems like depending on what
we think of its seriousness, we should either close both aurora and
beta, or we should close neither.

I'm inclined to say that this shouldn't hold the tree closed, though
I also think Ehsan's workaround in the bug (to say that testpilot
already customized the toolbar) is also perfectly reasonable; it
just means we're testing a slightly different configuration.

(We have *horrible* diversity of testing configuration compared to
what our users actually use; I don't think we should hold the tree
closed over this miniscule piece of it; I also think we should try
to improve it.)

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla http://www.mozilla.org/ 𝄂

Ehsan Akhgari

unread,
Jan 18, 2013, 10:35:52 AM1/18/13
to L. David Baron, dev-pl...@lists.mozilla.org
On Fri, Jan 18, 2013 at 5:39 AM, L. David Baron <dba...@dbaron.org> wrote:

> So given that this is a regression in Firefox 19 (which is now on
> beta), and the only reason we're not seeing this permaorange on beta
> is because we don't generate non-debug nightly builds on beta (and I
> don't think we run tests on any of our debug nightlies), it seems
> odd to close only Aurora for this. It seems like depending on what
> we think of its seriousness, we should either close both aurora and
> beta, or we should close neither.
>

I don't think we've ever closed a tree for test failures which _would_ show
up there if we ran tests there but don't because we don't do that...


> I'm inclined to say that this shouldn't hold the tree closed, though
> I also think Ehsan's workaround in the bug (to say that testpilot
> already customized the toolbar) is also perfectly reasonable; it
> just means we're testing a slightly different configuration.
>

I don't think that anybody has done enough debugging to understand exactly
why the failures happen so I'm extremely uncomfortable with reopening the
tree until such investigation has been done.


> (We have *horrible* diversity of testing configuration compared to
> what our users actually use; I don't think we should hold the tree
> closed over this miniscule piece of it; I also think we should try
> to improve it.)


I don't see why we should assume that this is a benign failure at this
point.

On related news, this thread diverged into multiple different private
threads, and it seems like the devtools team has two patches in bugs 824016
and 774619 which can probably help. I have asked them to land both patches
as they don't require approval (since they're both test only changes.)
Hopefully those would help.

Justin Lebar

unread,
Jan 18, 2013, 11:03:41 AM1/18/13
to Ehsan Akhgari, L. David Baron, dev-pl...@lists.mozilla.org
Fri, Jan 18, 2013 at 10:35 AM, Ehsan Akhgari <ehsan....@gmail.com> wrote:
> On Fri, Jan 18, 2013 at 5:39 AM, L. David Baron <dba...@dbaron.org> wrote:
>
>> So given that this is a regression in Firefox 19 (which is now on
>> beta), and the only reason we're not seeing this permaorange on beta
>> is because we don't generate non-debug nightly builds on beta (and I
>> don't think we run tests on any of our debug nightlies), it seems
>> odd to close only Aurora for this. It seems like depending on what
>> we think of its seriousness, we should either close both aurora and
>> beta, or we should close neither.
>>
>
> I don't think we've ever closed a tree for test failures which _would_ show
> up there if we ran tests there but don't because we don't do that...

This is an is/ought fallacy: dbaron is answering the question "what
ought we to do?", while the response above is an answer to the
question "what /do/ we do?".

See also http://en.wikipedia.org/wiki/Is%E2%80%93ought_problem

There may be a good reason not to close beta, but "we haven't done so
in the past" isn't particularly compelling.

-Justin

Ehsan Akhgari

unread,
Jan 18, 2013, 11:17:19 AM1/18/13
to Justin Lebar, L. David Baron, dev-pl...@lists.mozilla.org
I'm not sure where you're going with this, Justin. My intention was not
to present a fallacy. I was trying to suggest that we usually close
trees for build/test bustage, not for there being regressions there, so
I don't see a reason to close beta. I don't understand whether you're
arguing that we should close beta or are you just pointing out a problem
in what I said. In the latter case, I stand corrected and apologies for
not getting my sentence quite right. In the former case, you need to
have a better argument I think.

Ehsan

Justin Lebar

unread,
Jan 18, 2013, 11:35:41 AM1/18/13
to Ehsan Akhgari, L. David Baron, dev-pl...@lists.mozilla.org
> I was trying to suggest that we usually close trees for
> build/test bustage, not for there being regressions there, so I don't see a
> reason to close beta. I don't understand whether you're arguing that we
> should close beta or are you just pointing out a problem in what I said.

I was more trying to point out that I don't think you addressed
dbaron's argument. I happen to agree with him, although that wasn't
really what I was getting at.

To restate dbaron's argument in my own words:

1. There is a known issue affecting both beta and aurora nightly builds.

2. Either the issue is or isn't serious enough to warrant closing the
aurora tree.

3. If it is serious enough to warrant closing the aurora tree, it
seems unlikely to me that the mere fact that we don't run these tests
on beta nightlies means that it is not serious enough to warrant
closing the beta tree.

4. If on the other hand it's not serious enough to warrant closing
the beta tree, that indicates we're willing to ship with these
failures, which indicates that perhaps the Aurora tree should not
remain closed.

5. Therefore we should probably either close both Aurora and Beta or
close neither, unless something other than the fact that we don't run
the relevant tests on Beta mitigates the issue's impact there.

The key point to this argument is that the fact of whether we do or
don't run a given set of tests on the beta tree does not affect the
seriousness of the issue on that tree.

-Justin

Ehsan Akhgari

unread,
Jan 18, 2013, 11:49:11 AM1/18/13
to Justin Lebar, L. David Baron, dev-pl...@lists.mozilla.org
On 2013-01-18 11:35 AM, Justin Lebar wrote:
> To restate dbaron's argument in my own words:
>
> 1. There is a known issue affecting both beta and aurora nightly builds.
>
> 2. Either the issue is or isn't serious enough to warrant closing the
> aurora tree.
>
> 3. If it is serious enough to warrant closing the aurora tree, it
> seems unlikely to me that the mere fact that we don't run these tests
> on beta nightlies means that it is not serious enough to warrant
> closing the beta tree.
>
> 4. If on the other hand it's not serious enough to warrant closing
> the beta tree, that indicates we're willing to ship with these
> failures, which indicates that perhaps the Aurora tree should not
> remain closed.
>
> 5. Therefore we should probably either close both Aurora and Beta or
> close neither, unless something other than the fact that we don't run
> the relevant tests on Beta mitigates the issue's impact there.
>
> The key point to this argument is that the fact of whether we do or
> don't run a given set of tests on the beta tree does not affect the
> seriousness of the issue on that tree.

I see. I think your assumption in point #2 above is mistaken. We do
not close trees because of the gravity of issues affecting the code
base. We do close them when there are busted builds or failing tests
because those prevent proper testing of changesets landed on top of
them. Since your conclusion (and dbaron's) are based on this incorrect
assumption, I don't agree that we should close the Beta tree at this point.

Cheers,
Ehsan

L. David Baron

unread,
Jan 18, 2013, 1:50:33 PM1/18/13
to Ehsan Akhgari, dev-pl...@lists.mozilla.org, Justin Lebar
On Friday 2013-01-18 11:49 -0500, Ehsan Akhgari wrote:
> I see. I think your assumption in point #2 above is mistaken. We
> do not close trees because of the gravity of issues affecting the
> code base. We do close them when there are busted builds or failing
> tests because those prevent proper testing of changesets landed on
> top of them. Since your conclusion (and dbaron's) are based on this
> incorrect assumption, I don't agree that we should close the Beta
> tree at this point.

But there's the question of what constitutes "proper testing".
We're still running all of our usual tests across our usual
platforms on every push (modulo coalescing). We're just not running
quite all of them for the additional testing we do on nightlies.

I think the amount of additional test coverage we get from testing
nightlies is pretty small, since the differences between nightlies
and the PGO builds we generate for every-few-hours testing are
pretty small. I value that additional testing much less than I'd
value, say, running our tests on a localized build, or with
different system locale and timezone settings, or with common
extensions installed, or with other variations in commonly-changed
profile settings.

Ehsan Akhgari

unread,
Jan 18, 2013, 2:35:47 PM1/18/13
to L. David Baron, dev-pl...@lists.mozilla.org
On 2013-01-18 10:35 AM, Ehsan Akhgari wrote:
> On related news, this thread diverged into multiple different private
> threads, and it seems like the devtools team has two patches in bugs
> 824016 and 774619 which can probably help. I have asked them to land
> both patches as they don't require approval (since they're both test
> only changes.) Hopefully those would help.

...and it seems like the said patches did not help.

Ehsan

Ehsan Akhgari

unread,
Jan 18, 2013, 2:52:14 PM1/18/13
to L. David Baron, dev-pl...@lists.mozilla.org, Justin Lebar
I absolutely share all of these concerns, but I don't think this is going
to help us with the problem at hand... They're definitely worth looking
into once we're past this.

Mihai Sucan

unread,
Jan 18, 2013, 5:06:12 PM1/18/13
to L. David Baron, Ehsan Akhgari, dev-pl...@lists.mozilla.org
Hello everyone!

A summary of the situation:

1. bug 824016 was a known intermittent failure that we believe we fixed in
m-c with bug 827083. I did some important changes to how the web console
initializes / destroys - changes that we hope allow us to better ensure in
our tests that we listen for the correct initialization/destroy event.

2. bug 824016 seemed to be a failure caused by a timeout of 5 seconds -
waiting to open 4 web console instances, in 2 tabs (in 2 windows). On slow
machines 5s may not be enough. The new toolbox we landed has slowed
initialization a bit, in the early landing. However, our team has been
working on improving the performance.

3. In the m-c patch for bug 827083 we've increased the test timeout, along
with the web console initialization/destroy changes.

4. For aurora I did a minimal patch that fixes only bug 827083 - I did not
increase the timeout for the multiple_windows web console test. (as usual,
for aurora/beta we try to keep patches minimal, with no unrelated
changes). I did not expect this test to cause so much trouble in aurora.
My bad here!

5. Today I've been notified of the situation and together with Panos we've
landed 3 test fixes. For the two failing web console tests I've attempted
to pick a better event to listen for the web console destroy. It seems to
me the two tests finish() too early, before all console/toolbox cleanups
complete, hence the memleaks. I also tried to increase the timeout for the
multiple_windows test.

6. The above attempt did not prove successful. Together with Ehsan and
Panos we decided to disable the two web console tests. Reasoning: this is
almost certainly a tests issue, not a problem with how the toolbox
cleans-up after itself. For us to actually fix the memleaks in the two
tests we would have to land the big patch from bug 827083 and potentially
other toolbox fixes that may have improved the situation. This is, if I am
not mistaken, too much m-c backporting into aurora - would require
approvals and might be much riskier than disabling the tests.


At this point I hope aurora reopens ASAP. Apologies for the trouble. Thank
you!


Best regards,
Mihai

Phil Ringnalda

unread,
Jan 19, 2013, 1:16:38 AM1/19/13
to
On 1/18/13 2:06 PM, Mihai Sucan wrote:
> At this point I hope aurora reopens ASAP. Apologies for the trouble.

Nope. The devtools leaks, while interesting and potentially troublesome,
weren't really a significant tree-closing problem.

Now we're down to Linux64 and Win7 both failing (by which I mean the
test suite does not complete, so we have no way of knowing when people
introduce new failures) in ways that look exactly like the June/July and
October 2012 episodes where we started out by disabling webtools tests,
until we had disabled them all and we still OOMed in other tests.

Are our users also having OOM troubles on 20, or on 20 and on 21 if they
have testpilot installed? Dunno.

Have we introduced bustage since 20 merged to Aurora which affects
either or both of Linux64 or Win7 when testpilot is installed, bustage
which is caught by browser-chrome tests, but those tests do not run
because the suite dies before it gets to them, bustage which affects
users? We *cannot* know. The thing that we ship to Aurora users? We
don't know whether it's any good or not, we only know that the thing
which we do not ship to them is good.

Ehsan Akhgari

unread,
Jan 19, 2013, 10:01:09 AM1/19/13
to Phil Ringnalda, dev-pl...@lists.mozilla.org
dbaron posted a summary of our options on release-drivers. He and I
recommended disabling the testpilot extension completely as a solution.
I guess we'll wait until somebody approves doing that.

Cheers,
Ehsan

Ed Morley

unread,
Jan 19, 2013, 11:17:42 AM1/19/13
to Ehsan Akhgari, Phil Ringnalda, dev-pl...@lists.mozilla.org
On 19 January 2013 15:01:09, Ehsan Akhgari wrote:
> dbaron posted a summary of our options on release-drivers

Please can that be posted somewhere public for those of us not on
release-drivers?

Cheers,

Ed
(Away until Monday 21st Jan)

Justin Wood (Callek)

unread,
Jan 20, 2013, 6:19:39 PM1/20/13
to
Ed Morley wrote:
> On 19 January 2013 15:01:09, Ehsan Akhgari wrote:
>> dbaron posted a summary of our options on release-drivers
>
> Please can that be posted somewhere public for those of us not on
> release-drivers?

Not seeing anything that need be kept private, I'll forward a post or
two here:



----- Original Message -----
> From: "L. David Baron" <stripped>
> To: release-drivers
> Sent: Saturday, January 19, 2013 4:34:34 AM
> Subject: extended Aurora tree closure and options for reopening
(disabling testpilot extension?)
>
> The mozilla-aurora tree is currently closed (and has been since
> Wednesday) due to a set of permanent test failures. Failure to
> reopen the tree and allow fixes to land puts Firefox 20 at risk
> (with the risk increasing with the length of the closure).
>
> I wrote a detailed description of the situation and laid out the
> known options for moving forward in:
> https://bugzilla.mozilla.org/show_bug.cgi?id=823989#c52
> (with a few clarifications in the two comments after).
>
> The prior discussion of this issue that I'm aware of is mostly in
> that bug and in
>
https://groups.google.com/forum/?fromgroups=#!topic/mozilla.dev.platform/fffQo85eM8Y
>
> Of the three options I present, the one that I think has the
> strongest support and least opposition among the developers
> investigating the problems is option 2:
>
> # (2) Disable the testpilot extension on aurora using the patch in
> # comment 48, and reopen mozilla-aurora. comment 43 says that
> # we're not currently running any studies using testpilot (and
> # also that ehsan supports this solution).
>
> I think release-drivers should be aware that this is currently the
> leading option; I'm not sure who should make the final call here,
> but probably somebody a bit more informed about testpilot than I am.
>
>
> (It's currently Saturday morning in London, and I plan to spend the
> weekend as a tourist, so I expect to be only intermittenly online
> today and tomorrow.)
>
> -David
> _______________________________________________
> release-drivers mailing list



----- Original Message -----
> From: "Alex Keybl" <stripped>
> Sent: Saturday, January 19, 2013 7:36:36 PM
> Subject: Re: extended Aurora tree closure and options for reopening
(disabling testpilot extension?)
>
> Let's move forward with option 2 to re-open the tree, and continue
> investigating how to find final resolution allowing test pilot to be
> re-enabled on Aurora. Cheng and Jinghua - please let us know when
> you were hoping to push out the next survey, so we can put a date on
> re-enabling.
>
> -Alex

Nicholas Nethercote

unread,
Jan 20, 2013, 9:40:45 PM1/20/13
to Justin Wood (Callek), dev-pl...@lists.mozilla.org
>> Of the three options I present, the one that I think has the
>> strongest support and least opposition among the developers
>> investigating the problems is option 2:
>>
>> # (2) Disable the testpilot extension on aurora using the patch in
>> # comment 48, and reopen mozilla-aurora. comment 43 says that
>> # we're not currently running any studies using testpilot (and
>> # also that ehsan supports this solution).
>>
>> I think release-drivers should be aware that this is currently the
>> leading option; I'm not sure who should make the final call here,
>> but probably somebody a bit more informed about testpilot than I am.

Disabling it on mozilla-central should be an option too!
https://bugzilla.mozilla.org/show_bug.cgi?id=719455 was a bad memory
leak caused by Test Pilot that took *10 months* to fix, basically
because it's development is moribund. And now it's causing another
problem. And it's not being used in any useful capacity... you can
probably tell what I think should happen to it.

Nick

Ehsan Akhgari

unread,
Jan 21, 2013, 1:01:05 AM1/21/13
to Nicholas Nethercote, Justin Wood (Callek), dev-pl...@lists.mozilla.org
On 2013-01-20 9:40 PM, Nicholas Nethercote wrote:
>>> Of the three options I present, the one that I think has the
>>> strongest support and least opposition among the developers
>>> investigating the problems is option 2:
>>>
>>> # (2) Disable the testpilot extension on aurora using the patch in
>>> # comment 48, and reopen mozilla-aurora. comment 43 says that
>>> # we're not currently running any studies using testpilot (and
>>> # also that ehsan supports this solution).
>>>
>>> I think release-drivers should be aware that this is currently the
>>> leading option; I'm not sure who should make the final call here,
>>> but probably somebody a bit more informed about testpilot than I am.
>
> Disabling it on mozilla-central should be an option too!
> https://bugzilla.mozilla.org/show_bug.cgi?id=719455 was a bad memory
> leak caused by Test Pilot that took *10 months* to fix, basically
> because it's development is moribund. And now it's causing another
> problem. And it's not being used in any useful capacity... you can
> probably tell what I think should happen to it.

The testpilot extension is already disabled on trunk (which of course
doesn't mean that users cannot go ahead and install it manually.)

Cheers,
Ehsan

0 new messages