Future of Linux64 support

Chris Cooper

unread,

Aug 29, 2012, 5:13:07 PM8/29/12

to dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

catlee brought this up yesterday at the Platform meeting [1], so here is
the follow-up discussion I promised.

We (Mozilla-the-project) have had a lot of discussions recently about
how to fix our current capacity issues. Release engineering and IT
continue to try bring new testing infrastructure online [2], but we're
also re-examining our platform support needs as we go (e.g. 10.5 support
[3]) in case the support choices we made years ago no longer apply.

More to the point, release engineering and IT would like to *not* stand
up replacement infrastructure for a platform that we don't care about.

We currently generate linux 64-bit builds and run them through the full
gamut of testing. It is listed as a tier 1 platform [4], but we don't
actually ship linux 64-bit builds [5][6]. To produce these tier 1 builds
but not publish them (officially) seems like a waste to me.

Are these builds important enough to publish, or can we reduce our linux
64-bit build/test capacity and re-purpose it elsewhere?

I am not overly concerned about the build capacity for 64-bit linux
builds. Those machines are already slated to be re-imaged as Windows
64-bit builders as the build themselves get moved into AWS.

The test hardware, however, is another story. There are 70 rev3 minis
testing linux 64-bit builds. If we don't need the 64-bit coverage, we
could re-image these minis to provide more coverage for 32-bit linux,
Windows 7, and Windows XP builds.

As you consider options, please keep in mind also that:
* our linux end-user population is a small fraction (1.6%) of our total
ADIs [7]
* 32-bit linux builds account for almost 2/3s (62%) of those linux ADIs [7]
* historically most users get their linux builds via their distro

Note: there's lots of middle ground to be found here: reduced frequency
of builds/tests, build only 32-bit linux but test them on both linux
32-bit & linux 64-bit machines, etc.

We should decide what's important to us here.

cheers,
--
coop

1. https://wiki.mozilla.org/Platform/2012-08-28#Tree_Management
2. https://bugzilla.mozilla.org/show_bug.cgi?id=758624
3.
https://groups.google.com/forum/?fromgroups=#!topicsearchin/mozilla.dev.planning/10.5/mozilla.dev.planning/70QOVDgbNEo
4. https://developer.mozilla.org/en-US/docs/Supported_build_configurations
5. https://www.mozilla.org/en-US/firefox/all.html
6. https://bugzilla.mozilla.org/show_bug.cgi?id=527907
7. Stats for Firefox 14 on 2012-08-28 taken from
https://metrics.mozilla.com/stats/firefox.shtml

Anthony Hughes

unread,

Aug 29, 2012, 5:32:56 PM8/29/12

to dev-pl...@lists.mozilla.org

Just my two cents.

As a Linux 64-bit user, I would still like to be able to get my nightly
builds regardless if they've been put through the whole gamut of testing
or not. If we're talking about turning off test infrastructure for
Linux64 builds I'm okay. If we're talking about not building nightly
Linux64 builds I'm more concerned. This would make it especially hard
tracking down regression ranges for Linux64 specific regressions.

On a side note, I think percentage becomes less and less useful a metric
as our user base grows. 38% of 1.6% of 400 million users is still 2.4
million people. We've committed to doing much more for less people in
the past.

> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning

--
Anthony Hughes
Quality Engineer
Mozilla QA (Desktop)

Ben Hearsum

unread,

Aug 29, 2012, 5:42:12 PM8/29/12

to Anthony Hughes, dev-pl...@lists.mozilla.org

On 08/29/12 05:32 PM, Anthony Hughes wrote:
> Just my two cents.
>
> As a Linux 64-bit user, I would still like to be able to get my nightly
> builds regardless if they've been put through the whole gamut of testing

It should be noted that you can still run our official 32-bit builds on
Linux. We don't ship 64-bit Windows builds, but we still have 64-bit
Windows users.

Ben Hearsum

unread,

Aug 29, 2012, 5:42:12 PM8/29/12

to Anthony Hughes, dev-pl...@lists.mozilla.org

On 08/29/12 05:32 PM, Anthony Hughes wrote:

> Just my two cents.
>
> As a Linux 64-bit user, I would still like to be able to get my nightly
> builds regardless if they've been put through the whole gamut of testing

Matt Brubeck

unread,

Aug 29, 2012, 5:44:48 PM8/29/12

to dev-pl...@lists.mozilla.org, Chris Cooper, dev-tree-...@lists.mozilla.org

On 08/29/2012 02:13 PM, Chris Cooper wrote:
> We currently generate linux 64-bit builds and run them through the full
> gamut of testing. It is listed as a tier 1 platform [4], but we don't
> actually ship linux 64-bit builds [5][6].

While we don't expose Linux64 builds on our main web pages (except for
the Nightly page), we do make them publicly available for all channels
and releases, e.g.:
https://ftp.mozilla.org/pub/mozilla.org/firefox/releases/15.0/linux-x86_64/

> * 32-bit linux builds account for almost 2/3s (62%) of those linux ADIs [7]

> 7. Stats for Firefox 14 on 2012-08-28 taken from
> https://metrics.mozilla.com/stats/firefox.shtml

The same page shows that on Nightly (the one place where we do expose
Linux64 builds on a main download page) about 70% of Linux ADI are 64-bit.

The fact that only ~40% of release ADI are 64-bit is partly a situation
*we* created by failing to update our download pages (bug 527907). I
think the correct thing to do is to to fix the download pages, rather
than dropping these builds.

> * historically most users get their linux builds via their distro

Bug 527907 has some notes about which types of Firefox builds are
shipped by Linux distributions. Increasingly, major distros offer
64-bit builds as their main or default option. We should expect usage
of 32-bit builds to drop steadily (especially if we update our own
download pages), and 64-bit to grow. If we test only one type of Linux
build, I think it should be 64-bit.

Anthony Hughes

unread,

Aug 29, 2012, 5:45:49 PM8/29/12

to Ben Hearsum, dev-pl...@lists.mozilla.org

Yup, however I've traditionally had issues with mixing architecture on
Linux, particularly when it comes to plug-ins.

Reed Loden

unread,

Aug 29, 2012, 5:49:39 PM8/29/12

to dev-pl...@lists.mozilla.org, cco...@deadsquid.com, dev-tree-...@lists.mozilla.org

I disagree with your reading of the stats... On Firefox 3.6, there are
still a ton of 32-bit Linux users, but all other Firefox versions above
that have a much higher 64-bit following (by a large majority). If
anything, I'd say that getting rid of 32-bit builds makes way more
sense moving forward.

As somebody who has actually been using the Linux 64-bit nightlies for
years, I would be very grumpy if they suddenly went away.

I've been trying for years to get the 64-bit builds shown on
www.mozilla.org, but it's been hard to get webdev time to modify
product-details and bouncer to support it.

~reed

On Wed, 29 Aug 2012 17:13:07 -0400
Chris Cooper <cco...@deadsquid.com> wrote:

> catlee brought this up yesterday at the Platform meeting [1], so here is
> the follow-up discussion I promised.
>
> We (Mozilla-the-project) have had a lot of discussions recently about
> how to fix our current capacity issues. Release engineering and IT
> continue to try bring new testing infrastructure online [2], but we're
> also re-examining our platform support needs as we go (e.g. 10.5 support
> [3]) in case the support choices we made years ago no longer apply.
>
> More to the point, release engineering and IT would like to *not* stand
> up replacement infrastructure for a platform that we don't care about.
>

> We currently generate linux 64-bit builds and run them through the full
> gamut of testing. It is listed as a tier 1 platform [4], but we don't

> actually ship linux 64-bit builds [5][6]. To produce these tier 1 builds
> but not publish them (officially) seems like a waste to me.
>
> Are these builds important enough to publish, or can we reduce our linux
> 64-bit build/test capacity and re-purpose it elsewhere?
>
> I am not overly concerned about the build capacity for 64-bit linux
> builds. Those machines are already slated to be re-imaged as Windows
> 64-bit builders as the build themselves get moved into AWS.
>
> The test hardware, however, is another story. There are 70 rev3 minis
> testing linux 64-bit builds. If we don't need the 64-bit coverage, we
> could re-image these minis to provide more coverage for 32-bit linux,
> Windows 7, and Windows XP builds.
>
> As you consider options, please keep in mind also that:
> * our linux end-user population is a small fraction (1.6%) of our total
> ADIs [7]

> * 32-bit linux builds account for almost 2/3s (62%) of those linux ADIs [7]

> * historically most users get their linux builds via their distro
>

> Note: there's lots of middle ground to be found here: reduced frequency
> of builds/tests, build only 32-bit linux but test them on both linux
> 32-bit & linux 64-bit machines, etc.
>
> We should decide what's important to us here.
>
> cheers,
> --
> coop
>
> 1. https://wiki.mozilla.org/Platform/2012-08-28#Tree_Management
> 2. https://bugzilla.mozilla.org/show_bug.cgi?id=758624
> 3.
> https://groups.google.com/forum/?fromgroups=#!topicsearchin/mozilla.dev.planning/10.5/mozilla.dev.planning/70QOVDgbNEo
> 4. https://developer.mozilla.org/en-US/docs/Supported_build_configurations
> 5. https://www.mozilla.org/en-US/firefox/all.html
> 6. https://bugzilla.mozilla.org/show_bug.cgi?id=527907

> 7. Stats for Firefox 14 on 2012-08-28 taken from
> https://metrics.mozilla.com/stats/firefox.shtml

Dao

unread,

Aug 29, 2012, 5:54:16 PM8/29/12

to

On 29.08.2012 23:44, Matt Brubeck wrote:
>> * 32-bit linux builds account for almost 2/3s (62%) of those linux
>> ADIs [7]
> > 7. Stats for Firefox 14 on 2012-08-28 taken from
> > https://metrics.mozilla.com/stats/firefox.shtml
>
> The same page shows that on Nightly (the one place where we do expose
> Linux64 builds on a main download page) about 70% of Linux ADI are 64-bit.
>
> The fact that only ~40% of release ADI are 64-bit is partly a situation
> *we* created by failing to update our download pages (bug 527907). I
> think the correct thing to do is to to fix the download pages, rather
> than dropping these builds.

Oh, so the stats don't take distro builds into account. Seems like
they're practically useless then.

Dao

unread,

Aug 29, 2012, 5:56:29 PM8/29/12

to

On 29.08.2012 23:13, Chris Cooper wrote:
> * 32-bit linux builds account for almost 2/3s (62%) of those linux ADIs [7]

What does the trend look like? I would expect the 64-bit share to grow.

> * historically most users get their linux builds via their distro

Why does this matter? Are you saying distros should continuously track
mozilla-central/aurora/beta and run our tests?

Mike Connor

unread,

Aug 29, 2012, 5:56:54 PM8/29/12

to Anthony Hughes, dev-pl...@lists.mozilla.org

On 2012-08-29 5:32 PM, Anthony Hughes wrote:
> Just my two cents.
>
> As a Linux 64-bit user, I would still like to be able to get my
> nightly builds regardless if they've been put through the whole gamut

> of testing or not. If we're talking about turning off test
> infrastructure for Linux64 builds I'm okay. If we're talking about not
> building nightly Linux64 builds I'm more concerned. This would make it
> especially hard tracking down regression ranges for Linux64 specific
> regressions.

I think we can and should continue to produce builds unless we're unable
to do so. With AWS, I think we'll be okay here.

> On a side note, I think percentage becomes less and less useful a
> metric as our user base grows. 38% of 1.6% of 400 million users is
> still 2.4 million people. We've committed to doing much more for less
> people in the past.

We have, though it hasn't always been a great choice in hindsight. If
we're getting less support for Windows/Mac/Linux32 users because of 0.6%
of our users, I find it hard to believe that it's the right choice.

That said, we should decide whether Linux64 is something we feel is
important for the future of the project, and explicitly support it at
the level we feel is appropriate.

I would, perhaps naively, counter-propose that we more be more explicit
here:

1) Make Linux64 a Tier 2 platform, and identify a platform owner, if one
can be found.
2) Take it out of the default platforms built on try, but allow requests
for that platform.
3) Discontinue building on demand, and only do nightly builds
4) Maintain just enough testing infra to deliver results for those
builds, and retask the rest for tier-1 platforms.

Coop/Ben, would this suffice to free up sufficient capacity? Anthony,
does this admittedly low bar feel sufficient from your POV?

-- Mike

Boris Zbarsky

unread,

Aug 29, 2012, 5:58:46 PM8/29/12

to

On 8/29/12 5:56 PM, Mike Connor wrote:
> I would, perhaps naively, counter-propose that we more be more explicit
> here:
>
> 1) Make Linux64 a Tier 2 platform, and identify a platform owner, if one
> can be found.

It might make more sense to do that for linux32, given the data on
non-3.6 users.

-Boris

Karl Tomlinson

unread,

Aug 29, 2012, 5:59:16 PM8/29/12

to

Ben Hearsum writes:

> It should be noted that you can still run our official 32-bit builds on
> Linux. We don't ship 64-bit Windows builds, but we still have 64-bit
> Windows users.

This tends to come with its share of problems, such as gtk themes
not installed, crash submission not working, other system
integration issues...

Anthony Hughes

unread,

Aug 29, 2012, 6:06:00 PM8/29/12

to Mike Connor, dev-pl...@lists.mozilla.org

On 12-08-29 02:56 PM, Mike Connor wrote:
> On 2012-08-29 5:32 PM, Anthony Hughes wrote:
>> Just my two cents.
>>
>> As a Linux 64-bit user, I would still like to be able to get my
>> nightly builds regardless if they've been put through the whole gamut
>> of testing or not. If we're talking about turning off test
>> infrastructure for Linux64 builds I'm okay. If we're talking about
>> not building nightly Linux64 builds I'm more concerned. This would
>> make it especially hard tracking down regression ranges for Linux64
>> specific regressions.
>
> I think we can and should continue to produce builds unless we're
> unable to do so. With AWS, I think we'll be okay here.
>
>> On a side note, I think percentage becomes less and less useful a
>> metric as our user base grows. 38% of 1.6% of 400 million users is
>> still 2.4 million people. We've committed to doing much more for less
>> people in the past.
>
> We have, though it hasn't always been a great choice in hindsight. If
> we're getting less support for Windows/Mac/Linux32 users because of
> 0.6% of our users, I find it hard to believe that it's the right choice.
>
> That said, we should decide whether Linux64 is something we feel is
> important for the future of the project, and explicitly support it at
> the level we feel is appropriate.
>

> I would, perhaps naively, counter-propose that we more be more
> explicit here:
>
> 1) Make Linux64 a Tier 2 platform, and identify a platform owner, if
> one can be found.

> 2) Take it out of the default platforms built on try, but allow
> requests for that platform.
> 3) Discontinue building on demand, and only do nightly builds
> 4) Maintain just enough testing infra to deliver results for those
> builds, and retask the rest for tier-1 platforms.
>
> Coop/Ben, would this suffice to free up sufficient capacity? Anthony,
> does this admittedly low bar feel sufficient from your POV?
>
> -- Mike

I *think* so, thanks Mike.

Joshua Cranmer

unread,

Aug 29, 2012, 6:07:09 PM8/29/12

to

The rationale is different between Linux and Windows: Windows has its
full 32-bit system installed by default, while it turns out that at
least my distro (Debian) doesn't install the necessary core 32-bit
libraries by default. The main reason for not providing Windows 64-bit
builds is often cited to be the lack of plugin support, but on Linux,
the most important plugin (flash) has both 32-bit and 64-bit versions,
and probably most of the rest are compiled individually by distros
anyhow (or provided in both 32-bit/64-bit forms). Linux32-on-Linux64 is
a lot more difficult and error-prone than Win32-on-Win64.

Mike Connor

unread,

Aug 29, 2012, 6:13:25 PM8/29/12

to dev-pl...@lists.mozilla.org

Looking at the data, we're still at around 2/3 Linux32 on Fx13/14,
though Beta/Aurora/Nightly builds skew to 64-bit users. We made the
call on Windows to not invest in 64-bit, as there was no significant
user benefit to 64-bit, and 32-bit builds run fine. I'm not sure why
we'd make the inverse choice on Linux...

-- Mike

Karl Tomlinson

unread,

Aug 29, 2012, 6:18:16 PM8/29/12

to

There are plenty of good reasons to keep both 32 and 64-bit
builds running. Most importantly they make it easier for people
to run and test our nightlies.

Chris Cooper writes:

> Note: there's lots of middle ground to be found here: reduced
> frequency of builds/tests, build only 32-bit linux but test them
> on both linux 32-bit & linux 64-bit machines, etc.
>
> We should decide what's important to us here.

Yes, what is more in question is whether we need to run the full
set of tests on both platforms.

I imagine it is rare to have failures on 32 but not 64 or
vice-versa, and 32-bit NT and 64-bit Mac provide some coverage of
these issues.

Some things I would consider would be:

1) Dropping debug tests on x86(32) Linux.
2) Dropping talos tests on x86(32) Linux.

There is some benefit is running these, but if this is costing us
too dearly, then that might be a good place to start.

Boris Zbarsky

unread,

Aug 29, 2012, 6:19:59 PM8/29/12

to

On 8/29/12 6:13 PM, Mike Connor wrote:
> Looking at the data, we're still at around 2/3 Linux32 on Fx13/14,
> though Beta/Aurora/Nightly builds skew to 64-bit users. We made the
> call on Windows to not invest in 64-bit, as there was no significant
> user benefit to 64-bit, and 32-bit builds run fine. I'm not sure why
> we'd make the inverse choice on Linux...

Well, for a start because 32-bit builds don't run fine on 64-bit Linux
unless the user goes and installs a bunch of extra packages. And maybe
not even then.

But also because I keep hearing distros are mostly shipping 64-bit
builds, and I keep hearing that most of our Linux users are on distro
builds. If those are both true (which is an assumption), then we should
probably be focusing on what most of our Linux users (as opposed to just
the mozilla.org build Linux users) are using.

-Boris

Boris Zbarsky

unread,

Aug 29, 2012, 6:21:10 PM8/29/12

to

On 8/29/12 6:13 PM, Mike Connor wrote:

> Looking at the data, we're still at around 2/3 Linux32 on Fx13/14,

Actually, just to check. This is for final releases, right? Is this
including distro builds, or only for mozilla.org builds?

-Boris

L. David Baron

unread,

Aug 29, 2012, 7:00:55 PM8/29/12

to Ben Hearsum, dev-pl...@lists.mozilla.org, Anthony Hughes

This is basically true on Fedora-based systems, but it's harder to
do on Debian/Ubuntu-based systems and only half works when it works
at all.

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla http://www.mozilla.org/ 𝄂

L. David Baron

unread,

Aug 29, 2012, 7:14:59 PM8/29/12

to dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

On Wednesday 2012-08-29 17:13 -0400, Chris Cooper wrote:
> * 32-bit linux builds account for almost 2/3s (62%) of those linux ADIs [7]

Does that include distro builds? And how much of it is because
those are the builds that we push on the Web site?

> * historically most users get their linux builds via their distro

Given that:

(1) the plugin situation on 64-bit Linux is, for most users,
basically the same as the plugin situation on 32-bit Linux (i.e.,
Flash and Java work)

(2) it's easier for users, especially on Debian/Ubuntu, to install
the builds that mach the architecture of the distro (to the point
that on Ubuntu 32-bit have a bunch of the system integration stuff
broken because ia32-libs doesn't have libraries for it)

(3) anybody with 4GB or more of RAM needs to install a 64-bit distro
to use that RAM

I think we should, if anything, consider dropping 32-bit Linux to
tier 2 because it's a dying platform. I think dropping 64-bit Linux
is basically equivalent to saying that we're planning to drop Linux
in 2 years.

> Note: there's lots of middle ground to be found here: reduced
> frequency of builds/tests, build only 32-bit linux but test them on
> both linux 32-bit & linux 64-bit machines, etc.

That said, I don't think the difference between the two matters that
much, especially since we're running our tests of both on the same
long-since-unsupported [1] distro (Fedora 12, if I'm reading the
machine names correctly, which is the same chronological age as OS X
10.6, but I suspect Linux users are on a faster upgrade cycle).

-David

[1] http://fedoraproject.org/wiki/LifeCycle/EOL

Matt Brubeck

unread,

Aug 29, 2012, 7:15:14 PM8/29/12

to

Our ADI stats (including the ones quoted above from [1]) are based on
blocklist pings, so they should include distro builds (as long as the
blocklist settings are unchanged, which I believe is generally the case).

I don't know exactly what portion of our Linux users are on distro
builds; it would be interesting to see that.

I expect a large portion of Linux users are still on 32-bit distros, but
I also expect this to tilt toward 64-bit soon. Distributions like
Fedora have already made 64-bit the default for new installs, and others
will probably follow shortly. (This is possible in part because both
Firefox and Flash released stable 64-bit versions in 2011).

[1]: https://metrics.mozilla.com/stats/firefox.shtml

Benoit Jacob

unread,

Aug 29, 2012, 7:56:23 PM8/29/12

to dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

2012/8/29 Chris Cooper <cco...@deadsquid.com>:

> I am not overly concerned about the build capacity for 64-bit linux builds.
> Those machines are already slated to be re-imaged as Windows 64-bit builders
> as the build themselves get moved into AWS.
>
> The test hardware, however, is another story.

I'm concerned that we are making too many compromises due to our
testing being too expensive. Shouldn't we focus on making testing less
expensive, rather than continuing making compromises to accomodate
that?

Recently John O'Duinn blogged about how expensive testing is:
http://oduinn.com/blog/2012/08/21/137-hours-compute-hours-every-6-minutes/

I tried to propose a solution in comments there, let me repeat it here:

Maybe it's time to switch to running tests only at a certain time
interval, instead of on every push?

Then on regression, the system could automatically start bisecting.

On another note, if we look at the population of Nightly users on
Linux, according to the crash reports that we get, 64bit builds
account for more than half of the population.

bjacob:~/crash-stats$ zcat 20120810-pub-crashdata.csv.gz | grep 17.0a1
| grep Linux | grep -v Android | wc -l
37
bjacob:~/crash-stats$ zcat 20120810-pub-crashdata.csv.gz | grep 17.0a1
| grep Linux | grep -v Android | grep x86_64 | wc -l
27
bjacob:~/crash-stats$ zcat 20120811-pub-crashdata.csv.gz | grep 17.0a1
| grep Linux | grep -v Android | wc -l
41
bjacob:~/crash-stats$ zcat 20120811-pub-crashdata.csv.gz | grep 17.0a1
| grep Linux | grep -v Android | grep x86_64 | wc -l
26

Benoit

Justin Lebar

unread,

Aug 29, 2012, 7:56:45 PM8/29/12

to dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

Given that many developers use Linux 64, I think it's vitally
important if only for that reason that we continue to run tests on
Linux 64.

OTOH, with the death of XUL Fennec, Linux-32 builds are the closest
thing we have to B2G that we currently run tests on. (Many tests
relevant to B2G don't work on native Fennec.) I'd want to look
closely at how rarely we find 32-bit-only bugs before declaring that
Linux-32 doesn't matter in the short-term.

That said, I see no reason that we need to continue running Linux
tests specifically on mac mini's. I understood that there's work
being done on that front, although I don't have a bug. ISTM that
prioritizing that work over dropping any particular platforms would
get us the ability to continue scaling without asking us to trade off
test coverage on important platforms. Dropping tests on platform X is
only a band-aid to our inability to scale.

-Justin

On Wed, Aug 29, 2012 at 6:13 PM, Chris Cooper <cco...@deadsquid.com> wrote:
> catlee brought this up yesterday at the Platform meeting [1], so here is the
> follow-up discussion I promised.
>
> We (Mozilla-the-project) have had a lot of discussions recently about how to
> fix our current capacity issues. Release engineering and IT continue to try
> bring new testing infrastructure online [2], but we're also re-examining our
> platform support needs as we go (e.g. 10.5 support [3]) in case the support
> choices we made years ago no longer apply.
>
> More to the point, release engineering and IT would like to *not* stand up
> replacement infrastructure for a platform that we don't care about.
>
> We currently generate linux 64-bit builds and run them through the full
> gamut of testing. It is listed as a tier 1 platform [4], but we don't
> actually ship linux 64-bit builds [5][6]. To produce these tier 1 builds but
> not publish them (officially) seems like a waste to me.
>
> Are these builds important enough to publish, or can we reduce our linux
> 64-bit build/test capacity and re-purpose it elsewhere?
>

> I am not overly concerned about the build capacity for 64-bit linux builds.
> Those machines are already slated to be re-imaged as Windows 64-bit builders
> as the build themselves get moved into AWS.
>

> The test hardware, however, is another story. There are 70 rev3 minis
> testing linux 64-bit builds. If we don't need the 64-bit coverage, we could
> re-image these minis to provide more coverage for 32-bit linux, Windows 7,
> and Windows XP builds.
>
> As you consider options, please keep in mind also that:
> * our linux end-user population is a small fraction (1.6%) of our total ADIs
> [7]

> * 32-bit linux builds account for almost 2/3s (62%) of those linux ADIs [7]

> * historically most users get their linux builds via their distro
>

> Note: there's lots of middle ground to be found here: reduced frequency of
> builds/tests, build only 32-bit linux but test them on both linux 32-bit &
> linux 64-bit machines, etc.
>

> We should decide what's important to us here.
>

> cheers,
> --
> coop
>
> 1. https://wiki.mozilla.org/Platform/2012-08-28#Tree_Management
> 2. https://bugzilla.mozilla.org/show_bug.cgi?id=758624
> 3.
> https://groups.google.com/forum/?fromgroups=#!topicsearchin/mozilla.dev.planning/10.5/mozilla.dev.planning/70QOVDgbNEo
> 4. https://developer.mozilla.org/en-US/docs/Supported_build_configurations
> 5. https://www.mozilla.org/en-US/firefox/all.html
> 6. https://bugzilla.mozilla.org/show_bug.cgi?id=527907

> 7. Stats for Firefox 14 on 2012-08-28 taken from
> https://metrics.mozilla.com/stats/firefox.shtml

Ehsan Akhgari

unread,

Aug 29, 2012, 7:58:53 PM8/29/12

to Benoit Jacob, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

On 12-08-29 7:56 PM, Benoit Jacob wrote:
> Maybe it's time to switch to running tests only at a certain time
> interval, instead of on every push?
>
> Then on regression, the system could automatically start bisecting.

That might not be very practical due to turn around times, but backing
out everybody in that range might be. ;-)

Ehsan

Benoit Jacob

unread,

Aug 29, 2012, 8:03:50 PM8/29/12

to Ehsan Akhgari, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

2012/8/29 Ehsan Akhgari <ehsan....@gmail.com>:

In fact, the turnaround time problem already exists with our current
solution: as test slaves are lagging behind m-i, there is a turnaround
time which was about 4 hours last friday when the tree was closed. The
ones suffering the most from this are the perma-sheriffs. So a fixed
turn-around time of, say, 30 min would be much better than what we
currently have.

Benoit

>
> Ehsan
>

Kyle Huey

unread,

Aug 29, 2012, 8:18:54 PM8/29/12

to Benoit Jacob, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

On Wed, Aug 29, 2012 at 4:56 PM, Benoit Jacob <jacob.b...@gmail.com>wrote:

> Maybe it's time to switch to running tests only at a certain time
> interval, instead of on every push?
>

Thanks to test coalescing, we certainly don't run tests on every push.

- Kyle

Benoit Jacob

unread,

Aug 29, 2012, 8:21:56 PM8/29/12

to Kyle Huey, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

2012/8/29 Kyle Huey <m...@kylehuey.com>:

That's nitpicking: whatever the coalescing delay is, it's low enough
that our test infrastructure lags behind.

Benoit

>
> - Kyle

Nicholas Nethercote

unread,

Aug 30, 2012, 1:22:10 AM8/30/12

to Justin Lebar, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

On Thu, Aug 30, 2012 at 9:56 AM, Justin Lebar <justin...@gmail.com> wrote:
> Given that many developers use Linux 64, I think it's vitally
> important if only for that reason that we continue to run tests on
> Linux 64.

Yes, yes, a thousand times yes. I do 99% of my dev work on a Linux64 box.

I think downgrading Linux64 to tier-2 is an appalling idea. Please don't do it.

Nick

Mike Hommey

unread,

Aug 30, 2012, 2:09:37 AM8/30/12

to Justin Lebar, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

On Wed, Aug 29, 2012 at 08:56:45PM -0300, Justin Lebar wrote:
> Given that many developers use Linux 64, I think it's vitally
> important if only for that reason that we continue to run tests on
> Linux 64.
>

> OTOH, with the death of XUL Fennec, Linux-32 builds are the closest
> thing we have to B2G that we currently run tests on. (Many tests
> relevant to B2G don't work on native Fennec.) I'd want to look
> closely at how rarely we find 32-bit-only bugs before declaring that
> Linux-32 doesn't matter in the short-term.

I'd say most linux-32-only bugs are related to floats, and since it is
compiler dependent on top of being x86-32 dependent, these bugs are
usually not seen on other 32 bits platforms, including android.

Fortunately, these bugs usually only show up on reftests, so as long
as we still run those, we should be kind of safe.

Mike

Ms2ger

unread,

Aug 30, 2012, 4:40:22 AM8/30/12

to

On 08/30/2012 01:56 AM, Benoit Jacob wrote:
> 2012/8/29 Chris Cooper <cco...@deadsquid.com>:
>> I am not overly concerned about the build capacity for 64-bit linux builds.
>> Those machines are already slated to be re-imaged as Windows 64-bit builders
>> as the build themselves get moved into AWS.
>>
>> The test hardware, however, is another story.
>
> I'm concerned that we are making too many compromises due to our
> testing being too expensive. Shouldn't we focus on making testing less
> expensive, rather than continuing making compromises to accomodate
> that?
>
> Recently John O'Duinn blogged about how expensive testing is:
> http://oduinn.com/blog/2012/08/21/137-hours-compute-hours-every-6-minutes/
>
> I tried to propose a solution in comments there, let me repeat it here:
>
> Maybe it's time to switch to running tests only at a certain time
> interval, instead of on every push?
>
> Then on regression, the system could automatically start bisecting.

We need full tests on every run, because we can't trust developers to
only push patches they have tested. Last week, I took the time to
categorize which kinds of commits end up in the repo
(8707f5ddeda3:1c0ac073dc65 if you're interested). The results were as
follows:

* 6 merges
* 20 backout commits
* 25 commits that were backed out
* 57 commits that stuck

That means that about half of those commits could have been avoided by
simply using tryserver. I think instating a policy that punishes those
people that waste resources like this (be that by revoking their commit
access or through other means) would be a more effective way to reduce load.

HTH
Ms2ger

Justin Lebar

unread,

Aug 30, 2012, 5:34:25 AM8/30/12

to Ms2ger, dev-pl...@lists.mozilla.org

On Thu, Aug 30, 2012 at 5:40 AM, Ms2ger <ms2...@gmail.com> wrote:
> That means that about half of those commits could have been avoided by
> simply using tryserver. I think instating a policy that punishes those
> people that waste resources like this (be that by revoking their commit
> access or through other means) would be a more effective way to reduce load.

Since the test capacity for tryserver and m-i is shared, it's not
clear to me that people are wasting as much CPU time as you imply by
pushing busted code to m-i.

In particular, a successful push to m-i without a preceding tryserver
run has a cost of 1 round of tests. A successful push to try plus a
successful push to m-i has a cost of 2 rounds of tests. An
unsuccessful push to m-i followed by a successful push to try + m-i
has a cost of 4 rounds of tests.

So suppose I have a changeset which I think has a 90% chance of being
green on m-i. We can compute the expected infra load of my options:

1) Push to m-i only: 90% chance of 1 round, 10% chance of 4 rounds =
1.3 rounds expected.
2) Push to try and then m-i: 90% chance of 2 rounds, 10% chance of 3
rounds (busted try, successful try, m-i) = 2.1 rounds expected.

and observe that not pushing to try actually /saves/ CPU cycles overall!

The break even point, when (1) and (2) result in the same number of
rounds in expectation is x in the equation

x + 4*(1-x) = 2x + 3*(1-x) <==>
1-x = x <==>
x = .5

At a 50% chance of failure, both (1) and (2) lead to an expected 2.5
rounds of tests. (This ignores the effect of coalescing on m-i, which
I'm not sure how to model, but I think the point stands.)

I'll grant that many of the successful pushes to m-i in your sample
likely had try runs backing them, so our prior probability for the
success of an m-i push should be lower than 66%. However, I hope
you'll grant that there exist circumstances in which it would be more
responsible to risk burning m-i than to push to try.

Indeed, if we really wanted to save test cycles, we could suggest that
developers push to try with |-b d -p X -u all| and |-b d -p win32 -u
none|, for X selected uniform at random from the set {linux32,
linux64, macos64}. The logic above suggests that if this caught
slightly more than 50% of all failures, it would be a win overall.
("Slightly more than 50%" because we have to take into account the CPU
load of this try push, which is something much smaller than "1
round".) There'd be more bustage on m-i as a result, but fewer test
cycles consumed overall, /even if people continued pushing to try at
the same rate as they do today/. Also, the turnaround time for that
try push could be pretty fast.

But I wouldn't seriously recommend doing that so long as we have so
much coalescing on m-i, because CPU cycles aren't the only resource
we're consuming here. The time and sanity of our sheriffs is valuable
as well.

And that brings me to my real point here: The time and sanity of all
our engineers (sheriffs and otherwise) is in fact far more valuable
than CPU cycles. So I don't think the solution is to quibble about
how we can save CPU time and punish those who don't. The solution is
to grow our infrastructure to handle the load.

There does exist some X at which we might say "it's better for a human
to waste 30 minutes on this than to let our CPUs waste X days on it."
It seems to me that we are nowhere /near/ this particular break-even
point. If I'm wrong, then maybe we /should/ seriously consider
running fewer tests by default on try, because the extra cost to
sheriff would be justified.

-Justin

Mike Hommey

unread,

Aug 30, 2012, 5:58:12 AM8/30/12

to Justin Lebar, dev-pl...@lists.mozilla.org, Ms2ger

On Thu, Aug 30, 2012 at 06:34:25AM -0300, Justin Lebar wrote:
> And that brings me to my real point here: The time and sanity of all
> our engineers (sheriffs and otherwise) is in fact far more valuable
> than CPU cycles. So I don't think the solution is to quibble about
> how we can save CPU time and punish those who don't. The solution is
> to grow our infrastructure to handle the load.

Growing infrastructure is not the sole problem. We need to address the
fact that we actually waste a whole lot of CPU (as in use much less than
what we have, not use too much) from our existing infrastructure because
running tests is mostly single-process single-threaded. We should find
ways to run more tests *simultaneously* on the same hardware. It would
not only help the test infrastructure load, but it would also help
developers. I don't know if you've tried to run tests locally, but it
just takes forever, and you can't do much else at the same time.

Mike

Ms2ger

unread,

Aug 30, 2012, 6:23:22 AM8/30/12

to

On 08/30/2012 11:34 AM, Justin Lebar wrote:
> On Thu, Aug 30, 2012 at 5:40 AM, Ms2ger <ms2...@gmail.com> wrote:
>> That means that about half of those commits could have been avoided by
>> simply using tryserver. I think instating a policy that punishes those
>> people that waste resources like this (be that by revoking their commit
>> access or through other means) would be a more effective way to reduce load.
>
> Since the test capacity for tryserver and m-i is shared, it's not
> clear to me that people are wasting as much CPU time as you imply by
> pushing busted code to m-i.
>
> In particular, a successful push to m-i without a preceding tryserver
> run has a cost of 1 round of tests. A successful push to try plus a
> successful push to m-i has a cost of 2 rounds of tests. An
> unsuccessful push to m-i followed by a successful push to try + m-i
> has a cost of 4 rounds of tests.

I'd like to add scenarios that actually happened recently:

* Developer pushed to try, failed to notice that XPCShell tests go
orange across the board; developer pushed to inbound anyway; sheriff
backs out of inbound. Total cost: frustrated sheriffs, inbound closed
for a few hours, 3 rounds of tests, bug still not fixed.
* Developer doesn't bother with try. First push to inbound has an
incorrect commit message; developer backs out and relands with a better
commit message, and then notices that his patches also fail to build;
another backout follows. Some time later, developer pushes to inbound
again, still doesn't go green; yet another backout. Total cost: unhappy
sheriffs, 6 rounds of tests, bug still not fixed.

I believe it would be useful to avoid such cases.

> So suppose I have a changeset which I think has a 90% chance of being
> green on m-i.

Alright, you *think* you've got a 90% chance. In your case, that might
very well be correct—I don't remember having to back you out
repetitively. However, the scenarios I mentioned above show, I think,
that a lot of developers fail to correctly assess the risk of their
patches, so I wouldn't generally base any decision on such self-assessments.

> And that brings me to my real point here: The time and sanity of all
> our engineers (sheriffs and otherwise) is in fact far more valuable
> than CPU cycles.

As someone who occasionally does sheriff duty, I do not feel that
sheriffs' time and sanity is valued much. On the contrary, the
continuous claims that not running tests on all pushes will solve all
our load problems, suggest that the people who make those claims would
not agree with your valuation.

HTH
Ms2ger

Justin Lebar

unread,

Aug 30, 2012, 7:55:15 AM8/30/12

to Ms2ger, dev-pl...@lists.mozilla.org

> I'd like to add scenarios that actually happened recently:
>
> * Developer pushed to try, failed to notice that XPCShell tests go orange
> across the board; developer pushed to inbound anyway; sheriff backs out of
> inbound. Total cost: frustrated sheriffs, inbound closed for a few hours, 3
> rounds of tests, bug still not fixed.

Although of course nobody ought to make careless mistakes, it's also
not OK that we had to close inbound for a few hours because one person
made a mistake. We all make mistakes, and we should structure our
processes so as to minimize the cost of those mistakes. So in
particular, I think it's unacceptable that we should ever have to
coalesce test runs on m-i or m-c, because that's choosing to use fewer
CPU cycles at the expense of human cycles.

> * Developer doesn't bother with try. First push to inbound has an incorrect
> commit message; developer backs out and relands with a better commit
> message, and then notices that his patches also fail to build; another
> backout follows. Some time later, developer pushes to inbound again, still
> doesn't go green; yet another backout. Total cost: unhappy sheriffs, 6
> rounds of tests, bug still not fixed.

If we could cancel builds on m-i without breaking things, we could
bring this down to 4+epsilon rounds of tests. Maybe that's something
we should prioritize so as to reduce our infra load.

Again, from the perspective of CPU cycles, if there was a 51% chance
that the second push to inbound would have been green, then the
developer did the right thing by pushing straight to m-i. (Or, if you
prefer, it was the right thing to do if the dev thought there was a
75% chance of success, and the dev is not over-optimistic by 25
percentage points more than 50% of the time.)

>From my perspective (and it sounds like perhaps from yours as well),
the problem isn't the waste of cycles, but the waste of sheriffs'
time. (And in this particular case, it's not clear to me that
sheriffs' time /was/ inordinately wasted, since these were simple
backouts; the build failures weren't mysterious.)

The fact that someone made a mistake doesn't mean that the solution is
to scare that person into submission by threatening to revoke his or
her commit access. Everyone has to bear the costs of those threats,
and those costs are both material and psychological.

> As someone who occasionally does sheriff duty, I do not feel that sheriffs'
> time and sanity is valued much. On the contrary, the continuous claims that
> not running tests on all pushes will solve all our load problems, suggest
> that the people who make those claims would not agree with your valuation.

I'll let people make those claims themselves; I'm not going to argue
with someone who agrees with me. :)

(Mike Hommey wrote)

> We should find ways to run more tests *simultaneously* on the same hardware.

+1. My point is, we should have sufficient capacity so we don't have
to trade off CPU time and human time. Whether we achieve that by
adding more machines or by getting more out of our existing machines
(or both) doesn't make a difference to me!

-Justin

Ben Hearsum

unread,

Aug 30, 2012, 8:46:53 AM8/30/12

to Benoit Jacob, Kyle Huey, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

There is no coalescing delay. When a test starts, if there's any queue
for it, it collapses it entirely. Except on try, where we can't coalesce
at all.

Ben Hearsum

unread,

Aug 30, 2012, 8:46:53 AM8/30/12

to Benoit Jacob, Kyle Huey, dev-pl...@lists.mozilla.org, dev-tree-...@lists.mozilla.org

On 08/29/12 08:21 PM, Benoit Jacob wrote:

Kyle Huey

unread,

Aug 30, 2012, 9:09:14 AM8/30/12

to Justin Lebar, dev-pl...@lists.mozilla.org, Ms2ger

On Thu, Aug 30, 2012 at 4:55 AM, Justin Lebar <justin...@gmail.com>wrote:

> > I'd like to add scenarios that actually happened recently:
> >
> > * Developer pushed to try, failed to notice that XPCShell tests go orange
> > across the board; developer pushed to inbound anyway; sheriff backs out
> of
> > inbound. Total cost: frustrated sheriffs, inbound closed for a few
> hours, 3
> > rounds of tests, bug still not fixed.
>

> Although of course nobody ought to make careless mistakes, it's also
> not OK that we had to close inbound for a few hours because one person
> made a mistake. We all make mistakes, and we should structure our
> processes so as to minimize the cost of those mistakes. So in
> particular, I think it's unacceptable that we should ever have to
> coalesce test runs on m-i or m-c, because that's choosing to use fewer
> CPU cycles at the expense of human cycles.
>

I don't disagree with anything you've said, but I want to point out that if
it weren't for test coalescing the test infrastructure would be unable to
keep up with the load. Right now it tends to fall behind during the day
(in California) and catch up overnight. Not coalescing would add several
thousand more tests to the backlog that needs to be made up overnight, and
I don't think that backlog would clear before the next morning.

I haven't actually counted (somebody should!) but in the afternoon most
builds on inbound get at least a third, and often many more, of their tests
coalesced with later pushes.

- Kyle

Ralph Giles

unread,

Aug 30, 2012, 11:40:28 AM8/30/12

to Justin Lebar, dev-pl...@lists.mozilla.org

On 30/08/12 04:55 AM, Justin Lebar wrote:

> +1. My point is, we should have sufficient capacity so we don't have
> to trade off CPU time and human time. Whether we achieve that by
> adding more machines or by getting more out of our existing machines
> (or both) doesn't make a difference to me!

This. There are other effects of our overstrained testing infrastructure
as well. I would be adding a lot more tests if we weren't constantly
talking about how the tests take too long.

-r

Randell Jesup

unread,

Aug 30, 2012, 11:41:57 AM8/30/12

to

>I think we should, if anything, consider dropping 32-bit Linux to
>tier 2 because it's a dying platform. I think dropping 64-bit Linux
>is basically equivalent to saying that we're planning to drop Linux
>in 2 years.

+1. Run 32-bit linux tests every N checkins or even only a few times a
day, if you want - or don't run them at all. They rarely will be
different than 64-bit. In the odd case they are, it will be more
painful, but that will be rare.

>> Note: there's lots of middle ground to be found here: reduced
>> frequency of builds/tests, build only 32-bit linux but test them on
>> both linux 32-bit & linux 64-bit machines, etc.
>
>That said, I don't think the difference between the two matters that
>much, especially since we're running our tests of both on the same
>long-since-unsupported [1] distro (Fedora 12, if I'm reading the
>machine names correctly, which is the same chronological age as OS X
>10.6, but I suspect Linux users are on a faster upgrade cycle).

If we test with Fedora it should be no more than 1, maybe 2 years old.
If we test with Ubuntu it should be the most recent LTS. And don't get
me started about the RHEL5 builders (with updates off)... :-/

--
Randell Jesup, Mozilla Corp
remove ".news" for personal email

Zack Weinberg

unread,

Aug 30, 2012, 12:06:06 PM8/30/12

to

On 2012-08-29 7:56 PM, Justin Lebar wrote:
> Given that many developers use Linux 64, I think it's vitally
> important if only for that reason that we continue to run tests on
> Linux 64.

I also concur with this, and would like to expand on it a little.

My primary development box is Linux64. For the kind of bugs I work on,
test failures are usually, but not always, across-the-board. Build-farm
test failures on Linux64 almost always correspond directly to something
I can reproduce locally; test failures on other platforms -- even on
Linux32 -- usually *don't*. Furthermore, for whatever reason, it seems
to be that there are fewer intermittent oranges on Linux64 than other
platforms, which means I can generally assume that orange on Linux64 try
really is a mistake in whatever code I just wrote.

My usual workflow, therefore, goes like this:

1. Run some subset of the tests (the ones I think are relevant) locally.
2. Once they all succeed, push to try.
3. Wait half an hour or so.
4. Investigate Linux64 try failures by running more tests locally, fix
errors.
5. By now other platforms may have also posted results. If not, wait
six to eight hours.
6. Investigate other-platform-only test failures by staring at logs,
tearing hair out, adding debugging printouts and pushing to try again, etc.

Please take note of the delays at steps 3 and 5. The
edit-push-wait-read-logs cycle for Linux is *orders of magnitude* faster
than the cycle for OSX and Windows; this is in fact a major reason why I
do my development on Linux in the first place. Step 6 can therefore
take days to weeks. The fewer bugs I have to deal with in step 6, the
better.

But more importantly for *this* conversation, not running all tests on
every push to m-c/m-i for Linux64 risks contaminating step 4 with
unrelated bugs to the point where it's as bad as step 6.

zw

Matt Brubeck

unread,

Aug 30, 2012, 12:50:24 PM8/30/12

to

On 08/30/2012 01:40 AM, Ms2ger wrote:
> That means that about half of those commits could have been avoided by
> simply using tryserver. I think instating a policy that punishes those
> people that waste resources like this (be that by revoking their commit
> access or through other means) would be a more effective way to reduce
> load.

Here's something I've considered (but not started actually doing) when
acting as sheriff: When a patch lands that breaks the build, instead of
just backing out that one patch, it could save time in the long run to
also back out all the patches that landed on top of it -- except those
that had recent green Try runs.

This should reduce the risk of extended closures, *and* encourage
developers to use the Try server.

Mike Connor

unread,

Aug 30, 2012, 12:54:38 PM8/30/12

to Justin Lebar, dev-pl...@lists.mozilla.org, Ms2ger

On 2012-08-30 7:55 AM, Justin Lebar wrote:
>> We should find ways to run more tests *simultaneously* on the same hardware.

> +1. My point is, we should have sufficient capacity so we don't have
> to trade off CPU time and human time. Whether we achieve that by
> adding more machines or by getting more out of our existing machines
> (or both) doesn't make a difference to me!

Throwing hardware at the problem is still going to use a lot of human
time. Someone has to buy/rack/manage that hardware, and the more of it
we have, the more complex and expensive it gets to maintain these
systems. I think that after five years of "just get more hardware" as a
solution we need to start investing developer time in making the system
more efficient.

-- Mike

Justin Lebar

unread,

Aug 30, 2012, 1:36:06 PM8/30/12

to Matt Brubeck, dev-pl...@lists.mozilla.org

> Here's something I've considered (but not started actually doing) when
> acting as sheriff: When a patch lands that breaks the build, instead of
> just backing out that one patch, it could save time in the long run to also
> back out all the patches that landed on top of it -- except those that had
> recent green Try runs.
>
> This should reduce the risk of extended closures, *and* encourage developers
> to use the Try server.

...at the expense of your time (trolling through bugs to find links to
try, then starring those builds), and (innocent) developers' time (to
reland patches, unless you're going to do that too). And possibly
without significantly reducing load on infra, since if we get more
people pushing to try, that's not necessarily less infra load.

Maybe this is an acceptable short-term solution, given the necessity
of coalescing on m-i. But let's be clear that it's a work-around and
that we shouldn't have to live with it.

-Justin

Steve Fink

unread,

Aug 30, 2012, 2:33:30 PM8/30/12

to Matt Brubeck, dev-pl...@lists.mozilla.org

It seems reasonable to expect that we will require H units of hardware
resources per developer. Really, that should be X units per unit of code
change, but as long as we're running all tests on any change, the latter
can be approximated proportionally to the number of pushes, which can be
approximated proportionally to the number of active developers, which
these days can be approximated proportionally to the number of MoCo
employees. (Actually, that last is not a bad thing to do in any case; it
doesn't say that all developers should be employees, just that each
developer should represent about the same number of community members.)

Is this ongoing need taken into account? As in, we need to buy K
additional machines per developer hire? (Or allocate K VMs, etc.) I know
the cost of additional machine resources is not linear, but that's a
separate issue.

If we're going to continue to grow, then shrinking H through efficiency
isn't enough.

But in the meantime, shrinking H would seem to be highest priority. So
what are the sources of inefficiency? I'll write a sad little story:

Story: I do a push. My push touches some subset of the code, probably a
very small subset. We then run (almost) all tests on all platforms and
configurations. (The exception I am aware of is that touching js/src
will kick off some additional tests.) So the opportunities include:

1. Don't run all platforms x configurations. For example, we need all
the opt builds if we're going to run performance tests, but do we need
debug builds on every platform? Or vice versa, the debug tests do more
checking and so are more useful, so perhaps we shouldn't do perf tests
everywhere.

When mozilla-inbound gets colorful, there are usually many more than one
red/oranges. In the limit, all but one of those are waste.

2. Don't run all of the tests. Run tests based on what changed. (Or run
all the tests on the fastest platform, and a subset on the rest.) I
understand that it's a pain to identify which tests to run, and we'll
bikeshed that endlessly, but it seems like it's a pretty straightforward
application of machine learning based on a very, very large dataset of
past failures. (Do we have enough history in a database somewhere for that?)

3. Coalesce intelligently. Not every push needs to be the same. Every N
pushes can get a full test suite run. Listen to jlebar, and multiply the
probability of failure (based on data) by the cost of failure. Though I
would include human costs in the latter -- tree closures should cost
more than a few hundred CPU hours, and sheriffs' hair-pulling is
expensive. It's not just the Rogaine.

Story, cont'd: Now the tests are running. Some finish quickly, some wait
in a queue and then take longer. Many CPU cores are idle most of the time.

4. Run a VM per core, and just let the tests run singlethreaded. Dunno
if GPU access is an issue for some tests, but maybe you only use VMs for
some types of tests.

5. Or go the other way, and make more tests runnable in parallel. More
efficient than #4 because it avoids the VM overhead, much harder to
implement, would also improve testing locally. (Though making it easy to
set up test VMs could help local testing too.) Needing window focus will
again bite us here.

6. Which tests actually catch problems? Machine learning again.
Partition the individual tests within a suite into <intermittent,almost
always passes,good>. Don't run the ones that almost always pass as
often. ("Almost always pass" == "Almost always pass when buggy code is
committed", I mean.)

Story, cont'd: My push crashes and burns. Some number of test suites go
orange. By then, 3 other people have pushed on top of me, possibly
triggering more orange. The sheriff wades into the wreckage, backing out
pieces. It gets ugly, so the sheriff closes the tree.

7. Automatically cancel tests that are going to fail. Due to the
intermittent orange, this may be hard to determine, but if the same test
suite fails in the same way more than once (more than twice?), I'd say
it's good enough. Just be sure to report the canceled builds as
"cancelled because expected to fail based on builds B1 and B2".

8. Make multiple inbounds. This would reduce the number of things piled
on top of a failure and reduce the cost of a tree closure, but would
also require another set of jobs per merge.

9. Automate analysis. When a test suite fails, rerun only up to the
first dozen individual failed tests within that suite. Run on the
revisions before and after the first one that failed. (Do the same trick
to check whether an orange is intermittent first?) Make the individual
tests runnable independently to support this, or when that's too
painful, require metadata to describe the interdependencies.

10. Simpler automation: when a test suite fails, wait until the results
of another run of the same suite comes in. If it also fails,
automatically retrigger the identical test on the failed and previous
revisions. If there was coelescing, automatically trigger that test
suite on every single potentially bad push.

11. Make tbpl report "if you liked this failure, you might also
like...": add an analysis button for a failure that gives (a) the
results of the past 5 jobs on that slave, (b) the set of pushes
coalesced, if any (aka a list of pushes since the last test), (c) all
other runs of that test suite on pushes that include the changeset of
interest as well as the most recent runs that did not include the
changeset, and maybe (d) an indication of which failures look the same.
And of course, any stars that have been added.

Story, cont'd: the sheriff in question is ehsan. Or mbrubeck. Or....
This person just spent a big chunk of time dealing with someone else's
tree failure instead of writing code. Anyone noticed that these people
tend to be rather high-calibre?

12. Hire dedicated permasheriffs, at least to cover the crunch times.

13. Beef up tbpl and related tools until a trained monkey can handle 95%
of the cases.

To avoid animal cruelty laws, do not allow philor to do the training.

Note that I haven't mentioned the try server, and it's responsible
for... what, half of the current load? That's because I haven't thought
about it. But I'll add a final pair for that anyway:

14. Extend the try syntax to select different subsets of tests for each
platform/configuration. But generating the syntax is already too painful, so

15. Create a curses-based UI for doing the try pushes, something halfway
between the try server syntax web UI and the trychooser extension. (I
have a prototype of this in which I glued together the trychooser and
crecord extensions, and I've found that I have become much more precise
in my try pushes. YMMV.) This isn't just useful for #14; having the set
of possible unit tests in front of you is much nicer than a linear
questionnaire, and it does the push for you from the command line so
you're far more likely to use it than a separate web UI.

So *something* in the above list ought to be both useful and doable, I hope?

Ben Hearsum

unread,

Aug 30, 2012, 2:36:19 PM8/30/12

to

On 08/30/12 02:33 PM, Steve Fink wrote:
> It seems reasonable to expect that we will require H units of hardware
> resources per developer.

We also need X, Y, and Z amounts of IT/RelEng/QA resources per developer
(to scale up the hardware as well as keep up with brand new
projects/features/requests). These teams haven't scaled with Development.

Benoit Jacob

unread,

Aug 30, 2012, 2:36:17 PM8/30/12

to Justin Lebar, dev-pl...@lists.mozilla.org, Matt Brubeck

2012/8/30 Justin Lebar <justin...@gmail.com>:

>> Here's something I've considered (but not started actually doing) when
>> acting as sheriff: When a patch lands that breaks the build, instead of
>> just backing out that one patch, it could save time in the long run to also
>> back out all the patches that landed on top of it -- except those that had
>> recent green Try runs.
>>
>> This should reduce the risk of extended closures, *and* encourage developers
>> to use the Try server.
>

> ...at the expense of your time (trolling through bugs to find links to
> try, then starring those builds), and (innocent) developers' time (to
> reland patches, unless you're going to do that too). And possibly
> without significantly reducing load on infra, since if we get more
> people pushing to try, that's not necessarily less infra load.

Except if this brings/forces a culture change, making developers much
more careful about properly pushing to try before pushing to inbound,
leading to fewer bad pushes on inbound.

Benoit

Steve Fink

unread,

Aug 30, 2012, 2:57:59 PM8/30/12

to Ben Hearsum, dev-pl...@lists.mozilla.org

Yes, very much yes. Additional hardware costs way more than just the
price of the hardware, and there are many additional costs that are
loosely proportional to the number of developers.

Ehsan Akhgari

unread,

Aug 30, 2012, 3:15:53 PM8/30/12

to Steve Fink, dev-pl...@lists.mozilla.org, Matt Brubeck

On 12-08-30 2:33 PM, Steve Fink wrote:
> Story: I do a push. My push touches some subset of the code, probably a
> very small subset. We then run (almost) all tests on all platforms and
> configurations. (The exception I am aware of is that touching js/src
> will kick off some additional tests.) So the opportunities include:

JS is sort of special in that regard. Many parts of Gecko are so
intertwined into another that the idea of running only the relevant
subset of tests sounds like daydreaming!

> 1. Don't run all platforms x configurations. For example, we need all
> the opt builds if we're going to run performance tests, but do we need
> debug builds on every platform? Or vice versa, the debug tests do more
> checking and so are more useful, so perhaps we shouldn't do perf tests
> everywhere.

Choosing the subset of the matrix is very hard to decide...

> 2. Don't run all of the tests. Run tests based on what changed. (Or run
> all the tests on the fastest platform, and a subset on the rest.) I
> understand that it's a pain to identify which tests to run, and we'll
> bikeshed that endlessly, but it seems like it's a pretty straightforward
> application of machine learning based on a very, very large dataset of
> past failures. (Do we have enough history in a database somewhere for
> that?)

That won't be good enough unless the data includes test failures that
the developers found and fixed _locally_ before their patch hitting the
try server (or inbound or whatnot.) Based on my experience on Gecko,
those local failures are by far the majority of the benefit that our
tests have provided for me.

> 3. Coalesce intelligently. Not every push needs to be the same. Every N
> pushes can get a full test suite run. Listen to jlebar, and multiply the
> probability of failure (based on data) by the cost of failure. Though I
> would include human costs in the latter -- tree closures should cost
> more than a few hundred CPU hours, and sheriffs' hair-pulling is
> expensive. It's not just the Rogaine.

+1.

> 4. Run a VM per core, and just let the tests run singlethreaded. Dunno
> if GPU access is an issue for some tests, but maybe you only use VMs for
> some types of tests.

That is a very good idea. I believe that a lot of our tests don't
really rely on the GPU, so it shouldn't be too hard to separate them out.

> 5. Or go the other way, and make more tests runnable in parallel. More
> efficient than #4 because it avoids the VM overhead, much harder to
> implement, would also improve testing locally. (Though making it easy to
> set up test VMs could help local testing too.) Needing window focus will
> again bite us here.

I'm not convinced that this is feasible in the short to middle term for
any of our graphical test suites.

> 6. Which tests actually catch problems? Machine learning again.
> Partition the individual tests within a suite into <intermittent,almost
> always passes,good>. Don't run the ones that almost always pass as
> often. ("Almost always pass" == "Almost always pass when buggy code is
> committed", I mean.)

Again, the story of missing local data.

> 7. Automatically cancel tests that are going to fail. Due to the
> intermittent orange, this may be hard to determine, but if the same test
> suite fails in the same way more than once (more than twice?), I'd say
> it's good enough. Just be sure to report the canceled builds as
> "cancelled because expected to fail based on builds B1 and B2".

If you're talking about individual tests, then this won't help much with
the infra load.

> 8. Make multiple inbounds. This would reduce the number of things piled
> on top of a failure and reduce the cost of a tree closure, but would
> also require another set of jobs per merge.

Absolutely not. This is a very bad idea as it will increase the merging
pain. Whatever infra optimizations that can happen on multiple inbounds
can happen on the same inbound as well. It's just that thinking about
multiple inbounds is easier!

> 9. Automate analysis. When a test suite fails, rerun only up to the
> first dozen individual failed tests within that suite. Run on the
> revisions before and after the first one that failed. (Do the same trick
> to check whether an orange is intermittent first?) Make the individual
> tests runnable independently to support this, or when that's too
> painful, require metadata to describe the interdependencies.

Nice thing you mentioned this, since our mochitest runner has gained the
ability of re-running the failed tests. :-)

> 10. Simpler automation: when a test suite fails, wait until the results
> of another run of the same suite comes in. If it also fails,
> automatically retrigger the identical test on the failed and previous
> revisions. If there was coelescing, automatically trigger that test
> suite on every single potentially bad push.

See above.

> 11. Make tbpl report "if you liked this failure, you might also
> like...": add an analysis button for a failure that gives (a) the
> results of the past 5 jobs on that slave, (b) the set of pushes
> coalesced, if any (aka a list of pushes since the last test), (c) all
> other runs of that test suite on pushes that include the changeset of
> interest as well as the most recent runs that did not include the
> changeset, and maybe (d) an indication of which failures look the same.
> And of course, any stars that have been added.

Good idea.

> 12. Hire dedicated permasheriffs, at least to cover the crunch times.

I hear they're a pretty rare crowd. :-)

> 13. Beef up tbpl and related tools until a trained monkey can handle 95%
> of the cases.

I think we should automate whatever we can. RelEng has been working on
this kind of automation, but I'm afraid with the current infra load,
those types of automations cannot be implemented, since they require
more machine time to run tests etc. :(

Ehsan

Chris AtLee

unread,

Aug 30, 2012, 3:43:27 PM8/30/12

to

> 1. Don't run all platforms x configurations. For example, we need all
> the opt builds if we're going to run performance tests, but do we need
> debug builds on every platform? Or vice versa, the debug tests do more
> checking and so are more useful, so perhaps we shouldn't do perf tests
> everywhere.
>
> When mozilla-inbound gets colorful, there are usually many more than one
> red/oranges. In the limit, all but one of those are waste.

Maybe we can run only perf tests per-checkin for opt builds, and only
unit tests for debug tests per-checkin. We can run opt unittests on a
less frequent basis.

> 2. Don't run all of the tests. Run tests based on what changed. (Or run
> all the tests on the fastest platform, and a subset on the rest.) I
> understand that it's a pain to identify which tests to run, and we'll
> bikeshed that endlessly, but it seems like it's a pretty straightforward
> application of machine learning based on a very, very large dataset of
> past failures. (Do we have enough history in a database somewhere for
> that?)

I'd love to be able to do this. I don't know how the tests are
structured well enough to know how easy this would be to do. For
example, I think Benoit suggested that only a subset of tests could be
run when the webgl code has been changed.

> 3. Coalesce intelligently. Not every push needs to be the same. Every N
> pushes can get a full test suite run. Listen to jlebar, and multiply the
> probability of failure (based on data) by the cost of failure. Though I
> would include human costs in the latter -- tree closures should cost
> more than a few hundred CPU hours, and sheriffs' hair-pulling is
> expensive. It's not just the Rogaine.

How does automation make the decision about which pushes to coalesce?

> Story, cont'd: Now the tests are running. Some finish quickly, some wait
> in a queue and then take longer. Many CPU cores are idle most of the time.
>
> 4. Run a VM per core, and just let the tests run singlethreaded. Dunno
> if GPU access is an issue for some tests, but maybe you only use VMs for
> some types of tests.
>
> 5. Or go the other way, and make more tests runnable in parallel. More
> efficient than #4 because it avoids the VM overhead, much harder to
> implement, would also improve testing locally. (Though making it easy to
> set up test VMs could help local testing too.) Needing window focus will
> again bite us here.
>
> 6. Which tests actually catch problems? Machine learning again.
> Partition the individual tests within a suite into <intermittent,almost
> always passes,good>. Don't run the ones that almost always pass as
> often. ("Almost always pass" == "Almost always pass when buggy code is
> committed", I mean.)

To add to that, maybe we could run all the "good" tests first and then
only run the rest of the suites if the "good" tests pass. This is
assuming that "good" == "good indicator of code quality". If the "good"
set is small/fast enough, this would cut down the cost of busted builds
significantly.

Steve Fink

unread,

Aug 30, 2012, 3:45:50 PM8/30/12

to Ehsan Akhgari, dev-pl...@lists.mozilla.org, Matt Brubeck

On Thu 30 Aug 2012 12:15:53 PM PDT, Ehsan Akhgari wrote:
> On 12-08-30 2:33 PM, Steve Fink wrote:
>> Story: I do a push. My push touches some subset of the code, probably a
>> very small subset. We then run (almost) all tests on all platforms and
>> configurations. (The exception I am aware of is that touching js/src
>> will kick off some additional tests.) So the opportunities include:
>
> JS is sort of special in that regard. Many parts of Gecko are so
> intertwined into another that the idea of running only the relevant
> subset of tests sounds like daydreaming!

If you have to do it manually, yes.

>> 1. Don't run all platforms x configurations. For example, we need all
>> the opt builds if we're going to run performance tests, but do we need
>> debug builds on every platform? Or vice versa, the debug tests do more
>> checking and so are more useful, so perhaps we shouldn't do perf tests
>> everywhere.
>

> Choosing the subset of the matrix is very hard to decide...

Fortunately, you don't have to get it right. The cost of getting it
wrong is that you have a few more pushes coalesced for the failure.

>> 2. Don't run all of the tests. Run tests based on what changed. (Or run
>> all the tests on the fastest platform, and a subset on the rest.) I
>> understand that it's a pain to identify which tests to run, and we'll
>> bikeshed that endlessly, but it seems like it's a pretty straightforward
>> application of machine learning based on a very, very large dataset of
>> past failures. (Do we have enough history in a database somewhere for
>> that?)
>

> That won't be good enough unless the data includes test failures that
> the developers found and fixed _locally_ before their patch hitting
> the try server (or inbound or whatnot.) Based on my experience on
> Gecko, those local failures are by far the majority of the benefit
> that our tests have provided for me.

But does that matter? Developers should still run all the tests they
currently run locally. The cost/benefit calculation we're interested in
here is only with respect to what hits the server. If there's a great
test that catches a ton of failures locally, but is guaranteed to
succeed when run on the server because it's already done its work
locally, then we gain nothing by running it on the server.

Btw, does anybody actually run eg mochitests locally? I find it far too
painful, and only run them via try except when I'm trying to analyze a
failure. The js tests are useful locally. The xpcshell tests seem
pretty good too. But it still seems like a large subset of the tests
are, in practice, only useful on a server.

>> 5. Or go the other way, and make more tests runnable in parallel. More
>> efficient than #4 because it avoids the VM overhead, much harder to
>> implement, would also improve testing locally. (Though making it easy to
>> set up test VMs could help local testing too.) Needing window focus will
>> again bite us here.
>

> I'm not convinced that this is feasible in the short to middle term
> for any of our graphical test suites.

Me neither. I mostly threw it in for completeness.

>> 6. Which tests actually catch problems? Machine learning again.
>> Partition the individual tests within a suite into <intermittent,almost
>> always passes,good>. Don't run the ones that almost always pass as
>> often. ("Almost always pass" == "Almost always pass when buggy code is
>> committed", I mean.)
>

> Again, the story of missing local data.

And again, I assert that local data doesn't matter. :-)

Note that I'm thinking of mozilla-inbound here. Try pushes should be
based on try failure rates, and I would expect the useful subset of
tests to be much larger for try.

>> 7. Automatically cancel tests that are going to fail. Due to the
>> intermittent orange, this may be hard to determine, but if the same test
>> suite fails in the same way more than once (more than twice?), I'd say
>> it's good enough. Just be sure to report the canceled builds as
>> "cancelled because expected to fail based on builds B1 and B2".
>
> If you're talking about individual tests, then this won't help much
> with the infra load.

What do you mean by "tests"? I'm talking about test jobs, not the
individual tests within the jobs. Sorry for being vague.

>> 8. Make multiple inbounds. This would reduce the number of things piled
>> on top of a failure and reduce the cost of a tree closure, but would
>> also require another set of jobs per merge.
>
> Absolutely not. This is a very bad idea as it will increase the
> merging pain. Whatever infra optimizations that can happen on
> multiple inbounds can happen on the same inbound as well. It's just
> that thinking about multiple inbounds is easier!

To be pedantic, infra optimizations help in either case, but multiple
inbounds do have benefits that cannot be realized on a single inbound.
Those benefits just have to be balanced against the merge pain, and if
you assert that merge pain trumps pile-on pain, I am not going to doubt
you.

>> 9. Automate analysis. When a test suite fails, rerun only up to the
>> first dozen individual failed tests within that suite. Run on the
>> revisions before and after the first one that failed. (Do the same trick
>> to check whether an orange is intermittent first?) Make the individual
>> tests runnable independently to support this, or when that's too
>> painful, require metadata to describe the interdependencies.
>
> Nice thing you mentioned this, since our mochitest runner has gained
> the ability of re-running the failed tests. :-)

The JS test harness has this ability too (or at least, it can output a
list of failed tests to a file, and run the tests listed in a file.) Or
at least one of them does; we're still working on merging the two.

Having the functionality in the runner is only useful for the
intermittent check. For the rest, you need to communicate the list of
failed tests between jobs. But maybe that's what you mean the mochitest
runner can do? (Run on a list of failures, not just automatically
re-run failures within the same invocation.)

>> 12. Hire dedicated permasheriffs, at least to cover the crunch times.
>
> I hear they're a pretty rare crowd. :-)

Willing volunteers among the existing developers are rare, yes. But
would it be as hard to get responses to a job posting specifically for
this role?

>> 13. Beef up tbpl and related tools until a trained monkey can handle 95%
>> of the cases.
>
> I think we should automate whatever we can. RelEng has been working
> on this kind of automation, but I'm afraid with the current infra
> load, those types of automations cannot be implemented, since they
> require more machine time to run tests etc. :(

Fully understood. I don't blame them a bit.

Zack Weinberg

unread,

Aug 30, 2012, 3:55:22 PM8/30/12

to

I like a lot of the ideas in your list, and would like to add another one:

16. Make available canned virtual machine images which precisely
replicate the configuration of the test runners, and scripts which will
use those VM images to auto-run the full set of tests (or a subset)
against a patch stack "just as" try would.

(For what I, personally, want this for, it would suffice to have Linux
images, but if the licensing issues can be sorted, Windows, OSX,
Android, etc images would be very useful as well.)

The point of this is, it's a PITA to run the full test suite locally.
Small variations in e.g. system fonts, video drivers, library patch
levels, etc. often cause spurious failures. More importantly, many of
the tests malfunction if *anything else* is happening in the same
desktop environment at the same time, so you have to set the tests
running and walk away from the computer and hope that nobody IMs you at
the wrong moment. Canned VMs with the "official" test environments
solve both those problems.

zw

Mike Connor

unread,

Aug 30, 2012, 4:08:46 PM8/30/12

to Matt Brubeck, dev-pl...@lists.mozilla.org

On 2012-08-30 12:50 PM, Matt Brubeck wrote:
> On 08/30/2012 01:40 AM, Ms2ger wrote:
>> That means that about half of those commits could have been avoided by
>> simply using tryserver. I think instating a policy that punishes those
>> people that waste resources like this (be that by revoking their commit
>> access or through other means) would be a more effective way to reduce
>> load.
>
> Here's something I've considered (but not started actually doing) when
> acting as sheriff: When a patch lands that breaks the build, instead
> of just backing out that one patch, it could save time in the long run
> to also back out all the patches that landed on top of it -- except
> those that had recent green Try runs.

When we first implemented mozilla-inbound, this was a part of the
policy. If a change breaks stuff, we back out to last-known-good
changeset, notify the people we backed out, and move on. Since tree
watching isn't required, there isn't a major chunk of lost work if we
try this.

-- Mike

L. David Baron

unread,

Aug 30, 2012, 4:12:23 PM8/30/12

to Matt Brubeck, dev-pl...@lists.mozilla.org

I think this creates the wrong incentives: it will encourage people
to overuse try. For example, it will lead people to run tests on
all platforms for platform-agnostic work where running on one
platform is sufficient 99% of the time (and having to back out in
the 1% case is a much smaller drain on resources). And a push to
try is a bigger drain on infrastructure load than a push to inbound,
since it can't be coalesced.

If we're frequently hitting the problem that we have bustage on
inbound that's hard to get fixed because of bustage piled on top of
bustage, then maybe it would help to have more than one tree
operating under mozilla-inbound rules? That would distribute the
latency of getting test results across fewer pushes.

-David

--
𝄞 L. David Baron http://dbaron.org/ 𝄂
𝄢 Mozilla http://www.mozilla.org/ 𝄂

Steve Fink

unread,

Aug 30, 2012, 4:17:18 PM8/30/12

to Zack Weinberg, dev-pl...@lists.mozilla.org

On 08/30/2012 12:55 PM, Zack Weinberg wrote:
> I like a lot of the ideas in your list, and would like to add another
> one:
>
> 16. Make available canned virtual machine images which precisely
> replicate the configuration of the test runners, and scripts which
> will use those VM images to auto-run the full set of tests (or a
> subset) against a patch stack "just as" try would.
>
> (For what I, personally, want this for, it would suffice to have Linux
> images, but if the licensing issues can be sorted, Windows, OSX,
> Android, etc images would be very useful as well.)

Heh. I just filed a closely related bug 2 days ago:
https://bugzilla.mozilla.org/show_bug.cgi?id=786510

But that's for build images, not test images. I think test VM images are
a great idea, for the reasons you give.

Kyle Huey

unread,

Aug 30, 2012, 4:17:42 PM8/30/12

to Steve Fink, Ehsan Akhgari, dev-pl...@lists.mozilla.org, Matt Brubeck

On Thu, Aug 30, 2012 at 12:45 PM, Steve Fink <sf...@mozilla.com> wrote:

> Btw, does anybody actually run eg mochitests locally? I find it far too
> painful, and only run them via try except when I'm trying to analyze a
> failure. The js tests are useful locally. The xpcshell tests seem pretty
> good too. But it still seems like a large subset of the tests are, in
> practice, only useful on a server.

Yes. Nobody runs the whole test suite locally, but running the tests most
likely to be relevant to a given change is very common. Usually those are
the tests in the same directory as the code you're changing, but that
varies a lot.

- Kyle

Mike Connor

unread,

Aug 30, 2012, 4:23:51 PM8/30/12

to L. David Baron, dev-pl...@lists.mozilla.org, Matt Brubeck

On 2012-08-30 4:12 PM, L. David Baron wrote:
> On Thursday 2012-08-30 09:50 -0700, Matt Brubeck wrote:

> I think this creates the wrong incentives: it will encourage people
> to overuse try. For example, it will lead people to run tests on
> all platforms for platform-agnostic work where running on one
> platform is sufficient 99% of the time (and having to back out in
> the 1% case is a much smaller drain on resources). And a push to
> try is a bigger drain on infrastructure load than a push to inbound,
> since it can't be coalesced.

I'm not actually sure why more people would use try here, since the only
penalty for probably-ok csets is "I have to re-land the patch that
landed on bustage."

When m-i was proposed, the idea was to back out everything on top of a
bad cset (and kill all pending builds/test runs), since there's no fast
way to re-validate each of those changes otherwise, and the knock-on
effects of tree closures hurt a lot more. I don't think the cost of
re-landing a patch is high enough to justify keeping inbound closed for
more than 15-30 minutes, especially since developers are not expected to
watch inbound.

> If we're frequently hitting the problem that we have bustage on
> inbound that's hard to get fixed because of bustage piled on top of
> bustage, then maybe it would help to have more than one tree
> operating under mozilla-inbound rules? That would distribute the
> latency of getting test results across fewer pushes.

I don't think it'd change latency appreciably, and would reduce
coalescing and increase the number of runs, so it might actually make it
work.

-- Mike

Mike Connor

unread,

Aug 30, 2012, 4:42:57 PM8/30/12

to Benoit Jacob, dev-pl...@lists.mozilla.org, Justin Lebar, Matt Brubeck

On 2012-08-30 2:36 PM, Benoit Jacob wrote:
> Except if this brings/forces a culture change, making developers much
> more careful about properly pushing to try before pushing to inbound,
> leading to fewer bad pushes on inbound.

If we're advocating increased try use to keep inbound greener, I think
it's a sign we've lost sight of the original point of having
mozilla-inbound, which was a place to reduce effort for devs by having a
tree that was explicitly _allowed_ to break, and didn't carry the heavy
individual and collective cost of breaking mozilla-central. If we start
treating mozilla-inbound as a "must be protected from bustage" tree,
there's little point in having it as an additional step. So I'm
completely clear:

Breaking mozilla-inbound should 100% acceptable, and trivial for a
sheriff or others to fix.

The entire point is to enable a developer workflow of "this should be
good, pushing to -inbound, if it stays green I'm completely done" which
we can't have on try, and makes try more focused on "I think this might
break" patches rather than routine validation of patches.

-- Mike

Boris Zbarsky

unread,

Aug 30, 2012, 4:45:27 PM8/30/12

to

On 8/30/12 4:23 PM, Mike Connor wrote:
> I'm not actually sure why more people would use try here, since the only
> penalty for probably-ok csets is "I have to re-land the patch that
> landed on bustage."

Relanding is not free. Just reimporting the changesets into an mq (or
otherwise rebasing, while dealing with bitrot) takes time.

> When m-i was proposed, the idea was to back out everything on top of a
> bad cset (and kill all pending builds/test runs)

And at some point there was talk of the sheriffs doing the relandings...

-Boris

Steve Fink

unread,

Aug 30, 2012, 4:55:32 PM8/30/12

to Mike Connor, L. David Baron, dev-pl...@lists.mozilla.org, Matt Brubeck

On Thu 30 Aug 2012 01:23:51 PM PDT, Mike Connor wrote:
> When m-i was proposed, the idea was to back out everything on top of a

> bad cset (and kill all pending builds/test runs), since there's no
> fast way to re-validate each of those changes otherwise, and the
> knock-on effects of tree closures hurt a lot more. I don't think the
> cost of re-landing a patch is high enough to justify keeping inbound
> closed for more than 15-30 minutes, especially since developers are
> not expected to watch inbound.

Someone with a clue should correct me if I'm wrong:

As I understand it, the sheriffs have found in practice that individual
("surgical") backouts produce the best throughput. Note that it's not
uncommon for the first failure to show up -- or at least be identified
as real -- well after its push, and there can be a couple of backouts
and re-pushes for unrelated problems in the interim. Re-backing out,
necessitating a re-re-push, is ugly especially when it happens 8 hours
after the initial push and you have to do some serious rebasing the
next day -- when you've already done the work of fixing up whatever was
wrong.

The basic problem is that we have a certain flow rate of incoming
patches, and when the mozilla-inbound outgoing flow cannot accommodate
the inflow rate, things start piling up. Doing blanket backouts might
seem simpler, but because the overall average throughput is lower, you
really just get more stuff piled up. That backlog will then land in
rapid succession and inevitably trigger the next wave of backouts.

Surgical backouts also make more efficient use of the information
gathered by intervening jobs, and when it's a throughput problem,
efficiency is what matters.

Either that, or I'm full of crap and the whole thing is a social
problem caused by the sheriffs not wanting to annoy developers.
Although that's arguably a valid reason too -- the current setup mostly
avoids punishing the blameless, which sets up the right incentive
structure and hopefully moves people towards generally reducing their
crash landing rates. You don't want to move people towards "Why bother
pushing to try? Even if my push is fine, I'll probably get backed out
for somebody else's push anyway."

Mike Connor

unread,

Aug 30, 2012, 5:11:20 PM8/30/12

to Boris Zbarsky, dev-pl...@lists.mozilla.org

On 2012-08-30 4:45 PM, Boris Zbarsky wrote:
> On 8/30/12 4:23 PM, Mike Connor wrote:
>> I'm not actually sure why more people would use try here, since the only
>> penalty for probably-ok csets is "I have to re-land the patch that
>> landed on bustage."
>
> Relanding is not free. Just reimporting the changesets into an mq (or
> otherwise rebasing, while dealing with bitrot) takes time.

I didn't say it was free. But it doesn't seem appreciably harder when
it happens than going through try, digging through results, rebasing on
top of everything that's landed while the try run finished, and then
landing on inbound.

I don't have a current sense of what inbound is like these days, how
often are people landing on bustage?

>> When m-i was proposed, the idea was to back out everything on top of a
>> bad cset (and kill all pending builds/test runs)
>
> And at some point there was talk of the sheriffs doing the relandings...

It's part of the sheriff duty, but not in 100% of case, per
https://wiki.mozilla.org/Inbound_Sheriff_Duty#Sheriff_Duty

"If it's not possible to identify the guilty changeset, the sheriffs may
backout more changesets to minimize the overhead/time to fix the tree.
Completely innocent pushes will be relanded by the sheriff once the
bustage is cleared or, in case of doubts, a Try server run will be
requested in the bug, before the next landing"

-- Mike

Ehsan Akhgari

unread,

Aug 30, 2012, 5:41:13 PM8/30/12

to Steve Fink, dev-pl...@lists.mozilla.org, Matt Brubeck

On 12-08-30 3:45 PM, Steve Fink wrote:
> On Thu 30 Aug 2012 12:15:53 PM PDT, Ehsan Akhgari wrote:
>> On 12-08-30 2:33 PM, Steve Fink wrote:
>>> Story: I do a push. My push touches some subset of the code, probably a
>>> very small subset. We then run (almost) all tests on all platforms and
>>> configurations. (The exception I am aware of is that touching js/src
>>> will kick off some additional tests.) So the opportunities include:
>>
>> JS is sort of special in that regard. Many parts of Gecko are so
>> intertwined into another that the idea of running only the relevant
>> subset of tests sounds like daydreaming!
>
> If you have to do it manually, yes.

Do you have any idea how one would approach doing this automatically?

>>> 1. Don't run all platforms x configurations. For example, we need all
>>> the opt builds if we're going to run performance tests, but do we need
>>> debug builds on every platform? Or vice versa, the debug tests do more
>>> checking and so are more useful, so perhaps we shouldn't do perf tests
>>> everywhere.
>>
>> Choosing the subset of the matrix is very hard to decide...
>
> Fortunately, you don't have to get it right. The cost of getting it
> wrong is that you have a few more pushes coalesced for the failure.

Fair enough.

>>> 2. Don't run all of the tests. Run tests based on what changed. (Or run
>>> all the tests on the fastest platform, and a subset on the rest.) I
>>> understand that it's a pain to identify which tests to run, and we'll
>>> bikeshed that endlessly, but it seems like it's a pretty straightforward
>>> application of machine learning based on a very, very large dataset of
>>> past failures. (Do we have enough history in a database somewhere for
>>> that?)
>>
>> That won't be good enough unless the data includes test failures that
>> the developers found and fixed _locally_ before their patch hitting
>> the try server (or inbound or whatnot.) Based on my experience on
>> Gecko, those local failures are by far the majority of the benefit
>> that our tests have provided for me.
>
> But does that matter? Developers should still run all the tests they
> currently run locally. The cost/benefit calculation we're interested in
> here is only with respect to what hits the server. If there's a great
> test that catches a ton of failures locally, but is guaranteed to
> succeed when run on the server because it's already done its work
> locally, then we gain nothing by running it on the server.

Hmm, OK, maybe you're right here. The key thing which I was not
thinking about (and that you mentioned below) was that this doesn't need
to be the same data used on the try server.

At any rate, I think ideas like this must be approached very carefully.

> Btw, does anybody actually run eg mochitests locally? I find it far too
> painful, and only run them via try except when I'm trying to analyze a
> failure. The js tests are useful locally. The xpcshell tests seem pretty
> good too. But it still seems like a large subset of the tests are, in
> practice, only useful on a server.

Yes, running a subset of tests locally is very common.

>>> 7. Automatically cancel tests that are going to fail. Due to the
>>> intermittent orange, this may be hard to determine, but if the same test
>>> suite fails in the same way more than once (more than twice?), I'd say
>>> it's good enough. Just be sure to report the canceled builds as
>>> "cancelled because expected to fail based on builds B1 and B2".
>>
>> If you're talking about individual tests, then this won't help much
>> with the infra load.
>
> What do you mean by "tests"? I'm talking about test jobs, not the
> individual tests within the jobs. Sorry for being vague.

OK, this makes sense for whole test jobs.

>>> 8. Make multiple inbounds. This would reduce the number of things piled
>>> on top of a failure and reduce the cost of a tree closure, but would
>>> also require another set of jobs per merge.
>>
>> Absolutely not. This is a very bad idea as it will increase the
>> merging pain. Whatever infra optimizations that can happen on
>> multiple inbounds can happen on the same inbound as well. It's just
>> that thinking about multiple inbounds is easier!
>
> To be pedantic, infra optimizations help in either case, but multiple
> inbounds do have benefits that cannot be realized on a single inbound.
> Those benefits just have to be balanced against the merge pain, and if
> you assert that merge pain trumps pile-on pain, I am not going to doubt
> you.

Yes, that is what I'm asserting. :-)

>>> 9. Automate analysis. When a test suite fails, rerun only up to the
>>> first dozen individual failed tests within that suite. Run on the
>>> revisions before and after the first one that failed. (Do the same trick
>>> to check whether an orange is intermittent first?) Make the individual
>>> tests runnable independently to support this, or when that's too
>>> painful, require metadata to describe the interdependencies.
>>
>> Nice thing you mentioned this, since our mochitest runner has gained
>> the ability of re-running the failed tests. :-)
>
> The JS test harness has this ability too (or at least, it can output a
> list of failed tests to a file, and run the tests listed in a file.) Or
> at least one of them does; we're still working on merging the two.
>
> Having the functionality in the runner is only useful for the
> intermittent check. For the rest, you need to communicate the list of
> failed tests between jobs. But maybe that's what you mean the mochitest
> runner can do? (Run on a list of failures, not just automatically re-run
> failures within the same invocation.)

No, the mochitest runner capability is only useful for the intermittent
check...

>>> 12. Hire dedicated permasheriffs, at least to cover the crunch times.
>>
>> I hear they're a pretty rare crowd. :-)
>
> Willing volunteers among the existing developers are rare, yes. But
> would it be as hard to get responses to a job posting specifically for
> this role?

I have no idea!

Cheers,
Ehsan

Ehsan Akhgari

unread,

Aug 30, 2012, 5:49:27 PM8/30/12

to Mike Connor, Boris Zbarsky, dev-pl...@lists.mozilla.org

On 12-08-30 5:11 PM, Mike Connor wrote:
> On 2012-08-30 4:45 PM, Boris Zbarsky wrote:

>> On 8/30/12 4:23 PM, Mike Connor wrote:
>>> I'm not actually sure why more people would use try here, since the only
>>> penalty for probably-ok csets is "I have to re-land the patch that
>>> landed on bustage."
>>
>> Relanding is not free. Just reimporting the changesets into an mq (or
>> otherwise rebasing, while dealing with bitrot) takes time.
>

> I didn't say it was free. But it doesn't seem appreciably harder when
> it happens than going through try, digging through results, rebasing on
> top of everything that's landed while the try run finished, and then
> landing on inbound.
>
> I don't have a current sense of what inbound is like these days, how
> often are people landing on bustage?

Very often. Too often.

I think a big problem now which contributes to identifying which patch
to back out is the excessive load which results in excessive coalescing.
But originally I think the biggest reason why the "reset to the last
known good changeset" rule was not practiced very often was because the
sheriffs were trying to avoid causing more work for developers, and now
occasionally people are surprised when they get backed out after landing
on bustage.

Ehsan

Steve Fink

unread,

Aug 30, 2012, 6:19:29 PM8/30/12

to Ehsan Akhgari, dev-pl...@lists.mozilla.org, Matt Brubeck

On Thu 30 Aug 2012 02:41:13 PM PDT, Ehsan Akhgari wrote:
> On 12-08-30 3:45 PM, Steve Fink wrote:
>> On Thu 30 Aug 2012 12:15:53 PM PDT, Ehsan Akhgari wrote:
>>> On 12-08-30 2:33 PM, Steve Fink wrote:
>>>> Story: I do a push. My push touches some subset of the code,
>>>> probably a
>>>> very small subset. We then run (almost) all tests on all platforms and
>>>> configurations. (The exception I am aware of is that touching js/src
>>>> will kick off some additional tests.) So the opportunities include:
>>>
>>> JS is sort of special in that regard. Many parts of Gecko are so
>>> intertwined into another that the idea of running only the relevant
>>> subset of tests sounds like daydreaming!
>>
>> If you have to do it manually, yes.
>
> Do you have any idea how one would approach doing this automatically?

Well, I don't know much of anything about machine learning. But if I
were to handwave, I would bucket the source tree by top-level directory
and significant-sounding 2nd-level directories. (This selection can
actually be learned too, but let's keep it simple for now.) Then make a
2d matrix with the buckets being columns, each test a row. For each
test failure in the last 12 bajillion pushes, increment the cells
corresponding to each failure. Divide the whole matrix by 12 bajillion.
Zero out teeny tiny numbers.

When considering a push, look in all columns corresponding to the files
touched by the push. Run all the tests corresponding to rows with
nonzeroes in them. ("Test" is a type of test job, though you could do
the same thing with individual tests if you wanted to go more granular.)

The matrix is just the probability that test X fails given that your
push has a detectable failure. Pr(X fails|at least 1 failure if running
whole suite).

This is actually wrong, because many bad pushes will cause all of the
tests to fail, and that should provide far less support for running a
given test than if only that one test failed, but getting that right
would require somebody who actually knows this stuff. You really want a
subset S of tests to run such that Pr(some test in S fails|at least 1
test in the full set fails) > 0.99, and cost(S) is minimized, but the
tests are correlated so you can't just keep grabbing the "best" test
from what's left until you hit your threshold... never mind, I'll shut
up. Maybe I should go take that online machine learning class or
something. Or maybe the metrics guys would be all over this.

I suppose if there aren't that many types of tests, you could just use
all possible subsets of tests for the rows, and pick the smallest
subset that has a value over your threshold.

Anyway, the naive approach might work well enough.

Jeff Hammel

unread,

Aug 30, 2012, 6:27:37 PM8/30/12

to dev-pl...@lists.mozilla.org

I believe for our code base that so much depends on so much that this
sort of division will not work. While I would love to see our test
bucketized, such that if a change in (say) layout would only run layout
tests, I think just figuring out what tests would have to be run for
what files would be very hard. I also think the answer, with probably a
few special cases, is that most changes could, at least in theory,
affect most tests.

Zack Weinberg

unread,

Aug 30, 2012, 6:45:16 PM8/30/12

to

On 2012-08-30 4:42 PM, Mike Connor wrote:
> On 2012-08-30 2:36 PM, Benoit Jacob wrote:
>> Except if this brings/forces a culture change, making developers much
>> more careful about properly pushing to try before pushing to inbound,
>> leading to fewer bad pushes on inbound.
>
> If we're advocating increased try use to keep inbound greener, I think
> it's a sign we've lost sight of the original point of having
> mozilla-inbound, which was a place to reduce effort for devs by having a
> tree that was explicitly _allowed_ to break, and didn't carry the heavy
> individual and collective cost of breaking mozilla-central.

All along in this conversation I've had the feeling that my
understanding of "how things are done" was out of kilter with lots of
other people's, but now I finally realize what it is.

My process is: only ask for review *after* the patch is green on try.
Until then, for all I know I'm going to need major architectural changes
just to make the testsuite happy, and there's no point wasting
reviewers' time, which is a *far* scarcer resource in this organization
than CPU hours.

I thought this was what everyone did, and the inbound queue was just a
way to make dealing with intermittent orange less painful for everyone
except the sheriffs, not a try-substitute. Apparently I'm wrong? What
_is_ the expected process, then?

zw

Kyle Huey

unread,

Aug 30, 2012, 7:02:24 PM8/30/12

to Zack Weinberg, dev-pl...@lists.mozilla.org

On Thu, Aug 30, 2012 at 3:45 PM, Zack Weinberg <za...@panix.com> wrote:

> My process is: only ask for review *after* the patch is green on try.
> Until then, for all I know I'm going to need major architectural changes
> just to make the testsuite happy, and there's no point wasting reviewers'
> time, which is a *far* scarcer resource in this organization than CPU hours.
>

For a lot of changes, one can be reasonably certain that will not be the
case beforehand.

- Kyle

L. David Baron

unread,

Aug 30, 2012, 7:11:10 PM8/30/12

to Zack Weinberg, dev-pl...@lists.mozilla.org

On Thursday 2012-08-30 18:45 -0400, Zack Weinberg wrote:
> My process is: only ask for review *after* the patch is green on
> try. Until then, for all I know I'm going to need major
> architectural changes just to make the testsuite happy, and there's
> no point wasting reviewers' time, which is a *far* scarcer resource
> in this organization than CPU hours.

But there's also the opposite problem, which is that in some cases
reviewers might require changes that will invalidate (or make
unnecessary) the work that's been done to get the patch green.

I don't think there's a single correct solution here. Testing and
peer review are both tools we use to improve the quality of our
code; they don't necessarily belong in a particular order.

Nicholas Nethercote

unread,

Aug 30, 2012, 7:14:26 PM8/30/12

to Kyle Huey, dev-pl...@lists.mozilla.org, Zack Weinberg

On Thu, Aug 30, 2012 at 4:02 PM, Kyle Huey <m...@kylehuey.com> wrote:
>
>> My process is: only ask for review *after* the patch is green on try.
>> Until then, for all I know I'm going to need major architectural changes
>> just to make the testsuite happy, and there's no point wasting reviewers'
>> time, which is a *far* scarcer resource in this organization than CPU hours.
>>
>

> For a lot of changes, one can be reasonably certain that will not be the
> case beforehand.

A theme of this thread: Your Experience Is Not Universal (YEINU).

Nick

Steve Fink

unread,

Aug 30, 2012, 8:48:44 PM8/30/12

to Jeff Hammel, dev-pl...@lists.mozilla.org

On 08/30/2012 03:27 PM, Jeff Hammel wrote:
> I believe for our code base that so much depends on so much that this
> sort of division will not work. While I would love to see our test
> bucketized, such that if a change in (say) layout would only run
> layout tests, I think just figuring out what tests would have to be
> run for what files would be very hard. I also think the answer, with
> probably a few special cases, is that most changes could, at least in
> theory, affect most tests.

Maybe you're right. But one thing I want to make clear -- this doesn't
require changes in layout to only run layout tests. This is *only* based
on after-the-fact observation of what tests break from layout changes.
So if they run tests outside of layout, but those tests never break as a
result of layout changes, then this wouldn't run them. But if those
tests *did* break, then we would. There is nothing in this algorithm
that uses the location of tests for anything. (The source tree-based
"buckets" I was referring to were only for code.)

The source location of tests is probably correlated with the test jobs
that we have defined (M1 vs Moth vs xpcshell tests etc), and those are
probably not the ideal buckets for this purpose. But they're probably
not horrible, and that's what I imagine the easy-to-gather data is based
on, so perhaps they're good enough. If the data is available, it's easy
enough to test -- generate the matrix, and look for zeroes. If there
aren't many, then the bucketing on one or the other axis is useless.

Sorry; maybe that's what you meant in the first place. All I'm saying is
that "...most changes could, at least in theory, affect most tests"
doesn't matter. What matters is what happens *in practice*.

I strongly suspect that *some* particular partitioning would work well
for this, even if it didn't end up making a whole lot of sense just
looking at it. If we had data on exact tests that failed, we could
automatically generate a good partitioning. But that would mean
shuffling around tests between test suites, which would kinda suck.

Steve Fink

unread,

Aug 30, 2012, 8:50:37 PM8/30/12

to L. David Baron, dev-pl...@lists.mozilla.org, Zack Weinberg

On 08/30/2012 04:11 PM, L. David Baron wrote:
> On Thursday 2012-08-30 18:45 -0400, Zack Weinberg wrote:

>> My process is: only ask for review *after* the patch is green on
>> try. Until then, for all I know I'm going to need major
>> architectural changes just to make the testsuite happy, and there's
>> no point wasting reviewers' time, which is a *far* scarcer resource
>> in this organization than CPU hours.

> But there's also the opposite problem, which is that in some cases
> reviewers might require changes that will invalidate (or make
> unnecessary) the work that's been done to get the patch green.
>
> I don't think there's a single correct solution here. Testing and
> peer review are both tools we use to improve the quality of our
> code; they don't necessarily belong in a particular order.
>
> -David
>

Amen. I've done it both ways, so from bitter experience I can tell you
that they're both wrong. :-|

Boris Zbarsky

unread,

Aug 30, 2012, 10:18:56 PM8/30/12

to

On 8/30/12 5:11 PM, Mike Connor wrote:
> I didn't say it was free. But it doesn't seem appreciably harder when
> it happens than going through try, digging through results, rebasing on
> top of everything that's landed while the try run finished, and then
> landing on inbound.

It's easier because you don't have to reimport the patches in any way.

Consider what's involved in relanding something like
https://bugzilla.mozilla.org/show_bug.cgi?id=655877 or any of a number
of other layout/DOM bugfixes that come in the form of multiple
changesets, just in terms of the branch mechanics.

> I don't have a current sense of what inbound is like these days, how
> often are people landing on bustage?

Often enough. People do push without looking, since that's kinda the
premise of inbound.

But more importantly, people are landing _before_ the bustage. Let's
just run the numbers here, and apologies for the length that follows....

http://oduinn.com/blog/2012/08/04/infrastructure-load-for-july-2012/
says we had 0.261 * 5635 == 1470 pushes to inbound last month. That's
an average of two per hour if they were uniformly spread across all
times of day and all days of the month. Except they're not, of course.
Eyeballing the linked graph at
http://oduinn.com/images/2012/blog_2012_07_pushes.png it looks like on
weekdays we have 70-ish pushes per day to inbound, while on weekends the
volume is a lot lower. There is no chart of pushes-by-hour for inbound
(though I bet John can get you one if you want), but pushes-by-hour
overall is at
http://oduinn.com/images/2012/blog_2012_07_pushes_per_hour.png and also
decidedly nonuniform.

The upshot is that we have pushes to inbound every 10-15 minutes if not
more often during "peak" times like actual work time in the US.

Now looking at Ms2ger's stats earlier in this thread, 25 commits out of
the 82 commits that were not merges or backouts got backed out. So
figure one commit in 3 is going to go red or orange. How long does that
take to happen? If it's going to go red, probably at least 10 minutes,
might be as long as 20 if it shows up on Linux or 50+ if Window is
involved. That's assuming no queuing. If it's going to go orange,
figure at least 30, more likely 40-60, might be as long as 100+ minutes
if Windows is involved. Again, if there's no queuing. Let's be
generous and say 30 minutes on average to find out that a checkin is
busted, and pushes come every 15 minutes.

Now say you have a developer who _never_ pushes anything broken, but is
pushing in this environment and whenever bustage happens everything on
top of the bustage gets backed out. By the time this developer pushes,
there are on average (this is a bit bogus; the actual distribution
almost certainly matters here for exact numbers, though probably not for
ballpark estimates) two changesets in the tree that have no test results
yet. The chance that they will both be green is (2/3)*(2/3) = 4/9.

So if this developer never lands anything orange, and with what I think
are rather generous assumptions about test latency and checkin
frequency, he or she will get backed out 55% of the time. At least if
the checkin range ms2ger categorized is representative. If we have
better data, I'm all ears.

Note that as checkin volume rises, this problem gets worse unless test
latency drops in parallel.

When m-i was first put in place, checkin volume was lower, by the way.

By the way, this is something multiple integration branches _could_ help
with, if we're gated only on test latency, not test bandwidth. Sadly,
it sounds like a large part of our latency

-Boris

Nicholas Nethercote

unread,

Aug 30, 2012, 11:48:29 PM8/30/12

to Boris Zbarsky, dev-pl...@lists.mozilla.org

philor (who knows as much about this stuff as anyone) just mentioned
the following on IRC:

"did anyone point out that we take 60 minutes to run Win xpcshell,
when locally it takes 7 minutes, or that we build and test desktop on
pushes that only touch mobile/ or b2g/?"

Sounds like two pieces of large, low-hanging fruit.

Nick

Robert O'Callahan

unread,

Aug 31, 2012, 12:39:49 AM8/31/12

to dev-pl...@lists.mozilla.org

If we could run Linux functional tests on AWS, then maybe we could keep the
Linux build/functional-test backlog at zero and encourage people to try the
Linux non-functional tests before every non-trivial commit to inbound. It
seems to me that would greatly reduce bustage. (I suppose we have enough
data for someone to compute the fraction of bustage-inducing pushes that
did not break Linux functional tests.)

Rob
--
“You have heard that it was said, ‘Love your neighbor and hate your enemy.’
But I tell you, love your enemies and pray for those who persecute you,
that you may be children of your Father in heaven. ... If you love those
who love you, what reward will you get? Are not even the tax collectors
doing that? And if you greet only your own people, what are you doing more
than others?" [Matthew 5:43-47]

Ben Hearsum

unread,

Aug 31, 2012, 8:16:33 AM8/31/12

to rob...@ocallahan.org, dev-pl...@lists.mozilla.org

On 08/31/12 12:39 AM, Robert O'Callahan wrote:
> If we could run Linux functional tests on AWS, then maybe we could keep the
> Linux build/functional-test backlog at zero and encourage people to try the
> Linux non-functional tests before every non-trivial commit to inbound. It
> seems to me that would greatly reduce bustage. (I suppose we have enough
> data for someone to compute the fraction of bustage-inducing pushes that
> did not break Linux functional tests.)

I know this is in the works (sorry, I don't which bug is happening in),
but we can't quite run all of our unit tests on AWS. Anything that
depends on a GPU (reftest, some crashtests, and even some mochitests
I've heard) can't run there. We definitely want to move everything we
can to the cloud, though.

Ben Hearsum

unread,

Aug 31, 2012, 8:16:33 AM8/31/12

to rob...@ocallahan.org, dev-pl...@lists.mozilla.org

On 08/31/12 12:39 AM, Robert O'Callahan wrote:

> If we could run Linux functional tests on AWS, then maybe we could keep the
> Linux build/functional-test backlog at zero and encourage people to try the
> Linux non-functional tests before every non-trivial commit to inbound. It
> seems to me that would greatly reduce bustage. (I suppose we have enough
> data for someone to compute the fraction of bustage-inducing pushes that
> did not break Linux functional tests.)

Ehsan Akhgari

unread,

Aug 31, 2012, 10:50:06 AM8/31/12

to L. David Baron, dev-pl...@lists.mozilla.org, Zack Weinberg

On 12-08-30 7:11 PM, L. David Baron wrote:
> On Thursday 2012-08-30 18:45 -0400, Zack Weinberg wrote:

>> My process is: only ask for review *after* the patch is green on
>> try. Until then, for all I know I'm going to need major
>> architectural changes just to make the testsuite happy, and there's
>> no point wasting reviewers' time, which is a *far* scarcer resource
>> in this organization than CPU hours.
>

> But there's also the opposite problem, which is that in some cases
> reviewers might require changes that will invalidate (or make
> unnecessary) the work that's been done to get the patch green.
>
> I don't think there's a single correct solution here. Testing and
> peer review are both tools we use to improve the quality of our
> code; they don't necessarily belong in a particular order.

Usually when I'm working on a bug, my goal is to get the patch landed as
soon as possible and move on to other work. I have a hard time when I
have a lot of pending patches (more than 5 really) since they incur a
cognitive load for me as I always have to keep thinking about them. On
average, the single biggest thing which gets in the way of me landing
the patch is waiting for the review. In many cases, all of the other
steps in fixing a bug (understanding the bug, thinking of a solution,
coding it, testing it, pushing to the try server and waiting for
results) takes less time than it takes for the reviewer to start looking
at my patch. Because of this reason, I optimize for attaching the patch
to the bug and asking for review *as soon as possible*.

Ehsan

Ehsan Akhgari

unread,

Aug 31, 2012, 11:04:53 AM8/31/12

to Steve Fink, dev-pl...@lists.mozilla.org, Jeff Hammel

To give you a concrete example of why this kind of stuff does not work,
when I was working on bug 157681 (which was a layout optimization), I
came across a single browser-chrome test failure happening only on Mac,
which seemed pretty unrelated to my changes at first, but it turned out
that it actually uncovered a subtle bug in my patch which none of the
other layout tests that we have managed to catch.

This kind of stuff is rare, true, but it happens frequently enough that
it really matters. I don't think we can seriously consider bucketing
tests based on which files have changed in a patch without losing this
important aspect of catching bugs in patches -- except perhaps for
extremely localized components of the code.

Ehsan

Ehsan Akhgari

unread,

Aug 31, 2012, 11:14:01 AM8/31/12

to Ben Hearsum, dev-pl...@lists.mozilla.org, rob...@ocallahan.org

I think that only mochitests which test canvas fall into that category,
which means mochitest-1 (and we could bucket up those tests into a
separate suite if needed). The rest should be possible to be pushed to
the cloud.

Ehsan

Ehsan Akhgari

unread,

Aug 31, 2012, 11:15:02 AM8/31/12

to Nicholas Nethercote, Boris Zbarsky, dev-pl...@lists.mozilla.org

Except that as I understand things, we don't have a reliable way to
handle them, since our infrastructure is only capable of looking at the
tip of a push, not every changeset in it.

Ehsan

Chris AtLee

unread,

Aug 31, 2012, 11:33:03 AM8/31/12

to

That's just how it's currently implemented; it's certainly changeable
with enough effort.

Are the win xpcshell test times something to be concerned about? Are
there other tests that are taking unreasonably long?

Ehsan Akhgari

unread,

Aug 31, 2012, 11:40:31 AM8/31/12

to Chris AtLee, dev-pl...@lists.mozilla.org

On 12-08-31 11:33 AM, Chris AtLee wrote:
> On 31/08/12 11:15 AM, Ehsan Akhgari wrote:
>> On 12-08-30 11:48 PM, Nicholas Nethercote wrote:
>>> philor (who knows as much about this stuff as anyone) just mentioned
>>> the following on IRC:
>>>
>>> "did anyone point out that we take 60 minutes to run Win xpcshell,
>>> when locally it takes 7 minutes, or that we build and test desktop on
>>> pushes that only touch mobile/ or b2g/?"
>>>
>>> Sounds like two pieces of large, low-hanging fruit.
>>
>> Except that as I understand things, we don't have a reliable way to
>> handle them, since our infrastructure is only capable of looking at the
>> tip of a push, not every changeset in it.
>
> That's just how it's currently implemented; it's certainly changeable
> with enough effort.

Good point! -> bug 787449

> Are the win xpcshell test times something to be concerned about? Are
> there other tests that are taking unreasonably long?

Absolutely! Filed bug 787448 for the investigation on why this happens.
I don't know if the same problem happens with other tests as well.

Ehsan

Robert Kaiser

unread,

Aug 31, 2012, 12:15:18 PM8/31/12

to

Mike Connor schrieb:
> Looking at the data, we're still at around 2/3 Linux32 on Fx13/14

Are you looking at the "release" channel there or at the "default"
channel as well? Distro builds are usually on the "default" channel so
that we don't provide updates (as the distro does that).

Robert Kaiser

Steve Fink

unread,

Aug 31, 2012, 12:55:38 PM8/31/12

to Ehsan Akhgari, dev-pl...@lists.mozilla.org, Jeff Hammel

I can't argue convincingly without concrete data, but this sounds wrong
to me.

You give an example of where the test restrictions would fail due to the
bucketing, but you also say "This kind of stuff is rare...". So when
something like this happens, you wouldn't get a test build and wouldn't
see the failure until several pushes later when the test *did* get run.
So we get bad coalescing in rare cases.

In return, we lower the infrastructure load across the board, resulting
in less coalescing in the common case.

I think the tradeoff is likely to be worth it, but it totally depends on
the numbers. And predicting how much coalescing will be reduced, but
only during busy times when it matters, based on a certain reduction in
test load, is Hard.

Your example does point out that we'd also want test suppression to be
relative to current load -- no need to suppress any tests during off
hours, and in fact you'd probably want to set the threshold based on
current activity/backlog. Perhaps that makes it more palatable: "we're
overloaded and can't run everything, so what jobs would be least harmful
if we suppressed them?"

Ehsan Akhgari

unread,

Aug 31, 2012, 3:02:32 PM8/31/12

to Steve Fink, dev-pl...@lists.mozilla.org, Jeff Hammel

OK, thinking more about this, I see your point now. And I definitely
agree that this is the sort of thing which is hard to evaluate without
the data.

> Your example does point out that we'd also want test suppression to be
> relative to current load -- no need to suppress any tests during off
> hours, and in fact you'd probably want to set the threshold based on
> current activity/backlog. Perhaps that makes it more palatable: "we're
> overloaded and can't run everything, so what jobs would be least harmful
> if we suppressed them?"

Makes sense.

Cheers,
Ehsan

Chris Pearce

unread,

Aug 31, 2012, 9:08:25 PM8/31/12

to

On 31/08/12 07:15, Ehsan Akhgari wrote:
> On 12-08-30 2:33 PM, Steve Fink wrote:
>

>> 5. Or go the other way, and make more tests runnable in parallel. More
>> efficient than #4 because it avoids the VM overhead, much harder to
>> implement, would also improve testing locally. (Though making it easy to
>> set up test VMs could help local testing too.) Needing window focus will
>> again bite us here.
>
> I'm not convinced that this is feasible in the short to middle term
> for any of our graphical test suites.
>

The <audio> and <video> mochitests have a pretty simple test manager [1]
written in JS which can run multiple sub-tests in parallel. The level of
parallelsim can be cranked up or down by changing a simple parameter
[2]. Not all mochitests could be written like this (fullscreen
mochitests couldn't for example), but some of the slower running tests
may be able to be refactored to use techniques like this.

Chris P.

[1]
http://mxr.mozilla.org/mozilla-central/source/content/media/test/manifest.js#381
[2]
http://mxr.mozilla.org/mozilla-central/source/content/media/test/manifest.js#375

Johnathan Nightingale

unread,

Sep 3, 2012, 12:59:46 PM9/3/12

to Steve Fink, Ehsan Akhgari, Jeff Hammel, dev-pl...@lists.mozilla.org

On Aug 31, 2012, at 12:55 PM, Steve Fink wrote:
> On 08/31/2012 08:04 AM, Ehsan Akhgari wrote:
>> To give you a concrete example of why this kind of stuff does not work, when I was working on bug 157681 (which was a layout optimization), I came across a single browser-chrome test failure happening only on Mac, which seemed pretty unrelated to my changes at first, but it turned out that it actually uncovered a subtle bug in my patch which none of the other layout tests that we have managed to catch.
>>
>> This kind of stuff is rare, true, but it happens frequently enough that it really matters. I don't think we can seriously consider bucketing tests based on which files have changed in a patch without losing this important aspect of catching bugs in patches -- except perhaps for extremely localized components of the code.
>

> I can't argue convincingly without concrete data, but this sounds wrong to me.
>
> You give an example of where the test restrictions would fail due to the bucketing, but you also say "This kind of stuff is rare...". So when something like this happens, you wouldn't get a test build and wouldn't see the failure until several pushes later when the test *did* get run. So we get bad coalescing in rare cases.
>
> In return, we lower the infrastructure load across the board, resulting in less coalescing in the common case.

I'd also be perfectly okay with saying that changes someone like Ehsan makes to something like layout are gonna run the full suite every time. Layout pushes in general are likely to touch surprising things. But even granting that, Steve's suggestions could help firefox, toolkit, mobile, js, nss, webgl, &c pushes get out of the way by running subsets.

We could label whole directories as "touching this ends the world, test everything" and be pretty liberal about where we apply that label because at the moment, we effectively apply it to everything.

So who's gonna volunteer to do the strawman test-bucket vs code location matrix? :)

J

---
Johnathan Nightingale
Sr. Director of Firefox Engineering
@johnath

Chris Cooper

unread,

Sep 4, 2012, 1:16:35 PM9/4/12

to dev-pl...@lists.mozilla.org

This thread has wandered far, far away from the original purpose
(surprise) which was to assess whether we still needed/wanted Linux64 as
both a build and test platform.

Aside from the expected "OMGCHANGE" reactions, there were valid
arguments for keeping Linux64. We should invest the effort to get bug
527907 fixed.

However...

I'm not feeling a lot of love for 32-bit linux. Many people suggested
turning off linux32 instead if we needed to make a choice.

Would we consider stopping builds and tests on linux32 instead of
linux64, or at least putting some sort of horizon on how long we would
plan to support 32-bit linux as a tier 1 platform?

Again, no one is (necessarily) talking in absolutes here. We can
continue to run both linux platforms, we could demote linux32 to tier 2,
etc.

While it would obviously help unburden release engineering to reduce the
number of build/test environments we support, our primary goal here is
to make sure we're expending effort on relevant platforms and architecture.

cheers,
--
coop

Ed Morley

unread,

Sep 4, 2012, 1:30:28 PM9/4/12

to

On Thursday, 30 August 2012 21:43:06 UTC+1, Mike Connor wrote:
> If we're advocating increased try use to keep inbound greener, I think
> it's a sign we've lost sight of the original point of having
> mozilla-inbound, which was a place to reduce effort for devs by having a
> tree that was explicitly _allowed_ to break, and didn't carry the heavy

> individual and collective cost of breaking mozilla-central. If we start
> treating mozilla-inbound as a "must be protected from bustage" tree,
> there's little point in having it as an additional step. So I'm
> completely clear:
>
> Breaking mozilla-inbound should 100% acceptable, and trivial for a
> sheriff or others to fix.
>
> The entire point is to enable a developer workflow of "this should be
> good, pushing to -inbound, if it stays green I'm completely done" which
> we can't have on try, and makes try more focused on "I think this might
> break" patches rather than routine validation of patches.

This was never the purpose of mozilla-inbound.

The idea was to:
a) Have a tree where people did not have to watch their pushes for 4-6+ hours, since someone would keep an eye on it. (The primary dev incentive).
b) Mean that other branches could confidently pull from mozilla-central, knowing that it would be green.
c) Reduce the number of push races when merging other repos into mozilla-central (which can be more of a pain to rebase than normal sized pushes), since the traffic is lower.
d) Give us a way to not tie up mozilla-central if we end up with extreme bustage on mozilla-inbound. We also gained the ability to reset (mozilla-inbound) to a previous revision without reverting merges from other repos.

However, it was not ever meant to be a replacement for Try, and the inbound tree rules explicitly state this:
https://wiki.mozilla.org/Tree_Rules/Inbound#What_are_the_tree_rules_for_mozilla-inbound.3F

If people feel that it would be preferable (either from infra load or workflow) to change this policy, please can they start a dev.{platform,planning} discussion proposing a change - but in the meantime I would prefer it if they don't ignore the tree rules - since it results in very sadfaces sheriffs :-(

Best wishes,

Ed

Benoit Jacob

unread,

Sep 4, 2012, 2:31:51 PM9/4/12

to Chris Cooper, dev-pl...@lists.mozilla.org

2012/9/4 Chris Cooper <cco...@deadsquid.com>:

I tried to make that point in the previous thread:

The problem with dropping a platform is not just that that platform
may be worth keeping, it is also that the fact that you feel the need
to drop a platform is probably a consequence of a deeper problem which
is the right thing to fix: our testing is too expensive. How do we
make it less expensive? People discussed possible ideas in the other
thread, including running tests less often and/or skipping part of the
tests depending on what didn't change.

Benoit

>
> cheers,
> --
> coop
> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning

Chris Pearce

unread,

Sep 4, 2012, 3:47:26 PM9/4/12

to Benoit Jacob, Chris Cooper

On 05/09/12 06:31, Benoit Jacob wrote:
> The problem with dropping a platform is not just that that platform
> may be worth keeping, it is also that the fact that you feel the need
> to drop a platform is probably a consequence of a deeper problem which
> is the right thing to fix: our testing is too expensive.

I agree.

But if we *were* to consider dropping a platform to tier 2, we should
make that decision with data to back it up, which for Linux should also
include data regarding the Firefox x86/x64 split in the major distros,
since most roll their own Firefox packages which we don't track.

Chris P.

Ehsan Akhgari

unread,

Sep 4, 2012, 5:20:37 PM9/4/12

to Johnathan Nightingale, Steve Fink, dev-pl...@lists.mozilla.org, Jeff Hammel

This makes sense. Do you wanna file a bug in Core::Build Config and
assign it to Steve? ;-)

Ehsan

Justin Dolske

unread,

Sep 5, 2012, 4:32:11 PM9/5/12

to

On 9/4/12 10:30 AM, Ed Morley wrote:

>> Breaking mozilla-inbound should 100% acceptable, and trivial for a
>> sheriff or others to fix.

...

>
> However, it was not ever meant to be a replacement for Try, and the inbound tree rules explicitly state this:
> https://wiki.mozilla.org/Tree_Rules/Inbound#What_are_the_tree_rules_for_mozilla-inbound.3F

I _suspect_ you both may be saying the basically the same thing,
although differing on exactly where the line is...

I think it's true that m-i isn't a playground; developers should be
surprised if a m-i push fails, and not just expect it to flush out
problems. At a _minimum_, developers should have at least built and run
relevant tests locally. No push'n'pray.

I also think it's true that using Try is a best-practice. It's easy and
helps to spot the unexpected without causing work for other people. It's
even essentially _required_ if you're doing things that have a history
of being touchy -- C++ magic that various compilers might dislike,
invasive build system changes, platform-specific changes that you can't
check yourself, etc.

But in-between we're trusting developers to use their best judgement.
Trivial, well-understood changes might not need Try at all. When Try is
used, we ask them to use TryChooser to limit resource usage by doing
what's needed. Not everything needs a full Talos run + debug + opt + all
tests + all platforms (+ multiple runs to ensure no new random orange is
added or you got lucky with a random green).

Assuming this is all true, it seems what we might really want here are
some better guidelines for helping developers tune/improve their "best
judgement". Some of the quote-unquote-obvious things raised this thread
would be a good start.

Justin

Jonathan Kew

unread,

Sep 5, 2012, 6:33:37 PM9/5/12

to dev-pl...@lists.mozilla.org

On 5/9/12 21:32, Justin Dolske wrote:
> On 9/4/12 10:30 AM, Ed Morley wrote:
>
>>> Breaking mozilla-inbound should 100% acceptable, and trivial for a
>>> sheriff or others to fix.
> ...
>>
>> However, it was not ever meant to be a replacement for Try, and the
>> inbound tree rules explicitly state this:
>> https://wiki.mozilla.org/Tree_Rules/Inbound#What_are_the_tree_rules_for_mozilla-inbound.3F
>>
>
> I _suspect_ you both may be saying the basically the same thing,
> although differing on exactly where the line is...
>
> I think it's true that m-i isn't a playground; developers should be
> surprised if a m-i push fails, and not just expect it to flush out
> problems. At a _minimum_, developers should have at least built and run
> relevant tests locally. No push'n'pray.

> [etc]

Indeed. My understanding has always been that the expected "patch
quality" for m-i is the same as for m-c. I wouldn't push a patch to
inbound unless I believe that patch is ready for mozilla-central. If I
have any significant level of doubt about this, I'd push to tryserver
first to verify whatever tests/platforms/etc I'm concerned about.

Of course, I may misjudge this sometimes, in which case our faithful
sheriffs will rescue the tree by backing me out. But the primary reason
for me to land on inbound rather than m-c is simply that it frees me
from tree-watching responsibilities -- not that it lets me push stuff
that I feel is too risky for m-c.

JK

Steve Fink

unread,

Sep 6, 2012, 12:16:24 AM9/6/12

to Ehsan Akhgari, dev-pl...@lists.mozilla.org, Jeff Hammel, Johnathan Nightingale

I'd be fine with that, though I also wouldn't get to it for a while
unless I make it through a couple of other projects faster than I have
been so far.

Then again... ok, here's v1, in bash:

echo "run everything"

or in Python

print("run everything")

Now, who can hook this into buildbot? I'll patch it from there. :-)

Except I'm not kidding. I can go to town on some crazy algorithm, but
I've no clue about the code or the process for getting anything
actually hooked in and deployed.

Btw, upon further reflection, my previously sketched-out algorithm is
all wrong. You don't just want to have a per-push trigger that says
"what tests should we kick off for this push?" You really want a job
completion trigger that says "what could I do with this now-available
machine that would give me the most information, given what I currently
know?" Or maybe that's unit of information per machine-minute, I'm not
sure. But that formulation gives way more possibilities -- it might
choose to bisect a past coalesced failure rather than just kicking off
an almost-certain-to-be-useless test for the latest push. And as long
as you don't hint to it that intermittent *greens* are possible, it'll
naturally decay to running everything for every push if resources are
available. (If you don't limit it, it'll also use idle resources to
rerun every failure forever to make sure it's not intermittent, too.)
Welcome to our new overlord, the Robosheriff!

Steve Fink

unread,

Sep 7, 2012, 2:35:57 PM9/7/12

to mozilla.dev.planning group, Chris AtLee, Ehsan Akhgari

On Thu 06 Sep 2012 07:16:32 PM PDT, Ehsan Akhgari wrote:

> Steve Fink wrote:
>> On Tue 04 Sep 2012 02:20:37 PM PDT, Ehsan Akhgari wrote:
>>> On 12-09-03 12:59 PM, Johnathan Nightingale wrote:
>>>>
>>>> I'd also be perfectly okay with saying that changes someone like
>>>> Ehsan makes to something like layout are gonna run the full suite
>>>> every time. Layout pushes in general are likely to touch surprising
>>>> things. But even granting that, Steve's suggestions could help
>>>> firefox, toolkit, mobile, js, nss, webgl, &c pushes get out of the
>>>> way by running subsets.
>>>>
>>>> We could label whole directories as "touching this ends the world,
>>>> test everything" and be pretty liberal about where we apply that
>>>> label because at the moment, we effectively apply it to everything.
>>>>
>>>> So who's gonna volunteer to do the strawman test-bucket vs code
>>>> location matrix? :)
>>>
>>> This makes sense. Do you wanna file a bug in Core::Build Config and
>>> assign it to Steve? ;-)
>>
>> I'd be fine with that, though I also wouldn't get to it for a while
>> unless I make it through a couple of other projects faster than I
>> have been so far.
>>
>> Then again... ok, here's v1, in bash:
>>
>> echo "run everything"
>>
>> or in Python
>>
>> print("run everything")
>>
>> Now, who can hook this into buildbot? I'll patch it from there. :-)
>

> So, I discussed this idea briefly with catlee today. Here's the
> gist. Doing this is not as easy as I thought it would be, since it is
> the build machine which schedules the test jobs once the build is
> finished, and buildbot is not involved in the decision. However, it
> is the buildbot who knows which files have changed in a given push.
> So, we need to stream that information into the builder somehow so
> that it can make the call on which test suites to run.

How does coalescing happen? Does the build machine always request the
full set of tests, and then buildbot ignores the request if it's
overloaded? Or does the build machine actually know something about the
overload state? If the former, then plainly the build machine can
continue doing exactly what it's doing, and whatever is currently aware
of the overload would just need to be given information on the changes
made so that it could selectively suppress jobs. But I somehow doubt
it's that simple.

The pie-in-the-sky optimal interface would integrate more deeply, and
might require a bit of rearchitecting. It really wants to be a daemon
monitoring these notifications:

- job completion, with status
- new slave available (probably because it completed a job, but also
when adding to the pool or rebooting or whatever)
- changes pushed, with a way of knowing what's in that change
- star comment added

The "new slave available" notification might actually be a synchronous
call, since it would be the only thing kicking off new jobs. Optionally,
this daemon could cancel known-to-be-bad jobs, trigger clobbers, and
auto-star in limited cases.

Oh, and it wants to be able to distinguish regular pushes from merges
and backouts, because failure probabilities are totally different across
those. But a regex match is good enough for that.

In other words, it kind of wants to be the global scheduler. It would
maintain state. Version 1 would watch incoming pushes and queue up all
the build jobs. When a build job completed, it would queue up the test
jobs, only it wouldn't be a linear queue because when another build came
in it would need to reimplement the current coalescing strategy. When a
slave became available, it would throw a job at it. Ignoring the
(enormous) buildbot architectural questions, this should be pretty quick
and straightforward to implement.

Later versions would be maintaining state to quickly and correctly
answer the question, when a new slave is available, "what is the most
useful job to run on this machine?" Usually that would be grabbing one
of the test jobs from the most recent build, but could be bisecting
coalesced failures or retriggering possibly intermittent failures.

To correctly answer the "most useful job" question, it would need to
maintain estimates of the probability of any given job failing, as well
as an estimate of the current state of every type of job in the tree (eg
M1 is (85% probability) failing from one of the last 3 pushes, or (15%
probability) is a not-yet-starred intermittent failure; M2 is totally
happy with respect to the latest push.) That means it could eventually
provide a sheriff's dashboard, enumerating the possible causes of the
current horrific breakage and its plan for figuring out what's going on
(which of course can be overridden at any time via manual retriggers or
whatever.) It could even give its logic for why it picked each upcoming
job. It should be written to be reactive, though, so it doesn't depend
on anything following its advice.

In fact, an alternative implementation route would be implement the
dashboard with all the crazy estimation stuff first, but not give it any
ability to start/stop/star jobs. Then it could be validated on actual
data before giving it the reins.

This would not want live on the builders, though. It needs global
visibility.

Chris AtLee

unread,

Sep 7, 2012, 3:56:45 PM9/7/12

to

> How does coalescing happen? Does the build machine always request the
> full set of tests, and then buildbot ignores the request if it's
> overloaded? Or does the build machine actually know something about the
> overload state? If the former, then plainly the build machine can
> continue doing exactly what it's doing, and whatever is currently aware
> of the overload would just need to be given information on the changes
> made so that it could selectively suppress jobs. But I somehow doubt
> it's that simple.

tl;dr - builds and tests are greedy - a machine will grab all pending
work of the same type when it starts a build/test

Coalescing happens on the buildbot master at the time when a build
starts. Once a machine is available to start a job the default behaviour
is to grab all other pending jobs of the same type. The primary
exceptions to this are try jobs where coalescing is disabled completely.

For builds, this turns into something like this when we're running at
full capacity:
* push A -> pending build requests for win32, linux64, etc.
* push B -> pending build requests for win32, linux64, etc.
* push C -> pending build requests for win32, linux64, etc.
* win32 build slave becomes available. build master coalesces pending
requests for win32 A,B,C into a single job, and tells slave to
checkout/build the latest code (C).
* push D -> pending build requests for win32, linux64, etc.
* push E -> pending build requests for win32, linux64, etc.
* win32 build slave becomes available. build master coalesces pending
requests for win32 D,E into a single job, and tells slave to
checkout/build the latest code (E).

At this point, the build master has a lot of information about the
changes going into A,B,C,D,E, including which files have changed. This
data isn't currently communicated to the build slave, nor does it
influence decisions about what should be built in most cases.

For each build platform, when the builds of C,E finish, they trigger
tests by notifying the build master of a few pieces of data: branch,
revision, platform as well as urls to the builds, tests, and symbols.
This results in the pending queue for tests looking like:

* win32 mozilla-central C mochitests-1 http://....
* win32 mozilla-central C mochitests-2 http://....
...
* win32 mozilla-central E mochitests-1 http://....

These are subject to the same coalescing behaviours as the builds. So
all the "mochitest-1" jobs for win32 mozilla-central will be coalesced
the next time a slave is free. Note that the pending requests for test
jobs only include the revision, not the list of files that were changed
for the build. Also note that the test requests give no indication of
how many pushes were coalesced into one build. pushes A,B,D never
existed as far as tests are concerned.

This isn't to say that we *can't* change which tests are run in response
to which files are changed, rather that it's a significant change from
the current implementation.

I hope this helps!
Chris

Chris AtLee

unread,

Sep 7, 2012, 4:09:33 PM9/7/12

to

On 07/09/12 02:35 PM, Steve Fink wrote:
> The pie-in-the-sky optimal interface would integrate more deeply, and
> might require a bit of rearchitecting. It really wants to be a daemon
> monitoring these notifications:
>
> - job completion, with status
> - new slave available (probably because it completed a job, but also
> when adding to the pool or rebooting or whatever)
> - changes pushed, with a way of knowing what's in that change
> - star comment added
>
> The "new slave available" notification might actually be a synchronous
> call, since it would be the only thing kicking off new jobs. Optionally,
> this daemon could cancel known-to-be-bad jobs, trigger clobbers, and
> auto-star in limited cases.
>
> Oh, and it wants to be able to distinguish regular pushes from merges
> and backouts, because failure probabilities are totally different across
> those. But a regex match is good enough for that.
>
> In other words, it kind of wants to be the global scheduler. It would
> maintain state. Version 1 would watch incoming pushes and queue up all
> the build jobs. When a build job completed, it would queue up the test
> jobs, only it wouldn't be a linear queue because when another build came
> in it would need to reimplement the current coalescing strategy. When a
> slave became available, it would throw a job at it. Ignoring the
> (enormous) buildbot architectural questions, this should be pretty quick
> and straightforward to implement.

Indeed the architectural issues there are enormous...intelligent
scheduling is really tricky to get right, and then trickier to implement
in buildbot. We've been working a few approaches that may help: one is
to basically dump out all of the relevant state and events from the
buildbot master to make it consumable from external processes. These
processes can then inject new work into the system at their own pace.

This should make it easier to implement schedulers that require more
state, or that may require some "expensive" operations to figure out
what to do next (e.g. looking up past results in a DB, checking starred
status, etc.)

One big difference from what you describe is that buildbot doesn't
generate work in response to slave availability; it instead keeps a list
of pending work, and which slaves are eligible to do it, and assigns the
work out when slaves are free.

Henri Sivonen

unread,

Sep 18, 2012, 7:57:57 AM9/18/12

to mozilla.dev.planning group

On Fri, Aug 31, 2012 at 3:16 PM, Ben Hearsum <bhea...@mozilla.com> wrote:
> I know this is in the works (sorry, I don't which bug is happening in),
> but we can't quite run all of our unit tests on AWS. Anything that
> depends on a GPU (reftest, some crashtests, and even some mochitests
> I've heard) can't run there. We definitely want to move everything we
> can to the cloud, though.

Would it be possible to use llvmpipe to make a CPU-only config on AWS
that to Firefox looks like a config with a GPU?

--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/

Ben Hearsum

unread,

Sep 18, 2012, 8:11:29 AM9/18/12

to Henri Sivonen

On 09/18/12 07:57 AM, Henri Sivonen wrote:
> On Fri, Aug 31, 2012 at 3:16 PM, Ben Hearsum <bhea...@mozilla.com> wrote:
>> I know this is in the works (sorry, I don't which bug is happening in),
>> but we can't quite run all of our unit tests on AWS. Anything that
>> depends on a GPU (reftest, some crashtests, and even some mochitests
>> I've heard) can't run there. We definitely want to move everything we
>> can to the cloud, though.
>
> Would it be possible to use llvmpipe to make a CPU-only config on AWS
> that to Firefox looks like a config with a GPU?

Would testing like that constitute a valid test? We've been pretty
insistent that we test on real-world things in the past.

Henri Sivonen

unread,

Sep 18, 2012, 8:28:53 AM9/18/12

to mozilla.dev.planning group

To the extent there are already Linux distros that run Gnome Shell on
llvmpipe when suitable GPU OpenGL drivers are missing and Ubuntu is
moving to running the 3D version of Unity on llvmpipe when suitable
GPU OpenGL drivers are missing, I'd expect running on top of llvmpipe
to correspond to one kind of real-world situation, though I don't
actually know how Firefox sees the OpenGL stack when running with
Gnome Shell/llvmpipe or Unity/llvmpipe.