Increase in mozilla-inbound bustage due to people not using Try

Ed Morley

unread,

Aug 9, 2012, 10:19:23 AM8/9/12

to

Increasingly over the last few weeks, people have been landing untested changes on mozilla-inbound that has resulted in bustage - which with high levels of coalescing (that makes finding the culprit require multiple time-consuming re-triggers) has meant closing the tree for several hours.

For example - last night the tree was closed for a total of roughly 6 hours whilst 7 backouts were performed. This makes three days in a row where the tree has had to be closed during peak times.

This is neither fair on other developers (who are understandably frustrated when the tree is closed day after day), nor on the sheriffs (whom at that time of day are only volunteers).

As such, I would like to request that people adjust their risk thresholds for when to use Try - so we can minimise the impact on the tree.

I realise that Try Server end-to-end times have been less than ideal recently, so using try can be frustrating - however the situation is slowing improving (tegra reboot issues fixed, pymake for windows almost ready, see[/ping] bug 772458 for more). In addition, please bear in mind that landing bustage on trunk trees actually makes the Try wait times worse (since the trunk backouts/retriggers take test job priority over Try) - leading to others not bothering to use Try either, and so the situation cascades.

A quick reminder: If your patch(es) have been sent to try, please add the URL to the bug, so in the case of bustage, it's easier for the sheriffs to eliminate your push as the cause.

Finally, the list of steps to perform after pushing to inbound has recently been simplified (thanks to the new merge tool we use) - please take a look if you haven't in the last 4-6 weeks:
https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound

Best wishes,

Ed

Justin Lebar

unread,

Aug 9, 2012, 10:35:28 AM8/9/12

to Ed Morley, dev-pl...@lists.mozilla.org

> In addition, please bear in mind that landing bustage on trunk trees actually
> makes the Try wait times worse (since the trunk backouts/retriggers take test
> job priority over Try) - leading to others not bothering to use Try either, and so
> the situation cascades.

I thought tryserver used a different pool of machines isolated from
all the other trees, because we treated the tryserver machines as
pwned. Is that not or no longer the case?

> Increasingly over the last few weeks, people have been landing untested
> changes on mozilla-inbound that has resulted in bustage - which with high
> levels of coalescing (that makes finding the culprit require multiple
> time-consuming re-triggers)

Is there a plan to mitigate the coalescing on m-i? It seems like that
is a big part of the problem.

> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform

Ryan VanderMeulen

unread,

Aug 9, 2012, 11:32:53 AM8/9/12

to

On 8/9/2012 10:35 AM, Justin Lebar wrote:
> Is there a plan to mitigate the coalescing on m-i? It seems like that
> is a big part of the problem.
>

One thing that might help would be a limit on the number of consecutive
pushes that a test is able to coalesced before another run has to be
scheduled. I would say 3 max (and even that makes me somewhat
uncomfortable unless the next paragraph is addressed).

Bug 690672 would also help so that sheriffs could more quickly turn
around retriggers on coalesced builds. Along with that, retriggers
should take priority over regularly-scheduled builds.

-Ryan

Justin Wood (Callek)

unread,

Aug 10, 2012, 1:35:12 AM8/10/12

to Justin Lebar

Justin Lebar wrote:
>> In addition, please bear in mind that landing bustage on trunk trees actually
>> makes the Try wait times worse (since the trunk backouts/retriggers take test
>> job priority over Try) - leading to others not bothering to use Try either, and so
>> the situation cascades.
>
> I thought tryserver used a different pool of machines isolated from
> all the other trees, because we treated the tryserver machines as
> pwned. Is that not or no longer the case?
>

Yes and no, the build machines are completely different the test
machines -- not so much.

The testers however are shared. Testers have a completely different
passwords set, as well as other mitigations. The idea here is that our
test machines also have no permissions to upload anyway, nor any way to
leak/get sekrets. And all machines are in a restricted network
environment overall anyway.

So load on inbound affects *test* load on try, yes.

--
~Justin Wood (Callek)

Ed Morley

unread,

Aug 14, 2012, 3:14:07 PM8/14/12

to

On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar wrote:
> Is there a plan to mitigate the coalescing on m-i? It seems like that
> is a big part of the problem.

Reducing the amount of coalescing permitted would just mean we end up with a backlog of pending tests on the repo tip - which would result in tree closures regardless. So other than bug 690672 making sheriffs' lives easier, we just need more machines in the test pool - since it's simply a case of demand exceeding capacity.

The situation is made worse now that we're adding new platforms (OS X 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8 metro) faster than we're EOLing them - and we're pushing more changes per day than ever before [1]. From what I understand, Apple's aggressive hardware cycle is also making it difficult to expand the test pool [2].

On a more positive note, at the end of this cycle we should be able to turn off Android XUL on trunk trees [3], which will at least help improve the wait on that platform :-)

[1] http://oduinn.com/blog/2012/08/04/infrastructure-load-for-july-2012/
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=777037#c4

Justin Lebar

unread,

Aug 14, 2012, 3:38:08 PM8/14/12

to Ed Morley, John O'Duinn, dev-pl...@lists.mozilla.org

>> Is there a plan to mitigate the coalescing on m-i? It seems like that
>> is a big part of the problem.
>

> it's simply a case of demand exceeding capacity.

Understood.

But I think my question still stands: Is there a plan to address the
fact that we do not have capacity to run all the tests we need to run?

It sounds like [2] the answer is no, for at least the medium-term,
because releng is busy deploying Mac 10.8 and Windows 8.

I do not think we can afford to wait on these large projects before
deploying more hardware. I'd like to see data, but it seems to me
that we've hugely regressed tryserver turnaround times in the past few
months. Unless we're able to add more machines to the pool, there is
no end in sight.

It seems that we need a concrete promise from releng / it to keep
end-to-end tryserver times (push to final test finished) below X hours
at the 90th percentile, and to coalesce fewer than Y% of pushes to
m-i/m-c (measured during the busiest Z hours of each day). Then
there's no need to guess about whether the pool is unacceptably backed
up, or whether fixing the pile-up should be a priority.

-Justin

[2] https://bugzilla.mozilla.org/show_bug.cgi?id=772458#c3

Gregory Szorc

unread,

Aug 14, 2012, 3:47:17 PM8/14/12

to Ed Morley, dev-pl...@lists.mozilla.org

On 8/14/12 12:14 PM, Ed Morley wrote:
> On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar wrote:
>> Is there a plan to mitigate the coalescing on m-i? It seems like that
>> is a big part of the problem.
>
> Reducing the amount of coalescing permitted would just mean we end up with a backlog of pending tests on the repo tip - which would result in tree closures regardless. So other than bug 690672 making sheriffs' lives easier, we just need more machines in the test pool - since it's simply a case of demand exceeding capacity.
>
> The situation is made worse now that we're adding new platforms (OS X 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8 metro) faster than we're EOLing them - and we're pushing more changes per day than ever before [1]. From what I understand, Apple's aggressive hardware cycle is also making it difficult to expand the test pool [2].

Is there a tracking bug for areas where we could gain efficiency? We all
know the build phase is full of clownshoes. But, I believe we also do
silly things like execute some tests serially, only taking advantage of
1/N CPU cores in the process. This is just wasting resources. See [1]
for a concrete example.

Do we have data on the actual hardware load for the test runners? If we
are throwing away significant CPU cycles, etc, we could probably
alleviate a lot of the problems just with software changes.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=686240

Justin Lebar

unread,

Aug 14, 2012, 4:04:38 PM8/14/12

to Gregory Szorc, Ed Morley, dev-pl...@lists.mozilla.org

> But, I believe we also do silly
> things like execute some tests serially, only taking advantage of 1/N CPU
> cores in the process. This is just wasting resources. See [1] for a concrete
> example.

It would be very cool if we could run mochitests inside xvfb on Linux
(and maybe Mac?). But it is another point of failure -- for example,
on my machine, xvfb-run causes mochitest to randomly segfault. (I
think it's Firefox, not xvfb, that's dying, although I'm not
positive.)

Of course, investigating and implementing this would require
resources, which would require us to acknowledge that we're failing by
some metric, which would require us to agree on specific goals, which
would require us first to agree that we should have such goals in the
first place! :)

On Tue, Aug 14, 2012 at 3:47 PM, Gregory Szorc <g...@mozilla.com> wrote:
> On 8/14/12 12:14 PM, Ed Morley wrote:
>>

>> On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar wrote:
>>>
>>> Is there a plan to mitigate the coalescing on m-i? It seems like that
>>> is a big part of the problem.
>>
>>
>> Reducing the amount of coalescing permitted would just mean we end up with
>> a backlog of pending tests on the repo tip - which would result in tree
>> closures regardless. So other than bug 690672 making sheriffs' lives easier,
>> we just need more machines in the test pool - since it's simply a case of
>> demand exceeding capacity.
>>
>> The situation is made worse now that we're adding new platforms (OS X
>> 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop, Win8
>> metro) faster than we're EOLing them - and we're pushing more changes per
>> day than ever before [1]. From what I understand, Apple's aggressive
>> hardware cycle is also making it difficult to expand the test pool [2].
>
>

> Is there a tracking bug for areas where we could gain efficiency? We all
> know the build phase is full of clownshoes. But, I believe we also do silly
> things like execute some tests serially, only taking advantage of 1/N CPU
> cores in the process. This is just wasting resources. See [1] for a concrete
> example.
>
> Do we have data on the actual hardware load for the test runners? If we are
> throwing away significant CPU cycles, etc, we could probably alleviate a lot
> of the problems just with software changes.
>
> [1] https://bugzilla.mozilla.org/show_bug.cgi?id=686240
>

Aryeh Gregor

unread,

Aug 15, 2012, 5:52:34 AM8/15/12

to Gregory Szorc, Ed Morley, dev-pl...@lists.mozilla.org

On Tue, Aug 14, 2012 at 10:47 PM, Gregory Szorc <g...@mozilla.com> wrote:
> Is there a tracking bug for areas where we could gain efficiency? We all
> know the build phase is full of clownshoes. But, I believe we also do silly
> things like execute some tests serially, only taking advantage of 1/N CPU
> cores in the process. This is just wasting resources. See [1] for a concrete
> example.

Don't we execute *all* tests serially? Many of our tests require
focus, so you can't do two runs in parallel on the same desktop. In
theory we could specially flag the ones that don't need focus, and
make sure to always run them without focus -- that would probably be
most of the tests. Then those could be run in parallel. They could
also be run in the background on developer machines, which would be
nice. This would require a bunch of developer work.

Alternatively, the test machines could be set up with multiple
desktops with independent focus. At least Windows and Linux should
support this, AFAIK -- it's necessary if you want to allow a
thin-client setup in corporate environments. This would require a
bunch of IT work.

(I don't think xvfb-run is a good solution, because it's not exactly
the same as a normal X session. In my experience, a small fraction of
tests unexpectedly fail using xvfb-run. By the same token, I'm
guessing some will incorrectly pass. It doesn't seem like a good idea
to use a different environment for test machines than users will use,
if we can avoid it.)

Gregory Szorc

unread,

Aug 15, 2012, 6:10:16 PM8/15/12

to Aryeh Gregor, Ed Morley, dev-pl...@lists.mozilla.org

On 8/15/12 2:52 AM, Aryeh Gregor wrote:
> On Tue, Aug 14, 2012 at 10:47 PM, Gregory Szorc <g...@mozilla.com> wrote:
>> Is there a tracking bug for areas where we could gain efficiency? We all
>> know the build phase is full of clownshoes. But, I believe we also do silly
>> things like execute some tests serially, only taking advantage of 1/N CPU
>> cores in the process. This is just wasting resources. See [1] for a concrete
>> example.
>
> Don't we execute *all* tests serially?

Outside of splitting some suites into chunks and running on different
builders, I'm pretty sure we do.

> Many of our tests require
> focus, so you can't do two runs in parallel on the same desktop. In
> theory we could specially flag the ones that don't need focus, and
> make sure to always run them without focus -- that would probably be
> most of the tests. Then those could be run in parallel. They could
> also be run in the background on developer machines, which would be
> nice. This would require a bunch of developer work.

There are test suites like xpcshell and js reftests that AFAIK don't
have special hardware constraints (like the visual tests do). They can
at least be executed with full process + profile isolation. In some
cases, we could probably re-use processes to avoid that overhead. I view
each test suite as independent and I'm sure there are some low-hanging
fruits in there.

William Lachance

unread,

Aug 15, 2012, 6:17:42 PM8/15/12

to

On 08/14/2012 03:47 PM, Gregory Szorc wrote:
> On 8/14/12 12:14 PM, Ed Morley wrote:
>> On Thursday, 9 August 2012 15:35:28 UTC+1, Justin Lebar wrote:
>>> Is there a plan to mitigate the coalescing on m-i? It seems like that
>>> is a big part of the problem.
>>
>> Reducing the amount of coalescing permitted would just mean we end up
>> with a backlog of pending tests on the repo tip - which would result
>> in tree closures regardless. So other than bug 690672 making sheriffs'
>> lives easier, we just need more machines in the test pool - since it's
>> simply a case of demand exceeding capacity.
>>
>> The situation is made worse now that we're adding new platforms (OS X
>> 10.7, B2G GB, B2G ICS, Android Armv6, soon OS X 10.8, Win8 desktop,
>> Win8 metro) faster than we're EOLing them - and we're pushing more
>> changes per day than ever before [1]. From what I understand, Apple's
>> aggressive hardware cycle is also making it difficult to expand the
>> test pool [2].
>
> Is there a tracking bug for areas where we could gain efficiency? We all
> know the build phase is full of clownshoes. But, I believe we also do
> silly things like execute some tests serially, only taking advantage of
> 1/N CPU cores in the process. This is just wasting resources. See [1]
> for a concrete example.

Last year we had a buildfaster project to try and improve our end-to-end
build/test times:

https://wiki.mozilla.org/ReleaseEngineering/BuildFaster

I think it's been recently reactivated, I believe mostly with the
intention of working on build times (which is important, but only one
small part of the overall picture):

http://coop.deadsquid.com/2012/07/reviving-buildfaster-fixing-makefiles/

In general I would be very careful before tackling any particular bug
for the sake of improving our build/test times. If something is slow,
but not on the critical path as far as build/test is concerned, fixing
it will not result in any tangible improvement.

When I was working on this project last year, I designed a build charts
view to help visualize which parts were taking the longest (you can see
implicit dependencies between build/test tasks by seeing when certain
jobs run), which proved very helpful to determine which areas we needed
to optimize:

http://brasstacks.mozilla.com/gofaster/#/buildcharts

I'm not sure if the data feeding into that is still valid (some things
like look suspiciously low, and at the very least it doesn't seem
completely up to date). Anyway, if I were going to look into this again
(don't have time right now unfortunately), I would first spend a lot of
time staring at data. :)

Will

Jonas Sicking

unread,

Aug 15, 2012, 6:22:10 PM8/15/12

to Justin Lebar, Ed Morley, dev-pl...@lists.mozilla.org, John O'Duinn

On Tue, Aug 14, 2012 at 12:38 PM, Justin Lebar <justin...@gmail.com> wrote:
>>> Is there a plan to mitigate the coalescing on m-i? It seems like that
>>> is a big part of the problem.
>>

>> it's simply a case of demand exceeding capacity.
>

> Understood.
>
> But I think my question still stands: Is there a plan to address the
> fact that we do not have capacity to run all the tests we need to run?
>
> It sounds like [2] the answer is no, for at least the medium-term,
> because releng is busy deploying Mac 10.8 and Windows 8.

Would it be a win if we changed how we prioritize tryserver runs? Can
we make pushes which are running lots of tests, or are running on lots
of platforms have lower priority to pushes which only run one platform
and just a small number of tests?

Of course, it's likely releng that would implement this, and they are
the ones currently swamped.

/ Jonas

Gregory Szorc

unread,

Aug 15, 2012, 7:08:48 PM8/15/12

to William Lachance, dev-pl...@lists.mozilla.org

On 8/15/12 3:17 PM, William Lachance wrote:
> In general I would be very careful before tackling any particular bug
> for the sake of improving our build/test times. If something is slow,
> but not on the critical path as far as build/test is concerned, fixing
> it will not result in any tangible improvement.
>
> When I was working on this project last year, I designed a build charts
> view to help visualize which parts were taking the longest (you can see
> implicit dependencies between build/test tasks by seeing when certain
> jobs run), which proved very helpful to determine which areas we needed
> to optimize:
>
> http://brasstacks.mozilla.com/gofaster/#/buildcharts

Very nice. If you are accepting feature requests, I think the most
helpful would be checkboxes to filter hardware platforms. It's kind of
hard sorting through everything when all the platforms are mixed together.

I would also like to see hardware utilization in this chart somehow. If
a build step is consuming all local hardware resources (mainly CPU and
I/O), that is a completely different optimization strategy from one
where we are not fully utilizing local capacity or are waiting on
external resources, such as those on a network.

Ehsan Akhgari

unread,

Aug 15, 2012, 9:05:53 PM8/15/12

to dev-pl...@lists.mozilla.org

On 12-08-15 6:10 PM, Gregory Szorc wrote:
> On 8/15/12 2:52 AM, Aryeh Gregor wrote:
>> On Tue, Aug 14, 2012 at 10:47 PM, Gregory Szorc <g...@mozilla.com> wrote:

>>> Is there a tracking bug for areas where we could gain efficiency? We all
>>> know the build phase is full of clownshoes. But, I believe we also do
>>> silly
>>> things like execute some tests serially, only taking advantage of 1/N
>>> CPU
>>> cores in the process. This is just wasting resources. See [1] for a
>>> concrete
>>> example.
>>

>> Don't we execute *all* tests serially?
>
> Outside of splitting some suites into chunks and running on different
> builders, I'm pretty sure we do.

Yes, that's true.

>> Many of our tests require
>> focus, so you can't do two runs in parallel on the same desktop. In
>> theory we could specially flag the ones that don't need focus, and
>> make sure to always run them without focus -- that would probably be
>> most of the tests. Then those could be run in parallel. They could
>> also be run in the background on developer machines, which would be
>> nice. This would require a bunch of developer work.
>
> There are test suites like xpcshell and js reftests that AFAIK don't
> have special hardware constraints (like the visual tests do). They can
> at least be executed with full process + profile isolation. In some
> cases, we could probably re-use processes to avoid that overhead. I view
> each test suite as independent and I'm sure there are some low-hanging
> fruits in there.

There are xpcshell tests which do things like file system operations
which might race against other tests if we run them in parallel. Of
course that's something that could be fixed, but it is not necessarily
low hanging fruit (unless someone tries this and proves me wrong, which
I will be happy to see!)

Tests in the rest of our framework might implicitly depend on the
execution environment left behind by the previous tests. This is bad
practice for sure, but it's the fact of the matter. Which is why the
focus issue is not the only thing that needs to be fixed if we're
looking at parallelizing other test runs.

That all being said, it would be *fantastic* if we can pull this off at
some point. It just requires engineering and releng resources.

Cheers,
Ehsan

Ehsan Akhgari

unread,

Aug 15, 2012, 9:16:04 PM8/15/12

to William Lachance, dev-pl...@lists.mozilla.org

This looks great William. But looking at how our load has been for the
past few weeks, I think we're not going to benefit a lot by incremental
improvements to end-to-end times.

Honestly, the only big thing that we can probably fix to improve our
end-to-end times is to enable using pymake on our Windows builders to do
parallel builds. Developers on Windows have been using pymake to get
parallel builds for quite a while now, and somebody needs to figure out
what's happening on our build machines which causes us not to be able to
use pymake there, and fix it. That should significantly decrease our
Windows build times depending on the number of cores available on our
Windows builders.

Any other low hanging fruits that I can think of are all going to be
small incremental improvements which, although being very nice, stand no
chance against the rate at which our load is increasing. So
unfortunately I don't see any way to address the problem that we're
facing in the short term except for adding hardware.

Cheers,
Ehsan

Gregory Szorc

unread,

Aug 15, 2012, 9:18:44 PM8/15/12

to Ehsan Akhgari, dev-pl...@lists.mozilla.org, William Lachance

On 8/15/12 6:16 PM, Ehsan Akhgari wrote:
> Honestly, the only big thing that we can probably fix to improve our
> end-to-end times is to enable using pymake on our Windows builders to do
> parallel builds. Developers on Windows have been using pymake to get
> parallel builds for quite a while now, and somebody needs to figure out
> what's happening on our build machines which causes us not to be able to
> use pymake there, and fix it. That should significantly decrease our
> Windows build times depending on the number of cores available on our
> Windows builders.

https://bugzilla.mozilla.org/show_bug.cgi?id=593585 has seen lots of
activity in the last few weeks. pymake builds on the Windows builders is
near.

Ehsan Akhgari

unread,

Aug 15, 2012, 9:22:49 PM8/15/12

to Gregory Szorc, dev-pl...@lists.mozilla.org, William Lachance

This almost made my eyes full of happy tears! <3 to the people working
on this!

Ehsan

Mike Hommey

unread,

Aug 16, 2012, 2:10:00 AM8/16/12

to Ehsan Akhgari, dev-pl...@lists.mozilla.org, William Lachance

On Wed, Aug 15, 2012 at 09:16:04PM -0400, Ehsan Akhgari wrote:
> On 12-08-15 6:17 PM, William Lachance wrote:

> This looks great William. But looking at how our load has been for
> the past few weeks, I think we're not going to benefit a lot by
> incremental improvements to end-to-end times.
>

> Honestly, the only big thing that we can probably fix to improve our
> end-to-end times is to enable using pymake on our Windows builders
> to do parallel builds. Developers on Windows have been using pymake
> to get parallel builds for quite a while now, and somebody needs to
> figure out what's happening on our build machines which causes us
> not to be able to use pymake there, and fix it. That should
> significantly decrease our Windows build times depending on the
> number of cores available on our Windows builders.
>

> Any other low hanging fruits that I can think of are all going to be
> small incremental improvements which, although being very nice,
> stand no chance against the rate at which our load is increasing.
> So unfortunately I don't see any way to address the problem that
> we're facing in the short term except for adding hardware.

Something I noticed recently is that we spend more than 5 minutes (!)
during windows clobber builds to do the clobber (rm -rf). All try builds
are clobbers. A lot of time is wasted on mercurial cloning, too.

What is interesting is that the corresponding times are in the order of
seconds on linux and osx. We're just hitting the fact that windows sucks
at I/O.

But maybe we can work around this. At least for rm -rf, instead of
rm -rf'ing before the build, we could move the objdir away so that a
fresh new one is created. The older one could be removed much later.

Mike

Gregory Szorc

unread,

Aug 16, 2012, 2:24:41 AM8/16/12

to Mike Hommey, Ehsan Akhgari, dev-pl...@lists.mozilla.org, jhf...@mozilla.com, William Lachance

On 8/15/12 11:10 PM, Mike Hommey wrote:
> Something I noticed recently is that we spend more than 5 minutes (!)
> during windows clobber builds to do the clobber (rm -rf). All try builds
> are clobbers. A lot of time is wasted on mercurial cloning, too.
>
> What is interesting is that the corresponding times are in the order of
> seconds on linux and osx. We're just hitting the fact that windows sucks
> at I/O.

That is an over-generalization. I/O on Windows itself does not suck. I/O
on Windows sucks when you are using the POSIX APIs instead of the Win32
ones.

And, I'm willing to bet that rm (along with most of the GNU tools in our
MozillaBuild environment) is using the POSIX APIs or is at least not
using the most optimal Win32 API for the desired task.

A few months back, John Ford wrote a standalone win32 executable that
used the proper APIs to delete an entire directory. I think he said that
it deleted the object directory 5-10x faster or something. No clue what
happened with that.

Mike Hommey

unread,

Aug 16, 2012, 2:41:45 AM8/16/12

to Gregory Szorc, Ehsan Akhgari, dev-pl...@lists.mozilla.org, jhf...@mozilla.com, William Lachance

On Wed, Aug 15, 2012 at 11:24:41PM -0700, Gregory Szorc wrote:
> On 8/15/12 11:10 PM, Mike Hommey wrote:
> >Something I noticed recently is that we spend more than 5 minutes (!)
> >during windows clobber builds to do the clobber (rm -rf). All try builds
> >are clobbers. A lot of time is wasted on mercurial cloning, too.
> >
> >What is interesting is that the corresponding times are in the order of
> >seconds on linux and osx. We're just hitting the fact that windows sucks
> >at I/O.
>
> That is an over-generalization. I/O on Windows itself does not suck.
> I/O on Windows sucks when you are using the POSIX APIs instead of
> the Win32 ones.

Removing thousands of files and gigabytes of data in windows is
slow. That's a fact.
For example, see http://superuser.com/questions/19762/mass-deleting-files-in-windows/289399

(And if you /really/ want to waste time, try deleting files from the
file explorer instead of the command line)

> And, I'm willing to bet that rm (along with most of the GNU tools in
> our MozillaBuild environment) is using the POSIX APIs or is at least
> not using the most optimal Win32 API for the desired task.
>
> A few months back, John Ford wrote a standalone win32 executable
> that used the proper APIs to delete an entire directory. I think he
> said that it deleted the object directory 5-10x faster or something.
> No clue what happened with that.

I wish this were true, but I seriously doubt it. I can buy that it's
faster, but not 5-10 times so.

Mike

Nicholas Nethercote

unread,

Aug 16, 2012, 3:03:22 AM8/16/12

to Mike Hommey, William Lachance, Ehsan Akhgari, dev-pl...@lists.mozilla.org, jhf...@mozilla.com, Gregory Szorc

On Wed, Aug 15, 2012 at 11:41 PM, Mike Hommey <m...@glandium.org> wrote:
>>
>> A few months back, John Ford wrote a standalone win32 executable
>> that used the proper APIs to delete an entire directory. I think he
>> said that it deleted the object directory 5-10x faster or something.
>> No clue what happened with that.
>
> I wish this were true, but I seriously doubt it. I can buy that it's
> faster, but not 5-10 times so.

http://blog.johnford.org/writting-a-native-rm-program-for-windows/
says that it deleted a mozilla-central clone 3x faster.

https://bugzilla.mozilla.org/show_bug.cgi?id=727551 is the bug
tracking it; there appear to be some problems blocking progress.

Nick

Jason Duell

unread,

Aug 16, 2012, 3:17:14 AM8/16/12

to dev-pl...@lists.mozilla.org

On 08/16/2012 12:03 AM, Nicholas Nethercote wrote:
> On Wed, Aug 15, 2012 at 11:41 PM, Mike Hommey <m...@glandium.org> wrote:
>>> A few months back, John Ford wrote a standalone win32 executable
>>> that used the proper APIs to delete an entire directory. I think he
>>> said that it deleted the object directory 5-10x faster or something.
>>> No clue what happened with that.
>> I wish this were true, but I seriously doubt it. I can buy that it's
>> faster, but not 5-10 times so.
> http://blog.johnford.org/writting-a-native-rm-program-for-windows/
> says that it deleted a mozilla-central clone 3x faster.

And renaming the directory (then deleting it in parallel with the build,
or later) ought to be some power of ten faster than that, at least from
the build-time perspective. At least if you don't do anything expensive
like our nsIFile NTFS renaming goopage (that traverses the directory
tree making sure NTFS ACLs are preserved for all files). Which most
versions of 'rm' aren't going to do, I'd guess.

Jason

Robert O'Callahan

unread,

Aug 16, 2012, 5:57:49 AM8/16/12

to Jason Duell, dev-pl...@lists.mozilla.org

Whenever I need to delete a large directory on Windows I always move it to
a junk directory and then rm -rf the junk directory in the background. It
saves a lot of time.

Rob
--
“You have heard that it was said, ‘Love your neighbor and hate your enemy.’
But I tell you, love your enemies and pray for those who persecute you,
that you may be children of your Father in heaven. ... If you love those
who love you, what reward will you get? Are not even the tax collectors
doing that? And if you greet only your own people, what are you doing more
than others?" [Matthew 5:43-47]

Ben Hearsum

unread,

Aug 16, 2012, 9:18:11 AM8/16/12

to Mike Hommey, Ehsan Akhgari, William Lachance

On 08/16/12 02:10 AM, Mike Hommey wrote:
> But maybe we can work around this. At least for rm -rf, instead of
> rm -rf'ing before the build, we could move the objdir away so that a
> fresh new one is created. The older one could be removed much later.

I don't think this would be any more than a one-time win until the disk
fills up. At the start of each job we ensure there's enough space to do
the current job. By moving the objdir away we'd avoiding doing any clean
up until we need more space than is available. After that, each job
would still end up cleaning up roughly one objdir to clean up enough
space to run.

A common technique for dealing with this on Windows is to have a
dedicated partition for the builds, and to format it on start-up rather
than delete things, because a quick format is much quicker than
deleting. I don't think it's something RelEng could implement quickly,
but might be worthwhile looking at in the longer term.

Aryeh Gregor

unread,

Aug 16, 2012, 9:23:54 AM8/16/12

to Ben Hearsum, Ehsan Akhgari, dev-pl...@lists.mozilla.org, William Lachance

On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum <bhea...@mozilla.com> wrote:
> I don't think this would be any more than a one-time win until the disk
> fills up. At the start of each job we ensure there's enough space to do
> the current job. By moving the objdir away we'd avoiding doing any clean
> up until we need more space than is available. After that, each job
> would still end up cleaning up roughly one objdir to clean up enough
> space to run.

Why can't you move it, then spawn a background thread to remove it at
minimum priority? IIUC, Vista and later support I/O prioritization,
and the lowest priority will throttle down to two I/O's a second if
other I/O is happening. Or are build slaves already I/O-saturated?

Ben Hearsum

unread,

Aug 16, 2012, 9:30:14 AM8/16/12

to Aryeh Gregor, Ehsan Akhgari, William Lachance

I hadn't considered using a background thread to remove it. During
pulling/update we're I/O-saturated, I'm not sure about during compile.
Implementing this would be very tricky though....the way the build works
is by executing commands serially, so I'm not sure how we'd do this in
parallel with compilation. There's probably a way, but we'd have to be
reasonably sure it's useful to do before diving deeper, I think.

Mike Hommey

unread,

Aug 16, 2012, 9:33:45 AM8/16/12

to Ben Hearsum, Ehsan Akhgari, dev-pl...@lists.mozilla.org, William Lachance

On Thu, Aug 16, 2012 at 09:18:11AM -0400, Ben Hearsum wrote:
> On 08/16/12 02:10 AM, Mike Hommey wrote:
> > But maybe we can work around this. At least for rm -rf, instead of
> > rm -rf'ing before the build, we could move the objdir away so that a
> > fresh new one is created. The older one could be removed much later.
>
> I don't think this would be any more than a one-time win until the disk
> fills up. At the start of each job we ensure there's enough space to do
> the current job. By moving the objdir away we'd avoiding doing any clean
> up until we need more space than is available. After that, each job
> would still end up cleaning up roughly one objdir to clean up enough
> space to run.

If the cleanup happened at the end of the build, rather than at the
beginning, tests could start earlier.

Mike

Ben Hearsum

unread,

Aug 16, 2012, 9:38:37 AM8/16/12

to Mike Hommey, Ehsan Akhgari, dev-pl...@lists.mozilla.org, William Lachance

Good point. This won't help much when we have long wait times, but I
filed https://bugzilla.mozilla.org/show_bug.cgi?id=783253 on it.

Ben Hearsum

unread,

Aug 16, 2012, 9:38:37 AM8/16/12

to Mike Hommey, Ehsan Akhgari, dev-pl...@lists.mozilla.org, William Lachance

On 08/16/12 09:33 AM, Mike Hommey wrote:

Robert Kaiser

unread,

Aug 16, 2012, 1:33:53 PM8/16/12

to

Gregory Szorc schrieb:

> On 8/15/12 11:10 PM, Mike Hommey wrote:
>> What is interesting is that the corresponding times are in the order of
>> seconds on linux and osx. We're just hitting the fact that windows sucks
>> at I/O.
>
> That is an over-generalization. I/O on Windows itself does not suck. I/O
> on Windows sucks when you are using the POSIX APIs instead of the Win32
> ones.

From all I heard so far, the truth is in the middle of your and Mike's
position. I/O on Windows sucks, but it sucks even more when you are
using POSIX APIs on top of it.

An interesting data point is that the Wine team found out that running
tests involving file/disk I/O are significantly slower on native Windows
than on Wine-on-Linux on the same hardware. This implies that Windows
I/O really sucks already by itself (and I know from my own experience
how painful it is even with native Windows applications to delete larger
trees, even more so when they are VMs, which we have eliminated from out
build pools nowadays, though). Emulating POSIX upon that already slow
I/O makes it even worse, though.

Robert Kaiser

William Lachance

unread,

Aug 16, 2012, 4:32:26 PM8/16/12

to

On 08/15/2012 07:08 PM, Gregory Szorc wrote:
>>
>> When I was working on this project last year, I designed a build charts
>> view to help visualize which parts were taking the longest (you can see
>> implicit dependencies between build/test tasks by seeing when certain
>> jobs run), which proved very helpful to determine which areas we needed
>> to optimize:
>>
>> http://brasstacks.mozilla.com/gofaster/#/buildcharts
>
> Very nice. If you are accepting feature requests, I think the most
> helpful would be checkboxes to filter hardware platforms. It's kind of
> hard sorting through everything when all the platforms are mixed together.

We have a bugzilla component for filing these sorts of things (though
note that AFAIK no one's actively working on the dashboard atm):

https://bugzilla.mozilla.org/enter_bug.cgi?component=GoFaster&product=Testing

I do agree that more filtering options would be useful. I think the
first thing to do would be to confirm the data in these charts is valid
though.

> I would also like to see hardware utilization in this chart somehow. If
> a build step is consuming all local hardware resources (mainly CPU and
> I/O), that is a completely different optimization strategy from one
> where we are not fully utilizing local capacity or are waiting on
> external resources, such as those on a network.

I'm not sure if this works at all anymore, but it used to be that you
could click on a particular build to get the breakdown of the amount of
time spent on any particular step. We could certainly do a similar thing
with hardware utilization -- just a matter of getting the information
available somewhere we can access it (we used elastic search for the
build steps).

Will

Jason Duell

unread,

Aug 17, 2012, 12:31:23 AM8/17/12

to dev-pl...@lists.mozilla.org

On 08/16/2012 06:23 AM, Aryeh Gregor wrote:
> On Thu, Aug 16, 2012 at 4:18 PM, Ben Hearsum <bhea...@mozilla.com> wrote:

>> I don't think this would be any more than a one-time win until the disk
>> fills up. At the start of each job we ensure there's enough space to do
>> the current job. By moving the objdir away we'd avoiding doing any clean
>> up until we need more space than is available. After that, each job
>> would still end up cleaning up roughly one objdir to clean up enough
>> space to run.

> Why can't you move it, then spawn a background thread to remove it at
> minimum priority? IIUC, Vista and later support I/O prioritization,

Brian Bondy just added I/O prioritization to our code that removes
corrupt HTTP caches, in bug 773518, in case that code helps.

Jason

Mark Hammond

unread,

Aug 17, 2012, 3:15:55 AM8/17/12

to

On 16/08/2012 4:10 PM, Mike Hommey wrote:
...

> Something I noticed recently is that we spend more than 5 minutes (!)
> during windows clobber builds to do the clobber (rm -rf). All try builds
> are clobbers.

IME, "rd /s/q" is usually much faster than "rm -rf" - using "cmd /c rd
/s/q obj-xxx" might be worth investigating...

Mark