Preparing for the next windows PGO build memory exhaustion

Mike Hommey

unread,

Apr 13, 2013, 4:28:47 AM4/13/13

to dev-pl...@lists.mozilla.org

Hi,

For almost three months now, we've had graphs following the amount of
memory used by the linker on Windows builders during PGO builds. The
result can be seen here:

http://graphs.mozilla.org/graph.html#tests=[[205,63,8]]&sel=none&displayrange=90&datatype=running

The first thing to notice in here is the 13 spikes down. The last one is
bug 860371. I wasn't aware of any of the 12 others. It might be worth
looking into them to understand why they happen. Interestingly, my
dev-tree-management archive doesn't show any notification for these
(except for the last one), nor for any of the progressive "regressions".

The second thing to notice is the graph starts a little over 3.2GB and
ends a little below 3.6GB, for a 360MB growth in less than three months.
At this pace, we'll run out of address space around june or july.

So, it's this time of year again. But for once, we can get things in order
before they blow up, not when it's too late and we have to rush things.

And as bug 860371 reminded us (look for recent massive regressions accross
the board on dev-tree-management), PGO is a big deal. Note bug 860371
only removed the data used by the compiler during PGO builds, so link
time code generation was still happening. But we already knew LTCG alone
wasn't much of a win.

We need to look back (and looking around the times where the graph jumps
up might be good starting points) and see what can be accounted for this
growth. I suspect part of it is due to newly imported code. Possibly, this
new code might not need to be PGOed. There may be other areas that can
be unPGOed without much of an impact, like we did last time.

Are there any volunteers?

I think we need to start thinking how to make PGO opt-in instead of
opt-out, while keeping performance where it is now.

We also need to ensure we do get regression notifications on
dev-tree-management. If I hadn't looked at the graph after bug 860371
blew things up, I wouldn't have noticed we were getting in the dangerous
zone again.

Mike

Mike Hommey

unread,

Apr 13, 2013, 4:59:59 AM4/13/13

to dev-pl...@lists.mozilla.org

On Sat, Apr 13, 2013 at 10:28:47AM +0200, Mike Hommey wrote:
> I think we need to start thinking how to make PGO opt-in instead of
> opt-out, while keeping performance where it is now.

In fact, I'm wondering if at this point it wouldn't just make sense to
start the other way around, that is, to start from nothing PGOed, and add
PGO to directories or individual files until we're back to the same
performance for what we care about.

Mike

Asa Dotzler

unread,

Apr 13, 2013, 11:40:12 AM4/13/13

to

On 4/13/2013 1:59 AM, Mike Hommey wrote:
> On Sat, Apr 13, 2013 at 10:28:47AM +0200, Mike Hommey wrote:
>> I think we need to start thinking how to make PGO opt-in instead of
>> opt-out, while keeping performance where it is now.

I have a really basic question. Is PGO's performance gains something
users are actually going to notice or are we mostly talking about
synthetic benchmark pissing contests here? What's PGO's impact on
start-up time or new window time or the loading of a Twitter or Facebook
web page?

I've also seen over the years that we have a really difficult time
diagnosing and fixing some high profile crashes because of PGO and if we
can eliminate those or make them easier to diagnose and fix and all it
costs us is a some points on Sunspider, I'm wondering if it doesn't make
sense to just drop PGO and focus on finding performance wins elsewhere.

- A

Kyle Huey

unread,

Apr 13, 2013, 12:01:57 PM4/13/13

to Asa Dotzler, dev-pl...@lists.mozilla.org

On Sat, Apr 13, 2013 at 8:40 AM, Asa Dotzler <a...@mozilla.com> wrote:

> I have a really basic question. Is PGO's performance gains something users
> are actually going to notice or are we mostly talking about synthetic
> benchmark pissing contests here?

As we saw when we accidentally disabled it the other day, it shows up all
over our performance tests.

> What's PGO's impact on start-up time or new window time or the loading of
> a Twitter or Facebook web page?
>

I don't know about Twitter or Facebook specifically, but we do have
measurements for our talos pageset.

Some examples of what we saw disabling PGO:

Ts, Paint - XP - 16.3% increase (Startup)
Ts Paint, MAX Dirty Profile - XP - 16% increase (Startup)
Ts, Paint - Win7 - 15.3% increase (Startup)
Ts Paint, MED Dirty Profile - Win7 - 15.5% (Startup)
Ts Paint, MAX Dirty Profile - Win7 - 14% increase (Startup)

Tp5 Optimized - XP - 23.9% increase (Pageload)
Tp5 Optimized Responsiveness - XP - 42.3% increase (Pageload)
Tp5 Optimized (%CPU) - XP - 7.68% increase (Pageload)
Tp5 Optimized - Win7 - 21.9% increase (Pageload)
Tp5 Optimized Responsiveness - Win7 - 38.9% increase (Pageload)
Tp5 Optimized (%CPU) - Win7 - 5.36% increase (Pageload)

15% startup regressions and 20% pageload regressions aren't anything to
sneeze at.

I've also seen over the years that we have a really difficult time
> diagnosing and fixing some high profile crashes because of PGO and if we
> can eliminate those or make them easier to diagnose and fix and all it
> costs us is a some points on Sunspider, I'm wondering if it doesn't make
> sense to just drop PGO and focus on finding performance wins elsewhere.

Amusingly enough Sunspider is probably the least effected benchmark by PGO
because it spends all it's time in JIT code. But if Ts and Tp5 are models
of what users see then it costs us a lot more than a few Sunspider points.
On the other hand, if they aren't good models then we essentially have no
performance testing ...

- Kyle

Robert O'Callahan

unread,

Apr 14, 2013, 7:02:26 PM4/14/13

to Asa Dotzler, dev-pl...@lists.mozilla.org

On Sun, Apr 14, 2013 at 3:40 AM, Asa Dotzler <a...@mozilla.com> wrote:

> I have a really basic question. Is PGO's performance gains something users
> are actually going to notice or are we mostly talking about synthetic
> benchmark pissing contests here?
>

It seems to me that benchmark results affect the beliefs of at least some
users.

Rob
--
q“qIqfq qyqoquq qlqoqvqeq qtqhqoqsqeq qwqhqoq qlqoqvqeq qyqoquq,q qwqhqaqtq
qcqrqeqdqiqtq qiqsq qtqhqaqtq qtqoq qyqoquq?q qEqvqeqnq qsqiqnqnqeqrqsq
qlqoqvqeq qtqhqoqsqeq qwqhqoq qlqoqvqeq qtqhqeqmq.q qAqnqdq qiqfq qyqoquq
qdqoq qgqoqoqdq qtqoq qtqhqoqsqeq qwqhqoq qaqrqeq qgqoqoqdq qtqoq qyqoquq,q
qwqhqaqtq qcqrqeqdqiqtq qiqsq qtqhqaqtq qtqoq qyqoquq?q qEqvqeqnq
qsqiqnqnqeqrqsq qdqoq qtqhqaqtq.q"

Ehsan Akhgari

unread,

Apr 15, 2013, 9:40:03 PM4/15/13

to Mike Hommey, dev-pl...@lists.mozilla.org

On 2013-04-13 4:28 AM, Mike Hommey wrote:
> Hi,
>
> For almost three months now, we've had graphs following the amount of
> memory used by the linker on Windows builders during PGO builds. The
> result can be seen here:
>
> http://graphs.mozilla.org/graph.html#tests=[[205,63,8]]&sel=none&displayrange=90&datatype=running

/me shivers

> The first thing to notice in here is the 13 spikes down. The last one is
> bug 860371. I wasn't aware of any of the 12 others. It might be worth
> looking into them to understand why they happen. Interestingly, my
> dev-tree-management archive doesn't show any notification for these
> (except for the last one), nor for any of the progressive "regressions".

> The second thing to notice is the graph starts a little over 3.2GB and
> ends a little below 3.6GB, for a 360MB growth in less than three months.
> At this pace, we'll run out of address space around june or july.
>
> So, it's this time of year again. But for once, we can get things in order
> before they blow up, not when it's too late and we have to rush things.
>
> And as bug 860371 reminded us (look for recent massive regressions accross
> the board on dev-tree-management), PGO is a big deal. Note bug 860371
> only removed the data used by the compiler during PGO builds, so link
> time code generation was still happening. But we already knew LTCG alone
> wasn't much of a win.
>
> We need to look back (and looking around the times where the graph jumps
> up might be good starting points) and see what can be accounted for this
> growth. I suspect part of it is due to newly imported code. Possibly, this
> new code might not need to be PGOed. There may be other areas that can
> be unPGOed without much of an impact, like we did last time.
>
> Are there any volunteers?

If nobody volunteers, I guess we're going to have to volunteer somebody.
;-) We can't go ahead and ignore this...

> I think we need to start thinking how to make PGO opt-in instead of
> opt-out, while keeping performance where it is now.

Doing that would make sense to me.

> We also need to ensure we do get regression notifications on
> dev-tree-management. If I hadn't looked at the graph after bug 860371
> blew things up, I wouldn't have noticed we were getting in the dangerous
> zone again.

Good idea, is there a bug on file for that?

Cheers,
Ehsan

Neil

unread,

Apr 16, 2013, 4:59:51 AM4/16/13

to

Ehsan Akhgari wrote:

> there was something common in all four of them, they _all_ included
> changes to the build system which would cause most of the tree to be
> rebuilt

*snip*

> Is this merely a correlation?

Surely if it wasn't, then clobber builds would have the lowest memory
consumption?

--
Warning: May contain traces of nuts.

Jim Mathies

unread,

Apr 16, 2013, 6:39:30 AM4/16/13

to dev-pl...@lists.mozilla.org

We also still have bug 845840 - "File a support request with ms on our pgo
problems". As soon as we sort out the account stuff we can file something.

Jim

Mike Hommey

unread,

Apr 16, 2013, 7:27:00 AM4/16/13

to Jim Mathies, dev-pl...@lists.mozilla.org

I doubt we can get a satisfactory response from MS before things blow
out (if at all)

Mike

Ehsan Akhgari

unread,

Apr 16, 2013, 10:32:45 AM4/16/13

to Neil, dev-pl...@lists.mozilla.org

On 2013-04-16 4:59 AM, Neil wrote:

> Ehsan Akhgari wrote:
>> Is this merely a correlation?
>
> Surely if it wasn't, then clobber builds would have the lowest memory
> consumption?

Well, clobber builds delete everything in your objdir. but these builds
just cause a whole lot to be rebuilt (I think).

Anyways, it would definitely be interesting to see if anyone can
replicate those spikes down.

Ehsan

Gervase Markham

unread,

Apr 17, 2013, 8:46:13 AM4/17/13

to

On 16/04/13 12:27, Mike Hommey wrote:
> I doubt we can get a satisfactory response from MS before things blow
> out (if at all)

But if we'd asked them last time, we might have one by now. And if we
don't ask them this time, then we'll get to next time and still not have
one. :-)

Gerv

Matt Brubeck

unread,

Apr 17, 2013, 6:07:29 PM4/17/13

to Mike Hommey

On 4/13/2013 1:28 AM, Mike Hommey wrote:
> The first thing to notice in here is the 13 spikes down. The last one is
> bug 860371. I wasn't aware of any of the 12 others. It might be worth
> looking into them to understand why they happen. Interestingly, my
> dev-tree-management archive doesn't show any notification for these
> (except for the last one), nor for any of the progressive "regressions".

analyze_talos.py intentionally ignores transient changes. Until
recently it required 5 new test runs to show a statistically significant
difference from the preceding 30 runs. (Bug 858877 changed this to 12
new runs and 12 preceding runs, but the concept is still the same.) So
it's expected that the one-time spikes were ignored.

analyze_talos.py does identify several real regressions in that dataset
-- but it suppresses emails for them because each one is an increase of
less than 2% (see bug 822249). Here are regressions identified by the
latest version of analyze.py:

changeset mem (KB) t-test % change
----------------------------------------
c1ee454506f6 3296860000 139.519 1.05%
0acac77dd920 3330290000 11.809 0.99%
e6ca584f4fe7 3358010000 9.753 0.86%
80d52655c8b8 3398300000 74.742 0.45%
d27445d1eac5 3450400000 26.214 0.99%
eaff15332579 3468970000 70.525 0.52%
0ff1755d6359 2799760000 263.117 -21.58%
1c690be31939 3586590000 11.241 28.06%

Perhaps we should modify bug 822249 to ignore the 2% threshold for
specific tests where we *do* care about small changes. Or for any
change with a very high t-test score. I filed this bug:

https://bugzilla.mozilla.org/show_bug.cgi?id=863061

Matt Brubeck

unread,

Apr 18, 2013, 3:10:53 PM4/18/13

to

On 4/17/2013 3:07 PM, Matt Brubeck wrote:
> analyze_talos.py does identify several real regressions in that dataset
> -- but it suppresses emails for them because each one is an increase of
> less than 2% (see bug 822249). Here are regressions identified by the
> latest version of analyze.py:

In case anyone wants to investigate these, I've added regression ranges
below. These are automatically generated; I haven't double-checked them
manually.

Since we run PGO builds only a few times a day, the ranges can be large.
For those that include m-c merges, you could narrow them down using
the m-c data. WebIDL seems to be a common theme.

> changeset mem (KB) t-test % change
> ----------------------------------------
> c1ee454506f6 3296860000 139.519 1.05%

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=88543d623c3f&tochange=c1ee454506f6

(Includes several Paris bindings changes, among others)

> 0acac77dd920 3330290000 11.809 0.99%

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=16ddbb6852ec&tochange=0acac77dd920

(More Paris binding changes, and others)

> e6ca584f4fe7 3358010000 9.753 0.86%

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a69f329fc7ee&tochange=e6ca584f4fe7

(includes the Win8 Metro merge)

> 80d52655c8b8 3398300000 74.742 0.45%

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=4637a1449900&tochange=80d52655c8b8

(includes some WebIDL-related changes)

> d27445d1eac5 3450400000 26.214 0.99%

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=92824d900e25&tochange=d27445d1eac5

> eaff15332579 3468970000 70.525 0.52%

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=e38c5c346840&tochange=eaff15332579

(Olli Pettay ï¿½ Bug 822399 - Make Event to use Paris bindings)

> Perhaps we should modify bug 822249 to ignore the 2% threshold for
> specific tests where we *do* care about small changes. Or for any
> change with a very high t-test score. I filed this bug:
> https://bugzilla.mozilla.org/show_bug.cgi?id=863061

A fix has been checked in, so we should start getting alerts for future
linker memory regressions of any size.

Matt Brubeck

unread,

Apr 18, 2013, 7:39:24 PM4/18/13

to

On 4/18/2013 12:10 PM, Matt Brubeck wrote:
> Since we run PGO builds only a few times a day, the ranges can be large.
> For those that include m-c merges, you could narrow them down using
> the m-c data. WebIDL seems to be a common theme.

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=863492 to
investigate whether Paris bindings are a particular contributor to
linker memory usage, and possible solutions.