At the end of last week our Windows PGO builds started failing on
mozilla-inbound (https://bugzilla.mozilla.org/show_bug.cgi?id=709193).
After some investigation we determined that the problem seems to be that
the linker is running out of virtual address space during the optimization
phase.
This is not the first time we've run into this problem (e.g. Bug 543034).
A couple years ago we hit the 2 GB virtual address space limit. The build
machines were changed to use /3GB and that additional GB of address space
bought us some time. This time unfortunately the options aren't as easy as
flipping a switch.
As a temporary measure, we've turned off or ripped out a few new pieces of
code (Graphite, SPDY, libreg) which has brought us back down under the
limit for the moment. We don't really know how much breathing space we
have (but it's probably pretty small).
Our three options at this point:
1) Make libxul smaller - Either by removing code entirely or by splitting
things into separate shared libraries.
2) Move to MSVC 2010 - We know that changesets that reliably failed to link
on MSVC 2005 linked successfully with MSVC 2010. What we don't know is how
much this helps (I expect the answer is somewhere between a lot and a
little). We can't really do this for (at the bare minimum) a couple more
weeks anyways due to product considerations about what OSs we support.
3) Do our 32 bit builds on machines running a 64 bit OS. This will allow
the linker to use 4 GB of address space.
I think we need to pursue a combination of (1) in the short term and (3) in
the slightly less short term. Gal has some ideas on what we can do for (1)
that I'm investigating.
In the mean time, mozilla-inbound is closed, and mozilla-central is
restricted to approvals only. The only things currently allowed to land on
mozilla-central are:
- Test-only/NPOTB changes
- Changes that only touch Spidermonkey (which is not part of libxul on
Windows, and thus not contributing to the problem).
- Changes that only touch other cpp code that doesn't not end up in libxul
(cpp code in browser/, things like sqlite, angle, nss, nspr, etc).
- JS/XUL/HTML changes.
I'm hopeful that we can hack libxul enough to get the tree open
provisionally soon.
If needed, WebGL offers some opportunities for splitting stuff away from libxul:
- the ANGLE shader compiler can easily be split to a separate lib.
- so could probably the WebGL implementation itself.
The ANGLE implementation of OpenGL ES2 on top of D3D9 is already separate DLLs.
Notice that external lib's are dlopen'd already when one creates a
WebGL context: libGL.so.1 on linux, the ANGLE GLES2 and
D3DX/D3DCompiler DLLs on Windows, etc. So it wouldn't make a big
difference. The WebGL impl is 180 K:
> At the end of last week our Windows PGO builds started failing on
> mozilla-inbound (https://bugzilla.mozilla.org/show_bug.cgi?id=709193).
> After some investigation we determined that the problem seems to be that
> the linker is running out of virtual address space during the optimization
> phase.
> This is not the first time we've run into this problem (e.g. Bug 543034).
> A couple years ago we hit the 2 GB virtual address space limit. The build
> machines were changed to use /3GB and that additional GB of address space
> bought us some time. This time unfortunately the options aren't as easy as
> flipping a switch.
> As a temporary measure, we've turned off or ripped out a few new pieces of
> code (Graphite, SPDY, libreg) which has brought us back down under the
> limit for the moment. We don't really know how much breathing space we
> have (but it's probably pretty small).
> Our three options at this point:
> 1) Make libxul smaller - Either by removing code entirely or by splitting
> things into separate shared libraries.
> 2) Move to MSVC 2010 - We know that changesets that reliably failed to link
> on MSVC 2005 linked successfully with MSVC 2010. What we don't know is how
> much this helps (I expect the answer is somewhere between a lot and a
> little). We can't really do this for (at the bare minimum) a couple more
> weeks anyways due to product considerations about what OSs we support.
> 3) Do our 32 bit builds on machines running a 64 bit OS. This will allow
> the linker to use 4 GB of address space.
> I think we need to pursue a combination of (1) in the short term and (3) in
> the slightly less short term. Gal has some ideas on what we can do for (1)
> that I'm investigating.
> In the mean time, mozilla-inbound is closed, and mozilla-central is
> restricted to approvals only. The only things currently allowed to land on
> mozilla-central are:
> - Test-only/NPOTB changes
> - Changes that only touch Spidermonkey (which is not part of libxul on
> Windows, and thus not contributing to the problem).
> - Changes that only touch other cpp code that doesn't not end up in libxul
> (cpp code in browser/, things like sqlite, angle, nss, nspr, etc).
> - JS/XUL/HTML changes.
> I'm hopeful that we can hack libxul enough to get the tree open
> provisionally soon.
> so, adding the ANGLE shader compiler, we'd probably have a library
> weighing around 300 K of code (file size would be bigger).
This sounds good. Can you please start on this? We aren't sure how much we have to take out to safely reopen the tree until we have a better fix (64-bit linker).
If any other module owners know of large chunks they can split out without affecting startup, please file bugs.
On Sun, Dec 11, 2011 at 09:53:33PM -0500, Benoit Jacob wrote:
> (Replying only to dev-platform)
> If needed, WebGL offers some opportunities for splitting stuff away from libxul:
> - the ANGLE shader compiler can easily be split to a separate lib.
> - so could probably the WebGL implementation itself.
> The ANGLE implementation of OpenGL ES2 on top of D3D9 is already separate DLLs.
> Notice that external lib's are dlopen'd already when one creates a
> WebGL context: libGL.so.1 on linux, the ANGLE GLES2 and
> D3DX/D3DCompiler DLLs on Windows, etc. So it wouldn't make a big
> difference. The WebGL impl is 180 K:
> 1) Make libxul smaller - Either by removing code entirely or by splitting
> things into separate shared libraries.
If we're going with this, we should take a look what code is not in the hot startup path and split out that. AFAIK, the reason for linking everything into libxul was that startup is faster if we only need to open one library instead of multiple. If we split off parts we don't usually need at startup, we probably even make startup faster because the library to be loaded is smaller - and we work around the Windows PGO limit as well.
> At the end of last week our Windows PGO builds started failing on
> mozilla-inbound (https://bugzilla.mozilla.org/show_bug.cgi?id=709193).
> After some investigation we determined that the problem seems to be that
> the linker is running out of virtual address space during the optimization
> phase.
> This is not the first time we've run into this problem (e.g. Bug 543034).
> A couple years ago we hit the 2 GB virtual address space limit. The build
> machines were changed to use /3GB and that additional GB of address space
> bought us some time. This time unfortunately the options aren't as easy as
> flipping a switch.
> As a temporary measure, we've turned off or ripped out a few new pieces of
> code (Graphite, SPDY, libreg) which has brought us back down under the
> limit for the moment. We don't really know how much breathing space we
> have (but it's probably pretty small).
> Our three options at this point:
> 1) Make libxul smaller - Either by removing code entirely or by splitting
> things into separate shared libraries.
> 2) Move to MSVC 2010 - We know that changesets that reliably failed to link
> on MSVC 2005 linked successfully with MSVC 2010. What we don't know is how
> much this helps (I expect the answer is somewhere between a lot and a
> little). We can't really do this for (at the bare minimum) a couple more
> weeks anyways due to product considerations about what OSs we support.
> 3) Do our 32 bit builds on machines running a 64 bit OS. This will allow
> the linker to use 4 GB of address space.
I'd like to propose
4) Stop doing PGO.
I think it's worth looking at what PGO is buying us these days. It costs a lot in terms of build times and therefore build machine capacity. It's also non-deterministic, which scares me a lot.
If we can determine where PGO is helping us, maybe move just those pieces into a distinct library where we can do PGO.
> I think we need to pursue a combination of (1) in the short term and (3) in
> the slightly less short term. Gal has some ideas on what we can do for (1)
> that I'm investigating.
> In the mean time, mozilla-inbound is closed, and mozilla-central is
> restricted to approvals only. The only things currently allowed to land on
> mozilla-central are:
> - Test-only/NPOTB changes
> - Changes that only touch Spidermonkey (which is not part of libxul on
> Windows, and thus not contributing to the problem).
> - Changes that only touch other cpp code that doesn't not end up in libxul
> (cpp code in browser/, things like sqlite, angle, nss, nspr, etc).
> - JS/XUL/HTML changes.
> I'm hopeful that we can hack libxul enough to get the tree open
> provisionally soon.
Whether or not its the right thing to do long term, turning it off short term should let us quickly re-open the tree, correct? We can still go down the splitting out of big libxul chunks path and when we feel that's ready based on try results, we can turn PGO on again if need be.
I'll make this the main discussion topic for the engineering meeting tomorrow if people agree.
----- Original Message -----
From: "Chris AtLee" <cat...@mozilla.com>
To: dev-platf...@lists.mozilla.org
Sent: Monday, December 12, 2011 10:36:11 AM
Subject: Re: Gecko Is Too Big (Or, Why the Tree Is Closed)
On 11/12/11 09:27 PM, Kyle Huey wrote:
> At the end of last week our Windows PGO builds started failing on
> mozilla-inbound (https://bugzilla.mozilla.org/show_bug.cgi?id=709193).
> After some investigation we determined that the problem seems to be that
> the linker is running out of virtual address space during the optimization
> phase.
> This is not the first time we've run into this problem (e.g. Bug 543034).
> A couple years ago we hit the 2 GB virtual address space limit. The build
> machines were changed to use /3GB and that additional GB of address space
> bought us some time. This time unfortunately the options aren't as easy as
> flipping a switch.
> As a temporary measure, we've turned off or ripped out a few new pieces of
> code (Graphite, SPDY, libreg) which has brought us back down under the
> limit for the moment. We don't really know how much breathing space we
> have (but it's probably pretty small).
> Our three options at this point:
> 1) Make libxul smaller - Either by removing code entirely or by splitting
> things into separate shared libraries.
> 2) Move to MSVC 2010 - We know that changesets that reliably failed to link
> on MSVC 2005 linked successfully with MSVC 2010. What we don't know is how
> much this helps (I expect the answer is somewhere between a lot and a
> little). We can't really do this for (at the bare minimum) a couple more
> weeks anyways due to product considerations about what OSs we support.
> 3) Do our 32 bit builds on machines running a 64 bit OS. This will allow
> the linker to use 4 GB of address space.
I'd like to propose
4) Stop doing PGO.
I think it's worth looking at what PGO is buying us these days. It costs a lot in terms of build times and therefore build machine capacity. It's also non-deterministic, which scares me a lot.
If we can determine where PGO is helping us, maybe move just those pieces into a distinct library where we can do PGO.
On 12 Dec 2011, at 15:47, Jean-Paul Rosevear wrote:
> Whether or not its the right thing to do long term, turning it off short term should let us quickly re-open the tree, correct?
Yes, IMO. This would mean shipping non-PGO nightlies for the time being, which would presumably result in a significant perf regression, but one that we'd expect to recover when we update the build systems and can re-enable PGO.
I think some folk are concerned that we might land stuff in the meantime that seems fine on non-PGO builds/tests, but then fails under PGO when we eventually re-enable it. While that's a risk, I think it's a relatively small one, and we should accept it at this point as better than keeping m-c closed to most C++ development for an extended period.
I assume PGO is continuing to work as expected on mozilla-beta and mozilla-aurora trees, and so we have a breathing space before this problem hits the release channel. We'll need to disable PGO on aurora when the next m-c merge happens (unless we have overcome the problem by then), but I think we could live with that; we should aim to have a solution deployed before mozilla11 hits beta, however, so that we have the beta period to resolve any unexpected PGO-related failures that might crop up before this version goes to release.
So that gives us until the end of January to get the new compiler deployed, move to 64-bit builders, or whatever solution(s) we're going to use, or about 7 weeks, minus the Christmas and New Year holiday season.
> We can still go down the splitting out of big libxul chunks path and when we feel that's ready based on try results, we can turn PGO on again if need be.
> I'll make this the main discussion topic for the engineering meeting tomorrow if people agree.
PGO is a large performance win. TP5 goes from 400 to 330, a speedup of 1.2x.
> [PGO is] also non-deterministic, which scares me a lot.
Are you sure non-pgo is deterministic? :) Not that you shouldn't be
scared by the extra non-determinism in PGO.
> If we can determine where PGO is helping us, maybe move just those pieces into a distinct library where we can
> do PGO.
This isn't a bad idea, but we need to be careful. "Where PGO is
helping us" doesn't mean "code we can compile without PGO without
causing a regression on our performance tests." Our performance tests
are hardly comprehensive.
On Mon, Dec 12, 2011 at 10:47 AM, Jean-Paul Rosevear <j...@mozilla.com> wrote:
> Whether or not its the right thing to do long term, turning it off short term should let us quickly re-open the tree, correct? We can still go down the splitting out of big libxul chunks path and when we feel that's ready based on try results, we can turn PGO on again if need be.
> I'll make this the main discussion topic for the engineering meeting tomorrow if people agree.
> -JP
> ----- Original Message -----
> From: "Chris AtLee" <cat...@mozilla.com>
> To: dev-platf...@lists.mozilla.org
> Sent: Monday, December 12, 2011 10:36:11 AM
> Subject: Re: Gecko Is Too Big (Or, Why the Tree Is Closed)
> On 11/12/11 09:27 PM, Kyle Huey wrote:
>> At the end of last week our Windows PGO builds started failing on
>> mozilla-inbound (https://bugzilla.mozilla.org/show_bug.cgi?id=709193).
>> After some investigation we determined that the problem seems to be that
>> the linker is running out of virtual address space during the optimization
>> phase.
>> This is not the first time we've run into this problem (e.g. Bug 543034).
>> A couple years ago we hit the 2 GB virtual address space limit. The build
>> machines were changed to use /3GB and that additional GB of address space
>> bought us some time. This time unfortunately the options aren't as easy as
>> flipping a switch.
>> As a temporary measure, we've turned off or ripped out a few new pieces of
>> code (Graphite, SPDY, libreg) which has brought us back down under the
>> limit for the moment. We don't really know how much breathing space we
>> have (but it's probably pretty small).
>> Our three options at this point:
>> 1) Make libxul smaller - Either by removing code entirely or by splitting
>> things into separate shared libraries.
>> 2) Move to MSVC 2010 - We know that changesets that reliably failed to link
>> on MSVC 2005 linked successfully with MSVC 2010. What we don't know is how
>> much this helps (I expect the answer is somewhere between a lot and a
>> little). We can't really do this for (at the bare minimum) a couple more
>> weeks anyways due to product considerations about what OSs we support.
>> 3) Do our 32 bit builds on machines running a 64 bit OS. This will allow
>> the linker to use 4 GB of address space.
> I'd like to propose
> 4) Stop doing PGO.
> I think it's worth looking at what PGO is buying us these days. It costs
> a lot in terms of build times and therefore build machine capacity. It's
> also non-deterministic, which scares me a lot.
> If we can determine where PGO is helping us, maybe move just those
> pieces into a distinct library where we can do PGO.
> _______________________________________________
> dev-platform mailing list
> dev-platf...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
> I think some folk are concerned that we might land stuff in the meantime that seems fine on non-PGO builds/tests, but then fails under PGO when we eventually re-enable it. While that's a risk, I think it's a relatively small one, and we should accept it at this point as better than keeping m-c closed to most C++ development for an extended period.
It's not a so small risk, it happened 3 times in the last 2 months iirc, that's the original reason philor asked to go back to always pgo, since it was hard to track back the original changeset causing the problem with intermittent pgo.
-m
> I think some folk are concerned that we might land stuff in the meantime that seems fine on non-PGO builds/tests, but then fails under PGO when we eventually re-enable it. While that's a risk, I think it's a relatively small one
The data seems to show that such a checkin happens about once a week on average (see the recent "Proposal to switch mozilla-inbound back to always doing PGO builds" thread in dev.planning).
So either we think that we'll have PGO builds back up within much less than a week, or the risk is decidedly not small, right?
> So that gives us until the end of January to get the new compiler deployed, move to 64-bit builders, or whatever solution(s) we're going to use, or about 7 weeks, minus the Christmas and New Year holiday season.
At which point we will need to find the average of 7 checkins that no longer build with pgo that will land between now and then...
Moving code out of libxul is only a band-aid over the problem. Since we
don't have any reason to believe that the memory usage of the linker is
linear in terms of the code size, we can't be sure that removing 10% of the
code in libxul will give us 10% more breathing space. Also, moving code
out of libxul might break the sorts of optimizations that we've been doing
assuming that most of our code lives inside libxul (for example, libxul
preloading, etc.)
I agree with JP that the shortest path to reopening the trees is disabling
PGO builds. But we should also note that we're pretty close to the cut-off
date, which would mean that we would end up in a situation where we would
need to release Firefox 11 for Windows with PGO disabled, unless RelEng can
deploy 64-bit builders in time.
Moving to 64-bit builders gives us 33% more address space, which should be
enough for a while. But there is ultimately a hard limit on how much code
we can have in libxul before we hit the 4GB address space limit of the
linker. That might take a couple of more years, but my pessimistic side
thinks that it's going to happen sooner this time. ;-)
The only real fix is for us to get a 64-bit linker. I remember some folks
mentioning how Microsoft doesn't have plans on shipping one (my memory
might not be serving me well here). But really, we should talk to
Microsoft and find this out. If they're not planning to ship a 64-bit
linker within the next year or so, turning PGO off is just something that
we would have to do at some point in the future.
> PGO is a large performance win. TP5 goes from 400 to 330, a speedup of
> 1.2x.
> > [PGO is] also non-deterministic, which scares me a lot.
> Are you sure non-pgo is deterministic? :) Not that you shouldn't be
> scared by the extra non-determinism in PGO.
> > If we can determine where PGO is helping us, maybe move just those
> pieces into a distinct library where we can
> > do PGO.
> This isn't a bad idea, but we need to be careful. "Where PGO is
> helping us" doesn't mean "code we can compile without PGO without
> causing a regression on our performance tests." Our performance tests
> are hardly comprehensive.
> On Mon, Dec 12, 2011 at 10:47 AM, Jean-Paul Rosevear <j...@mozilla.com>
> wrote:
> > Whether or not its the right thing to do long term, turning it off short
> term should let us quickly re-open the tree, correct? We can still go down
> the splitting out of big libxul chunks path and when we feel that's ready
> based on try results, we can turn PGO on again if need be.
> > I'll make this the main discussion topic for the engineering meeting
> tomorrow if people agree.
> > -JP
> > ----- Original Message -----
> > From: "Chris AtLee" <cat...@mozilla.com>
> > To: dev-platf...@lists.mozilla.org
> > Sent: Monday, December 12, 2011 10:36:11 AM
> > Subject: Re: Gecko Is Too Big (Or, Why the Tree Is Closed)
> > On 11/12/11 09:27 PM, Kyle Huey wrote:
> >> At the end of last week our Windows PGO builds started failing on
> >> mozilla-inbound (https://bugzilla.mozilla.org/show_bug.cgi?id=709193).
> >> After some investigation we determined that the problem seems to be that
> >> the linker is running out of virtual address space during the
> optimization
> >> phase.
> >> This is not the first time we've run into this problem (e.g. Bug
> 543034).
> >> A couple years ago we hit the 2 GB virtual address space limit. The
> build
> >> machines were changed to use /3GB and that additional GB of address
> space
> >> bought us some time. This time unfortunately the options aren't as
> easy as
> >> flipping a switch.
> >> As a temporary measure, we've turned off or ripped out a few new pieces
> of
> >> code (Graphite, SPDY, libreg) which has brought us back down under the
> >> limit for the moment. We don't really know how much breathing space we
> >> have (but it's probably pretty small).
> >> Our three options at this point:
> >> 1) Make libxul smaller - Either by removing code entirely or by
> splitting
> >> things into separate shared libraries.
> >> 2) Move to MSVC 2010 - We know that changesets that reliably failed to
> link
> >> on MSVC 2005 linked successfully with MSVC 2010. What we don't know is
> how
> >> much this helps (I expect the answer is somewhere between a lot and a
> >> little). We can't really do this for (at the bare minimum) a couple
> more
> >> weeks anyways due to product considerations about what OSs we support.
> >> 3) Do our 32 bit builds on machines running a 64 bit OS. This will
> allow
> >> the linker to use 4 GB of address space.
> > I'd like to propose
> > 4) Stop doing PGO.
> > I think it's worth looking at what PGO is buying us these days. It costs
> > a lot in terms of build times and therefore build machine capacity. It's
> > also non-deterministic, which scares me a lot.
> > If we can determine where PGO is helping us, maybe move just those
> > pieces into a distinct library where we can do PGO.
> > _______________________________________________
> > dev-platform mailing list
> > dev-platf...@lists.mozilla.org
> > https://lists.mozilla.org/listinfo/dev-platform > _______________________________________________
> dev-platform mailing list
> dev-platf...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
On Mon, Dec 12, 2011 at 11:34:32AM -0500, Ehsan Akhgari wrote:
> Moving code out of libxul is only a band-aid over the problem. Since we
> don't have any reason to believe that the memory usage of the linker is
> linear in terms of the code size, we can't be sure that removing 10% of the
> code in libxul will give us 10% more breathing space. Also, moving code
> out of libxul might break the sorts of optimizations that we've been doing
> assuming that most of our code lives inside libxul (for example, libxul
> preloading, etc.)
> I agree with JP that the shortest path to reopening the trees is disabling
> PGO builds. But we should also note that we're pretty close to the cut-off
> date, which would mean that we would end up in a situation where we would
> need to release Firefox 11 for Windows with PGO disabled, unless RelEng can
> deploy 64-bit builders in time.
> Moving to 64-bit builders gives us 33% more address space, which should be
> enough for a while. But there is ultimately a hard limit on how much code
> we can have in libxul before we hit the 4GB address space limit of the
> linker. That might take a couple of more years, but my pessimistic side
> thinks that it's going to happen sooner this time. ;-)
> The only real fix is for us to get a 64-bit linker. I remember some folks
> mentioning how Microsoft doesn't have plans on shipping one (my memory
> might not be serving me well here). But really, we should talk to
> Microsoft and find this out. If they're not planning to ship a 64-bit
> linker within the next year or so, turning PGO off is just something that
> we would have to do at some point in the future.
Note that MSVC2010 uses less memory, since it can link with 3GB memory
with PGO enabled.
> On 12/12/11 11:12 AM, Jonathan Kew wrote:
>> I think some folk are concerned that we might land stuff in the meantime that seems fine on non-PGO builds/tests, but then fails under PGO when we eventually re-enable it. While that's a risk, I think it's a relatively small one
> The data seems to show that such a checkin happens about once a week on average (see the recent "Proposal to switch mozilla-inbound back to always doing PGO builds" thread in dev.planning).
Has that always been the case, or is this relatively high frequency a relatively recent phenomenon?
I'm assuming the address-space limit we've hit is not based simply on "raw" codesize (we don't have 3GB of code, do we?) but rather the total of various structures that the compiler/linker builds internally in order to support its optimization and code-gen process. And so it relates somehow to complexity/inter-relationships as well as raw size, and given that we've presumably been fairly close to the breaking point for a while, I'd think it quite possible that some of the "internal compiler error" failures were in fact out-of-address-space failures, due to a checkin modifying code (without necessarily _adding_ much) in a way that happens to be more memory-hungry for the compiler to handle.
So once we raise that ceiling, we may see a reduction in the incidence of PGO failure on apparently-innocent checkins.
> So either we think that we'll have PGO builds back up within much less than a week, or the risk is decidedly not small, right?
>> So that gives us until the end of January to get the new compiler deployed, move to 64-bit builders, or whatever solution(s) we're going to use, or about 7 weeks, minus the Christmas and New Year holiday season.
> At which point we will need to find the average of 7 checkins that no longer build with pgo that will land between now and then...
I don't doubt that it happens, but I think having to tackle a handful of these on aurora during January and/or beta during February would be better than blocking much C++ development for an extended period - and dealing with the resulting pressure on the tree when it re-opens and everyone wants to land the stuff they've been holding back in the meantime.
And if releng can get us onto VS2010 and/or 64-bit builders more quickly - which I hope is possible, but don't know what's actually involved in making the switch - the number of such problematic checkins will presumably be correspondingly smaller.
> I don't doubt that it happens, but I think having to tackle a handful of these on aurora during January and/or beta during February would be better than blocking much C++ development for an extended period
I'm not actually sure it is. At that point we'll have to first find the checkins responsible, then figure out how to fix them, possibly backing them and other things out.
I suspect the net effect will be similar to holding the tree closed for several days now, but time-shifted into January/February.
If we think we'll need to have the tree closed for longer than a few days, I agree that disabling PGO temporarily sounds more palatable.
> And if releng can get us onto VS2010 and/or 64-bit builders more quickly - which I hope is possible, but don't know what's actually involved in making the switch - the number of such problematic checkins will presumably be correspondingly smaller.
The 32-bit builders currently have VS2010 installed on them in addition to VS2005. There are other issues preventing switching over to 2010 however; iirc switching to 2010 breaks firefox on older versions of windows XP.
Both are pure JS tests. For pure JS tests, time is either spent in jitcode (not affected by PGO) or in libmozjs (which is compiled with PGO disabled already on Windows because as far as we can tell VS 2005 PGO miscompiles it; see https://bugzilla.mozilla.org/show_bug.cgi?id=673518 ).
It shows the PGO builds doing about 267 runs/s while the non-PGO ones are doing about 209 runs/s. So about 25% speedup.
(Amusingly, http://graphs-new.mozilla.org/graph.html#tests=[[72,94,1],[72,1,1]]&sel=none&displayrange=7&datatype=running also shows no speedup, because contrary to its name Dromaeo-CSS is largely a JS test in practice.)
> Moving code out of libxul is only a band-aid over the problem. Since we
> don't have any reason to believe that the memory usage of the linker is
> linear in terms of the code size, we can't be sure that removing 10% of the
> code in libxul will give us 10% more breathing space. Also, moving code
> out of libxul might break the sorts of optimizations that we've been doing
> assuming that most of our code lives inside libxul (for example, libxul
> preloading, etc.)
This argument, however, doesn't apply equally well to all parts of
libxul. Some parts are relatively self-contained, with critical loops
that are well-identified, don't interact with other parts of libxul,
and already optimized i.e. coded in such a way that PGO won't make
them faster than -O2. I think that WebGL is such an example.
To put it another way, there's a limit to the scale at which PGO makes
sense, or else we should just link all the software on a computed as a
single file...
> I agree with JP that the shortest path to reopening the trees is disabling
> PGO builds. But we should also note that we're pretty close to the cut-off
> date, which would mean that we would end up in a situation where we would
> need to release Firefox 11 for Windows with PGO disabled, unless RelEng can
> deploy 64-bit builders in time.
> Moving to 64-bit builders gives us 33% more address space, which should be
> enough for a while. But there is ultimately a hard limit on how much code
> we can have in libxul before we hit the 4GB address space limit of the
> linker. That might take a couple of more years, but my pessimistic side
> thinks that it's going to happen sooner this time. ;-)
> The only real fix is for us to get a 64-bit linker. I remember some folks
> mentioning how Microsoft doesn't have plans on shipping one (my memory
> might not be serving me well here). But really, we should talk to
> Microsoft and find this out. If they're not planning to ship a 64-bit
> linker within the next year or so, turning PGO off is just something that
> we would have to do at some point in the future.
> On Mon, Dec 12, 2011 at 11:12 AM, Justin Lebar <justin.le...@gmail.com>wrote:
>> > I think it's worth looking at what PGO is buying us these days. It costs
>> a lot in terms of build times and therefore
>> > build machine capacity.
>> PGO is a large performance win. TP5 goes from 400 to 330, a speedup of
>> 1.2x.
>> > [PGO is] also non-deterministic, which scares me a lot.
>> Are you sure non-pgo is deterministic? :) Not that you shouldn't be
>> scared by the extra non-determinism in PGO.
>> > If we can determine where PGO is helping us, maybe move just those
>> pieces into a distinct library where we can
>> > do PGO.
>> This isn't a bad idea, but we need to be careful. "Where PGO is
>> helping us" doesn't mean "code we can compile without PGO without
>> causing a regression on our performance tests." Our performance tests
>> are hardly comprehensive.
>> On Mon, Dec 12, 2011 at 10:47 AM, Jean-Paul Rosevear <j...@mozilla.com>
>> wrote:
>> > Whether or not its the right thing to do long term, turning it off short
>> term should let us quickly re-open the tree, correct? We can still go down
>> the splitting out of big libxul chunks path and when we feel that's ready
>> based on try results, we can turn PGO on again if need be.
>> > I'll make this the main discussion topic for the engineering meeting
>> tomorrow if people agree.
>> > -JP
>> > ----- Original Message -----
>> > From: "Chris AtLee" <cat...@mozilla.com>
>> > To: dev-platf...@lists.mozilla.org
>> > Sent: Monday, December 12, 2011 10:36:11 AM
>> > Subject: Re: Gecko Is Too Big (Or, Why the Tree Is Closed)
>> > On 11/12/11 09:27 PM, Kyle Huey wrote:
>> >> At the end of last week our Windows PGO builds started failing on
>> >> mozilla-inbound (https://bugzilla.mozilla.org/show_bug.cgi?id=709193).
>> >> After some investigation we determined that the problem seems to be that
>> >> the linker is running out of virtual address space during the
>> optimization
>> >> phase.
>> >> This is not the first time we've run into this problem (e.g. Bug
>> 543034).
>> >> A couple years ago we hit the 2 GB virtual address space limit. The
>> build
>> >> machines were changed to use /3GB and that additional GB of address
>> space
>> >> bought us some time. This time unfortunately the options aren't as
>> easy as
>> >> flipping a switch.
>> >> As a temporary measure, we've turned off or ripped out a few new pieces
>> of
>> >> code (Graphite, SPDY, libreg) which has brought us back down under the
>> >> limit for the moment. We don't really know how much breathing space we
>> >> have (but it's probably pretty small).
>> >> Our three options at this point:
>> >> 1) Make libxul smaller - Either by removing code entirely or by
>> splitting
>> >> things into separate shared libraries.
>> >> 2) Move to MSVC 2010 - We know that changesets that reliably failed to
>> link
>> >> on MSVC 2005 linked successfully with MSVC 2010. What we don't know is
>> how
>> >> much this helps (I expect the answer is somewhere between a lot and a
>> >> little). We can't really do this for (at the bare minimum) a couple
>> more
>> >> weeks anyways due to product considerations about what OSs we support.
>> >> 3) Do our 32 bit builds on machines running a 64 bit OS. This will
>> allow
>> >> the linker to use 4 GB of address space.
>> > I'd like to propose
>> > 4) Stop doing PGO.
>> > I think it's worth looking at what PGO is buying us these days. It costs
>> > a lot in terms of build times and therefore build machine capacity. It's
>> > also non-deterministic, which scares me a lot.
>> > If we can determine where PGO is helping us, maybe move just those
>> > pieces into a distinct library where we can do PGO.
>> > _______________________________________________
>> > dev-platform mailing list
>> > dev-platf...@lists.mozilla.org
>> > https://lists.mozilla.org/listinfo/dev-platform >> _______________________________________________
>> dev-platform mailing list
>> dev-platf...@lists.mozilla.org
>> https://lists.mozilla.org/listinfo/dev-platform
> 2011/12/12 Ehsan Akhgari <ehsan.akhg...@gmail.com>:
> > Moving code out of libxul is only a band-aid over the problem. Since we
> > don't have any reason to believe that the memory usage of the linker is
> > linear in terms of the code size, we can't be sure that removing 10% of
> the
> > code in libxul will give us 10% more breathing space. Also, moving code
> > out of libxul might break the sorts of optimizations that we've been
> doing
> > assuming that most of our code lives inside libxul (for example, libxul
> > preloading, etc.)
> This argument, however, doesn't apply equally well to all parts of
> libxul. Some parts are relatively self-contained, with critical loops
> that are well-identified, don't interact with other parts of libxul,
> and already optimized i.e. coded in such a way that PGO won't make
> them faster than -O2. I think that WebGL is such an example.
There is also the question of which interfaces the code in question can
use. For example, if the code in question calls a function on an object,
and the code for the said object lives outside of its module, the function
needs to either be virtual or publicly exported.
> To put it another way, there's a limit to the scale at which PGO makes
> sense, or else we should just link all the software on a computed as a
> single file...
I think that's an unfair comparison. Theoretically, if we had linkers
which could use 64-bit address space, we could take advantage of PGO
without needing to put all of the code inside a single source file.
Problem is, we don't have those linkers for now. :(
We have two patches in hand (Bug 709657 and Bug 709721) to split out a
couple chunks of libxul. I tested one of them last night and it got the
final xul.dll size below the size of mozilla-beta's xul.dll by a couple
hundred kilobytes.
If we're willing to make the assumption that final binary size and peak
linker memory consumption are somewhat correlated then these two bugs
should buy us a fair amount of time (or code size, I suppose).
> On 12/12/11 12:20 PM, Jonathan Kew wrote:
>> I don't doubt that it happens, but I think having to tackle a handful of these on aurora during January and/or beta during February would be better than blocking much C++ development for an extended period
> I'm not actually sure it is. At that point we'll have to first find the checkins responsible, then figure out how to fix them, possibly backing them and other things out.
> I suspect the net effect will be similar to holding the tree closed for several days now, but time-shifted into January/February.
> If we think we'll need to have the tree closed for longer than a few days, I agree that disabling PGO temporarily sounds more palatable.
But if we expect we'll be able to re-open (with PGO) within a few days anyway, then we'll only be dealing with a few days' worth of non-PGO'd checkins that might have problems that need to be tracked down once PGO is back.
So I don't see much benefit to holding the tree mostly-closed at this point. Either we can get the PGO builds working again soon, in which case the odds are pretty good that they'll "just work" with whatever patches have landed - it's not like we break them on a daily basis - or it's going to take "longer than a few days", in which case we really can't afford to block development while we wait for it.
On Mon, Dec 12, 2011 at 1:56 PM, Benoit Jacob <jacob.benoi...@gmail.com> wrote:
> This argument, however, doesn't apply equally well to all parts of
> libxul. Some parts are relatively self-contained, with critical loops
> that are well-identified, don't interact with other parts of libxul,
> and already optimized i.e. coded in such a way that PGO won't make
> them faster than -O2. I think that WebGL is such an example.
This is an almost impossible statement to make. Even highly optimized
code can be made faster by the PGO optimizer, because it does
optimizations like:
* massive inlining
* speculative virtual call inlining
* hot+cold function block separation
which are incredibly hard to replicate without hand-crafting unreadable code.
> To put it another way, there's a limit to the scale at which PGO makes
> sense, or else we should just link all the software on a computed as a
> single file...
This is probably false. If the compiler could inline your system
library calls and things like that, your software would likely be
faster. It's only because of API boundaries that things like that
don't happen.
I have an idea which might enable us to use VS2010 to build binaries that
will run with Win2k, XP and XP SP1. We were going to switch to VS2010 for
Gecko 12 anyways, so if we can get this to work, we can switch to VS2010
today and we wouldn't need to rip out anything either.
On Mon, Dec 12, 2011 at 2:18 PM, Kyle Huey <m...@kylehuey.com> wrote:
> Status update:
> We have two patches in hand (Bug 709657 and Bug 709721) to split out a
> couple chunks of libxul. I tested one of them last night and it got the
> final xul.dll size below the size of mozilla-beta's xul.dll by a couple
> hundred kilobytes.
> If we're willing to make the assumption that final binary size and peak
> linker memory consumption are somewhat correlated then these two bugs
> should buy us a fair amount of time (or code size, I suppose).