PGO build proposal

Ted Mielczarek

unread,

May 19, 2011, 10:14:26 AM5/19/11

to mozilla.dev.planning group, Boris Zbarsky, rob...@ocallahan.org

On Tue, May 17, 2011 at 8:53 PM, Robert O'Callahan <rob...@ocallahan.org> wrote:
> On Wed, May 18, 2011 at 12:47 PM, Kyle Huey <m...@kylehuey.com> wrote:
>
>> Even with a very powerful machine you're still looking at multiple hours.
>> PGO essentially serializes compilation ... and we compile twice. We could
>> be smart in the build system and put in a bunch of hacks to compile
>> mozjs.dll, nspr4.dll, etc in parallel, but you're still limited by how long
>> it takes to compile all of the code in xul.dll at essentially -j1, which is
>> most of our code.
>>
>
> How often do we encounter PGO-only bugs?
>
> I wonder if we could drop PGO from the main test matrix and do nightly PGO
> builds and tests, or something like that. If the nightly shows a new PGO
> test failure someone would have to bisect. That could conceivably be
> automated.

On Tue, May 17, 2011 at 9:12 PM, Boris Zbarsky <bzba...@mit.edu> wrote:
> On 5/17/11 8:56 PM, Kyle Huey wrote:
>>
>> Well this would mean that we'd only be getting daily perf numbers on our
>> most used platform ... that may be too high a cost to pay.
>
> Can we build both PGO and non-PGO, and use the latter for the correctness
> tests and the former for talos?
>
> We've already established that it takes 8-36 hours or so to find out about
> perf regressions (even on Linux, where the builds are quick), so this won't
> make things much worse for the perf number latency, while still giving us
> much shorter latency to test green.

I think I've made this proposal elsewhere in the past[1], but I think
it bears repeating and investigating. I believe we ought to:
1) Switch our per-checkin builds to be non-PGO builds, but continue to
runTalos/unittests on them. This means that cycle time for an
individual checkin would go way down on Windows (and back down on
Linux now, as well).
2) Also do PGO builds, but not per-checkin, just as fast as one build
slave at a time can generate them, and run Talos/unittests on them.
This way we will still get the sanity of testing the configuration we
ship to users, and a double-check on our perf results, but we don't
force everyone to pay that cost for every checkin. If we do hit a
PGO-only regression, we'll have a larger regression range, but it
should be manageable since we expect to hit them very infrequently.
3) Continue to build Nightly and Release builds with PGO, no change.

-Ted

1. https://bugzilla.mozilla.org/show_bug.cgi?id=420320#c0

Axel Hecht

unread,

May 19, 2011, 10:27:52 AM5/19/11

to

I like that so much, I even had the same idea ;-)

Axel

Johnathan Nightingale

unread,

May 19, 2011, 12:07:18 PM5/19/11

to Ted Mielczarek, mozilla.dev.planning group

On 2011-05-19, at 10:14 AM, Ted Mielczarek wrote:
> I believe we ought to:
> 1) Switch our per-checkin builds to be non-PGO builds, but continue to
> runTalos/unittests on them. This means that cycle time for an
> individual checkin would go way down on Windows (and back down on
> Linux now, as well).
> 2) Also do PGO builds, but not per-checkin, just as fast as one build
> slave at a time can generate them, and run Talos/unittests on them.
> This way we will still get the sanity of testing the configuration we
> ship to users, and a double-check on our perf results, but we don't
> force everyone to pay that cost for every checkin. If we do hit a
> PGO-only regression, we'll have a larger regression range, but it
> should be manageable since we expect to hit them very infrequently.
> 3) Continue to build Nightly and Release builds with PGO, no change.

I like this idea, it's a quick and significant win that can happen regardless of all the other threads about how to manage the tree. The costs seem low to me, but I would like to know how RelEng feels.

(I don't think the latter should stop us from filing a bug, though! Is there one already, Ted? Happy to file, if not)

J

---
Johnathan Nightingale
Director of Firefox Engineering
joh...@mozilla.com

Kyle Huey

unread,

May 19, 2011, 12:25:10 PM5/19/11

to Ted Mielczarek, mozilla.dev.planning group, Boris Zbarsky, rob...@ocallahan.org

> it bears repeating and investigating. I believe we ought to:

> 1) Switch our per-checkin builds to be non-PGO builds, but continue to
> runTalos/unittests on them. This means that cycle time for an
> individual checkin would go way down on Windows (and back down on
> Linux now, as well).
> 2) Also do PGO builds, but not per-checkin, just as fast as one build
> slave at a time can generate them, and run Talos/unittests on them.
> This way we will still get the sanity of testing the configuration we
> ship to users, and a double-check on our perf results, but we don't
> force everyone to pay that cost for every checkin. If we do hit a
> PGO-only regression, we'll have a larger regression range, but it
> should be manageable since we expect to hit them very infrequently.
> 3) Continue to build Nightly and Release builds with PGO, no change.
>

> -Ted
>
> 1. https://bugzilla.mozilla.org/show_bug.cgi?id=420320#c0
> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning
>

I fully support this.

- Kyle

Shawn Wilsher

unread,

May 19, 2011, 12:54:57 PM5/19/11

to dev-pl...@lists.mozilla.org

We should, however, make sure you can get a PGO run off of try (or opt).
In fact, try should probably do both since any change pushed to m-c
has the chance of doing both.

Cheers,

Shawn

Chris Cooper

unread,

May 19, 2011, 1:08:25 PM5/19/11

to dev-pl...@lists.mozilla.org

On 2011-05-19 10:14 AM, Ted Mielczarek wrote:
> I think I've made this proposal elsewhere in the past[1], but I think
> it bears repeating and investigating. I believe we ought to:
> 1) Switch our per-checkin builds to be non-PGO builds, but continue to
> runTalos/unittests on them. This means that cycle time for an
> individual checkin would go way down on Windows (and back down on
> Linux now, as well).
> 2) Also do PGO builds, but not per-checkin, just as fast as one build
> slave at a time can generate them, and run Talos/unittests on them.
> This way we will still get the sanity of testing the configuration we
> ship to users, and a double-check on our perf results, but we don't
> force everyone to pay that cost for every checkin. If we do hit a
> PGO-only regression, we'll have a larger regression range, but it
> should be manageable since we expect to hit them very infrequently.
> 3) Continue to build Nightly and Release builds with PGO, no change.

So this *may* improve end-to-end times by cutting build times in half,
but won't actually help with wait times overall, especially for tests.

Right now, we're pretty consistently able to start over 95% of build
requests within 15 minutes of a push, even with PGO. We currently can't
say that about test requests. If we cut build times in half, we'll
simply have finished builds queuing up for testing faster, which will
actually make the problem worse if we end up with a greater number of
build requests.

I'm not saying we shouldn't disable PGO builds per-checkin, but
improving build time won't necessarily help you get your test results
faster.

If we end up generating *both* PGO and non-PGO builds per-checkin and
are expected to test both, the testing wait time numbers are going to
get ugly(er).

RelEng is continuing to look for solutions to improving test turnaround
time:

* reducing test overhead in general
* increasing (or reducing) test chunking to minimize overhead
* aggressively disabling tests on branches where they aren't required or
don't make sense
* increasing the number of testing machines to improve throughput

Mike Shaver

unread,

May 19, 2011, 1:11:18 PM5/19/11

to Ted Mielczarek, Boris Zbarsky, rob...@ocallahan.org, mozilla.dev.planning group

On May 19, 2011 7:15 AM, "Ted Mielczarek" <t...@mielczarek.org> wrote:
>
> 1) Switch our per-checkin builds to be non-PGO builds, but continue to
> runTalos/unittests on them.

> 2) Also do PGO builds, but not per-checkin, just as fast as one build
> slave at a time can generate them, and run Talos/unittests on them.

This.

Mike

Ted Mielczarek

unread,

May 19, 2011, 1:22:36 PM5/19/11

to Chris Cooper, dev-pl...@lists.mozilla.org

On Thu, May 19, 2011 at 1:08 PM, Chris Cooper <cco...@deadsquid.com> wrote:
> Right now, we're pretty consistently able to start over 95% of build
> requests within 15 minutes of a push, even with PGO. We currently can't
> say that about test requests. If we cut build times in half, we'll
> simply have finished builds queuing up for testing faster, which will
> actually make the problem worse if we end up with a greater number of
> build requests.
>
> I'm not saying we shouldn't disable PGO builds per-checkin, but
> improving build time won't necessarily help you get your test results
> faster.

Certainly, if we produce builds faster we will wind up with the
opportunity to test more changes. However, currently Windows builds
take 2 hours 45 minutes on average, which means that that is currently
our lower bound for getting any Windows test results, which is pretty
awful. I think my proposal is the only way to usefully reduce that
number per-checkin.

> If we end up generating *both* PGO and non-PGO builds per-checkin and
> are expected to test both, the testing wait time numbers are going to
> get ugly(er).

I am not suggesting this at all, only that we produce PGO builds
serially and test them. This would increase the overall volume of
builds to be tested, but I don't know by how much. It should overall
reduce the load on our build pool.

-Ted

Chris AtLee

unread,

May 19, 2011, 1:26:27 PM5/19/11

to

On 19/05/11 12:07 PM, Johnathan Nightingale wrote:
>
> On 2011-05-19, at 10:14 AM, Ted Mielczarek wrote:
>> I believe we ought to:
>> 1) Switch our per-checkin builds to be non-PGO builds, but continue to
>> runTalos/unittests on them. This means that cycle time for an
>> individual checkin would go way down on Windows (and back down on
>> Linux now, as well).
>> 2) Also do PGO builds, but not per-checkin, just as fast as one build
>> slave at a time can generate them, and run Talos/unittests on them.
>> This way we will still get the sanity of testing the configuration we
>> ship to users, and a double-check on our perf results, but we don't
>> force everyone to pay that cost for every checkin. If we do hit a
>> PGO-only regression, we'll have a larger regression range, but it
>> should be manageable since we expect to hit them very infrequently.
>> 3) Continue to build Nightly and Release builds with PGO, no change.
>
>
> I like this idea, it's a quick and significant win that can happen regardless of all the other threads about how to manage the tree. The costs seem low to me, but I would like to know how RelEng feels.

I like it! I don't know how we're going to implement it, but I still
like it!

I think that since the number of builds happening will be about the
same, the impact on our test pool will be small.

Paul Biggar

unread,

May 19, 2011, 1:30:21 PM5/19/11

to Chris Cooper, dev-pl...@lists.mozilla.org

On Thu, May 19, 2011 at 10:08, Chris Cooper <cco...@deadsquid.com> wrote:
> If we end up generating *both* PGO and non-PGO builds per-checkin and
> are expected to test both, the testing wait time numbers are going to
> get ugly(er).

Are we able to build both without increasing build times? For example,
stopping after the first build, and packaging that up as non-PGO, then
carrying on? In that case, we could have tests much sooner, and still
have Talos on PGO builds.

--
Paul Biggar
Compiler Geek
pbi...@mozilla.com

Ehsan Akhgari

unread,

May 19, 2011, 1:37:38 PM5/19/11

to Ted Mielczarek, dev-pl...@lists.mozilla.org, Chris Cooper

On 11-05-19 1:22 PM, Ted Mielczarek wrote:
>> If we end up generating *both* PGO and non-PGO builds per-checkin and
>> are expected to test both, the testing wait time numbers are going to
>> get ugly(er).
>

> I am not suggesting this at all, only that we produce PGO builds
> serially and test them. This would increase the overall volume of
> builds to be tested, but I don't know by how much. It should overall
> reduce the load on our build pool.

Also, I think we should only run Talos tests on the PGO builds, and rely
on unit tests running on the nightly builds (which would also be PGO) to
catch the extremely rare unit test regressions that only happen on PGO
builds.

Cheers,
Ehsan

Ehsan Akhgari

unread,

May 19, 2011, 1:39:12 PM5/19/11

to Shawn Wilsher, dev-pl...@lists.mozilla.org

On 11-05-19 12:54 PM, Shawn Wilsher wrote:
> We should, however, make sure you can get a PGO run off of try (or opt).
> In fact, try should probably do both since any change pushed to m-c has
> the chance of doing both.

I don't think that we need to do the latter. We do want the ability of
generating PGO builds on Talos (probably using a trychooser syntax), but
as long as we're talking about 2 PGO related bugs so far, it's not worth
doing PGO build for every try push which triggers a Windows build.

Cheers,
Ehsan

Ted Mielczarek

unread,

May 19, 2011, 1:47:16 PM5/19/11

to Paul Biggar, dev-pl...@lists.mozilla.org, Chris Cooper

On Thu, May 19, 2011 at 1:30 PM, Paul Biggar <pbi...@mozilla.com> wrote:

> On Thu, May 19, 2011 at 10:08, Chris Cooper <cco...@deadsquid.com> wrote:
>> If we end up generating *both* PGO and non-PGO builds per-checkin and
>> are expected to test both, the testing wait time numbers are going to
>> get ugly(er).
>

> Are we able to build both without increasing build times? For example,
> stopping after the first build, and packaging that up as non-PGO, then
> carrying on? In that case, we could have tests much sooner, and still
> have Talos on PGO builds.

I was going to say no, but I think given the right combination of
compiler flags this could be made to work. I'm not sure what the build
times would look like, since you'd basically wind up doing:
a) Normal build
b) Another build pass that doesn't compile any object files, but
re-links all the binaries.
c) Profiling run
d) Another build pass to re-link with optimization.

That being said, I'm not sure it's worth jumping through hoops for. I
think building a PGO build per-checkin is a waste of resources.

-Ted

Kyle Huey

unread,

May 19, 2011, 1:50:05 PM5/19/11

to Ted Mielczarek, Paul Biggar, dev-pl...@lists.mozilla.org, Chris Cooper

> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning
>

This won't work, because a "normal" build doesn't use link time code
generation (-LTCG), and in order to do PGO we need to use -LTCG. -LTCG is
the magical thing that kills the build perf ... since it defers code
generation until the link phase and results in what is essentially single
threaded compilation.

- Kyle

Mike Hommey

unread,

May 19, 2011, 1:56:43 PM5/19/11

to Ehsan Akhgari, Shawn Wilsher, dev-pl...@lists.mozilla.org

It's actually easy to trigger a PGO build when PGO is disabled:
mk_add_options MOZ_PGO=1

Mike

Shawn Wilsher

unread,

May 19, 2011, 4:39:54 PM5/19/11

to dev-pl...@lists.mozilla.org

On 5/19/2011 10:37 AM, Ehsan Akhgari wrote:
> Also, I think we should only run Talos tests on the PGO builds, and rely
> on unit tests running on the nightly builds (which would also be PGO) to
> catch the extremely rare unit test regressions that only happen on PGO
> builds.

I think that's a really bad idea. We should be measuring what our users
use, not some build that is similar but totally different.

Cheers,

Shawn

Mike Connor

unread,

May 19, 2011, 5:23:12 PM5/19/11

to Shawn Wilsher, dev-pl...@lists.mozilla.org

Agreed. I don't think the minimal savings are worth the undefined risk here.

-- Mike

Shawn Wilsher

unread,

May 19, 2011, 5:43:59 PM5/19/11

to dev-pl...@lists.mozilla.org

On 5/19/2011 1:39 PM, Shawn Wilsher wrote:
> On 5/19/2011 10:37 AM, Ehsan Akhgari wrote:
>> Also, I think we should only run Talos tests on the PGO builds, and rely
>> on unit tests running on the nightly builds (which would also be PGO) to
>> catch the extremely rare unit test regressions that only happen on PGO
>> builds.
> I think that's a really bad idea. We should be measuring what our users
> use, not some build that is similar but totally different.

So, ignore the second sentence, and let me try again. Some talos data
is better than no talos data here, and opt builds are still better than
nothing for talos data. This can give us an approximation of if we are
going to have a regression, and the PGO builds can tell us for sure (but
with a larger regression range). I hate it when talos runs get merged
now because that just means we get a bigger regression range if we have
a problem, and fewer data points to establish a moving baseline with.

Cheers,

Shawn

Nicholas Nethercote

unread,

May 19, 2011, 7:39:56 PM5/19/11

to Chris Cooper, dev-pl...@lists.mozilla.org

On Fri, May 20, 2011 at 3:08 AM, Chris Cooper <cco...@deadsquid.com> wrote:
>
> So this *may* improve end-to-end times by cutting build times in half,
> but won't actually help with wait times overall, especially for tests.
>

> Right now, we're pretty consistently able to start over 95% of build
> requests within 15 minutes of a push, even with PGO. We currently can't
> say that about test requests. If we cut build times in half, we'll
> simply have finished builds queuing up for testing faster, which will
> actually make the problem worse if we end up with a greater number of
> build requests.

Is that because there are separate pools of build and test machines?
If so, this change would reduce the load on the build pool -- could
some of the build pool machines be moved into the test pool?

Nick

Justin Dolske

unread,

May 19, 2011, 8:59:46 PM5/19/11

to

On 5/19/11 10:08 AM, Chris Cooper wrote:

> Right now, we're pretty consistently able to start over 95% of build
> requests within 15 minutes of a push, even with PGO. We currently can't
> say that about test requests. If we cut build times in half, we'll
> simply have finished builds queuing up for testing faster, which will
> actually make the problem worse if we end up with a greater number of
> build requests.

Perhaps I'm being dense, but I don't see how things would get worse.
Every checkin currently creates N jobs (build, test, talos). That would
still be the case after this change, except 1 of those jobs (Windows
build) will be non-PGO and thus faster. Otherwise, the total volume of
work is the same.

[Ok, there's actually a slight bump from having a new PGO builder and
the additional test/talos work it generates, but that seems like a drop
in the bucket compared to current load.]

Justin

Armen Zambrano Gasparnian

unread,

May 24, 2011, 9:14:13 AM5/24/11

to

On 11-05-19 7:39 PM, Nicholas Nethercote wrote:
>could
> some of the build pool machines be moved into the test pool?
>
> Nick

No, not really. The builders should be of the same spec as the testing
machines (rev3 minis) which they are not.

-- armenzg