On Tue, May 17, 2011 at 9:12 PM, Boris Zbarsky <bzba...@mit.edu> wrote:
> On 5/17/11 8:56 PM, Kyle Huey wrote:
>>
>> Well this would mean that we'd only be getting daily perf numbers on our
>> most used platform ... that may be too high a cost to pay.
>
> Can we build both PGO and non-PGO, and use the latter for the correctness
> tests and the former for talos?
>
> We've already established that it takes 8-36 hours or so to find out about
> perf regressions (even on Linux, where the builds are quick), so this won't
> make things much worse for the perf number latency, while still giving us
> much shorter latency to test green.
I think I've made this proposal elsewhere in the past[1], but I think
it bears repeating and investigating. I believe we ought to:
1) Switch our per-checkin builds to be non-PGO builds, but continue to
runTalos/unittests on them. This means that cycle time for an
individual checkin would go way down on Windows (and back down on
Linux now, as well).
2) Also do PGO builds, but not per-checkin, just as fast as one build
slave at a time can generate them, and run Talos/unittests on them.
This way we will still get the sanity of testing the configuration we
ship to users, and a double-check on our perf results, but we don't
force everyone to pay that cost for every checkin. If we do hit a
PGO-only regression, we'll have a larger regression range, but it
should be manageable since we expect to hit them very infrequently.
3) Continue to build Nightly and Release builds with PGO, no change.
-Ted
I like that so much, I even had the same idea ;-)
Axel
I like this idea, it's a quick and significant win that can happen regardless of all the other threads about how to manage the tree. The costs seem low to me, but I would like to know how RelEng feels.
(I don't think the latter should stop us from filing a bug, though! Is there one already, Ted? Happy to file, if not)
J
---
Johnathan Nightingale
Director of Firefox Engineering
joh...@mozilla.com
> it bears repeating and investigating. I believe we ought to:
> 1) Switch our per-checkin builds to be non-PGO builds, but continue to
> runTalos/unittests on them. This means that cycle time for an
> individual checkin would go way down on Windows (and back down on
> Linux now, as well).
> 2) Also do PGO builds, but not per-checkin, just as fast as one build
> slave at a time can generate them, and run Talos/unittests on them.
> This way we will still get the sanity of testing the configuration we
> ship to users, and a double-check on our perf results, but we don't
> force everyone to pay that cost for every checkin. If we do hit a
> PGO-only regression, we'll have a larger regression range, but it
> should be manageable since we expect to hit them very infrequently.
> 3) Continue to build Nightly and Release builds with PGO, no change.
>
> -Ted
>
> 1. https://bugzilla.mozilla.org/show_bug.cgi?id=420320#c0
> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning
>
I fully support this.
- Kyle
Cheers,
Shawn
So this *may* improve end-to-end times by cutting build times in half,
but won't actually help with wait times overall, especially for tests.
Right now, we're pretty consistently able to start over 95% of build
requests within 15 minutes of a push, even with PGO. We currently can't
say that about test requests. If we cut build times in half, we'll
simply have finished builds queuing up for testing faster, which will
actually make the problem worse if we end up with a greater number of
build requests.
I'm not saying we shouldn't disable PGO builds per-checkin, but
improving build time won't necessarily help you get your test results
faster.
If we end up generating *both* PGO and non-PGO builds per-checkin and
are expected to test both, the testing wait time numbers are going to
get ugly(er).
RelEng is continuing to look for solutions to improving test turnaround
time:
* reducing test overhead in general
* increasing (or reducing) test chunking to minimize overhead
* aggressively disabling tests on branches where they aren't required or
don't make sense
* increasing the number of testing machines to improve throughput
Other solutions are welcome.
cheers,
--
coop
This.
Mike
Certainly, if we produce builds faster we will wind up with the
opportunity to test more changes. However, currently Windows builds
take 2 hours 45 minutes on average, which means that that is currently
our lower bound for getting any Windows test results, which is pretty
awful. I think my proposal is the only way to usefully reduce that
number per-checkin.
> If we end up generating *both* PGO and non-PGO builds per-checkin and
> are expected to test both, the testing wait time numbers are going to
> get ugly(er).
I am not suggesting this at all, only that we produce PGO builds
serially and test them. This would increase the overall volume of
builds to be tested, but I don't know by how much. It should overall
reduce the load on our build pool.
-Ted
I like it! I don't know how we're going to implement it, but I still
like it!
I think that since the number of builds happening will be about the
same, the impact on our test pool will be small.
Are we able to build both without increasing build times? For example,
stopping after the first build, and packaging that up as non-PGO, then
carrying on? In that case, we could have tests much sooner, and still
have Talos on PGO builds.
--
Paul Biggar
Compiler Geek
pbi...@mozilla.com
Also, I think we should only run Talos tests on the PGO builds, and rely
on unit tests running on the nightly builds (which would also be PGO) to
catch the extremely rare unit test regressions that only happen on PGO
builds.
Cheers,
Ehsan
I don't think that we need to do the latter. We do want the ability of
generating PGO builds on Talos (probably using a trychooser syntax), but
as long as we're talking about 2 PGO related bugs so far, it's not worth
doing PGO build for every try push which triggers a Windows build.
Cheers,
Ehsan
I was going to say no, but I think given the right combination of
compiler flags this could be made to work. I'm not sure what the build
times would look like, since you'd basically wind up doing:
a) Normal build
b) Another build pass that doesn't compile any object files, but
re-links all the binaries.
c) Profiling run
d) Another build pass to re-link with optimization.
That being said, I'm not sure it's worth jumping through hoops for. I
think building a PGO build per-checkin is a waste of resources.
-Ted
> _______________________________________________
> dev-planning mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-planning
>
This won't work, because a "normal" build doesn't use link time code
generation (-LTCG), and in order to do PGO we need to use -LTCG. -LTCG is
the magical thing that kills the build perf ... since it defers code
generation until the link phase and results in what is essentially single
threaded compilation.
- Kyle
It's actually easy to trigger a PGO build when PGO is disabled:
mk_add_options MOZ_PGO=1
Mike
Cheers,
Shawn
Agreed. I don't think the minimal savings are worth the undefined risk here.
-- Mike
Cheers,
Shawn
Is that because there are separate pools of build and test machines?
If so, this change would reduce the load on the build pool -- could
some of the build pool machines be moved into the test pool?
Nick
> Right now, we're pretty consistently able to start over 95% of build
> requests within 15 minutes of a push, even with PGO. We currently can't
> say that about test requests. If we cut build times in half, we'll
> simply have finished builds queuing up for testing faster, which will
> actually make the problem worse if we end up with a greater number of
> build requests.
Perhaps I'm being dense, but I don't see how things would get worse.
Every checkin currently creates N jobs (build, test, talos). That would
still be the case after this change, except 1 of those jobs (Windows
build) will be non-PGO and thus faster. Otherwise, the total volume of
work is the same.
[Ok, there's actually a slight bump from having a new PGO builder and
the additional test/talos work it generates, but that seems like a drop
in the bucket compared to current load.]
Justin
-- armenzg