Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Proposal to switch mozilla-inbound back to always doing PGO builds

12 views
Skip to first unread message

Phil Ringnalda

unread,
Dec 9, 2011, 3:43:53 PM12/9/11
to
Followup-to mozilla.dev.tree-management, please.

We switched from doing PGO builds on Linux and Windows on every push to
doing non-PGO on push, and only doing PGO every few hours, in order to
speed up getting results from a push, so that the person pushing
wouldn't be on the hook for so many hours.

Unfortunately, between the idea and the execution, we also switched from
"everybody pushes to mozilla-central and in theory has to watch the
tree" to "nearly everybody pushes to mozilla-inbound, and doesn't watch
the tree at all" so now the only people who are in a hurry to get
results from mozilla-inbound are the merge vikings, and only then when
there is some blazing hurry to be able to merge a particular push, which
means nearly never and when it happens that thing probably shouldn't
have been pushed to inbound.

The reason we felt able to switch to only periodically building and
testing what we ship is because we assured ourselves that we almost
never have PGO bugs or PGO bustage.

That was a lie. We have it all too frequently. Inbound is currently
completely hosed for Windows PGO in what I think is either the seventh
or eighth PGO bustage we've had since the start of periodic PGO, though
I may well have lost count.

Because of the way periodic PGO is triggered, you cannot currently
retrigger just one PGO build, so when you have your sheriff hat on, and
are faced with

push 7 - red Win32 PGO
push 6
push 5
push 4
push 3
push 2
push 1 - green Win32 PGO

you can only see if the red was a fluke by retriggering all four sorts
of PGO builds on push 7, waiting four hours for the Windows build, then
when you find that it was not a fluke, triggering all four sorts on push
6, waiting until they are running before you trigger all four sorts on
push 5 (because otherwise your request for push 5 would just get
coalesced into the triggering on push 6), waiting until push 5 is
running, rinse, repeat.

In https://bugzilla.mozilla.org/show_bug.cgi?id=709192 I propose that we
drop periodic PGO for mozilla-inbound, and go back to doing PGO on-push
on that tree.

Andrew McCreight

unread,
Dec 9, 2011, 5:05:15 PM12/9/11
to dev-tree-...@lists.mozilla.org
Would it be possible to change the way PGO builds are triggered/coalesced
so that you could fire off PGO builds for pushes 2 through 6 all at once?
I don't know if that is technically feasible, but it would bound the amount
of time it took to blame a particular set to the cost of a PGO build.

Andrew
> In https://bugzilla.mozilla.org/**show_bug.cgi?id=709192<https://bugzilla.mozilla.org/show_bug.cgi?id=709192>I propose that we drop periodic PGO for mozilla-inbound, and go back to
> doing PGO on-push on that tree.
> ______________________________**_________________
> dev-tree-management mailing list
> dev-tree-management@lists.**mozilla.org<dev-tree-...@lists.mozilla.org>
> https://lists.mozilla.org/**listinfo/dev-tree-management<https://lists.mozilla.org/listinfo/dev-tree-management>
>

Marco Bonardo

unread,
Dec 9, 2011, 5:27:00 PM12/9/11
to
On 09/12/2011 21:43, Phil Ringnalda wrote:
> In https://bugzilla.mozilla.org/show_bug.cgi?id=709192 I propose that we
> drop periodic PGO for mozilla-inbound, and go back to doing PGO on-push
> on that tree.

I think one of the points regarding intermittent PGO builds was to
reduce the load, so finding a way to both reduce the load and avoid
missing failures would be the best approach.

In the bug I proposed that rather than doing PGO builds on a timer or on
a counter, we do them when they are needed, so when there is a change to
any code that is involved in a PGO build, thus c,cpp,h,Makefiles and so on.
Or alternatively, as a safe measure, the rule of thumb may be to always
do PGO, but when the push touches only not involved files, like
js,jsm,css,xml,xhtml,html,jpg,png and similar.

-m

Philip Ringnalda

unread,
Dec 9, 2011, 5:41:43 PM12/9/11
to
I filed a bug about being able to disable coalescing, a couple of months
ago. It's a P5 enh, unassigned. A reasonable guess would be a 50:50
chance within two years.

John Ford

unread,
Dec 12, 2011, 7:26:37 PM12/12/11
to Philip Ringnalda, dev-tree-...@lists.mozilla.org
Mozilla-inbound has a large portion of our load. If we turn PGO on for all builds on this branch, we will loose a lot of the reason for turning off pgo in the first place.

Before we convert mozilla-inbound to having PGO on by default, lets look into the following mitigations:

1) bumping mozilla-inbound periodic timing to 3 hours (from the current 6). bug 710048 filed
2) make PGO builds non-mergeable. bug 710050 filed
3) following this "bustage is backed out right away" imperative on the wiki [1] and update it to clarify that "pgo bustage means backing out until last green pgo build"

John Ford

[1] https://wiki.mozilla.org/Tree_Rules/Inbound#Sheriff_Duty
> _______________________________________________
> dev-tree-management mailing list
> dev-tree-...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-tree-management

Philip Ringnalda

unread,
Dec 13, 2011, 12:00:24 AM12/13/11
to
On 12/12/11 4:26 PM, John Ford wrote:
> 3) following this "bustage is backed out right away" imperative on
> the wiki [1] and update it to clarify that "pgo bustage means backing
> out until last green pgo build"
>
> [1] https://wiki.mozilla.org/Tree_Rules/Inbound#Sheriff_Duty

Heh. Aren't you a little curious about why that rule has never, ever,
not even once, been applied? Let's flesh it out a bit with some examples.

"""
Bustage is backed out right away. Changes landed on top of bustage will
be backed out to minimize overhead/time to fix the tree. They will be
relanded by the sheriff once the bustage is cleared.

Say for example, that a push with six csets includes one which adds a
new test, which fails on 10.6 opt. That failure will show up after two
to three hours, during which time another ten or fifteen pushes will
have landed on top of it. The sheriff will immediately back out all
eleven or sixteen pushes, and then will immediately reland all but the
single offensive cset, making him wonder why on earth he just doubled
the build and test load when he knew exactly what to back out, and no
harm had been done to the later pushes. The next time, rather than
relanding in eleven or sixteen separate pushes, he relands them all in a
single blob, which then gets the blame from the talos regression finding
script. The sheriff begins drinking.

---

Say, for example, that we do green PGO builds on a push, then there are
three more pushes including a merge from mozilla-central, then we do PGO
builds on the fourth push, then there are two more pushes, then one of
the PGO builds on the fourth push fails. Call this "philor's Tuesday the
6th of December."

The sheriff will not retrigger that build to see if it was some
transitory problem with the slave of the sort we have all the time. The
sheriff will not close the tree while he tries to figure out the
problem. The sheriff will back out all twenty five csets, leaving the
tree open while he does, rebasing his backout as more things land, and
backing them out too. When he finally wins the push race (there were
pushes every ten minutes or so for the rest of that day) and lands his
backout of twenty five or thirty or forty pushes, he will immediately
reland all but the first four pushes, because he had no reason to
suspect them, and no reason to back them out, either, but that is The Rule.

Then, the sheriff will reschedule all his meetings for the day, or if he
is a volunteer, call in sick at his day job, and push the first four to
try, triggering PGO on the affected build. As each one comes back green
from try, he will reland them. Because it was both intermittent and
cumulative, he will find that every single one of them is green on try,
and will reland all of them. If, by chance, something lands while he is
doing that which shrinks libxul slightly, then even when he relands all
of them on inbound, they will still be green. If, by chance, something
lands while he is doing that which expands libxul slightly, then the
ones which were green on try will fail on inbound. The sheriff will
begin drinking again.
"""

That rule solves two problems: it completely removes from the sheriff's
toolbox the option of closing the tree to deal with bustage, and it
makes actually being the inbound sheriff an intolerable job that nobody
would want to do, rather than the current intolerable job that nearly
nobody wants to do.

Oddly enough, those are not problems that we need to solve.

Had we applied it when you wanted us to, on Friday (that being the only
time we closed the tree for this instance of PGO bustage, I dealt with
Tuesday and Wednesday without feeling the need to close), the net effect
would have been to allow us to churn like crazy while we still didn't
know what the problem was, backing out blameless patches along with
things that went into libxul, then letting a new thing that went into
libxul fill up the remaining space before we relanded any of the things
that had made it under the limit but got caught up in the last backout.

We didn't ever follow that rule in the past because it's a sledgehammer
we don't need, a sledgehammer which mostly applies itself to the
sheriff's head; in the future, we won't follow it both because it's a
sledgehammer we don't need, and because we now know that there are times
when backing out to the last green and allowing anything other than what
landed after it to land is not always the solution to a problem.

Dao

unread,
Dec 13, 2011, 5:54:05 AM12/13/11
to
On 13.12.2011 06:00, Philip Ringnalda wrote:
> On 12/12/11 4:26 PM, John Ford wrote:
>> 3) following this "bustage is backed out right away" imperative on
>> the wiki [1] and update it to clarify that "pgo bustage means backing
>> out until last green pgo build"
>>
>> [1] https://wiki.mozilla.org/Tree_Rules/Inbound#Sheriff_Duty
>
> Heh. Aren't you a little curious about why that rule has never, ever,
> not even once, been applied? Let's flesh it out a bit with some examples.

You're right, and this was raised before inbound was set up.
Now, since this wasn't actually practiced, I guess we don't need to
debate this further and can just edit that page?
0 new messages