>> My main concern is that we seem to be pre-solving problems we're not
>> sure we have with solutions that will definitely cause issues.
>
> I agree with Boris.
Do we have data upon which we can help base these decisions (one way or
the other)?
My impression is the tree has two major problems -- intermittent oranges
and closure for bustage. Oranges seem to be slowly getting better
according to OrangeFactor. Tree closure _seems_ to be less frequent to
me, but that's just my unscientific impression.
This isn't to say we shouldn't try to make things better. But, TBH, I
also get the impression from these threads that there's a lot of
repository idealism, and it's not clear to me that some proposals will
be a net win, or even how we'd know they're a net win if we try them out.
EG, what percentage of the time is the tree open, idle, and waiting
checkins? Or not ready but basically green as it's processing existing
commits? Or essentially closed to checking (if not formally) because of
some problem (test fail, bustage, infra outage, etc).
Justin
While it hasn't always been perfectly aligned with what we actually do
(it sees things on Tinderbox that we don't see on tbpl because they
didn't TinderboxPrint a cset id, for instance), Kai's thing at
http://kuix.de/mozilla/tinderboxstat/ just had its first birthday the
other week, and thanks to recent improvements we've managed to get the
lifetime stats for either green or starred, so you can push, to almost
40% of the time!
It's be interesting if someone crunched that into both how we're doing
over time, and how much of the bad time is in each of the three unequal
times, 6am-5pm Pacific (the employee hours), 5pm - 11pm (the contributor
hours), and 11pm - 6am (Europe's hours).
http://kuix.de/mozilla/tinderboxstat/stat.txt
2.41 % green
32.47 % known orange
3.90 % known red and orange
41.66 % new unknown orange
19.53 % new unknown red
---
in other words:
2.41 % green
36.38 % problems, but allowed to commit
61.20 % forbidden to commit
I suppose the question is "do people think this is good enough?" If
not, what does success look like? Personally, I think that we should
aim for >90% on our primary repository, but that is because we have the
technology to make that possible.
-- Mike
This is misleading. Since starring is a manual process, it can't happen
immediately, and changing some rules won't change the fact that we have
about one intermittent orange per push.
Things may even remain unstarred for hours when nobody wants to push,
which happens all the time during my work hours. This again does not
mean the tree is broken. When I want to push, I star things and then I
push. I usually don't have to wait.
> On 17.05.2011 06:46, Mike Connor wrote:
>> 61.20 % forbidden to commit
>
> This is misleading. Since starring is a manual process, it can't happen immediately, and changing some rules won't change the fact that we have about one intermittent orange per push.
I'm not sure how we're defining "known" and "unknown" here. I was assuming that these stats are not looking at starred vs. unstarred, since starred certainly doesn't mean "known, can land anyway" despite what some people have been doing!
> Things may even remain unstarred for hours when nobody wants to push, which happens all the time during my work hours. This again does not mean the tree is broken. When I want to push, I star things and then I push. I usually don't have to wait.
It is currently the responsibility of the committer to star the oranges on their push, no? Your statement here implies that isn't happening. If people aren't doing that, then we have another problem.
-- Mike
Yes, we have another problem. A comitter must not land on a known
orange, so while you're trying to land, you need to investigate the
prior 2-5 landings, too. Just pointing fingers on #developers didn't
work for me, at least.
Axel
>> Things may even remain unstarred for hours when nobody wants to push, which happens all the time during my work hours. This again does not mean the tree is broken. When I want to push, I star things and then I push. I usually don't have to wait.
>
> It is currently the responsibility of the committer to star the oranges on their push, no? Your statement here implies that isn't happening. If people aren't doing that, then we have another problem.
Often the person wanting to commit will star whatever oranges are around before landing. In practice this seems to work just as well and has a better incentive structure than having the committer star the oranges on their push.
-Jeff
>
>
> > Things may even remain unstarred for hours when nobody wants to push,
> which happens all the time during my work hours. This again does not mean
> the tree is broken. When I want to push, I star things and then I push. I
> usually don't have to wait.
>
> It is currently the responsibility of the committer to star the oranges on
> their push, no? Your statement here implies that isn't happening. If
> people aren't doing that, then we have another problem.
>
>
This doesn't happen for a variety of reasons.
1) Windows results don't finish coming in till approximately 6 hours after
their push. Almost nobody sticks around that long.
2) There's nobody really responsible to star nightly builds. The original
committer is almost certainly gone by the time the nightly test results come
in (on weekends, this could be almost a day after your original push.)
3) Having people who want to commit star the tree *works better* than the
alternative. As long as your push doesn't break anything (which can be
verified for most types of changes on tryserver) you don't have to stick
around anywhere near as long as you would otherwise, and the people who have
to interact with the tree to star it are the people who want to be
interacting with the tree (to push).
tl;dr, as long as test results are still trickling in basically a full
work-day later expecting a committer to star every orange on their push is
unreasonable.
- Kyle
No. The reason that you have to star oranges on the tree before pushing
is that other people are not doing their job. It is the responsibility
of the person who pushes to make sure that their push passes all of the
unit tests and does not cause any regression on the performance tests.
But unfortunately many people completely ignore this rule, which is why
you see oranges on other people's pushes when you want to push yourself.
Not starring your own pushes only makes the problem worse for everyone.
Cheers,
Ehsan
>
>
> On Tue, May 17, 2011 at 6:43 AM, Mike Connor <mco...@mozilla.com> wrote:
>
>
> > Things may even remain unstarred for hours when nobody wants to push, which happens all the time during my work hours. This again does not mean the tree is broken. When I want to push, I star things and then I push. I usually don't have to wait.
>
> It is currently the responsibility of the committer to star the oranges on their push, no? Your statement here implies that isn't happening. If people aren't doing that, then we have another problem.
>
>
> This doesn't happen for a variety of reasons.
>
> 1) Windows results don't finish coming in till approximately 6 hours after their push. Almost nobody sticks around that long.
IOW: people don't watch the tree that closely, and rely on other people to clean up after them.
> 2) There's nobody really responsible to star nightly builds. The original committer is almost certainly gone by the time the nightly test results come in (on weekends, this could be almost a day after your original push.)
Okay, so one build/day. We can figure that part out.
> 3) Having people who want to commit star the tree *works better* than the alternative. As long as your push doesn't break anything (which can be verified for most types of changes on tryserver) you don't have to stick around anywhere near as long as you would otherwise, and the people who have to interact with the tree to star it are the people who want to be interacting with the tree (to push).
For intermittent oranges, sure. For new/unknown issues, you're passing the buck on backouts to the next committer. The degenerate case I see anecdotally is that people like Boris/Ehsan/Shawn end up having to fix someone else's problem because they want to commit. I do not think that is fair or sustainable.
> tl;dr, as long as test results are still trickling in basically a full work-day later expecting a committer to star every orange on their push is unreasonable.
Then the system as currently defined is broken, and we should figure out a better solution.
-- Mike
I really don't see what Kai could possibly be measuring other than
starred vs. unstarred.
> It is currently the responsibility of the committer to star the oranges on their push, no? Your statement here implies that isn't happening. If people aren't doing that, then we have another problem.
As others have explained, I'm not sure it's a problem. It's a fact that
doesn't bother me much, personally. And it's distinctly different from
saying "people aren't allowed to commit 60% of the time".
It very much looks like starred vs. unstarred, with a polling every n
minutes, meaning until something is starred, it appears as unknown.
Mike
> On Tue, May 17, 2011 at 04:20:39PM +0200, Dao wrote:
>> On 17.05.2011 15:43, Mike Connor wrote:
>>> I'm not sure how we're defining "known" and "unknown" here. I was assuming that these stats are not looking at starred vs. unstarred, since starred certainly doesn't mean "known, can land anyway" despite what some people have been doing!
>>
>> I really don't see what Kai could possibly be measuring other than
>> starred vs. unstarred.
>
> It very much looks like starred vs. unstarred, with a polling every n
> minutes, meaning until something is starred, it appears as unknown.
Okay, so if that's the case, then here's what we can say, definitively:
* 2.4% of the time, I can commit without doing extra work
* 32.5% of the time, there are starred oranges that I may or may not be able to land on
* 23.5% of the time, one or more builds is broken entirely (red), which often means I should wait
* 61% of the time, I have to dig through previous commits and star builds to even figure out if I _can_ land.
None of those numbers look to me like a viable, healthy system.
-- Mike
These numbers look right.
> None of those numbers look to me like a viable, healthy system.
The 32.5% number is just a reflection of randomorange: we have an
average of one randomorange per push right now. I can try to run the
numbers if you want (though I'll need to know more about what exactly
kaie is measuring to be sure), but I'd be very surprised if the expected
value of "all green" were much higher than a few percent.
The 23.5% is bad, I agree. I would be interested to see how much of
that is things like Jetpack recently.... Might not move the needle
much. We do have a real problem here.
The 61% thing is interesting. In my experience, when I go to push there
is typically an unstarred orange or two; starring them takes 30-60
seconds and then things are fine. Sometimes the tree really is broken,
but not that often.
I would be really interested in stats about how much of the time there
we're really seeing randomorange that no one has gotten to yet.
One other comment about randomorange and tree-watching. Right now you
have to poll the tree. Getting those fixes about notifying committers
when tests fail in place would help a lot with people starring their own
orange, I suspect. But there'd still be a lag between the orange
appearing and the committer reading the mail (the mail propagation
delay, if nothing else); figure 10-15 mins at least. Again, I'd have to
run some numbers to come out with an expected orange percentage for that.
-Boris
Cheers,
Shawn
> No. The reason that you have to star oranges on the tree before pushing is
> that other people are not doing their job. It is the responsibility of the
> person who pushes to make sure that their push passes all of the unit tests
> and does not cause any regression on the performance tests. But
> unfortunately many people completely ignore this rule, which is why you see
> oranges on other people's pushes when you want to push yourself.
>
> Not starring your own pushes only makes the problem worse for everyone.
>
Yeah, but what Kyle said is also true; watching mozilla-central until every
last test result comes in is not really practical for people with normal
lives. And even if you're around to see a test failure six hours after you
checked in, the delay can make it painful to address.
The time taken to get test results seems to be a core problem driving a lot
of the others, and none of the rules proposed so far will affect it as far
as I can tell. We need to reduce the time it takes to get those test
results. I don't have any ideas of my own other than to keep pushing on what
we're already pushing on: more infrastructure bandwidth, faster builds,
perhaps split up the slow test suites more. Profiling the slow test suites
to see if we can speed them up might help.
Rob
--
"Now the Bereans were of more noble character than the Thessalonians, for
they received the message with great eagerness and examined the Scriptures
every day to see if what Paul said was true." [Acts 17:11]
> The time taken to get test results seems to be a core problem driving a lot of the others, and none of the rules proposed so far will affect it as far as I can tell. We need to reduce the time it takes to get those test results. I don't have any ideas of my own other than to keep pushing on what we're already pushing on: more infrastructure bandwidth, faster builds, perhaps split up the slow test suites more. Profiling the slow test suites to see if we can speed them up might help.
As I said from the beginning, there is no silver bullet. We have to push on all fronts here, and take the wins we can find as fast as we can find them. Limiting scope of bustages/closures in both time and people blocked is a win. Making everything cycle faster means problems can be addressed faster. I believe we'll need to do all of these things to keep scaling active development and stay on this fast cycle.
-- Mike
That's true, and that is why mconnor suggested the -staging proposal.
But it doesn't change the fact that according to the current tree rules
for m-c, people are supposed to watch the tree, and many of them don't.
I don't like watching the tree myself, and I can totally get why it's
a pain, but that's not the point here.
This strictness of this proposal stems from this fact, among others. I
think we now know from experience that expecting a few people to read
other people's patches, debugging them and trying to figure out what's
wrong with them (if anything) does not scale. And people have not
gotten more careful with what they're pushing to m-c. The proposal here
might not be ideal, but so far I haven't seen any other proposals in
this thread.
> The time taken to get test results seems to be a core problem driving a lot
> of the others, and none of the rules proposed so far will affect it as far
> as I can tell. We need to reduce the time it takes to get those test
> results. I don't have any ideas of my own other than to keep pushing on what
> we're already pushing on: more infrastructure bandwidth, faster builds,
> perhaps split up the slow test suites more. Profiling the slow test suites
> to see if we can speed them up might help.
I agree that it's a problem which we need to fix, but judging by the
fact that nobody owns debugging this and determining how we can make our
builds and tests faster, I'm not holding my breath for that happening
any time soon. And as our data shows, without having people working on
this, we've been regressing our build and test run speed significantly,
so at least in the near future I think this will only get worse.
Cheers,
Ehsan
The last 5 Windows opt builds that completed on m-c took the following
amount of time to complete from the push time: 155min, 214min, 165min,
158min, 148min.
That does include |make check|, I believe, so tests can start a bit
before this, but nevertheless. Changing our test suites won't help this
issue. Much faster builds somehow would.
Oh, I said "that completed" because the Windows opt build for Mounir's
6:34am Pacific push is still not done. It is currently 12:45pm Pacific,
so it's been 371 minutes....
For comparison, the longest test suites (still looking at the Win opt
builds) are reftest at 40min, xpcshell tests at 50min, dromaeo at 50min.
Chopping these up or speeding them up might save us 20 min or so.
Interestingly, those test suites don't seem to take much more time in
debug builds.
-Boris
For instance, I no longer watch the tree closely (apart from the fact
that there's a big screen by my desk that has it displayed at all times)
since we shipped Firefox 4. During the run-up to Firefox 4, we had a
number of issues with bad pushes that would end up making us close the
tree for hours. Compound that with the fact that a good number of
sheriffs are either not around or not sheriffing means that someone like
Ehsan or myself would come along and sheriff until the problem was
resolved. All of that led to me effectively being burned out about
watching the tree and helping to make sure things run smoothly. This
just increases the burden placed on those remaining who do that, and
increases the likelihood of them suffering the same fate.
The only thing I do do is watch dev.tree-management for performance
regression e-mails, which very few other people seem to pay attention
to. Frequently, regression e-mails seem to get zero responses from
anyone in the range until someone nags them about it.
Cheers,
Shawn
There's a bug[1] on the fact that a bunch of our tests are really slow
on Windows 7.
-Ted
> For comparison, the longest test suites (still looking at the Win opt
> builds) are reftest at 40min, xpcshell tests at 50min, dromaeo at 50min.
> Chopping these up or speeding them up might save us 20 min or so.
I think that tp5 has less pages than tp4, so it should cycle faster.
What's your guess about the average amount of time it takes to make sure
that the tree has been fixed? Here's a rough rundown on what happened
on Monday this week:
08:45 - glandium pushed a changeset which turned out to turn the tree
red [1].
09:16 - mstange pushed.
09:24 - volkmar pushed.
09:24 - the first build went red.
09:52 - glandium landed his first trial fix, which was ineffective.
10:04 - mfinkle pushed (on a red tree which might have been mis-starred)
10:04 - the first build on the ineffective fix went red.
10:30 - glandium backed out the original cset.
- we waited for the builds to cycle to make sure the new red that
we saw on the backout were just a clobber problem.
~13:50 - I reopened the tree.
This shows that we wasted about 4:30 hours trying to fix the bustage,
during which time the tree was closed for everybody.
[1] For the record, this did not happen because he landed anything
untested, it was a problem which would only happen with non-clobber
builds, and you don't get those on the try server.
Cheers,
Ehsan
4-5 hours. I agree that this is really bad.
> 08:45 - glandium pushed a changeset which turned out to turn the tree
> red [1].
> 09:16 - mstange pushed.
> 09:24 - volkmar pushed.
> 09:24 - the first build went red.
> 09:52 - glandium landed his first trial fix, which was ineffective.
> 10:04 - mfinkle pushed (on a red tree which might have been mis-starred)
> 10:04 - the first build on the ineffective fix went red.
> 10:30 - glandium backed out the original cset.
> - we waited for the builds to cycle to make sure the new red that we saw
> on the backout were just a clobber problem.
> ~13:50 - I reopened the tree.
>
> This shows that we wasted about 4:30 hours trying to fix the bustage,
> during which time the tree was closed for everybody.
Hold on. In this sort of situation I think we should have backed out
glandium right around 9:25. I think we all agree on that. The only
question we're discussing here is whether mstange and volkmar should
have been backed out as well.
-Boris
For the record, actually, it was effective... for Linux. And the Windows
build for my first push turned red after that fixup and it was a different
red.
> 10:04 - mfinkle pushed (on a red tree which might have been mis-starred)
> 10:04 - the first build on the ineffective fix went red.
Only needed clobber.
> 10:30 - glandium backed out the original cset.
> - we waited for the builds to cycle to make sure the new red
> that we saw on the backout were just a clobber problem.
> ~13:50 - I reopened the tree.
>
> This shows that we wasted about 4:30 hours trying to fix the
> bustage, during which time the tree was closed for everybody.
This shows that the time spent on Windows builds makes tree closure
really too long.
Mike
I seriously don't consider that a proposal, that's just the natural
reaction to change: "OMG no change!". ;-)
> For instance, I no longer watch the tree closely (apart from the fact
> that there's a big screen by my desk that has it displayed at all times)
> since we shipped Firefox 4. During the run-up to Firefox 4, we had a
> number of issues with bad pushes that would end up making us close the
> tree for hours. Compound that with the fact that a good number of
> sheriffs are either not around or not sheriffing means that someone like
> Ehsan or myself would come along and sheriff until the problem was
> resolved. All of that led to me effectively being burned out about
> watching the tree and helping to make sure things run smoothly. This
> just increases the burden placed on those remaining who do that, and
> increases the likelihood of them suffering the same fate.
I will second Shawn here. I've been paying a lot close attention to the
tree on a daily basis than I used to (basically still recovering from
the cedar maintenance burn-out), which makes me even more worried about
this problem.
> The only thing I do do is watch dev.tree-management for performance
> regression e-mails, which very few other people seem to pay attention
> to. Frequently, regression e-mails seem to get zero responses from
> anyone in the range until someone nags them about it.
You're right. I wonder how many people even read that list... :(
Ehsan
On another note, we are most likely going to update the RAM on the
linux/linux64 IX builders and get enterprise hard drives on the IX machines.
--armenzg
The goal was to fix the regression from that patch, and that fix didn't
achieve the goal. This is one of the very common patterns which causes
people to land some spot-fixes and back the original patch out in the
end. :-)
>> 10:30 - glandium backed out the original cset.
>> - we waited for the builds to cycle to make sure the new red
>> that we saw on the backout were just a clobber problem.
>> ~13:50 - I reopened the tree.
>>
>> This shows that we wasted about 4:30 hours trying to fix the
>> bustage, during which time the tree was closed for everybody.
>
> This shows that the time spent on Windows builds makes tree closure
> really too long.
Contrast this with us reverting the tree to the the parent revision of
your cset and clobbering at 9:30. It would have saved us a lot of time,
with no guesswork involved.
Cheers,
Ehsan
OK, I assert that without waiting on test runs of mstange and volkmar's
pushes, we would never know if we achieve our goal[1] by only backing
glandium out. Do you agree?
Ehsan
[1] Our goal being bringing the tree to a good state so that we can
reopen and let people pull from/push to it without worry.
Cheers,
Shawn
Yes.
However if you back out all three and then reland the other two
immediately, you're in the same boat. And if you don't reland them,
you're in that boat as soon as anyone pushes anything (in that you no
longer know whether the tree is green).
And also, we can't quite wait for a push to go fully green before the
next push, right?
So no matter what we do, we will be pushing on trees of unknown
greenness. The question is whether this matters.
-Boris
For one thing, normal lives are overrated. ;-)
For another, if we could magically make Windows fast, that would make
things much easier from what I hear. There are efforts going on on
making building faster on Windows, but not sure how much they help. We
probably should try to switch the build machines over to pymake on
Windows, which I hear also should save time there. And if someone can
find more angles for solutions, e.g. making pymake faster altogether,
building less code (is that possible at all?), making tests on Win7 run
with decent speed, or other stuff, I'm sure everyone would be happy.
Robert Kaiser
--
Note that any statements of mine - no matter how passionate - are never
meant to be offensive but very often as food for thought or possible
arguments that we as a community needs answers to. And most of the time,
I even appreciate irony and fun! :)
Also, to add to the fun, we also had a (theorically) unrelated
infrastructure bustage that also required clobber.
Mike
To be clear, in the timeline here:
> 08:45 - glandium pushed a changeset which turned out to turn the tree
> red [1].
> 09:16 - mstange pushed.
> 09:24 - volkmar pushed.
> 09:24 - the first build went red.
you would back out glandium at 9:25 and then the only differences from
that push not having happened at all are:
1) You get your test results for mstange's push 9 minutes later than
you would have otherwise (this is true even if you back out both other
changesets).
2) _If_ the backout changeset is still orange or red then you don't
know whether it's mstange or volkmar that needs backing out.
I can see backing them both out and then immediately relanding to avoid
issue #2 there.
-Boris
It would have saved 1 hour or tree closure. Still would have remained
3:30 hours.
Mike
It did fix the regression from that patch. Except there was another
regression that wasn't known until after the fix landed.
Anyways, I'm not trying to say not backing out was a good idea, I'm just
saying that the time spent for Windows builds is *really* not helping in
many different ways.
Mike
Actually, at the time, we could pretty much assert the backout would
have been enough considering the red weren't accross all platforms (only
Linux/Linux64 opt and Win (but we didn't know for Win until after
9:30)), and considering the content of the subsequent pushes.
Mike
Even on other platforms, it would be interesting to know why non clobber
builds, happening on the same slave, still take much more time compared
to local builds under the same conditions.
Mike
Last I heard from releng, pymake didn't really make anything fasteron
our build machines. I'm not currently aware of any other projects with
the promise of giving us faster Windows builds.
Ehsan
I made a quick attempt today on try, but unfortunately, I didn't manage to
get python-multiprocessing working.
I'm quite surprised to hear pymake doesn't make a difference on build
bots, considering how much difference it makes on my windows 7 VM.
Mike
Exactly, this is why I suggested that everything after the broken cset
should be backed out. This problem didn't happen on Monday
(un)fortunately, but it has happened often enough in the past.
> I can see backing them both out and then immediately relanding to avoid
> issue #2 there.
I'm glad that we finally came to an agreement. :-)
Ehsan
What would it take to get the *complete cycle* time down to the point
where we could make the rule be that nobody pushes until the previous
cycle is done? I think that would be livable with a complete cycle time
on the order of 15 minutes to half an hour. Quicker would of course be
better.
All other suggestions in this conversation seem to be trying to cope
with not being able to make that rule.
zw
I would guess that the only way to get people to pay more attention to
perf regressions is to make them turn the relevant T orange. I have no
idea how doable that would be.
zw
>> Contrast this with us reverting the tree to the the parent revision
>> of your cset and clobbering at 9:30. It would have saved us a lot
>> of time, with no guesswork involved.
>
> It would have saved 1 hour or tree closure. Still would have remained
> 3:30 hours.
Actually, it could have been five minutes. If we backed out everything at 9:25, we would be back to the previous changeset, which already built successfully and didn't fail anything unexpected. As we already know that changeset is good, there is no need for the tree to be closed at all, since we know we are now at a known-good state. We could wait five minutes to ensure that new pushes collapse into the backout run.
-- M
Why is it more important to have a green tree with no progress since
that last-known-good changeset than a tree with a few not-yet-tested
pushes in it (i.e., a state just like the things that had been
pushed since the changeset that caused bustage had just been pushed
and hadn't cycled yet)?
It seems to me you're prioritizing greenness over everything else.
If that's what we want we should close the tree to everything other
than random orange fixes, permanently, and otherwise just go home.
But that's not what I want.
-David
--
L. David Baron http://dbaron.org/
Mozilla Corporation http://www.mozilla.com/
Because if there are N patches landed after a patch which breaks the
build, and one of them regresses the build or a test, we will have a
much harder time figuring out which one has caused the regression (it is
effectively as if the N patches have landed in the same push).
> It seems to me you're prioritizing greenness over everything else.
> If that's what we want we should close the tree to everything other
> than random orange fixes, permanently, and otherwise just go home.
> But that's not what I want.
The goal here is to get the tree back to a state where we have _some_
data about its health as soon as possible, with minimum amount of time
where the tree is closed. I think this has been stated several times in
this thread so far.
Ehsan
We're doing this (landing N patches in the same push) all the time,
aren't we? Are you saying we should we stop doing that?
I still think this is trying to solve a problem we don't actually
have.
> On Tuesday 2011-05-17 18:01 -0400, Ehsan Akhgari wrote:
>> Because if there are N patches landed after a patch which breaks the
>> build, and one of them regresses the build or a test, we will have a
>> much harder time figuring out which one has caused the regression
>> (it is effectively as if the N patches have landed in the same
>> push).
>
> I still think this is trying to solve a problem we don't actually
> have.
We've had this problem a number of times, but no one has data either way. Clearly we need better data on extended tree closures that come as a result of this type of thing.
-- Mike
> On 5/17/2011 2:02 PM, Ehsan Akhgari wrote:
>
>> Last I heard from releng, pymake didn't really make anything fasteron
>> our build machines. I'm not currently aware of any other projects with
>> the promise of giving us faster Windows builds.
>>
> I believe that was pymake with -j1, which is why I was not surprised that
> it did not make a difference at the time. Our build platform at the time
> was just VMs with only one or two CPUs IIRC.
>
> If it were -j1, I would expect it to be slower. pymake is a net perf loss
at -j1.
- Kyle
> On Tuesday 2011-05-17 17:16 -0400, Mike Connor wrote:
>>
>> On 2011-05-17, at 4:43 PM, Mike Hommey wrote:
>>
>>>> Contrast this with us reverting the tree to the the parent revision
>>>> of your cset and clobbering at 9:30. It would have saved us a lot
>>>> of time, with no guesswork involved.
>>>
>>> It would have saved 1 hour or tree closure. Still would have remained
>>> 3:30 hours.
>>
>> Actually, it could have been five minutes. If we backed out everything at 9:25, we would be back to the previous changeset, which already built successfully and didn't fail anything unexpected. As we already know that changeset is good, there is no need for the tree to be closed at all, since we know we are now at a known-good state. We could wait five minutes to ensure that new pushes collapse into the backout run.
>
> Why is it more important to have a green tree with no progress since
> that last-known-good changeset than a tree with a few not-yet-tested
> pushes in it (i.e., a state just like the things that had been
> pushed since the changeset that caused bustage had just been pushed
> and hadn't cycled yet)?
Ehsan answered this already, but concrete example:
A: known good
B: bustage
C: unknown
D: unknown
E: unknown
Now you discover B is busted, and in a way that builds for C/D/E will not deliver a full set of results, since we couldn't even build on those platforms.
Option 1 (what happened yesterday):
* Back out B. Next run is A+C+D+E.
* If everything is fine when that run finishes, reopen the tree.
* If we have a new failure, we have to back out one or more of C+D+E and wait longer.
Option 2 (what I'd like to see):
* Back out B+C+D+E. Next run will be green, since A was green before.
* Wait five minutes, land C. Repeat, land D. Repeat, land E.
* Now you have four builds in flight on five minute delays, but if something breaks you know what changeset broke it.
* You can reopen the tree now, with confidence that A will go green, and any new bustage will be easily identified.
> It seems to me you're prioritizing greenness over everything else.
> If that's what we want we should close the tree to everything other
> than random orange fixes, permanently, and otherwise just go home.
> But that's not what I want.
I'm prioritizing getting the tree back open when stuff gets broken. We had a 4.5 hour closure that could have been addressed with 99% confidence, and subsequent patches relanded as separate pushes on a green base, in 15 minutes. I feel like that's a pretty big win for everyone involved.
-- Mike
We're doing this (landing N patches in the same push) all the time,
No, nobody is saying that. I should also mention that as an unofficial
rule, people like the entire push which breaks the tree to be backed out
(although many people do not follow this rule).
Ehsan
Or Option 3:
* Back out B. Next run is A+C+D+E.
* If everything is fine when that run finishes, reopen the tree.
* If we have a new failure, back out all of C,D, and E, and you're at green.
* You can reopen the tree now, knowing that it's at a green state.
Option 3 gets you to a good state strictly faster than Option 2 ... it also
backs out C, D, and E only if there is a bad push in the bunch.
The seductive thing about option 2 is that you think you can land F five
minutes after E and it feels okay, whereas with Option 3 it's obvious that's
not a good idea. Stacking up pushes in rapid succession is a bad idea to
begin with, because that's how you end up in this situation. In an ideal
world we wouldn't push anything before all test runs were green on the last
push ... but obviously with 3 hour windows builds on a good day this is not
an ideal world.
- Kyle
I suspect this would require not doing PGO... or magic.
-Boris
If the bustage is the sort of bustage that will prevent C, D and E from
returning any test results (e.g. broken build or test-suite timeout/crash),
the case for backing out all of B, C, D and E is relatively strong. If the
bustage is just "B's new test failed", or "one test failed", then we can
expect the C, D and E pushes to produce usable data and the probability that
C, D or E busted the same test is low.
Rob
--
"Now the Bereans were of more noble character than the Thessalonians, for
they received the message with great eagerness and examined the Scriptures
every day to see if what Paul said was true." [Acts 17:11]
Sounds like we have an action item for Releng!
It requires Microsoft to release a compiler that can do the recompilation
phase of PGO in parallel, along with a variety of build system fixes, or any
number of probably unacceptable options (turn off PGO, switch to mingw,
investigate cross compiling on linux, etc) Or hardware that is a couple
orders of magnitude faster than what we have now.
- Kyle
- Kyle
> Even with a very powerful machine you're still looking at multiple hours.
> PGO essentially serializes compilation ... and we compile twice. We could
> be smart in the build system and put in a bunch of hacks to compile
> mozjs.dll, nspr4.dll, etc in parallel, but you're still limited by how long
> it takes to compile all of the code in xul.dll at essentially -j1, which is
> most of our code.
>
How often do we encounter PGO-only bugs?
I wonder if we could drop PGO from the main test matrix and do nightly PGO
builds and tests, or something like that. If the nightly shows a new PGO
test failure someone would have to bisect. That could conceivably be
automated.
> The correct answer may depend on what the bustage is.
>
> If the bustage is the sort of bustage that will prevent C, D and E from returning any test results (e.g. broken build or test-suite timeout/crash), the case for backing out all of B, C, D and E is relatively strong. If the bustage is just "B's new test failed", or "one test failed", then we can expect the C, D and E pushes to produce usable data and the probability that C, D or E busted the same test is low.
Yeah, that's an important distinction, I'm now convinced. I think the tricky part is how to define that bustage, so we have a clear guideline for people operating without a sheriff around.
-- Mike
> I wonder if we could drop PGO from the main test matrix and do nightly PGO
> builds and tests, or something like that. If the nightly shows a new PGO
> test failure someone would have to bisect. That could conceivably be
> automated.
... using regular Windows opt builds for every-push testing, instead of PGO,
of course.
> On Wed, May 18, 2011 at 12:47 PM, Kyle Huey <m...@kylehuey.com> wrote:
>
>> Even with a very powerful machine you're still looking at multiple hours.
>> PGO essentially serializes compilation ... and we compile twice. We could
>> be smart in the build system and put in a bunch of hacks to compile
>> mozjs.dll, nspr4.dll, etc in parallel, but you're still limited by how long
>> it takes to compile all of the code in xul.dll at essentially -j1, which is
>> most of our code.
>>
>
> How often do we encounter PGO-only bugs?
>
I think we've had a total of two, one of which was a compiler bug.
>
> I wonder if we could drop PGO from the main test matrix and do nightly PGO
> builds and tests, or something like that. If the nightly shows a new PGO
> test failure someone would have to bisect. That could conceivably be
> automated.
>
>
Well this would mean that we'd only be getting daily perf numbers on our
most used platform ... that may be too high a cost to pay.
> Rob
> --
> "Now the Bereans were of more noble character than the Thessalonians, for
> they received the message with great eagerness and examined the Scriptures
> every day to see if what Paul said was true." [Acts 17:11]
>
- Kyle
> Or Option 3:
>
> * Back out B. Next run is A+C+D+E.
> * If everything is fine when that run finishes, reopen the tree.
> * If we have a new failure, back out all of C,D, and E, and you're at green.
> * You can reopen the tree now, knowing that it's at a green state.
>
> Option 3 gets you to a good state strictly faster than Option 2 ... it also backs out C, D, and E only if there is a bad push in the bunch.
Option 3 means the tree is open in >= 4 hours, and if one or more of C/D/E is broken, you took that delay for no benefit. Option 2 you have a known good state immediately, and if you reland C/D/E on five minute intervals (so they get their own builds/talos runs) you get to the same place in N + 15 minutes, except that you can then sort out which patch caused the regression.
> The seductive thing about option 2 is that you think you can land F five minutes after E and it feels okay, whereas with Option 3 it's obvious that's not a good idea. Stacking up pushes in rapid succession is a bad idea to begin with, because that's how you end up in this situation. In an ideal world we wouldn't push anything before all test runs were green on the last push ... but obviously with 3 hour windows builds on a good day this is not an ideal world.
Stacking up pushes certainly means you have more landings between bad landing and discovery of bad landing. What I'm proposing at least splits the C+D+E stack so get better answers if failures happen.
Landing badness on top of badness is what seems to cause the biggest/ugliest/most painful tree closures. I'm pushing for a method of a) minimizing closure time and b) ensuring that we have good data on each changeset so if there's further bustage it's just another backout.
-- Mike
PGO Windows builds take ~1.5 hours to link libxul the second time no
matter what. The linker is single-threaded when doing the optimizing
compile pass, and completely CPU-bound. We can optimize some other
time out of the build, certainly, but that is always going to be the
long pole. I asked on the VC blog if making that process parallel was
something they were looking at, and I got a reply saying that it
wasn't anything they had planned.
-Ted
> On Tue, May 17, 2011 at 5:53 PM, Robert O'Callahan <rob...@ocallahan.org>wrote:
>
>> On Wed, May 18, 2011 at 12:47 PM, Kyle Huey <m...@kylehuey.com> wrote:
>>
>>> Even with a very powerful machine you're still looking at multiple
>>> hours. PGO essentially serializes compilation ... and we compile twice. We
>>> could be smart in the build system and put in a bunch of hacks to compile
>>> mozjs.dll, nspr4.dll, etc in parallel, but you're still limited by how long
>>> it takes to compile all of the code in xul.dll at essentially -j1, which is
>>> most of our code.
>>>
>>
>> How often do we encounter PGO-only bugs?
>>
>
> I think we've had a total of two, one of which was a compiler bug.
>
If we're eating two hours of test latency per push to catch two bugs ever,
I'd drop PGO tests in a heartbeat.
On the other hand, there is also the question of how often we catch a real
perf regression in PGO that would not have shown up in a regular Windows opt
build. On the third hand, I know that we have seen several false performance
regressions due to PGO...
>
>> I wonder if we could drop PGO from the main test matrix and do nightly PGO
>> builds and tests, or something like that. If the nightly shows a new PGO
>> test failure someone would have to bisect. That could conceivably be
>> automated.
>>
>>
> Well this would mean that we'd only be getting daily perf numbers on our
> most used platform ... that may be too high a cost to pay.
>
We'd still have perf numbers for regular Windows opt builds. We could do
several perf test runs off the single nightly PGO build to make sure we've
got significant data. It might work.
Can we build both PGO and non-PGO, and use the latter for the
correctness tests and the former for talos?
We've already established that it takes 8-36 hours or so to find out
about perf regressions (even on Linux, where the builds are quick), so
this won't make things much worse for the perf number latency, while
still giving us much shorter latency to test green.
Heck, we could run all of the test on both PGO and non-PGO but
speculatively assume things are OK once the non-PGO correctness tests
are done.
-Boris
Means doubling the Talos Windows pool + probably 50% more builders. Definitely an interesting thing to consider, but a very very nonzero cost. Aside from people who watch the tree, what's the benefit here?
-- Mike
Considering that we're talking about drastic changes to commit/backout
policy that are proving to be highly controversial, I think maybe some
of those highly-controversial technical changes should also be on the table.
zw
Nit: if you really *know* that the next run (A+B+C+D+E-E-D-C-B=A) will
be green, there's no need to do that build. You may as well just back
out B, D, and E (A+B+C+D+E-E-D-B=A+C). But you don't *really* know that
it'll be green, because B might require a clobber, and perhaps another
clobber after the backout. But those are just details.
Thanks for the concrete example. Maybe we should make a fighting card
game: "Mozilla Build Sheriff". Your goal: land as many good patches in
an 8-hour period. The obstacle: your opponent has some number of build
and test failure cards to use against you, each labeled with a varying
delay before you see the failure. He also starts out with two of the
dreaded "needs clobber!" cards that he can throw onto a build following
a build failure card for extra difficulty.
Once you master that, you can buy separately "Mozilla Build Sheriff 2:
European Timezone".
If I were given the task of writing an AI for the game, and having
unlimited automation (but not build/test) resources, I'd probably do
something like... wait. I originally wrote up a detailed proposal here,
but I think it's a tangent and therefore off-topic for this thread. So
I'll put it over at
http://blog.mozilla.com/sfink/2011/05/17/mozilla-central-automated-landing-proposal/
instead. It's an alternative proposal for an automated landing procedure
for mozilla-central -- alternative to Ehsan's, that is; I don't intend
it to take the place of the mozilla-staging proposal on the table here.
Cheers,
Shawn
Cheers,
Shawn
For my last suggestion, yes. For my first one, no.
> Definitely an interesting thing to consider, but a very very nonzero
cost.
Look, we're trying to get the tree to a better state. You're willing to
impose costs on people but not hardware costs?
> Aside from people who watch the tree, what's the benefit here?
1) That's a pretty big aside.
2) It's a benefit for everyone who doesn't have to reland their patches
because some build under them went orange or red.
-Boris
"Aside from"? This whole thread is about trying to save man hours
spent watching the tree. People-time is more expensive than hardware.
There is a difference between controversial changes to the tree rules,
and talking about technology which simply doesn't exist!
Oh, and while on the topic, I think this proposal being highly
controversial is part of its definition. It is a change, and it
prohibits things which are now considered as common practice. So, it is
bound to be controversial. :-)
Ehsan
I agree.
How about defining it as "any kind of bustage which would prevent C, D
and E from getting test coverage"? This should cover build bustages and
test harness crashes/timeouts, without making it too complex
(determining the order in which the tests covering the code changes in
C, D and E are executed).
Ehsan
Can we do a week of test run with doing both PGO and non-PGO builds
using our existing number, and watching the perf numbers closely to see
if it's really worth doing PGO builds per push for performance
regression testing?
Ehsan
> On 5/17/11 9:24 PM, Mike Connor wrote:
>>> Can we build both PGO and non-PGO, and use the latter for the correctness tests and the former for talos?
>>>
>>> We've already established that it takes 8-36 hours or so to find out about perf regressions (even on Linux, where the builds are quick), so this won't make things much worse for the perf number latency, while still giving us much shorter latency to test green.
>>>
>>> Heck, we could run all of the test on both PGO and non-PGO but speculatively assume things are OK once the non-PGO correctness tests are done.
>>
>> Means doubling the Talos Windows pool + probably 50% more builders.
>
> For my last suggestion, yes. For my first one, no.
>
> > Definitely an interesting thing to consider, but a very very nonzero cost.
>
> Look, we're trying to get the tree to a better state. You're willing to impose costs on people but not hardware costs?
Where did I say that? I said it's expensive (true), so we should paint a clear picture of the benefit. SpecOps (the IT group that supports releng) is in early planning for some new infra, so I'd like to make the case with as much data as possible.
>> Aside from people who watch the tree, what's the benefit here?
>
> 1) That's a pretty big aside.
Remember that I've stated repeatedly that watching the tree (whether for two hours or five) is an unfortunate timesink we need to dramatically cull. If a) we have automated notifications, b) better sheriff coverage and c) -staging, the scope of "people watching the tree" gets much smaller, and the cost of that time shrinks. If it's 500 hours a year of dev time vs. $500k of hardware/colo
> 2) It's a benefit for everyone who doesn't have to reland their patches because some build under them went orange or red.
FWIW, I'm starting to think that the right thing to do is have the sheriff do this, in the scoped-down refinement of the original proposal.
-- Mike
Also note that PGO may hide compiler bugs:
https://bugzilla.mozilla.org/show_bug.cgi?id=657569
(thus disabling PGO will trigger these bugs ; at the moment, that's the
only one)
Mike
Sold.
-Boris
OK, fair.
>>> Aside from people who watch the tree, what's the benefit here?
>>
>> 1) That's a pretty big aside.
>
> Remember that I've stated repeatedly that watching the tree (whether for two hours or five) is an unfortunate timesink we need to dramatically cull. If a) we have automated notifications, b) better sheriff coverage and c) -staging, the scope of "people watching the tree" gets much smaller, and the cost of that time shrinks. If it's 500 hours a year of dev time vs. $500k of hardware/colo
There are 250 work days per year. I fully expect sheriffs to spend a
full workday on the tree. So figure we're talking a full-time position,
if we have only one sheriff shift.
I agree that if we're talking $500k maybe it's not worth it. ;)
>> 2) It's a benefit for everyone who doesn't have to reland their patches because some build under them went orange or red.
>
> FWIW, I'm starting to think that the right thing to do is have the sheriff do this, in the scoped-down refinement of the original proposal.
Sounds good to me.
-Boris
The official rule I've observed is that everybody tries to figure out
which of the patches could have caused the problem and backs out this
one, not the others.
Anyway, we're not even talking about one of N patches actually having a
problem. We're talking about the possibility that one of them could turn
out to be a problem, and therefore the N patches should be backed out.
By that rule we should never land N patches in one push, unless you're
saying landing N patches when the tree was red is worse than landing N
patches when the tree was green.
Yet, we land N patches in one push all the time.
I think backing out patches that landed after the one showing first
bustage would make sense if we had the expectation that roughly every
second or every third patch will break the tree. But that's not the
expectation we should have. Patches generally don't break the tree.
Since getting an innocent patch backed out wastes people-time, I think
it would make more sense to backout the patch the broke the tree ASAP
and not back out whatever landed after it on the assumption that patches
generally don't break the tree.
Of course, this means that there's more to sort out if the subsequent
patches that landed before the backout actually also broke the tree, but
at least in the more common case, we would be no worse off than the
subsequent patches having landed in one push on a green tree.
--
Henri Sivonen
hsiv...@iki.fi
http://hsivonen.iki.fi/
Do you take the buildsymbols step into account there? If so, that's
really interesting and something it probably would be worth to work on.
Robert Kaiser
--
Note that any statements of mine - no matter how passionate - are never
meant to be offensive but very often as food for thought or possible
arguments that we as a community needs answers to. And most of the time,
I even appreciate irony and fun! :)
And every merge is basically that as well, just as a note.
That's 2000 hours just for the sheriff, not counting the
tree-watching time spent by people who have landed or are waiting to
land.
- Mike
_______________________________________________
dev-planning mailing list
dev-pl...@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-planning
Ben
On Tue, May 17, 2011 at 5:56 PM, Kyle Huey <m...@kylehuey.com> wrote:
> On Tue, May 17, 2011 at 5:53 PM, Robert O'Callahan <rob...@ocallahan.org
> >wrote:
>
> > On Wed, May 18, 2011 at 12:47 PM, Kyle Huey <m...@kylehuey.com> wrote:
> >
> >> Even with a very powerful machine you're still looking at multiple
> hours.
> >> PGO essentially serializes compilation ... and we compile twice. We
> could
> >> be smart in the build system and put in a bunch of hacks to compile
> >> mozjs.dll, nspr4.dll, etc in parallel, but you're still limited by how
> long
> >> it takes to compile all of the code in xul.dll at essentially -j1, which
> is
> >> most of our code.
> >>
> >
> > How often do we encounter PGO-only bugs?
> >
>
> I think we've had a total of two, one of which was a compiler bug.
>
>
> >
> > I wonder if we could drop PGO from the main test matrix and do nightly
> PGO
> > builds and tests, or something like that. If the nightly shows a new PGO
> > test failure someone would have to bisect. That could conceivably be
> > automated.
> >
> >
> Well this would mean that we'd only be getting daily perf numbers on our
> most used platform ... that may be too high a cost to pay.
>
>
> > Rob
> > --
> > "Now the Bereans were of more noble character than the Thessalonians, for
> > they received the message with great eagerness and examined the
> Scriptures
> > every day to see if what Paul said was true." [Acts 17:11]
> >
>
> - Kyle
> What about pushing bugs that are expected to change PGO with a tag like we
> do for NOBUILD (except in this case, it builds extra much!). Something like
> this might also be useful for clobbering on try.
>
> Ben
>
> The general problem is dealing with the tree is not things people expect to
happen, it's the things they don't expect to happen.
I'm not sure what you're referring to w.r.t tryserver. All builds are
clobbers there ... if you're talking about not clobbering, you really can't
do that, since you can't guarantee that the previous try run left the objdir
in any remotely sane state.
- Kyle
Anyways, PGO tag could be supplemental to nightly PGO builds if we were to
go that route.
Ben
Try builds are clobbered, always. m-c is not. A patch that needed a
clobber was pushed to m-c without a clobber being triggered and that
broke m-c. The author did not realize the patch needed a clobber,
because he'd run it on try and everything was fine... because try always
clobbers.
Is that clear? ;)
-Boris