Merge Queue Prioritization (ab)Use

60 views
Skip to first unread message

Paris, Eric

unread,
May 12, 2016, 4:54:11 PM5/12/16
to kuberne...@googlegroups.com
tl;dr;

- If you fix a test failure/flake you may raise the priority of one of
your PRs by one.
- Otherwise your PR is unprioritized/P3 just like everyone else's.
- https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake

----------

I brought this up on the community call today and heard little
resistance. I propose we, as a community, start following this
immediately. I'll put a doc in the repo assuming most are okay with
the idea.

We are suffering from 2 problems:

#1 Too many test failures means the merge queue is slow
#2 People are semi-arbitrarily adding priority/p* labels to their
issues to skip ahead in the queue

If there were no test failures/flakes the merge queue could handle
between 40-50 PRs per day. It would stay ~empty at our current
development rates. We need to get there. I think we can try to better
both of those issues at the same time with a new agreement among
developers. The new agreement is basically:

- test failure fixes are p0/p1 - your discretion
- if you fix a test failure/flake you may raise the priority of one of
your PRs by 1.
- otherwise your PR is p3 just like everybody else.


If you are suffering from "rebase-hell" go fix a flake, then move your
PR from P3 to P2. Just be sure to state in the PR WHY you are raising
the priority.

If there are a bunch of other PRs in P2 and you want to be P1? Go fix
another flake!

Fixing 2 flakes will allow you to move to the top of the queue (just
behind your flake fixes, most likely).

I will be sporadically going through all PRs which are not P3 and
asking/complaining if you seem to have prioritized your PR without
comment as to why. Everyone please leave a comment why you think your
PR should not be default/P3 when you change the priority. I'm going to
try to help people keep themselves honest. I'm not going to 'enforce'
anything here. We are (mostly) adults and need to work together. The
people who fix the test failure will be making the system better for
the rest of us.

Want to know what to fix? Start with:
https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+is%3Aissue+label%3Akind%2Fflake
or
https://github.com/kubernetes/kubernetes/issues/24753

Honestly, do anything to help the test system, then reward yourself
with a priority bump!

Comments, complaints? Lets squash those bugs and get moving again!

-Eric

David Oppenheimer

unread,
May 12, 2016, 6:36:22 PM5/12/16
to Paris, Eric, kuberne...@googlegroups.com
It seems to me that everyone in the community benefits if we are able to prioritize PRs that are considered high priority for an upcoming release, in two ways:
(1) people, whether at Google, RedHat, other companies, or private individuals, have stuff they need to get into a particular release
(2) people who don't have stuff they need to get into a release, probably don't want the release date to slip, but this is likely to happen if critical stuff is not able to get in

My suggestion would be transparency rather than prohibition. In particular, I would propose that we have a list of features that are considered critical for a release (exactly how "critical" gets defined TBD). There is a process for anyone in the community to add to this list (we can start with no review process for adding to the list, and only add one if it starts getting abused). PRs related to features on the list receive increased merge queue priority, and anything not on the list is not allowed to have increased merge queue priority.

Of course flake fixes would always receive increase merge queue priority.

Does that sound reasonable?



-Eric

--
You received this message because you are subscribed to the Google Groups "kubernetes-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/CACLa4pvMVxf9LQOF-wGQsxFBGWeBpMaiW%3DY01YXVAH4ZVDgY1A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

David Oppenheimer

unread,
May 12, 2016, 6:37:56 PM5/12/16
to Paris, Eric, kuberne...@googlegroups.com
(That said, I really really really like the idea of incentivizing test fixing -- I'm just not sure this is the best approach.)

Eric Tune

unread,
May 12, 2016, 7:01:01 PM5/12/16
to kubernetes-dev, epa...@parisplace.org
The submit queue prioritizes items with 1.3 milestone above those without, within the same priority band.  Do you view putting the milestone on as okay?  

If so, then that, combined with what Eric Paris suggested, means that the priority order is:

1. test flakes, and PRs of test flake fixers
2. Stuff for 1.3
3. Other stuff.

Which seems okay to me.

I don't want 1.3 to be randomly ordered with non-1.3.

Clayton Coleman

unread,
May 12, 2016, 7:39:06 PM5/12/16
to David Oppenheimer, Paris, Eric, kuberne...@googlegroups.com
I'll take the other side here - I've had 4 or 5 LGTM performance / bug fixes in the queue for almost two weeks.   When flakes peak, we waste a lot of time doing unproductive work.  Incentivizing people to seek out and fix flakes benefits everyone - I don't think today we have so many people fixing flakes that we are at risk of wasting development time.  

As an example, and this is not intended to pick on anyone, just to highlight a reality of merging large feature PRs - deployments flaked the queue for weeks.  I am absolutely sure that we will see an increase in flakes over the next two weeks that will put all 1.3 features at risk, and starting now to incentivize fixes seems like the right direction.

David Oppenheimer

unread,
May 12, 2016, 7:57:25 PM5/12/16
to Clayton Coleman, Paris, Eric, kuberne...@googlegroups.com
So IIUC, this is primarily about incentivizing people to fix the tests, as opposed to philosophical opposition to prioritizing PRs for the release? If so, can we try to think about other ways to incentivize?

I'm of course all in favor of fixing tests (who isn't ? :-) but from a practical standpoint, if people stop doing the remaining feature work for 1.3 to work on tests now, those features won't be finished by the code freeze.

If we really want to tackle the tests, I would prefer something like saying "we are going to work exclusively on tests for N weeks, and we are going to slip the code freeze by N weeks." But it's not clear to me that the community prefers slipping the release to prioritizing 1.3 PRs and then tackling the tests after the code freze.


Eric Tune

unread,
May 12, 2016, 8:08:57 PM5/12/16
to David Oppenheimer, Clayton Coleman, Paris, Eric, kuberne...@googlegroups.com
David, you didn't address Clayton's point that "we will see an increase in flakes over the next two weeks that will put all 1.3 features at risk".  

David Oppenheimer

unread,
May 12, 2016, 8:25:01 PM5/12/16
to Eric Tune, Clayton Coleman, Paris, Eric, kuberne...@googlegroups.com
Well, it's not two weeks--it's one week and one day...

I'm not convinced that a strategy of continuing to prioritize the 1.3 PRs and then work on the tests after the code freeze won't work.

But if it won't work, then there is no option that will allow us to met the code freeze deadline, and we should start discussing how much to slip it ASAP. But I thought people were opposed to slipping the code freeze deadline. 


Eric Paris

unread,
May 12, 2016, 8:33:18 PM5/12/16
to Eric Tune, David Oppenheimer, Clayton Coleman, Paris, Eric, kuberne...@googlegroups.com
I am philosophical not on the side of defacto prioritization of 'for
release' PRs. (Different answer likely Monday, May 23.) However that
does not mean I don't recognize why you (and others) have suggested it.

I'm not sure I buy the premise (although I may well be wrong) that
among the 79 PRs which wanted to be merged today that there were more
than a handful from authors who would happily wait until 1.4. People
are pushing even more than usual numbers of PRs because of the 1.3
deadline; because they want them in 1.3.

I am good with Eric Tune's idea about the 1.3 milestone. If you want
your PR in 1.3, add the milestone. If you don't care, leave it off.
That sounds very reasonable to me. And if I'm wrong, and people don't
want their work in 1.3 it 'solves' the problem as your suggest.

But we have many PRs LGTM'd more than a week before the feature freeze
date and the 'critical' list shouldn't be a "too bad too sad list"
until at least Monday 23rd. The door should be open to all. I think
anyone with a LGTM'd PR that gets turned away now because their work
isn't on some list, despite being complete in plenty of time is a bad
idea.

I think no matter what, we are going to have to go into high-gear for
test failures (I hate calling them flakes, they are failures even if
they don't fail 100% of the time). I'd just rather start now than
refuse people's work and keep digging deeper as we jam code in by hand.

For reference:
Saad merged 13 PRs in the last 24 hours by hand (thank you)
The bot merged 15 PRs which required testing

We merged 28 'meaningful' PRs today. With no flakes the bot can merge
between 40-50/day. So a bit over 36 hours we could have an empty queue.

The problem as we hit on 1.3 is not the number of PRs. Or even how late
they are coming. The problem is the flakes. Fix the failures and we
have none of this priority issue. So I don't want to say no to work
when the problem isn't the amount of work. I want to say no to flakes.
(And next week we do more saying no to work)

-Eric



On Thu, 2016-05-12 at 17:08 -0700, 'Eric Tune' via kubernetes-dev
wrote:
> David, you didn't address Clayton's point that "we will see an
> increase in flakes over the next two weeks that will put all 1.3
> features at risk".  
>
> On Thu, May 12, 2016 at 4:57 PM, 'David Oppenheimer' via kubernetes-
> dev <kuberne...@googlegroups.com> wrote:
> > So IIUC, this is primarily about incentivizing people to fix the
> > tests, as opposed to philosophical opposition to prioritizing PRs
> > for the release? If so, can we try to think about other ways to
> > incentivize?
> >
> > I'm of course all in favor of fixing tests (who isn't ? :-) but
> > from a practical standpoint, if people stop doing the remaining
> > feature work for 1.3 to work on tests now, those features won't be
> > finished by the code freeze.
> >
> > If we really want to tackle the tests, I would prefer something
> > like saying "we are going to work exclusively on tests for N weeks,
> > and we are going to slip the code freeze by N weeks." But it's not
> > clear to me that the community prefers slipping the release to
> > prioritizing 1.3 PRs and then tackling the tests after the code
> > freze.
> >
> >
> >
> > On Thu, May 12, 2016 at 4:38 PM, Clayton Coleman <ccoleman@redhat.c
> > > > On Thu, May 12, 2016 at 1:54 PM, Paris, Eric <eparis@parisplace
> > > > > it, send an email to kubernetes-dev+unsubscribe@googlegroups.
> > > > > com.
> > > > > To post to this group, send email to kubernetes-dev@googlegro
> > > > > ups.com.
> > > > > To view this discussion on the web visit https://groups.googl
> > > > > e.com/d/msgid/kubernetes-dev/CACLa4pvMVxf9LQOF-
> > > > > wGQsxFBGWeBpMaiW%3DY01YXVAH4ZVDgY1A%40mail.gmail.com.
> > > > > For more options, visit https://groups.google.com/d/optout.
> > > > >
> > > >
> > > > -- 
> > > > You received this message because you are subscribed to the
> > > > Google Groups "kubernetes-dev" group.
> > > > To unsubscribe from this group and stop receiving emails from
> > > > it, send an email to kubernetes-de...@googlegroups.co
> > > > m.
> > > > To post to this group, send email to kubernetes-dev@googlegroup
> > > > s.com.
> > > > To view this discussion on the web visit https://groups.google.
> > > > com/d/msgid/kubernetes-
> > > > dev/CAOU1bzcJvph4XTAMkXtT%2BPHiPZetyT3yUzh1znCR39i3MgRBYQ%40mai
> > > > l.gmail.com.

Eric Paris

unread,
May 12, 2016, 10:22:48 PM5/12/16
to David Oppenheimer, Eric Tune, Clayton Coleman, Paris, Eric, kuberne...@googlegroups.com
I think you are correct in some respect. If you have a large work item
trying to race it in at the last minute you (and your team) are going
to have to do a delicate balance between helping yourself get your
feature merged and helping every single other person get their features
and bug fixes merged.

I think my suggestion is going to largely impact the coders at large
companies. Individuals or contributers with reasonable features should
be able to get in the queue and be happy (eventually). It will be us,
the Googlers, the Red Hatters, and other larger entities who are trying
to push large multi-staged, multi-PR features with dependency chains at
the last moment. And those people should, in my mind, be incentivised
to weigh fixing test failures.

We seem to be all agreeing, no matter what, those 'critical' patches
are going in to 1.3. Doesn't matter if we just jam them in by hand or
start fixing flakes first. The work is the same. We must for release:
(a) get the features in
(b) fix the test failures.
The release date is not impacted by the order of (a) and (b).

I believe doing a bit more (b) right now gives the non-critical path
people a fighting change. Doing all (a) sort of thumbs our nose at the
non-critical path patches.

I'm talking specifically about what we do for the next 1-2 weeks here,
but think the Eric/Eric compromise might help us not get quite so far
behind in 1.4...

Then again, I'm open to ANY solution people have to drain the merge
queue besides me just merging everything and testing nothing  :)

On Thu, 2016-05-12 at 17:37 -0700, David Oppenheimer wrote:
> But saying no to flakes right now is saying no to work, because
> people would have to stop their work to work on flakes. Which is fine
> if that's what we want to do, but then we need to push back the
> deadline (for everyone).

David Oppenheimer

unread,
May 12, 2016, 10:58:08 PM5/12/16
to Eric Paris, Clayton Coleman, kuberne...@googlegroups.com, Paris, Eric, Eric Tune

I agree that the order doesn't really affect the ship date, but it does affect the freeze date: if we stop feature work now to fix flaky tests, then we'll have to push back the code freeze date. If we wait to fix flaky tests until after the code freeze, then we don't need to push back the code freeze date (at least not for the reason that we spent time fixing flaky tests). I don't know how many people care whether we push back the code freeze date (I know some people do, though).

Clayton Coleman

unread,
May 12, 2016, 11:28:42 PM5/12/16
to David Oppenheimer, Eric Paris, kuberne...@googlegroups.com, Paris, Eric, Eric Tune
I'm specifically thinking of the end of 1.2, where we got very flaky right at the end and were unable to close things out because each multi-stage PR was caught up in flakes - to the point where we had to stop everything anyway to address them.  And we still slipped.

T.J. Goltermann

unread,
May 13, 2016, 2:22:15 AM5/13/16
to Clayton Coleman, David Oppenheimer, Eric Paris, kuberne...@googlegroups.com, Paris, Eric, Eric Tune
I thought a bit about it this evening, and I'll guess I'll register what sounds like a dissenting opinion.

We all know flakes are bad and the merge queue throughput is bad.  However, there are two really easy things we can do to help the situation short term:
1) The build cop oncall can manually merge small, low risk PRs.  This along with the judicious use of e2e-not-required will focus the queue's time on verifying the more complex (risky) fixes.
2) We can (uniformly and fairly) apply the v1.3 milestone tag to any feature work in 1.3.  We would not use that label for PRs we'd take after the feature complete date next week (bug fixes, cleanups, etc.)

This creates a three-tiered queue: some stuff flies right in, some gets the appropriate bump up and gets tested, and some gets put in the low pri bin.  I suspect that's enough to get us through next week to the feature complete date of May 20 (perhaps including the weekend after to drain it).

The alternative to slide out the feature complete date is an appealing option, but I think there's real issues with it:
- it delays the highest churn work until later, which in turn delays finding and fixing those inevitable flakes (i.e. the deployment code in 1.2)
- it puts us in the unfortunate world like 1.2, where we're feature complete...except X, Y and Z - which makes it harder to hold the line on large feature-sized fixes incoming
- it's a heavy context switch for everyone working hard to finish their features, right in the crunch week when they need to focus the most....then a context switch a week or two later when we come back off flake surge to pick that feature work up again

My vote is that we're so close (6 business days) from finishing feature complete and drastically tamping down churn that we should just push through, even with a painful submit queue.  I just don't see that slipping feature complete is a much better option.

--
You received this message because you are subscribed to the Google Groups "kubernetes-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-de...@googlegroups.com.
To post to this group, send email to kuberne...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-dev/205102684334216995%40unknownmsgid.

For more options, visit https://groups.google.com/d/optout.



--

-T.J. Goltermann

David Oppenheimer

unread,
May 13, 2016, 9:56:45 AM5/13/16
to Eric Paris, Eric Tune, Clayton Coleman, Paris, Eric, kuberne...@googlegroups.com
But saying no to flakes right now is saying no to work, because people would have to stop their work to work on flakes. Which is fine if that's what we want to do, but then we need to push back the deadline (for everyone).


On Thu, May 12, 2016 at 5:33 PM, Eric Paris <epa...@redhat.com> wrote:
Reply all
Reply to author
Forward
This conversation is locked
You cannot reply and perform actions on locked conversations.
0 new messages