Kubernetes Reliability

237 views
Skip to first unread message

Wojciech Tyczynski

unread,
Jul 24, 2020, 6:51:29 AM7/24/20
to kubernetes-sig-architecture, Tim Hockin, John Belamaric
 Hi SIG Architecture,
  I'm sure that Kubernetes Reliability is near and dear to most if not all of us. All our users rely on the fact Kubernetes is stable and reliable and while Kubernetes is further maturing that bar is only increasing.

  As SIG Architecture, we created the "Production Readiness" subproject with the goal to exactly increase the bar (ensure that Kubernetes is easier to support, features can be monitored, feature owners think about upgrades, scalability and so on).
  But, in my personal opinion, this is not enough. As a community we need to do more. And I think SIG Architecture is the best place where we can discuss this and decide how we (as the whole community) can increase our engagement in reliability of the system.

  I'm fairly sure all of you can provide a number of other examples, so just to name a few from my own list of the things we're by far not investing enough:

1. soak tests - we have nothing; the longest running tests as of now are scalability tests that take ~12h on 5k-node cluster; if a problem appears after a day, we're blind;

2. upgrade testing - I think we all agree that we're far from where we would like to be; when creating "Production Readiness Template" we were considering having an "upgrade test' to be a requirement for GA. But both the upgrade test infrastructure as well as the existing tests isn't at the stage where we can require everyone to rely on it and treat as examples build based on recommended best practices.

3. chaos testing - SIG Scalability is doing a little bit of them, but by far not enough; we have some disruptive tests, but in turn they are not being run at scale. There are a lot of things that can be done in this area too.

4. hardening control plane & protecting people from shooting into their foots; currently it's way too easy to overload or break your cluster; Priority & Fairness effort driven by SIG Apimachinery will improve it, but again, we need much more (and I guess we could also utilize many more people in that effort).

  I wanted to bring this topic to your attention and get your thoughts on what we can do in this situation. 
1. Recommending creating "SIG Reliability" is one thing that cames to my mind, but there are counterarguments (overlap with existing SIGs, fear that other SIGs start to think that now they don't need to care about those things, etc. - though SIG Scalability example shows that it doesn't have to be the case).
2. Creating a WG with the goal of creating the strategy to address the problem with people from other SIGs coming to it is another idea. But it's not obvious there exist any reasonable "exit criteria" here.
3. Empowering existing SIGs to invest into reliability-related efforts is another thing. But I think some coordination across different efforts would be very helpful.

 There are also other challenges like the fact that many of those things is not a fancy feature work, rather a grungy one. So how can we attract people to actually do that?


  I don't have any concrete proposal at this point, but I would really like to kick the tires around that.
I'm adding this topic to the upcoming SIG Architecture meeting to discuss further, in the meantime all the ideas, comments and concerns are more than welcome in this thread.

 thanks
wojtek

Benjamin Elder

unread,
Jul 24, 2020, 11:48:49 AM7/24/20
to Wojciech Tyczynski, kubernetes-sig-architecture, Tim Hockin, John Belamaric
1, 2, and 3 sound like SIG Testing, nominally.
I would encourage people interested in these topics to consider participating more in SIG Testing, we would be very interested to see work on these.

I personally have an objective to explore upgrade testing in Q3, but I've been unable to provide attention to it yet because of the demands of keeping what we have now healthy (e.g. https://github.com/kubernetes/kubernetes/issues/92937), which few people are working on.

4. Sort of sounds like an ask for API Machinery?

I wholeheartedly agree that we need to do more here, thanks for starting the conversation.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAPgfqyXH02dMwd4c-6eC7uKNnTtZfA2fshtc%3DC4nSEE%2B75Dw1Q%40mail.gmail.com.

Benjamin Elder

unread,
Jul 24, 2020, 11:49:08 AM7/24/20
to Wojciech Tyczynski, kubernetes-sig-testing, kubernetes-sig-architecture, Tim Hockin, John Belamaric

Daniel Smith

unread,
Jul 24, 2020, 11:59:52 AM7/24/20
to Wojciech Tyczynski, K8s API Machinery SIG, kubernetes-...@googlegroups.com, kubernetes-sig-architecture, Tim Hockin, John Belamaric
On Fri, Jul 24, 2020 at 3:51 AM 'Wojciech Tyczynski' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:
 Hi SIG Architecture,
  I'm sure that Kubernetes Reliability is near and dear to most if not all of us. All our users rely on the fact Kubernetes is stable and reliable and while Kubernetes is further maturing that bar is only increasing.

  As SIG Architecture, we created the "Production Readiness" subproject with the goal to exactly increase the bar (ensure that Kubernetes is easier to support, features can be monitored, feature owners think about upgrades, scalability and so on).
  But, in my personal opinion, this is not enough. As a community we need to do more. And I think SIG Architecture is the best place where we can discuss this and decide how we (as the whole community) can increase our engagement in reliability of the system.

  I'm fairly sure all of you can provide a number of other examples, so just to name a few from my own list of the things we're by far not investing enough:

1. soak tests - we have nothing; the longest running tests as of now are scalability tests that take ~12h on 5k-node cluster; if a problem appears after a day, we're blind;

2. upgrade testing - I think we all agree that we're far from where we would like to be; when creating "Production Readiness Template" we were considering having an "upgrade test' to be a requirement for GA. But both the upgrade test infrastructure as well as the existing tests isn't at the stage where we can require everyone to rely on it and treat as examples build based on recommended best practices.

3. chaos testing - SIG Scalability is doing a little bit of them, but by far not enough; we have some disruptive tests, but in turn they are not being run at scale. There are a lot of things that can be done in this area too.

4. hardening control plane & protecting people from shooting into their foots; currently it's way too easy to overload or break your cluster; Priority & Fairness effort driven by SIG Apimachinery will improve it, but again, we need much more (and I guess we could also utilize many more people in that effort).

As Ben said, APF is an API Machinery effort. However, much like the resource quota mechanism, it's not a complete solution. Specifically, we have built tools to enforce limits, but we haven't yet built any tools to figure out what the limits should be, or to automatically set them.

(Also, yes, we could use some additional folks, especially those already in reviewer/approvers, hit me up on slack if you want to participate.)
 

  I wanted to bring this topic to your attention and get your thoughts on what we can do in this situation. 
1. Recommending creating "SIG Reliability" is one thing that cames to my mind, but there are counterarguments (overlap with existing SIGs, fear that other SIGs start to think that now they don't need to care about those things, etc. - though SIG Scalability example shows that it doesn't have to be the case).
2. Creating a WG with the goal of creating the strategy to address the problem with people from other SIGs coming to it is another idea. But it's not obvious there exist any reasonable "exit criteria" here.
3. Empowering existing SIGs to invest into reliability-related efforts is another thing. But I think some coordination across different efforts would be very helpful.

 There are also other challenges like the fact that many of those things is not a fancy feature work, rather a grungy one. So how can we attract people to actually do that?


  I don't have any concrete proposal at this point, but I would really like to kick the tires around that.
I'm adding this topic to the upcoming SIG Architecture meeting to discuss further, in the meantime all the ideas, comments and concerns are more than welcome in this thread.

 thanks
wojtek

--

Juan-Lee Pang

unread,
Jul 24, 2020, 1:11:26 PM7/24/20
to kubernetes-sig-architecture
Is there currently any SIG that drives reliability issues across SIGs? Is there anyone collecting reliability scenarios, defining SLOs for those scenarios, and figuring out how to measure them? If not, these might be good opportunities for a potential reliability SIG.

Mike Spreitzer

unread,
Jul 24, 2020, 1:29:16 PM7/24/20
to Benjamin Elder, John Belamaric, kubernetes-sig-architecture, Tim Hockin, Wojciech Tyczynski
And speaking of https://github.com/kubernetes/kubernetes/issues/92937 ,
that raises another big issue --- perhaps even a more important one. We
are dying of flakiness now.


Lauri Apple

unread,
Jul 24, 2020, 2:57:41 PM7/24/20
to Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, Tim Hockin, Wojciech Tyczynski
My concern would be that creating a SIG for reliability would end up silo'ing the topic (which from my understanding didn't serve the goals of sig-pm, RIP, very well). Embedding reliability in existing SIGs' work as a value/attribute to consistently strive for ensures that reliability can remain everyone's concern.  

One approach could be to handle reliability in a federated way — SIGs make specific tactical efforts to maintain or increase it, then have reps meet on some cadence to collaborate on knowledge exchange and strategic issues. (If this idea sounds familiar, it's because I proposed it in the thread on the proposed CI Signal subproject issue.) If we don't have any SIG empowered/actively working on the efforts posed by Juan-Lee — "is there anyone collecting reliability scenarios, defining SLOs for those scenarios, and figuring out how to measure them?"—maybe a working group is formed to focus on prototyping those tools to then collaborate with SIGs in refining, then adopting them.

While we're on this topic, I highly recommend watching John Allspaw's talk "People are the adaptable element of complex systems," if you've got ~an hour.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.

Wojciech Tyczynski

unread,
Jul 27, 2020, 12:42:10 PM7/27/20
to Lauri Apple, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, Tim Hockin
 Thanks for the replies. To give my objective answer to things that appear in this thread.

1. Re P&F - sure, this is api-machinery effort. But that was just an example. As Daniel mentioned, it's just an initial mechanism. Now we will need to tune it (or ideally auto-tune). And that is another effort.
 But that still is just one example. Things like smarter retries (instead of always sending Retry-After=1), smarter batching for endpoints (https://github.com/kubernetes/kubernetes/issues/88967) are just
 two other examples.
 I think the discussion focused too much on examples that I gave (which meant to be just examples), rather that on the generic problem itself. There are dozens if not hundreds of those smaller issues.
 And each of them can probably belong to individual SIGs, but we need to look into the problem more holistically to decide which ones are actually the most important and prioritize correctly *across SIGs*.
 As an example, some problems may become obsolete if we have more generic solutions.

2. Re testing efforts. Sure that may be part of SIG testing (as I also wrote in my initial email). But the problem is that SIG testing in the current shape doesn't
  seem to have capacity to handle them (or they don't have high enough priority).
  And I think the problem is that we don't really have people that are interested in those efforts AND have enough capacity to push that (as opposed to help with
  shaping and contributing to overall direction). And the quality of the whole project is suffering from it.
  [And as I mentioned before, those aren't particularly fancy things, so it's hard to attract people to them too.]

3. Re SLOs, etc. - we are making them part of Production Readiness Template, SIG Scalability has subproject to define and measure performance-oriented ones (in coordinatio
   with other SIGs, like networking). So this particular thing is probably one of the places where we're in relatively best shape (comparing to the others mentioned above)
   [though obviously we can do much better of course]

 thanks
wojtek

Tim Hockin

unread,
Jul 27, 2020, 3:37:08 PM7/27/20
to Wojciech Tyczynski, Lauri Apple, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture
I agree STRONGLY with the fear that making reliability "someone else's
problem" means people will pay even less attention to it. Not what we
need.

I also agree with Mike - our test health is horrific right now and,
frankly, the incentives are wrong.

I don't know what to do about it. If an internal project were this
bad, we would have called a "Code Yellow" a long time ago, and we
would not have shipped another release until the flakes were all but
gone. We can't really do that here, or rather we have not had the
political will to do it. What would happen if we just shut the
project down indefinitely? No more merges unless you are fixing test
flakes or real bugs (not "lack of feature" bugs). Period. No
exceptions? Would people help fix flakes, or would they just work in
their own feature trees until the doors re-opened. I don't know how
to crack this nut. It's not the first time we have discussed it,
either.

Tim

Jordan Liggitt

unread,
Jul 27, 2020, 4:30:20 PM7/27/20
to Tim Hockin, Wojciech Tyczynski, Lauri Apple, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, Aaron Crickenberger, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
On Mon, Jul 27, 2020 at 3:37 PM 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:
I also agree with Mike - our test health is horrific right now and,
frankly, the incentives are wrong.

I don't know what to do about it.  If an internal project were this
bad, we would have called a "Code Yellow" a long time ago, and we
would not have shipped another release until the flakes were all but
gone.  We can't really do that here, or rather we have not had the
political will to do it.  What would happen if we just shut the
project down indefinitely?  No more merges unless you are fixing test
flakes or real bugs (not "lack of feature" bugs).  Period.  No
exceptions?  Would people help fix flakes, or would they just work in
their own feature trees until the doors re-opened.  I don't know how
to crack this nut.  It's not the first time we have discussed it,
either.

(adding sig-testing/sig-release and sig leads for visibility)

I've had a lot of conversations on this topic over the last week. The last month has made it clear that changes in our testing approach are badly needed. We have to hold SIGs/component owners accountable for their test quality.

For that to be reasonable, those tests have to execute consistently, which requires changes in the way we run test jobs. Aaron, Ben, and I are working on a proposal for some concrete changes to prow / CI configuration and policy to accomplish that.

Once we have a solid test env foundation, we can better identify test problems (flakes due to test problems or component bugs, constant test job failures, excessively long presubmits, etc) and require SIGs to resolve them.

I am in favor of prioritizing this stabilization and test health (which is likely to require effort from all SIGs) over opening master for 1.20 features. As history has shown, the alternative is reduced velocity, wasted resources, and riskier releases for everyone.

Jordan

 

Benjamin Elder

unread,
Jul 27, 2020, 4:34:47 PM7/27/20
to Jordan Liggitt, Tim Hockin, Wojciech Tyczynski, Lauri Apple, Mike Spreitzer, John Belamaric, kubernetes-sig-architecture, Aaron Crickenberger, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
> 2. Re testing efforts. Sure that may be part of SIG testing (as I also wrote in my initial email). But the problem is that SIG testing in the current shape doesn't
  seem to have capacity to handle them (or they don't have high enough priority).

I would agree that some form of this is currently true to a certain extent.
I don't think forming more groups solves this in any way.

If anything it adds more overhead to be involved because now you must track N+1 meetings / groups.

Tomorrow we will be discussing the proposal Jordan linked in the SIG Testing meeting, I hope you all can join.

Tim Hockin

unread,
Jul 27, 2020, 4:58:05 PM7/27/20
to Benjamin Elder, Jordan Liggitt, Wojciech Tyczynski, Lauri Apple, Mike Spreitzer, John Belamaric, kubernetes-sig-architecture, Aaron Crickenberger, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
I don't think I can join, but I'd love to see notes or slides or something.

Daniel Smith

unread,
Jul 27, 2020, 6:22:44 PM7/27/20
to Benjamin Elder, Jordan Liggitt, Tim Hockin, Wojciech Tyczynski, Lauri Apple, Mike Spreitzer, John Belamaric, kubernetes-sig-architecture, Aaron Crickenberger, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
On Mon, Jul 27, 2020 at 1:34 PM 'Benjamin Elder' via leads <le...@kubernetes.io> wrote:
> 2. Re testing efforts. Sure that may be part of SIG testing (as I also wrote in my initial email). But the problem is that SIG testing in the current shape doesn't
  seem to have capacity to handle them (or they don't have high enough priority).

I would agree that some form of this is currently true to a certain extent.
I don't think forming more groups solves this in any way.

If anything it adds more overhead to be involved because now you must track N+1 meetings / groups.

Tomorrow we will be discussing the proposal Jordan linked in the SIG Testing meeting, I hope you all can join.

I won't be able to join, but I hope the proposal is agreed on. Otherwise I have a list of even more draconian ideas... :)

On Mon, Jul 27, 2020 at 1:30 PM Jordan Liggitt <lig...@google.com> wrote:
On Mon, Jul 27, 2020 at 3:37 PM 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:
I also agree with Mike - our test health is horrific right now and,
frankly, the incentives are wrong.

I don't know what to do about it.  If an internal project were this
bad, we would have called a "Code Yellow" a long time ago, and we
would not have shipped another release until the flakes were all but
gone.  We can't really do that here, or rather we have not had the
political will to do it.  What would happen if we just shut the
project down indefinitely?  No more merges unless you are fixing test
flakes or real bugs (not "lack of feature" bugs).  Period.  No
exceptions?  Would people help fix flakes, or would they just work in
their own feature trees until the doors re-opened.  I don't know how
to crack this nut.  It's not the first time we have discussed it,
either.

(adding sig-testing/sig-release and sig leads for visibility)

I've had a lot of conversations on this topic over the last week. The last month has made it clear that changes in our testing approach are badly needed. We have to hold SIGs/component owners accountable for their test quality.

For that to be reasonable, those tests have to execute consistently, which requires changes in the way we run test jobs. Aaron, Ben, and I are working on a proposal for some concrete changes to prow / CI configuration and policy to accomplish that.

Once we have a solid test env foundation, we can better identify test problems (flakes due to test problems or component bugs, constant test job failures, excessively long presubmits, etc) and require SIGs to resolve them.

I am in favor of prioritizing this stabilization and test health (which is likely to require effort from all SIGs) over opening master for 1.20 features. As history has shown, the alternative is reduced velocity, wasted resources, and riskier releases for everyone.

Jordan

 

--
To unsubscribe from this group and stop receiving emails from it, send an email to leads+un...@kubernetes.io.

Vallery Lancey

unread,
Jul 27, 2020, 7:01:17 PM7/27/20
to Jordan Liggitt, Tim Hockin, Wojciech Tyczynski, Lauri Apple, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, Aaron Crickenberger, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
I'm skeptical about the viability of horizontal SIGs in general. I may well be wrong, but speaking as an observer and former cochair, horizontal SIGs seem to have a major silo and lack-of-influence problem. There is also a lack of people power being pumped into these less visible chop-wood-carry-water jobs. Companies want specific features, and typically don't have the same stake in wide-scope security, reliability, usability, etc (I would argue "lack of" is even a plus/opportunity for many in the space). Also, I have typically not seen having a dedicated and discrete "reliability team" play out well within various commercial orgs and products.

I would be very happy to see the testing situation improved, but test running and deflaking is only a part of reliability. There are other critical and broad concerns like feature cross-interaction, scalability-related reliability issues, and operator mistakes arising from complexity. I don't pretend to have a solution on hand, but I strongly think that tighter visibility of trends and more participation in KEPs from SIG-Arch is the path forward. The PRR subproject is a great step in that direction, but we also need to pay closer attention to the existing body of work.

--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.

Aaron Crickenberger

unread,
Jul 29, 2020, 12:35:43 PM7/29/20
to Vallery Lancey, Jordan Liggitt, Tim Hockin, Wojciech Tyczynski, Lauri Apple, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
Just to close the loop, we discussed the proposed policy changes Jordan linked at SIG Testing yesterday (Notes, Recording).  There were no objections to moving forward.

Explicit resource requirements/limits have been set for most of the release-blocking jobs as a proof of concept, and the CI Signal team is monitoring and tuning appropriately.

Next steps are to break down the changes into delegatable chunks of work, and consider what else we feel is necessary beyond the proposed changes thus far.  Lauri has volunteered to help coalesce the N docs, chats, slack threads, etc. into one place.

Like Jordan, I favor holding off on new 1.20 feature/enhancement work until we feel the situation has improved.

- aaron

Lauri Apple

unread,
Jul 29, 2020, 12:43:29 PM7/29/20
to Aaron Crickenberger, Vallery Lancey, Jordan Liggitt, Tim Hockin, Wojciech Tyczynski, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
Thanks for updating, Aaron. Collating the docs is the first step—I'm hopeful we can turn the results into a brief roadmap-type document that clarifies our objectives, priorities, and the boundaries of who does/owns what. 

Tim Hockin

unread,
Jul 29, 2020, 2:03:55 PM7/29/20
to Lauri Apple, Aaron Crickenberger, Vallery Lancey, Jordan Liggitt, Wojciech Tyczynski, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
Can we craft a set of small, pointed messages hat we can start to
socialize about this? AKA "This is why I am not approving your PR" ?

Wojciech Tyczynski

unread,
Aug 4, 2020, 6:49:42 AM8/4/20
to Tim Hockin, Jordan Liggitt, Davanum Srinivas, Lauri Apple, Aaron Crickenberger, Vallery Lancey, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
 To follow up from the last SIG meeting, I started a doc with a proposal to create a working group:

 It's not fully finished, but hopefully contains the most critical information, so I would appreciate any feedback (especially if you disagree with any points there)
so that we can resolve it as soon as possible.

 I shared that with couple SIG mailing lists, but if you don't have access, let me know.

 thanks
wojtek

Davanum Srinivas

unread,
Aug 4, 2020, 9:56:02 PM8/4/20
to Wojciech Tyczynski, Tim Hockin, Jordan Liggitt, Lauri Apple, Aaron Crickenberger, Vallery Lancey, Mike Spreitzer, Benjamin Elder, John Belamaric, kubernetes-sig-architecture, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
wojtek,

+1 i’d support this WG and the goals listed.

thanks,
Dims

Wojciech Tyczynski

unread,
Aug 31, 2020, 4:12:25 AM8/31/20
to Davanum Srinivas, Tim Hockin, Jordan Liggitt, John Belamaric, Aaron Crickenberger, Lauri Apple, Vallery Lancey, Mike Spreitzer, Benjamin Elder, kubernetes-sig-architecture, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
 Thanks everyone for the great feedback (and sorry for sitting on it for that long - it was a busy period for me).

 I have updated the proposal to reflect different kinds of feedback I got.

 Please take another look at the doc (I would like to kick the process of officially creating this WG in the next week or so, but I want to ensure that there is no fundamental disagreement on anything in the proposal).

 thanks
wojtek

--
To unsubscribe from this group and stop receiving emails from it, send an email to leads+un...@kubernetes.io.

Wojciech Tyczynski

unread,
Sep 16, 2020, 5:04:22 AM9/16/20
to Davanum Srinivas, Tim Hockin, Jordan Liggitt, John Belamaric, Aaron Crickenberger, Lauri Apple, Vallery Lancey, Mike Spreitzer, Benjamin Elder, kubernetes-sig-architecture, le...@kubernetes.io, kubernetes-sig-testing, kubernetes-sig-release
 To close the loop here. The proposal went official last week: https://github.com/kubernetes/community/pull/5127 - I hope we will have that approved in the next week or so.

 Thanks for all the feedback along the way!

 thanks
wojtek

Reply all
Reply to author
Forward
0 new messages