--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAPgfqyXH02dMwd4c-6eC7uKNnTtZfA2fshtc%3DC4nSEE%2B75Dw1Q%40mail.gmail.com.
Hi SIG Architecture,I'm sure that Kubernetes Reliability is near and dear to most if not all of us. All our users rely on the fact Kubernetes is stable and reliable and while Kubernetes is further maturing that bar is only increasing.As SIG Architecture, we created the "Production Readiness" subproject with the goal to exactly increase the bar (ensure that Kubernetes is easier to support, features can be monitored, feature owners think about upgrades, scalability and so on).But, in my personal opinion, this is not enough. As a community we need to do more. And I think SIG Architecture is the best place where we can discuss this and decide how we (as the whole community) can increase our engagement in reliability of the system.I'm fairly sure all of you can provide a number of other examples, so just to name a few from my own list of the things we're by far not investing enough:1. soak tests - we have nothing; the longest running tests as of now are scalability tests that take ~12h on 5k-node cluster; if a problem appears after a day, we're blind;2. upgrade testing - I think we all agree that we're far from where we would like to be; when creating "Production Readiness Template" we were considering having an "upgrade test' to be a requirement for GA. But both the upgrade test infrastructure as well as the existing tests isn't at the stage where we can require everyone to rely on it and treat as examples build based on recommended best practices.3. chaos testing - SIG Scalability is doing a little bit of them, but by far not enough; we have some disruptive tests, but in turn they are not being run at scale. There are a lot of things that can be done in this area too.4. hardening control plane & protecting people from shooting into their foots; currently it's way too easy to overload or break your cluster; Priority & Fairness effort driven by SIG Apimachinery will improve it, but again, we need much more (and I guess we could also utilize many more people in that effort).
I wanted to bring this topic to your attention and get your thoughts on what we can do in this situation.1. Recommending creating "SIG Reliability" is one thing that cames to my mind, but there are counterarguments (overlap with existing SIGs, fear that other SIGs start to think that now they don't need to care about those things, etc. - though SIG Scalability example shows that it doesn't have to be the case).2. Creating a WG with the goal of creating the strategy to address the problem with people from other SIGs coming to it is another idea. But it's not obvious there exist any reasonable "exit criteria" here.3. Empowering existing SIGs to invest into reliability-related efforts is another thing. But I think some coordination across different efforts would be very helpful.There are also other challenges like the fact that many of those things is not a fancy feature work, rather a grungy one. So how can we attract people to actually do that?I don't have any concrete proposal at this point, but I would really like to kick the tires around that.I'm adding this topic to the upcoming SIG Architecture meeting to discuss further, in the meantime all the ideas, comments and concerns are more than welcome in this thread.thankswojtek
--
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/OF6B851AD6.B35F1C64-ON852585AF.005FF006-852585AF.00600E14%40notes.na.collabserv.com.
I also agree with Mike - our test health is horrific right now and,
frankly, the incentives are wrong.
I don't know what to do about it. If an internal project were this
bad, we would have called a "Code Yellow" a long time ago, and we
would not have shipped another release until the flakes were all but
gone. We can't really do that here, or rather we have not had the
political will to do it. What would happen if we just shut the
project down indefinitely? No more merges unless you are fixing test
flakes or real bugs (not "lack of feature" bugs). Period. No
exceptions? Would people help fix flakes, or would they just work in
their own feature trees until the doors re-opened. I don't know how
to crack this nut. It's not the first time we have discussed it,
either.
> 2. Re testing efforts. Sure that may be part of SIG testing (as I also wrote in my initial email). But the problem is that SIG testing in the current shape doesn'tseem to have capacity to handle them (or they don't have high enough priority).
I would agree that some form of this is currently true to a certain extent.
I don't think forming more groups solves this in any way.
If anything it adds more overhead to be involved because now you must track N+1 meetings / groups.
Tomorrow we will be discussing the proposal Jordan linked in the SIG Testing meeting, I hope you all can join.
On Mon, Jul 27, 2020 at 1:30 PM Jordan Liggitt <lig...@google.com> wrote:On Mon, Jul 27, 2020 at 3:37 PM 'Tim Hockin' via kubernetes-sig-architecture <kubernetes-si...@googlegroups.com> wrote:I also agree with Mike - our test health is horrific right now and,
frankly, the incentives are wrong.
I don't know what to do about it. If an internal project were this
bad, we would have called a "Code Yellow" a long time ago, and we
would not have shipped another release until the flakes were all but
gone. We can't really do that here, or rather we have not had the
political will to do it. What would happen if we just shut the
project down indefinitely? No more merges unless you are fixing test
flakes or real bugs (not "lack of feature" bugs). Period. No
exceptions? Would people help fix flakes, or would they just work in
their own feature trees until the doors re-opened. I don't know how
to crack this nut. It's not the first time we have discussed it,
either.(adding sig-testing/sig-release and sig leads for visibility)I've had a lot of conversations on this topic over the last week. The last month has made it clear that changes in our testing approach are badly needed. We have to hold SIGs/component owners accountable for their test quality.For that to be reasonable, those tests have to execute consistently, which requires changes in the way we run test jobs. Aaron, Ben, and I are working on a proposal for some concrete changes to prow / CI configuration and policy to accomplish that.Once we have a solid test env foundation, we can better identify test problems (flakes due to test problems or component bugs, constant test job failures, excessively long presubmits, etc) and require SIGs to resolve them.I am in favor of prioritizing this stabilization and test health (which is likely to require effort from all SIGs) over opening master for 1.20 features. As history has shown, the alternative is reduced velocity, wasted resources, and riskier releases for everyone.Jordan
--
To unsubscribe from this group and stop receiving emails from it, send an email to leads+un...@kubernetes.io.
--
You received this message because you are subscribed to the Google Groups "kubernetes-sig-architecture" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubernetes-sig-arch...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubernetes-sig-architecture/CAMBP-pKg6eE53TU2YtoBJo3SmesDA_ZDcToRrGQAPNX__TEx1A%40mail.gmail.com.
--
To unsubscribe from this group and stop receiving emails from it, send an email to leads+un...@kubernetes.io.