Worrying state of Etcd community

3,519 views
Skip to first unread message

Marek Siarkowicz

unread,
Mar 7, 2022, 1:11:51 PM3/7/22
to stee...@kubernetes.io, Tim Hockin, Piotr Tabor

We (@serathius, @ptabor) are reaching out to K8s steering committee to bring to their attention recent changes in and the current state of the etcd community. 


In the last few months, primary maintainers Gyuho Lee (@gyuho, Amazon, announcement) and Sam Batschelet (@hexfusion, Red Hat) have stopped actively participating in the project. This leaves the project with only one active and two occasionally-reviewing maintainers, Marek Siarkowicz (@serathius, Google),  Piotr Tabor (@ptabor, Google), both are relatively new to the project (1 month and 1 year of tenure) and Sahdev P Zala (spzala@, IBM). Other maintainers are either dormant or have very minimal activity over the last six months. The project is effectively unmaintained.


This lack of maintainers is impacting the community:

  • Cannot make important project decisions (like conflict resolution) based on governance as it requires a supermajority of maintainers to agree. This has especially bad impact on the design process, where major proposals don’t get enough feedback and scrutiny. Due to lack of maintainer activity, we cannot introduce a proper approval process, resulting in important features getting reviews from only one maintainer. For example #13168 was reviewed by only @ptabor (relatively new maintainer) and @lilic (reviewer, no longer active in project).

  • Unable to reliably triage issues and release bug fixes. Fixes for critical bugs can take months to be released, causing users to lose trust and not adopt new releases. For example v3.5 was released with multiple critical bugs (#13196, #13192) and it took the community over a quarter to release fixes, making it unusable in production. As of v1.23.3 Kubernetes still recommends the mostly broken Etcd version v3.5.0 (#106589). 

  • Slowed or blocked contributions. In theory all changes should be reviewed by 2 maintainers before submitting. A second view-point is especially important for Etcd, to ensure security and correctness of changes, as they can be difficult to verify. We have been forced to break this rule and rely on lazy consensus, making the whole process error prone. In case of a mistake we are only able to verify them via prod-releases (which are 2 years apart).  There is no healthy feedback loop due to maintainers changing too frequently.


Etcd is a critical dependency of Kubernetes. If the situation in etcd doesn’t improve it will create a significant risk for the future of the K8s project. This may impede improvements in K8s reliability or other areas that require changes on the etcd side. It may also lead to a situation where a severe etcd bug, like data corruption, gets detected after it’s already present in tens or hundreds of thousands of Kubernetes clusters around the globe. This could irreparably break users' trust in Kubernetes.

We're hoping that by bringing this to attention we can start discussing and planning making proper steps to mitigate the issue. Thanks, Marek 

Marek Siarkowicz

unread,
Mar 7, 2022, 2:03:10 PM3/7/22
to stee...@kubernetes.io, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor
I was asked privately to loop in +cncf-pr...@lists.cncf.io 

Clayton Coleman

unread,
Mar 9, 2022, 10:56:39 AM3/9/22
to Marek Siarkowicz, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor
Thank you for raising this Marek.  Speaking for Red Hat, we're also concerned, and I've heard from other vendors as well.

Have there been additional responses outside this thread, or a different forum where people are talking (seeing so few responses here)?

--
You received this message because you are subscribed to the Google Groups "steering" group.
To unsubscribe from this group and stop receiving emails from it, send an email to steering+u...@kubernetes.io.
To view this discussion on the web visit https://groups.google.com/a/kubernetes.io/d/msgid/steering/CAJs3Yt3JeE%3DbnhZ9V%3D15sRJ_mPzZ7MzfRwJcPy9_98Ape290NA%40mail.gmail.com.

Marek Siarkowicz

unread,
Mar 9, 2022, 11:37:56 AM3/9/22
to Clayton Coleman, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor

On Wed, Mar 9, 2022, 16:56 Clayton Coleman <ccol...@redhat.com> wrote:
Thank you for raising this Marek.  Speaking for Red Hat, we're also concerned, and I've heard from other vendors as well.

Have there been additional responses outside this thread, or a different forum where people are talking (seeing so few responses here)?

Not that I know of. 

Stephen Augustus

unread,
Mar 9, 2022, 11:52:16 AM3/9/22
to Clayton Coleman, Marek Siarkowicz, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor
All --

Just want to send a quick note to ACK this on behalf of Steering.
(We're the ones that requested TOC private be included here)

As soon as we have more, we'll comment on this thread.

Marek --

Would you mind opening a tracking issue on k/steering and linking that here?

-- Stephen

Davanum Srinivas

unread,
Mar 9, 2022, 11:53:08 AM3/9/22
to Clayton Coleman, Marek Siarkowicz, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor
Clayton, Marek,

k8s steering hasn't yet met after this email :) yes, definitely worried with that hat on. 

TOC met yesterday and this was the first item on the agenda. 

One way or another we need to get this on the CNCF GB agena for the next upcoming meeting. 

I believe this has happened before with etcd, so this is the second time trying to come up with options for them. 

thanks,
Dims

On Wed, Mar 9, 2022 at 10:56 AM Clayton Coleman <ccol...@redhat.com> wrote:

Emily Moss

unread,
Mar 9, 2022, 12:19:41 PM3/9/22
to steering, dav...@gmail.com, Marek Siarkowicz, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor, ccoleman
Just adding quickly that I Slacked Marek and Piotr yesterday with details on the Red Hat etcd team updates.

Jordan Liggitt

unread,
Mar 10, 2022, 2:25:29 PM3/10/22
to steering, Emily Moss, dav...@gmail.com, Marek Siarkowicz, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor, ccoleman
Talked about this today in sig-arch. Specifically to this point:
  • Unable to reliably triage issues and release bug fixes. Fixes for critical bugs can take months to be released, causing users to lose trust and not adopt new releases. For example v3.5 was released with multiple critical bugs (#13196, #13192) and it took the community over a quarter to release fixes, making it unusable in production. As of v1.23.3 Kubernetes still recommends the mostly broken Etcd version v3.5.0 (#106589). 

My immediate interest is stabilizing the current state and resolving regressions in already-shipped releases. The 3.5.1 update was blocked on test coverage of the regression (https://github.com/kubernetes/kubernetes/pull/106591#discussion_r780357459). It looks like those have been added, so that could be unblocked at this point. Are there other critical regressions in the etcd client being tracked? Does sig-apimachinery have visibility to this?

Jordan Liggitt

unread,
Mar 10, 2022, 2:35:51 PM3/10/22
to steering, Jordan Liggitt, Emily Moss, dav...@gmail.com, Marek Siarkowicz, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor, ccoleman, K8s API Machinery SIG, kubernetes-sig-architecture
+cc arch (from a dependencies perspective) and api-machinery (from an etcd client use perspective) lists

Added to the next sig-api-machinery agenda to sync on closing specific gaps in testing of Kubernetes' use of the client (whether the test gaps get closed in kubernetes tests or in etcd tests)

Vipul Sabhaya

unread,
Mar 10, 2022, 8:26:42 PM3/10/22
to steering, Jordan Liggitt, Emily Moss, dav...@gmail.com, Marek Siarkowicz, steering, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor, ccoleman, K8s API Machinery SIG, kubernetes-sig-architecture
We are adding folks from AWS to contribute into etcd.  Folks from the EKS etcd team - Chao Chen (@chaochn47) and Geeta Gharpure (@geetasg) have contributions, and will continue to be regulars in etcd community meetings and contributions. 

Davanum Srinivas

unread,
Mar 10, 2022, 8:55:42 PM3/10/22
to Vipul Sabhaya, steering, Jordan Liggitt, Emily Moss, Marek Siarkowicz, cncf-pri...@lists.cncf.io, Tim Hockin, Piotr Tabor, ccoleman, K8s API Machinery SIG, kubernetes-sig-architecture
Thanks Vipul! Welcome aboard Chao and Geeta

Davanum Srinivas

unread,
Jul 18, 2022, 1:51:57 PM7/18/22
to Marek Siarkowicz, etcd...@googlegroups.com, cncf...@lists.cncf.io, stee...@kubernetes.io, Tim Hockin, Piotr Tabor, canis...@linuxfoundation.org
Marek, Sahdev, 
It has been a few months since this email about lack of enough hands to help with etcd. Has the situation improved at all? 

Paris, RichiH,
Any feedback you are hearing with your Dev Rep hats on?

ChrisA,
Looks like this came up in both TOC and GB meetings, but we have feedback that the folks working hard on etcd are not really seeing changes in their day-to-day work. Anything we can do from the CNCF side to help? 

Do we all want to meet on a call? I can offer up an upcoming TOC call to talk about this? Please don't wait for the call to discuss this, feel free to send your thoughts/ideas/status here on this thread.

thanks,
Dims

--
You received this message because you are subscribed to the Google Groups "steering" group.
To unsubscribe from this group and stop receiving emails from it, send an email to steering+u...@kubernetes.io.

Alex Chircop

unread,
Jul 18, 2022, 2:03:31 PM7/18/22
to Davanum Srinivas, Marek Siarkowicz, etcd...@googlegroups.com, CNCF TOC, stee...@kubernetes.io, Tim Hockin, Piotr Tabor, Chris Aniszczyk
Hi,

I don't know if this is helpful for context, but one of the etcd maintainers, Benjamin Wang from VMware, had presented to the TAG in June 2022.    This is the deck with the update: https://docs.google.com/presentation/d/e/2PACX-1vRGyr3gSUJVm23Zv6bQNmZRaCeU-2nQD1U1vBEbZJZc2vhEbN4w8Rxf6e-01L1B8w/pub?start=false&loop=false&delayms=3000&slide=id.p1

Kind regards,
Alex




This email and any attachments are confidential to the intended recipient and may also be privileged or copyrighted material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient please delete it from your system and notify the sender. StorageOS Ltd is a company registered in England and Wales with company number 09614942. Registered office address: 2 Minton Place, Victoria Road, Bicester, Oxfordshire, OX26 6QB.

Marek Siarkowicz

unread,
Jul 18, 2022, 3:04:21 PM7/18/22
to Davanum Srinivas, etcd-maintainers, cncf...@lists.cncf.io, stee...@kubernetes.io, Tim Hockin, Piotr Tabor, canis...@linuxfoundation.org
+etcd-maintainers 
Moving etcd-dev@ to bcc

Marek Siarkowicz

unread,
Jul 18, 2022, 4:49:22 PM7/18/22
to Davanum Srinivas, etcd-maintainers, paris....@gmail.com, cncf...@lists.cncf.io, stee...@kubernetes.io, Tim Hockin, Piotr Tabor, canis...@linuxfoundation.org
On Mon, Jul 18, 2022 at 7:52 PM Davanum Srinivas <dav...@gmail.com> wrote:
Marek, Sahdev, 
It has been a few months since this email about lack of enough hands to help with etcd. Has the situation improved at all? 


Benjamin Wang from VMware became a new maintainer, however our capacity didn't grow much as Piotr Tabor is on long holidays (till October). Not sure if he still plans to continue to be active after as he no longer works on etcd at Google.

There was some progress on issues mentioned in the original email, however the underlying issues were not addressed:
* We patched the etcd governance, it now officially supports lazy consensus after 2 weeks. However, we are still struggling with restoring unwritten knowledge that was lost with previous maintainers. For example, interaction with CNCF. We just discovered that the only remaining active etcd maintainers were unaware and did't have access to CNCF helpdesk. https://github.com/cncf/foundation/pull/387 <- still waiting
* In the last couple of months we discovered and fixed a data inconsistency issue that was hiding within untriaged issues (postmortem). However, it took us over a year to fix, the number of new untriaged issues doesn't go down (#14138), we get new reports about critical issues (#14211#14143#14098) and we are still unable to qualify the latest release. In my opinion there is still a significant risk of undiscovered issues present in v3.5 release.
* With Benjamin joining we just managed to fill all the release manager positions (#13912). We have enough capacity to review and merge bug fixes. However, we still don't review non-bugfix. As this creates bad experience for new contributors that are unaware of this policy, I'm planning to make it official that etcd doesn't accept new features until we are happy with reliability and qualification.

cc +paris....@gmail.com 

Paris

unread,
Jul 18, 2022, 5:45:57 PM7/18/22
to Marek Siarkowicz, Davanum Srinivas, etcd-maintainers, CNCF TOC, steering, Tim Hockin, Piotr Tabor, Chris Aniszczyk
I think that it's more than worthwhile to have a senior+ community engineer onboard to the etcd maintainer crew and frankly, think all graduated projects should have this kind of support for someone part time (50%) or full time. (see: /issues/43). Are there any orgs that could step up and provide this support now? Would the etcd maintainer folks welcome this? 

The communities are too large and the maintainer burden is too high to do the necessary work to build and then maintain the community. Things that a senior+ community engineer could help the etcd crew with:
- video tutorials for reviewing code/advanced contributing/maintainer training 
- run your community meetings 
- ama sessions for new contributors and those interested in maintaining
- outreach for new maintainers/future maintainers 
- help with continuity and institutional knowledge gathering for onboarding and offboarding maintainers 
- detailed contributing and developer guides 
Reply all
Reply to author
Forward
0 new messages