[kubernetes/kubernetes] Use of resourceVersion=0 in reflectors for initial sync breaks pod safety when more than one master (#59848)

Clayton Coleman

unread,

Feb 13, 2018, 11:01:25 PM2/13/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

When we added resourceVersion=0 to reflectors, we didn't properly reason about its impact on nodes. Its current behavior can cause two nodes to run a pod with the same name at the same time, which violates the pod safety guarantees on the cluster. Because the read is not a quorum read from the watch cache, it can be arbitrarily delayed which means a pod that was previously deleted can be resurrected.

Scenario:

T1: StatefulSet controller creates pod-0 (uid 1) which is scheduled to node-1
T2: pod-0 is deleted as part of a rolling upgrade
node-1 sees that pod-0 is deleted and cleans it up, then deletes the pod in the api
The StatefulSet controller creates a second pod pod-0 (uid 2) which is assigned to node-2
node-2 sees that pod-0 has been scheduled to it and starts pod-0
The kubelet on node-1 crashes and restarts, then performs an initial list of pods scheduled to it against an API server in an HA setup (more than one API server) that is partitioned from the master (watch cache is arbitrarily delayed). node-1 observes a list of pods from before T2
node-1 starts pod-0 (uid 1) and node-2 is already running pod-0 (uid 2).

This violates pod safety. Since we support HA masters, we cannot use resourceVersion=0 from reflectors on the node, and probably should not use it on the masters. We can only safely use resourceVersion=0 after we have retrieved at least one list, and only if we verify that resourceVersion is in the future.

@kubernetes/sig-apps-bugs @kubernetes/sig-api-machinery-bugs @kubernetes/sig-scalability-bugs This is a fairly serious issue that can lead to cluster identity guarantees being lost, which means clustered software cannot run safely if it has assumed the pod safety guarantee prevents two pods with the same name running on the cluster at the same time.

This is also something that could happen for controllers - during a controller lease failover the next leader could be working from a very old cache and undo recent work done.

No matter what, the first list of a component with a clean state that must preserve "happens-before" must perform a live quorum read against etcd to fill their cache. That can only be done by omitting resourceVersion=0.

Fixes:

Disable resourceVersion=0 from being used in reflector list, only use when known safe
Disable resourceVersion=0 for first list, and optionally use resourceVersion=0 for subsequent calls if we know the previous resourceVersion is after our current version (assumes resource version monotonicity)
Disable resourceVersion=0 for the first list from a reflector, then send resourceVersion= on subsequent list calls (which causes the watch cache to wait until the resource version shows up).
Perform live reads from Kubelet on all new pods coming in to verify they still exist

1 is a pretty significant performance regression, but is the most correct and safest option (just like we enabled quorum reads everywhere). 2 is more complex, and there are a few people trying to remove the monotonicity guarantees from resource version, but would retain most of the performance benefits of using this in the reflector. 3 is probably less complex than 2, but i'm not positive it actually works. 4 is hideous and won't fix other usage.

Probably needs to be backported to 1.6.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Clayton Coleman

unread,

Feb 13, 2018, 11:25:57 PM2/13/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Note that we added this as a performance at scale win, so we are almost certainly going to regress to some degree (how much has been mitigated by other improvements in the last year is uncertain)

Haowei Cai (Roy)

unread,

Feb 15, 2018, 4:33:55 PM2/15/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/sub

Clayton Coleman

unread,

Feb 17, 2018, 1:58:09 PM2/17/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Disabling resourceVersion=0 for lists when more than one master is present is an option as well.

As Jordan noted the optimization here is impactful because we avoid having to fetch many full pod lists from etcd when we only return a subset to each node. It’s possible we could require all resourceVersion=0 calls to acquire a read lease on etcd which bounds the delay, but doesn’t guarantee happens before if the cache is delayed. If the watch returned a freshness guarantee we could synchronize on that as well.

Clayton Coleman

unread,

Feb 17, 2018, 2:43:17 PM2/17/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

We can do a synthetic write to the range and wait until it is observed by the cache, then service the rv=0. The logical equivalent is a serializable read on etcd for the range, but we need to know the highest observed RV on the range.

We can accomplish that by executing a range request with min create and mod revisions equal to the latest observed revision, and then performing the watch cache list at the highest RV on the range.

So:

All reflectors need to preserve a “happens before” guarantee when fetching for general sanity - use of rv=0 breaks that
rv=0 is a client visible optimization, but it’s not really necessary except for clients that can observe arbitrarily delayed history safely
Not many clients can observe arbitrarily delayed history safely
We should stop trearing rv=0 specially

We can mitigate the performance impact by ensuring that an initial list from the watch cache preserves happens before. We can safely serve historical lists (rv=N) at any time. The watch cache already handles rv=N mostly correctly.

So the proposed change here is:

Clients remove rv=0 and we stop honoring it
Verify the watch cache correctly waits for rv=N queries
We can serve a list or get from watch cache iff we perform a serializable read against etcd and retrieve the highest create/mod revisions
We can make that query efficient by using min_create/mod_revision on the etcd list call
Clients currently sending rv=0 where perf is critical (node) should start using the last observed resource version, which allows the watch cache to serve correctly.

That resolves this issue.

Clayton Coleman

unread,

Feb 17, 2018, 2:47:43 PM2/17/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Serializable read over the range, that is.

Clayton Coleman

unread,

Feb 17, 2018, 2:49:54 PM2/17/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Also:

The next time someone adds a cache at any layer, we need to have a good review process that catches this. Probably a checklist we add into api review and an audit of any existing caches.

Josh Berkus

unread,

Feb 21, 2018, 2:38:25 PM2/21/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

if this MUST be resolved for 1.10, please add status/approved-for-milestone to it before Code Freeze. Thanks!

Jaice Singer DuMars

unread,

Feb 23, 2018, 11:03:10 AM2/23/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Hi Clayton, could you go ahead and add "approved-for-milestone" label to this, as well as status (in progress)? That will help it stay in the milestone if this is a 1.10 blocker. Thanks!

Clayton Coleman

unread,

Feb 26, 2018, 3:08:14 PM2/26/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Caveat on the label - still trying to identify how we fix this - but no matter what because this is a critical data integrity issues we'll end up back porting this several releases. Will try to get a better estimate of timeframe soon.

Josh Berkus

unread,

Feb 26, 2018, 6:16:47 PM2/26/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

ooops, fixing labels.

Clayton Coleman

unread,

Feb 26, 2018, 11:52:20 PM2/26/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

This can happen on a singleton master using an HA etcd cluster:

apiserver is talking to member 0 in the cluster
apiserver crashes and watch cache performs its initial list at RV=10
kubelet deletes pod at RV=11
a new pod is created at RV=12 on a different node
kubelet crashes
apiserver becomes partitioned from the etcd majority but is able to reestablish a watch from the partitioned member at RV=10
kubelet contacts the apiserver and sees RV=10, launches the pod

Wojciech Tyczynski

unread,

Mar 1, 2018, 4:31:26 AM3/1/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

So the proposed change here is:

Clients remove rv=0 and we stop honoring it

Verify the watch cache correctly waits for rv=N queries

We can serve a list or get from watch cache iff we perform a serializable read against etcd and retrieve the highest create/mod revisions

We can make that query efficient by using min_create/mod_revision on the etcd list call

Clients currently sending rv=0 where perf is critical (node) should start using the last observed resource version, which allows the watch cache to serve correctly.

Why do we need (3) and (4)?
If we verify that rv=N works fine (and it should), why can't we just start setting rv= instead.
If we just do that there are two options:

that RV is already in watch cache, and then we can safely serve from watch cache
that RV is not yet in watch cache (watch cache is delayed) and then we will wait for some time and if it won't become fresh enough within some threshold we will reject with some error IIRC.

In the second case, reflector may retry without setting any RV for consistent read from etcd (that should be pretty rare so it should be fine from performance point of view).

@smarterclayton What an I missing?

Wojciech Tyczynski

unread,

Mar 1, 2018, 4:31:27 AM3/1/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@kubernetes/sig-scalability-misc

Jaice Singer DuMars

unread,

Mar 1, 2018, 11:19:09 AM3/1/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Please stay in close contact with the release team on this.

Kubernetes Submit Queue

unread,

Mar 2, 2018, 1:57:40 PM3/2/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Action Required: This issue has not been updated since Mar 1. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

Clayton Coleman

unread,

Mar 3, 2018, 12:50:08 AM3/3/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

If we verify that rv=N works fine (and it should), why can't we just start setting rv= instead.

The hard part is finding N. If we can find N, then yes, the solution is easy (that's the second part of the attempted PR before I realized that finding N is not possible today without executing a write).

In order to safely serve a list from the watch cache, you must ensure the watch cache is at an RV that is at the same level or higher than if you executed a linearizable read against etcd over the range at the current time and then started a watch from that RV (must include all modifications and deletes up to and including N).

Kubernetes Submit Queue

unread,

Mar 3, 2018, 12:50:53 AM3/3/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Kubernetes Submit Queue

unread,

Mar 4, 2018, 3:29:36 AM3/4/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Action Required: This issue has not been updated since Mar 3. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Maxim Ivanov

unread,

Mar 4, 2018, 4:31:31 AM3/4/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Is anything else affected by it? Can same ReadWriteOnce PV attempted to be mounted on multiple nodes simultaneously due to this problem?

Kubernetes Submit Queue

unread,

Mar 4, 2018, 4:32:56 AM3/4/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Clayton Coleman

unread,

Mar 4, 2018, 3:41:02 PM3/4/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Any controller or client using an informer that is sending rv=0 and has the stock kubernetes config AND is running against either etcd HA or apiserver HA is vulnerable to stale reads until we put a mitigation in place. Any client that doesn't execute a write or perform a live read prior to operating on some external state (like a remote api or state of a host) would potentially be able to go back in time. So it's possible, although I haven't looked through that particular code path.

Kubernetes Submit Queue

unread,

Mar 6, 2018, 3:31:55 AM3/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Action Required: This issue has not been updated since Mar 4. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Wojciech Tyczynski

unread,

Mar 6, 2018, 3:58:16 AM3/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

The hard part is finding N. If we can find N, then yes, the solution is easy (that's the second part of the attempted PR before I realized that finding N is not possible today without executing a write).

In order to safely serve a list from the watch cache, you must ensure the watch cache is at an RV that is at the same level or higher than if you executed a linearizable read against etcd over the range at the current time and then started a watch from that RV (must include all modifications and deletes up to and including N).

Thanks - so my understanding was correct.

So I guess what I was trying to write was: Why not:

after component restart relist from etcd
if we need to relist during component lifetime, we know the last know version in the component so we know N in this situation.

With that approach we would have a single list from etcd per component lifetime.

But I guess there are two main problems with this approach:

some number of crashlooping components may lead to overloading apiserver/etcd
in the second case, if we get "error too old", we don't really know where to start

For the second case, we can e.g. incorporate the current RV in apiserver into structured "error to old" error and used that one in the next List call (we can double check that it's really greater than what we wanted in the previous call).

So the question is how often we restart components and how serious the first problem is. Maybe it's good enough fix for now and we can work on more proper solution a bit later?

Kubernetes Submit Queue

unread,

Mar 6, 2018, 3:59:25 AM3/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Josh Berkus

unread,

Mar 6, 2018, 3:02:32 PM3/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

So ... given the difficulty of addressing this issue (darn that CAP theorem!), is it realistic to have a fix in time for the 1.10 release?

Clayton Coleman

unread,

Mar 7, 2018, 3:29:34 PM3/7/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

So the question is how often we restart components and how serious the first problem is. Maybe it's good enough fix for now and we can work on more proper solution a bit later?

My concern is that it regresses 5000 node cluster completely (because if we list from etcd then we have to filter in the apiserver once for each node). So we'd probably drop back to down to 1k nodes as the max.

Kubernetes Submit Queue

unread,

Mar 7, 2018, 3:30:38 PM3/7/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Wojciech Tyczynski

unread,

Mar 8, 2018, 2:24:47 AM3/8/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

My concern is that it regresses 5000 node cluster completely (because if we list from etcd then we have to filter in the apiserver once for each node). So we'd probably drop back to down to 1k nodes as the max.

That should happen only if we restart all nodes at once, right? That shouldn't really happen...
Or am I missing something big here?

Kubernetes Submit Queue

unread,

Mar 9, 2018, 3:31:37 AM3/9/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Action Required: This issue has not been updated since Mar 8. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Tim Pepper

unread,

Mar 9, 2018, 3:59:50 PM3/9/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@smarterclayton should this issue also move to v1.11 now?

Kubernetes Submit Queue

unread,

Mar 9, 2018, 4:01:08 PM3/9/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Clayton Coleman

unread,

Mar 9, 2018, 5:46:31 PM3/9/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

That should happen only if we restart all nodes at once, right? That shouldn't really happen...
Or am I missing something big here?

I wouldn't be surprised if it caused the api server to die during bringup. I'm not necessarily against regressing here, I just think we need to do it with the understanding that we're doing so and that cluster operators may need a mitigation or way to mitigate at those extreme scales (until we come up with a better solution).

Kubernetes Submit Queue

unread,

Mar 11, 2018, 4:29:31 AM3/11/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Action Required: This issue has not been updated since Mar 9. Please provide an update.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Josh Berkus

unread,

Mar 12, 2018, 12:54:14 PM3/12/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Removing from 1.10, per discussion on sig-release.

Kubernetes Submit Queue

unread,

Mar 12, 2018, 12:54:39 PM3/12/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

[MILESTONENOTIFIER] Milestone Issue Needs Attention

@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc

Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress label so it can be tracked with other in-flight issues.

Note: This issue is marked as priority/critical-urgent, and must be updated every 1 day during code freeze.

Example update:

ACK.  In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required

Issue Labels

sig/api-machinery sig/apps sig/scalability: Issue will be escalated to these SIGs if needed.
priority/critical-urgent: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.
kind/bug: Fixes a bug discovered during the current release.

Help

—

Xiaogang Xin

unread,

Apr 1, 2018, 8:58:51 PM4/1/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

We have similar issue here #61343

Especially when cluster scale is large and there are large load for List request from any clients without ResourceVersion specified in the request.
What we found is the grpc connection throughput is limited which cause the request timeout. Because, during the slowness, both etcd and apiserver, there are still enough CPU and memory. And also if we exec into etcd pod and do the list, it is fast enough.

The approach of @smarterclayton 's proposal is also not suitable for our case. Because there is no way for etcd to get the latest revision for a specified resource type like Pod. The only way to get it is to have a write and then get it from the response. But this is not a perfect solution because you need to have a write every time.

Because there is already wait logic in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L513, if we have a way to get the latest revision for a specified resource type in line https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L479, we can then get the data from cache while not hit etcd every time.
But have a write every time is not a perfect way to get it.

I think maybe @xiang90 can have more comments?

Wojciech Tyczynski

unread,

Jun 6, 2018, 7:18:58 AM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@smarterclayton - I just realized one thing - please correct me if I'm wrong here.

I think that in majority of cases, we may already be good. This is because:

in most (all?) places, we are using default ListWatch:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/listwatch.go#L51
that ListWatch by default is using pagination:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/listwatch.go#L106
when we are setting limit, cacher is supposed to forward the request to etcd anyway:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L472

So I think that there are two options:

either pagination is broken
or we are already not listing from cache.

TBH, I've never seen a request with continue token set anywhere in logs, so I'm afraid it may be the former.

Clayton Coleman

unread,

Jun 6, 2018, 8:57:23 AM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Pagination doesn’t bypass the list cache because that has the same
performance impact as disabling watch cache.

On Jun 6, 2018, at 7:17 AM, Wojciech Tyczynski <notifi...@github.com>
wrote:

@smarterclayton <https://github.com/smarterclayton> - I just realized one

thing - please correct me if I'm wrong here.

I think that in majority of cases, we may already be good. This is because:

- in most (all?) places, we are using default ListWatch:

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/listwatch.go#L51
- that ListWatch by default is using pagination:

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/listwatch.go#L106
- when we are setting limit, cacher is supposed to forward the request

to etcd anyway:

https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L472

So I think that there are two options:

- either pagination is broken
- or we are already not listing from cache.

TBH, I've never seen a request with continue token set anywhere in logs, so
I'm afraid it may be the former.

—

You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub

<https://github.com/kubernetes/kubernetes/issues/59848#issuecomment-395033151>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p9WgfJ-m4BnZ0AN6aWOpJDbXOC6_ks5t57o8gaJpZM4SEw2b>

Wojciech Tyczynski

unread,

Jun 6, 2018, 9:00:49 AM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

So what is this line doing:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L472

?

Jordan Liggitt

unread,

Jun 6, 2018, 9:06:47 AM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

on an initial list with ?resourceVersion=0&limit=500, all of those conditions will be false

Clayton Coleman

unread,

Jun 6, 2018, 9:09:12 AM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

We almost certainly need an etcd api change in order to preserve the
watch cache here.

We need to ensure any given query with rc=0 happens on a watch cache
that has all watch events from an etcd index > than the index a
serializable read of etcd would return.

The watch periodic notify channel is close, but only fires every ten
minutes. If we were able to “ping” a watch and have the server send
an ordered event response to a watch (such that we can verify it’s a
response to our request), that would be sufficient (because the event
has the etcd index in it). Then the watch cache would hold the
response until it sees a the result of its ping returned in the
response. Multiple such simultaneous requests can be batched and then
a single “ping” could clear it.

Wojciech Tyczynski

unread,

Jun 6, 2018, 9:14:14 AM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

on an initial list with ?resourceVersion=0&limit=500, all of those conditions will be false

I missed "hasLimit" has "&& resourceVersion != "0" inside - thanks @liggitt

@smarterclayton - yes, I agree that with etcd API change it would be much easier. And what you wrote makes perfect sense to me.

I was just wondering why we are not bypassing cacher - the fact that "hasLimit" has a condition on resourceVersion inside is a bit misleading to me...

Daniel Smith

unread,

Jun 6, 2018, 12:18:02 PM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

cc @jpbetz

Xiang Li

unread,

Jun 6, 2018, 12:36:24 PM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/cc @ccding

Daniel Smith

unread,

Jun 6, 2018, 12:56:16 PM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Sorry this is the first I've seen of this issue.

When did we allow reflectors to read out of the watch cache ( = not consistent read)? That seems wrong. Hopefully I am not the one that approved that :) I think I have always argued that RV=0 is always wrong to use.

Alternative solution: do a consistent list against etcd, but set the limit to 1 (or 0 if that works), record the returned global index. Wait for that index to show up in the watch cache. Then serve the request out of the watch cache. This only works if the watch cache's source is watching all objects (and not a subdir), otherwise there's no guarantee that the write which set the current global index will show up in a particular subdir.

Xiang Li

unread,

Jun 6, 2018, 1:09:30 PM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

The watch periodic notify channel is close, but only fires every ten
minutes. If we were able to “ping” a watch and have the server send
an ordered event response to a watch (such that we can verify it’s a
response to our request), that would be sufficient (because the event
has the etcd index in it). Then the watch cache would hold the
response until it sees a the result of its ping returned in the
response. Multiple such simultaneous requests can be batched and then
a single “ping” could clear it.

Alternative solution: do a consistent list against etcd, but set the limit to 1 (or 0 if that works), record the returned global index. Wait for that index to show up in the watch cache. Then serve the request out of the watch cache. This only works if the watch cache's source is watching all objects (and not a subdir), otherwise there's no guarantee that the write which set the current global index will show up in a particular subdir.

Basically, we want to know if the progress of the etcd watcher has passed a given logic time. Adding a progress request type into etcd watcher API seems easy enough.

Xiang Li

unread,

Jun 6, 2018, 1:09:59 PM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/cc @smarterclayton @wojtek-t @jpbetz

Joe Betz

unread,

Jun 6, 2018, 1:35:41 PM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

We'll have a look and propose a fix.

Wojciech Tyczynski

unread,

Jun 6, 2018, 5:01:10 PM6/6/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

When did we allow reflectors to read out of the watch cache ( = not consistent read)?

@lavalamp 5 releases ago or sth like that :)

Alternative solution: do a consistent list against etcd, but set the limit to 1 (or 0 if that works), record the returned global index. Wait for that index to show up in the watch cache. Then serve the request out of the watch cache.

That's interesting - I'm wondering what's the cost of listing 1 (or 0) elements from etcd.
@xiang90 - is that proportional to amount of returned elements? Or does it still copy/read all elements that would be returned if no limit is set.

This only works if the watch cache's source is watching all objects (and not a subdir), otherwise there's no guarantee that the write which set the current global index will show up in a particular subdir.

@lavalamp - not sure if I understand that. Watch cache is per API type and is watching all objects of a given type. But I'm not sure if that's what you were asking about...

Joe Betz

unread,

Jun 14, 2018, 2:20:12 PM6/14/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Quick status update. Current plan is:

Reproduce cleanly, ideally with a test case
Attempt to make all RV=0 requests consistent by first doing a quorum read against etcd for global index, then checking watch cache for index (introducing a watch progress op to etcd like Xiang suggested to check global index)
Sort out what the performance impact of this change would be (how often RV=0 requests are made, looking for worse case situations)
Sort out what to do for kube-apiserver versions running versions of etcd that do not have a watch progress request op (any other options beside re-list?)

Wojciech Tyczynski

unread,

Jun 15, 2018, 3:16:25 AM6/15/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

then checking/waiting for the watch cache to catch up with that global index

If you pass RV to watch cache it already waits for that...
.. though there is one problem with it, which is that the internal RV is updated based on watch events that are delivered to it from etcd. So if there wasn't any change of object of a given type after provided RV, watch cache will never reach that RV.
That is something where, in my opinion, we need some etcd extension to handle it better.

Sort out what the performance impact of this change would be (how often RV=0 requests are made, looking for worse case situations)

Since we will still be listing from apiserver cache, this can only add some latency to those requests.
RV=0 is used in three main cases:

informers/reflectors when do list they set RV=0 <- this is of the biggest concern
secrets/configmaps are fetched by kubelets with RV=0 (but that we're planning to migrate to watch in early 1.12)
Kubelets are doing "GET node with RV=0" to apply node status patch on top of that (once we switch to lease objects (also planned for 1.12), that should have much lower impact)

So I think we should mostly focus on the impact of first one.

Wojciech Tyczynski

unread,

Jun 15, 2018, 3:17:26 AM6/15/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Sort out what to do for kube-apiserver versions running versions of etcd that do not have a watch progress request op (any other options beside re-list?)

This is a long-standing bug (we have that in k8s for year+).

Would it be possible to e.g. patch previous etcd versions (e.g. 3.1 and 3.2) and say that you should update your etcd to get that?

Joe Betz

unread,

Jun 15, 2018, 1:31:03 PM6/15/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

That is something where, in my opinion, we need some etcd extension to handle it better.

Agreed. The progress op @xiang90 suggested we consider adding to the watch API should do the trick. If I understand the idea right, it would allow the client, at any time, to send a "what's my progress" request on the bi-directional watch stream and then wait for a "latest revision of this watch stream is X" by response back from the server. I'll work on details.

So I think we should mostly focus on the impact of first one.

Thanks for summarizing where RV=0 is used. That helps.

Joe Betz

unread,

Jun 15, 2018, 2:03:24 PM6/15/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Would it be possible to e.g. patch previous etcd versions (e.g. 3.1 and 3.2) and say that you should update your etcd to get that?

API additions should be backward/compatible and safe. Once I have details I confirm if we can backport.

Oh, and not knowing which server version of etcd the client is connecting to is a problem. I'll look into that.

Xiang Li

unread,

Jun 15, 2018, 2:17:57 PM6/15/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Would it be possible to e.g. patch previous etcd versions (e.g. 3.1 and 3.2) and say that you should update your etcd to get that?

It is OK to introduce the new feature I proposed to etcd. But we should not introduce it to an old version of etcd. It is not really a patch but a new feature...

fejta-bot

unread,

Sep 13, 2018, 2:35:12 PM9/13/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Wojciech Tyczynski

unread,

Sep 14, 2018, 2:18:22 AM9/14/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/lifecycle frozen
/remove-lifecycle stale

Joe Betz

unread,

Sep 17, 2018, 7:06:56 PM9/17/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

We've written up a design doc covering both this issue and #67998. Please have a look!

@smarterclayton

Clayton Coleman

unread,

Sep 18, 2018, 7:49:02 AM9/18/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Where's the design doc?

Davanum Srinivas

unread,

Sep 18, 2018, 8:01:38 AM9/18/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@smarterclayton it's here - #67998 (comment)

Joe Betz

unread,

Sep 18, 2018, 1:29:22 PM9/18/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Where's the design doc?

Sorry, forgot the link Apiserver Resource Version Semantics is the design doc.

Yassine TIJANI

unread,

Oct 25, 2018, 10:01:54 AM10/25/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

any news on this @jpbetz

Joe Betz

unread,

Oct 29, 2018, 1:23:33 PM10/29/18

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@yastij See the above linked design doc. Our main discussion/activity has been there. We'll be sending out PRs for the initial set of changes shortly.

Davanum Srinivas

unread,

Jul 12, 2019, 5:23:16 PM7/12/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/priority important-soon
/milestone v1.16

(to match the priority and milestone in the PR)

Barnabas Makonda

unread,

Aug 23, 2019, 5:12:03 AM8/23/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@jpbetz @dims Hello! I'm bug triage shadow for the 1.16 release cycle and considering this issue is tagged for 1.16, but not updated for a long time, I'd like to check its status. The code freeze is starting on August 29th (about 1.5 weeks from now), which means that there should be a PR ready (and merged) until then.

Do we still target this issue to be fixed for 1.16?

Stephen Augustus

unread,

Aug 28, 2019, 12:35:32 PM8/28/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/remove-priority critical-urgent
/milestone clear

Clayton Coleman

unread,

Aug 28, 2019, 1:03:18 PM8/28/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@jpbetz please update the status

/priority critical-urgent

This is still the most severe possible known vulnerability in Kubernetes safety guarantees.

Daniel Smith

unread,

Aug 28, 2019, 1:53:28 PM8/28/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Joe is OOO; I was under the impression that this was mostly or completely done, but I could be wrong.

Wojciech Tyczynski

unread,

Aug 28, 2019, 3:37:39 PM8/28/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

I don't think it is - we didn't get to conclusion:
https://docs.google.com/document/d/1x3JXKwPTpum8S6bC-YXvoGS583iixuQSskEGukTAzcI/edit

Josh Berkus

unread,

Aug 28, 2019, 7:23:40 PM8/28/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Generally "critical-urgent" means "needs to be, and will be, fixed in 3-10 days", and, importantly, "Kubernetes cannot be released unless we fix this first".

Given that we have had 5 releases since this issue was first opened, that's manifestly not true.

Joe Betz

unread,

Sep 17, 2019, 7:00:18 PM9/17/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@smarterclayton

Apologies. This got stalled out. We had split this into two sub-problems:

Clarifying how resourceVersion semantics currently behave and proposing some adjustments to make them consistent (both across operations, and when the watch cache is enabled/disabled) (ref). Once we have this resolved we can fix #67998.
Provide clear guidance to clients must do to avoid stale reads (ref)

I can circle back on the resourceVersion semantics this week. If we can get closure on that then we can work on updating reflectors to be consistent across restarts and then talk more with sig-node about options for fixing the node problem.

Clayton Coleman

unread,

Sep 17, 2019, 7:40:35 PM9/17/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Right now anyone running Kubernetes and depending on the uniqueness of pods
has infinite liability. I'm. fine with important-soon, but either way this
is still the top of issues we as a project have to care about.

On Tue, Sep 17, 2019 at 6:58 PM Joe Betz <notifi...@github.com> wrote:

> @smarterclayton <https://github.com/smarterclayton>

>
> Apologies. This got stalled out. We had split this into two sub-problems:
>

> - Clarifying how resourceVersion semantics currently behave and

> proposing some adjustments to make them consistent (both across operations,
> and when the watch cache is enabled/disabled) (ref

> <https://docs.google.com/document/d/1x3JXKwPTpum8S6bC-YXvoGS583iixuQSskEGukTAzcI/edit#bookmark=id.61bwd0wkn9c6>).

> Once we have this resolved we can fix #67998

> <https://github.com/kubernetes/kubernetes/pull/67998>.
> - Provide clear guidance to clients must do to avoid stale reads (ref
> <https://docs.google.com/document/d/1x3JXKwPTpum8S6bC-YXvoGS583iixuQSskEGukTAzcI/edit#bookmark=id.x4lvrxm2ug88>

> )
>
> I can circle back on the resourceVersion semantics this week. If we can
> get closure on that then we can work on updating reflectors to be
> consistent across restarts

> <https://docs.google.com/document/d/1x3JXKwPTpum8S6bC-YXvoGS583iixuQSskEGukTAzcI/edit#heading=h.5fkhr18kcia0>

> and then talk more with sig-node about options for fixing the node problem.
>
> —

> You are receiving this because you were mentioned.

> Reply to this email directly, view it on GitHub

> <https://github.com/kubernetes/kubernetes/issues/59848?email_source=notifications&email_token=AAI37JY3XJP3HCLQXHIKKI3QKFOHZA5CNFSM4EQTBWN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD66EU7Q#issuecomment-532433534>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AAI37J7EXOLK5B7OSUKYHADQKFOHZANCNFSM4EQTBWNQ>

Joe Betz

unread,

Sep 19, 2019, 4:59:34 PM9/19/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

#82862 and #72170 are ready for review.

Marky Jackson

unread,

Oct 26, 2019, 8:10:25 AM10/26/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@jpbetz Bug triage for 1.17 here with a gentle reminder that code freeze for this release is on November 18. Is this issue still intended for 1.17?

—
You are receiving this because you are on a team that was mentioned.

Reply to this email directly, view it on GitHub, or unsubscribe.

Marky Jackson

unread,

Nov 5, 2019, 1:54:07 PM11/5/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@jpbetz A kind reminder that code freeze for this release is on November 14. Is this issue still intended for 1.17? If no traction by weeks end we will have to move this to milestone 1.18

Joe Betz

unread,

Nov 5, 2019, 3:11:40 PM11/5/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@markyjackson-taulia Trying to get #83520 in this week. It solves the "relist" problem that contributes to this issue but not the "restart" issue, so this issue will definitely not be fully resolved in 1.17.

Marky Jackson

unread,

Nov 5, 2019, 3:38:02 PM11/5/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@jpbetz ok, I am going to move to 1.18 milestone
/milestone v1.18

Joe Betz

unread,

Dec 19, 2019, 2:49:55 PM12/19/19

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

kubernetes/enhancements#1404 proposes how we might fix the last remaining underlying problem we need to fix to resolve this issue.

Marky Jackson

unread,

Feb 6, 2020, 9:24:50 AM2/6/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Hi @jpbetz

Bug Triage team here. Will this making it in the v1.18 release?

Joe Betz

unread,

Feb 6, 2020, 10:23:24 AM2/6/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

This should be set to 1.19. We have a KEP for the remaining work that we’re
hoping to get in for 1.19

On Thu, Feb 6, 2020 at 6:24 AM Marky Jackson <notifi...@github.com>
wrote:

> Hi @jpbetz <https://github.com/jpbetz>

>
> Bug Triage team here. Will this making it in the v1.18 release?
>
> —

> You are receiving this because you were mentioned.

> Reply to this email directly, view it on GitHub

> <https://github.com/kubernetes/kubernetes/issues/59848?email_source=notifications&email_token=AAD72RQXE23HDDRF5XBYAK3RBQMS3A5CNFSM4EQTBWN2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEK7MSKI#issuecomment-582928681>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAD72RVPCGVZA3NWHF54RBDRBQMS3ANCNFSM4EQTBWNQ>
> .

Lubomir I. Ivanov

unread,

Feb 6, 2020, 2:20:15 PM2/6/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

/milestone v1.19

kbruner

unread,

Jun 1, 2020, 3:32:19 AM6/1/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Hello from the bug triage team! This issue has not been updated for a long time, so I'd like to check what's the status. The code freeze is starting June 25th (about 4 weeks from now) and while there is still plenty of time, we want to ensure that each PR has a chance to be merged on time.

As the issue is tagged for 1.19, is it still planned for this release?

kbruner

unread,

Jun 30, 2020, 7:19:10 PM6/30/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

This issue hasn't been updated in 4 months. Are you still expecting to complete it for 1.19? @kubernetes/sig-api-machinery-bugs @kubernetes/sig-apps-bugs @kubernetes/sig-scalability-bugs

Wojciech Tyczynski

unread,

Jul 1, 2020, 2:49:02 AM7/1/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

We slightly improved that in 1.18, but nothing more is planned for 1.19.

Sayan Chowdhury

unread,

Oct 22, 2020, 7:08:14 AM10/22/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

👋🏽 Hello! I'm from the Bug Triage Team again! The code freeze for 1.20 release is starting November 12th which is about 3 weeks from now. Is this issue still planned for 1.20 release?

erismaster

unread,

Nov 18, 2020, 7:07:52 PM11/18/20

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

👋 Hi Bug Triage here, seeing that there hasn't been recent activity on this issue, I am going to remove the milestone target.

/milestone clear

Julien Laurenceau

unread,

Oct 15, 2021, 6:25:21 AM10/15/21

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Hi,

Maybe it would be a good idea to have a set of options to completely disable watch cache ?
As I understand today we can disable watch-cache in api-server by using --watch-cache=false but the local cache cannot be disabled in controllers.

These options would be helpful for testing purpose, but also would open more broadly the way etcd shims such as Kine or FoundationDB

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.

Triage notifications on the go with GitHub Mobile for iOS or Android.

Michael Gasch

unread,

Oct 15, 2021, 7:24:04 AM10/15/21

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

As I understand today we can disable watch-cache in api-server by using --watch-cache=false but the local cache cannot be disabled in controllers.

What do you mean by local cache? The cache.Store interface? What would be the benefit or use case of not using a local cache?

Julien Laurenceau

unread,

Oct 19, 2021, 3:52:10 AM10/19/21

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

I am not very familiar with exactly how many caching layers there are between etcd and final client, but the idea would be to deactivate all of them except the one in the final client (we do not convert Watch in Get semantic).
As I understand the final client that currently observe stale reads are controllers and the source of truth is etcd.
Api-server is for sure one layer of cache, but I have the feeling that due to shared-cache / local-cache etcd there can also be several layers of cache after Api-server.

Michael Gasch

unread,

Oct 19, 2021, 7:56:25 AM10/19/21

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

@julienlau you might want to read the details shared in this project, explaining the time travel issues in detail and how to detect/prevent them: https://github.com/sieve-project/sieve

Clayton Coleman

unread,

Nov 2, 2021, 11:05:47 AM11/2/21

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Just setting a status, this is still fundamentally broken in Kube, and the use of watch cache is unsafe in HA apiservers in a way that allows two kubelets to be running a pod with the same name at the same time, which violates assumptions that statefulsets and PVC depend on. Any cache that is HA must establish that it has the freshest data (data from time T > time request was made) to preserve the guarantees that Kube controllers implicitly rely on.

I have not seen a solution proposed other than the watch cache being required to verify the cache is up to date before serving a LIST @ rv=0, but we do need to address this soon.

Josh Berkus

unread,

Nov 3, 2021, 7:12:45 PM11/3/21

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Welcome to the CAP theorem, I guess?

Julien Laurenceau

unread,

Nov 4, 2021, 10:39:22 AM11/4/21

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Maybe defining specific test cases to work it out would help ?
It seems to me that several other issues linked to this one use some sort of workaround to get things working but could use a proper way to handle things out at controller level.
Sieve seem to be designed to debug controllers, but the same failure scenarii could be tested at Apiserver level to test apiserver watch cache and loadbalancing behavior ?
Maybe a read-through write-through debug interface to Apiserver would also help ?

Wojciech Tyczynski

unread,

Jan 26, 2022, 4:27:14 AM1/26/22

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

I wanted to make an update on this for quite some time and finally getting here (motivated by a need to link it)

With #83520 (so since 1.18), this problem has much lower exposure.

That particular PR ensures, then when a reflector has to relist (e.g. because of out-of-history in watch or whatever other reason), the relist will NOT go back in time (we either list from cache by providing the currently known RV which ensures that our LIST will be at least that fresh or list from etcd using quorum reads).

This means, that within a single incarnation of a component, it will never go back in time.

The remaining (and unsolved case) is what happens on component start. It still lists from cache then (without any preconditions), which means that it can go back in time comparing to where it was before it was restarted.

But within a single incarnation of a component - we're now safe.

—

Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.

You are receiving this because you are on a team that was mentioned.

Julien Laurenceau

unread,

Jan 26, 2022, 12:40:25 PM1/26/22

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Update appreciated.
Ok for RV usage in general, but regarding corner cases you mentionned "component start".
In your view does kubernetes updates also checks these criteria ?

—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.

You are receiving this because you are on a team that was mentioned.

Maxim Ivanov

unread,

Jan 27, 2022, 4:01:29 AM1/27/22

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

It still lists from cache then (without any preconditions), which means that it can go back in time comparing to where it was before it was restarted.

Naive solution, which likely is not applicable, but I'll ask nevertheless, is listing from etcd without quorum read is an option here?

—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.

You are receiving this because you are on a team that was mentioned.

Wojciech Tyczynski

unread,

Jan 27, 2022, 5:18:57 AM1/27/22

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Naive solution, which likely is not applicable, but I'll ask nevertheless, is listing from etcd with quorum read an option here?

This break scalability (especially badly for some resource, pods in particular, because you can't list "pods from my node" from etcd - you can only list all pods and then deserialize in kube-apiserver and filter out. For each node independently. If you have 5k-node cluster with 150k pods that had an incident and all of kubelets are now starting at the same time ....]

—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.

You are receiving this because you are on a team that was mentioned.

Michael Gasch

unread,

Jan 27, 2022, 2:30:15 PM1/27/22

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

With https://github.com/sieve-project/sieve we found several bugs in real-world controllers, some of those caused by time travel anomalies.

While not a perfect mitigation (bc always depending on the problem domain), using optimistic concurrency control patterns not only during updates but also (conditional) deletes, also can mitigate some of the time travel anomalies by fencing off operations from stale views. Again, not perfect, but a good practice to not sacrifice scalability.

On smaller clusters, disabling the API server cache might be a safe alternative, too. This eliminates the aforementioned (re)start time travel anomaly in controllers.

Thoughts @wojtek-t ?

—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.

You are receiving this because you are on a team that was mentioned.

Wojciech Tyczynski

unread,

Jan 31, 2022, 7:55:03 AM1/31/22

to kubernetes/kubernetes, k8s-mirror-api-machinery-bugs, Team mention

Sure - those are all good, but are only partial mitigations, not a full solution. And what we're trying to do is to provide full solution.
FWIW - kubernetes/enhancements#3142 is one of the things we're looking at.

—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.

You are receiving this because you are on a team that was mentioned.