When we added resourceVersion=0 to reflectors, we didn't properly reason about its impact on nodes. Its current behavior can cause two nodes to run a pod with the same name at the same time, which violates the pod safety guarantees on the cluster. Because the read is not a quorum read from the watch cache, it can be arbitrarily delayed which means a pod that was previously deleted can be resurrected.
Scenario:
pod-0
(uid 1) which is scheduled to node-1
pod-0
is deleted as part of a rolling upgradenode-1
sees that pod-0
is deleted and cleans it up, then deletes the pod in the apipod-0
(uid 2) which is assigned to node-2
node-2
sees that pod-0
has been scheduled to it and starts pod-0
node-1
crashes and restarts, then performs an initial list of pods scheduled to it against an API server in an HA setup (more than one API server) that is partitioned from the master (watch cache is arbitrarily delayed). node-1
observes a list of pods from before T2node-1
starts pod-0
(uid 1) and node-2
is already running pod-0
(uid 2).This violates pod safety. Since we support HA masters, we cannot use resourceVersion=0
from reflectors on the node, and probably should not use it on the masters. We can only safely use resourceVersion=0
after we have retrieved at least one list, and only if we verify that resourceVersion is in the future.
@kubernetes/sig-apps-bugs @kubernetes/sig-api-machinery-bugs @kubernetes/sig-scalability-bugs This is a fairly serious issue that can lead to cluster identity guarantees being lost, which means clustered software cannot run safely if it has assumed the pod safety guarantee prevents two pods with the same name running on the cluster at the same time.
This is also something that could happen for controllers - during a controller lease failover the next leader could be working from a very old cache and undo recent work done.
No matter what, the first list of a component with a clean state that must preserve "happens-before" must perform a live quorum read against etcd to fill their cache. That can only be done by omitting resourceVersion=0.
Fixes:
1 is a pretty significant performance regression, but is the most correct and safest option (just like we enabled quorum reads everywhere). 2 is more complex, and there are a few people trying to remove the monotonicity guarantees from resource version, but would retain most of the performance benefits of using this in the reflector. 3 is probably less complex than 2, but i'm not positive it actually works. 4 is hideous and won't fix other usage.
Probably needs to be backported to 1.6.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
Note that we added this as a performance at scale win, so we are almost certainly going to regress to some degree (how much has been mitigated by other improvements in the last year is uncertain)
/sub
Disabling resourceVersion=0 for lists when more than one master is present is an option as well.
As Jordan noted the optimization here is impactful because we avoid having to fetch many full pod lists from etcd when we only return a subset to each node. It’s possible we could require all resourceVersion=0 calls to acquire a read lease on etcd which bounds the delay, but doesn’t guarantee happens before if the cache is delayed. If the watch returned a freshness guarantee we could synchronize on that as well.
We can do a synthetic write to the range and wait until it is observed by the cache, then service the rv=0. The logical equivalent is a serializable read on etcd for the range, but we need to know the highest observed RV on the range.
We can accomplish that by executing a range request with min create and mod revisions equal to the latest observed revision, and then performing the watch cache list at the highest RV on the range.
So:
We can mitigate the performance impact by ensuring that an initial list from the watch cache preserves happens before. We can safely serve historical lists (rv=N) at any time. The watch cache already handles rv=N mostly correctly.
So the proposed change here is:
That resolves this issue.
Serializable read over the range, that is.
Also:
if this MUST be resolved for 1.10, please add status/approved-for-milestone to it before Code Freeze. Thanks!
Hi Clayton, could you go ahead and add "approved-for-milestone" label to this, as well as status (in progress)? That will help it stay in the milestone if this is a 1.10 blocker. Thanks!
Caveat on the label - still trying to identify how we fix this - but no matter what because this is a critical data integrity issues we'll end up back porting this several releases. Will try to get a better estimate of timeframe soon.
ooops, fixing labels.
This can happen on a singleton master using an HA etcd cluster:
So the proposed change here is:
- Clients remove rv=0 and we stop honoring it
- Verify the watch cache correctly waits for rv=N queries
- We can serve a list or get from watch cache iff we perform a serializable read against etcd and retrieve the highest create/mod revisions
- We can make that query efficient by using min_create/mod_revision on the etcd list call
- Clients currently sending rv=0 where perf is critical (node) should start using the last observed resource version, which allows the watch cache to serve correctly.
Why do we need (3) and (4)?
If we verify that rv=N works fine (and it should), why can't we just start setting rv= instead.
If we just do that there are two options:
In the second case, reflector may retry without setting any RV for consistent read from etcd (that should be pretty rare so it should be fine from performance point of view).
@smarterclayton What an I missing?
Please stay in close contact with the release team on this.
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Action Required: This issue has not been updated since Mar 1. Please provide an update.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.If we verify that rv=N works fine (and it should), why can't we just start setting rv= instead.
The hard part is finding N. If we can find N, then yes, the solution is easy (that's the second part of the attempted PR before I realized that finding N is not possible today without executing a write).
In order to safely serve a list from the watch cache, you must ensure the watch cache is at an RV that is at the same level or higher than if you executed a linearizable read against etcd over the range at the current time and then started a watch from that RV (must include all modifications and deletes up to and including N).
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Action Required: This issue has not been updated since Mar 3. Please provide an update.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
Is anything else affected by it? Can same ReadWriteOnce PV attempted to be mounted on multiple nodes simultaneously due to this problem?
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
Any controller or client using an informer that is sending rv=0 and has the stock kubernetes config AND is running against either etcd HA or apiserver HA is vulnerable to stale reads until we put a mitigation in place. Any client that doesn't execute a write or perform a live read prior to operating on some external state (like a remote api or state of a host) would potentially be able to go back in time. So it's possible, although I haven't looked through that particular code path.
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Action Required: This issue has not been updated since Mar 4. Please provide an update.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
The hard part is finding N. If we can find N, then yes, the solution is easy (that's the second part of the attempted PR before I realized that finding N is not possible today without executing a write).
In order to safely serve a list from the watch cache, you must ensure the watch cache is at an RV that is at the same level or higher than if you executed a linearizable read against etcd over the range at the current time and then started a watch from that RV (must include all modifications and deletes up to and including N).
Thanks - so my understanding was correct.
So I guess what I was trying to write was: Why not:
With that approach we would have a single list from etcd per component lifetime.
But I guess there are two main problems with this approach:
For the second case, we can e.g. incorporate the current RV in apiserver into structured "error to old" error and used that one in the next List call (we can double check that it's really greater than what we wanted in the previous call).
So the question is how often we restart components and how serious the first problem is. Maybe it's good enough fix for now and we can work on more proper solution a bit later?
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
So ... given the difficulty of addressing this issue (darn that CAP theorem!), is it realistic to have a fix in time for the 1.10 release?
So the question is how often we restart components and how serious the first problem is. Maybe it's good enough fix for now and we can work on more proper solution a bit later?
My concern is that it regresses 5000 node cluster completely (because if we list from etcd then we have to filter in the apiserver once for each node). So we'd probably drop back to down to 1k nodes as the max.
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
My concern is that it regresses 5000 node cluster completely (because if we list from etcd then we have to filter in the apiserver once for each node). So we'd probably drop back to down to 1k nodes as the max.
That should happen only if we restart all nodes at once, right? That shouldn't really happen...
Or am I missing something big here?
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Action Required: This issue has not been updated since Mar 8. Please provide an update.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
@smarterclayton should this issue also move to v1.11 now?
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
That should happen only if we restart all nodes at once, right? That shouldn't really happen...
Or am I missing something big here?
I wouldn't be surprised if it caused the api server to die during bringup. I'm not necessarily against regressing here, I just think we need to do it with the understanding that we're doing so and that cluster operators may need a mitigation or way to mitigate at those extreme scales (until we come up with a better solution).
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Action Required: This issue has not been updated since Mar 9. Please provide an update.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
Removing from 1.10, per discussion on sig-release.
[MILESTONENOTIFIER] Milestone Issue Needs Attention
@smarterclayton @kubernetes/sig-api-machinery-misc @kubernetes/sig-apps-misc @kubernetes/sig-scalability-misc
Action required: During code freeze, issues in the milestone should be in progress.
If this issue is not being actively worked on, please remove it from the milestone.
If it is being worked on, please add the status/in-progress
label so it can be tracked with other in-flight issues.
Note: This issue is marked as priority/critical-urgent
, and must be updated every 1 day during code freeze.
Example update:
ACK. In progress
ETA: DD/MM/YYYY
Risks: Complicated fix required
Issue Labels
sig/api-machinery
sig/apps
sig/scalability
: Issue will be escalated to these SIGs if needed.priority/critical-urgent
: Never automatically move issue out of a release milestone; continually escalate to contributor and SIG through all available channels.kind/bug
: Fixes a bug discovered during the current release.—
We have similar issue here #61343
Especially when cluster scale is large and there are large load for List request from any clients without ResourceVersion specified in the request.
What we found is the grpc connection throughput is limited which cause the request timeout. Because, during the slowness, both etcd and apiserver, there are still enough CPU and memory. And also if we exec into etcd pod and do the list, it is fast enough.
The approach of @smarterclayton 's proposal is also not suitable for our case. Because there is no way for etcd to get the latest revision for a specified resource type like Pod. The only way to get it is to have a write and then get it from the response. But this is not a perfect solution because you need to have a write every time.
Because there is already wait logic in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L513, if we have a way to get the latest revision for a specified resource type in line https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L479, we can then get the data from cache while not hit etcd every time.
But have a write every time is not a perfect way to get it.
I think maybe @xiang90 can have more comments?
@smarterclayton - I just realized one thing - please correct me if I'm wrong here.
I think that in majority of cases, we may already be good. This is because:
So I think that there are two options:
TBH, I've never seen a request with continue token set anywhere in logs, so I'm afraid it may be the former.
So what is this line doing:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/pkg/storage/cacher.go#L472
?
on an initial list with ?resourceVersion=0&limit=500
, all of those conditions will be false
on an initial list with ?resourceVersion=0&limit=500, all of those conditions will be false
I missed "hasLimit" has "&& resourceVersion != "0" inside - thanks @liggitt
@smarterclayton - yes, I agree that with etcd API change it would be much easier. And what you wrote makes perfect sense to me.
I was just wondering why we are not bypassing cacher - the fact that "hasLimit" has a condition on resourceVersion inside is a bit misleading to me...
cc @jpbetz
/cc @ccding
Sorry this is the first I've seen of this issue.
When did we allow reflectors to read out of the watch cache ( = not consistent read)? That seems wrong. Hopefully I am not the one that approved that :) I think I have always argued that RV=0 is always wrong to use.
Alternative solution: do a consistent list against etcd, but set the limit to 1 (or 0 if that works), record the returned global index. Wait for that index to show up in the watch cache. Then serve the request out of the watch cache. This only works if the watch cache's source is watching all objects (and not a subdir), otherwise there's no guarantee that the write which set the current global index will show up in a particular subdir.
The watch periodic notify channel is close, but only fires every ten
minutes. If we were able to “ping” a watch and have the server send
an ordered event response to a watch (such that we can verify it’s a
response to our request), that would be sufficient (because the event
has the etcd index in it). Then the watch cache would hold the
response until it sees a the result of its ping returned in the
response. Multiple such simultaneous requests can be batched and then
a single “ping” could clear it.
Alternative solution: do a consistent list against etcd, but set the limit to 1 (or 0 if that works), record the returned global index. Wait for that index to show up in the watch cache. Then serve the request out of the watch cache. This only works if the watch cache's source is watching all objects (and not a subdir), otherwise there's no guarantee that the write which set the current global index will show up in a particular subdir.
Basically, we want to know if the progress of the etcd watcher has passed a given logic time. Adding a progress request type into etcd watcher API seems easy enough.
We'll have a look and propose a fix.
When did we allow reflectors to read out of the watch cache ( = not consistent read)?
@lavalamp 5 releases ago or sth like that :)
Alternative solution: do a consistent list against etcd, but set the limit to 1 (or 0 if that works), record the returned global index. Wait for that index to show up in the watch cache. Then serve the request out of the watch cache.
That's interesting - I'm wondering what's the cost of listing 1 (or 0) elements from etcd.
@xiang90 - is that proportional to amount of returned elements? Or does it still copy/read all elements that would be returned if no limit is set.
This only works if the watch cache's source is watching all objects (and not a subdir), otherwise there's no guarantee that the write which set the current global index will show up in a particular subdir.
@lavalamp - not sure if I understand that. Watch cache is per API type and is watching all objects of a given type. But I'm not sure if that's what you were asking about...
Quick status update. Current plan is:
then checking/waiting for the watch cache to catch up with that global index
If you pass RV to watch cache it already waits for that...
.. though there is one problem with it, which is that the internal RV is updated based on watch events that are delivered to it from etcd. So if there wasn't any change of object of a given type after provided RV, watch cache will never reach that RV.
That is something where, in my opinion, we need some etcd extension to handle it better.
Sort out what the performance impact of this change would be (how often RV=0 requests are made, looking for worse case situations)
Since we will still be listing from apiserver cache, this can only add some latency to those requests.
RV=0 is used in three main cases:
So I think we should mostly focus on the impact of first one.
Sort out what to do for kube-apiserver versions running versions of etcd that do not have a watch progress request op (any other options beside re-list?)
This is a long-standing bug (we have that in k8s for year+).
Would it be possible to e.g. patch previous etcd versions (e.g. 3.1 and 3.2) and say that you should update your etcd to get that?
That is something where, in my opinion, we need some etcd extension to handle it better.
Agreed. The progress op @xiang90 suggested we consider adding to the watch API should do the trick. If I understand the idea right, it would allow the client, at any time, to send a "what's my progress" request on the bi-directional watch stream and then wait for a "latest revision of this watch stream is X" by response back from the server. I'll work on details.
So I think we should mostly focus on the impact of first one.
Thanks for summarizing where RV=0 is used. That helps.
Would it be possible to e.g. patch previous etcd versions (e.g. 3.1 and 3.2) and say that you should update your etcd to get that?
API additions should be backward/compatible and safe. Once I have details I confirm if we can backport.
Oh, and not knowing which server version of etcd the client is connecting to is a problem. I'll look into that.
Would it be possible to e.g. patch previous etcd versions (e.g. 3.1 and 3.2) and say that you should update your etcd to get that?
It is OK to introduce the new feature I proposed to etcd. But we should not introduce it to an old version of etcd. It is not really a patch but a new feature...
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/lifecycle frozen
/remove-lifecycle stale
We've written up a design doc covering both this issue and #67998. Please have a look!
Where's the design doc?
@smarterclayton it's here - #67998 (comment)
Where's the design doc?
Sorry, forgot the link Apiserver Resource Version Semantics is the design doc.
any news on this @jpbetz
@yastij See the above linked design doc. Our main discussion/activity has been there. We'll be sending out PRs for the initial set of changes shortly.
/priority important-soon
/milestone v1.16
(to match the priority and milestone in the PR)
@jpbetz @dims Hello! I'm bug triage shadow for the 1.16 release cycle and considering this issue is tagged for 1.16, but not updated for a long time, I'd like to check its status. The code freeze is starting on August 29th (about 1.5 weeks from now), which means that there should be a PR ready (and merged) until then.
Do we still target this issue to be fixed for 1.16?
/remove-priority critical-urgent
/milestone clear
@jpbetz please update the status
/priority critical-urgent
This is still the most severe possible known vulnerability in Kubernetes safety guarantees.
Joe is OOO; I was under the impression that this was mostly or completely done, but I could be wrong.
I don't think it is - we didn't get to conclusion:
https://docs.google.com/document/d/1x3JXKwPTpum8S6bC-YXvoGS583iixuQSskEGukTAzcI/edit
Generally "critical-urgent" means "needs to be, and will be, fixed in 3-10 days", and, importantly, "Kubernetes cannot be released unless we fix this first".
Given that we have had 5 releases since this issue was first opened, that's manifestly not true.
Apologies. This got stalled out. We had split this into two sub-problems:
I can circle back on the resourceVersion semantics this week. If we can get closure on that then we can work on updating reflectors to be consistent across restarts and then talk more with sig-node about options for fixing the node problem.
@jpbetz Bug triage for 1.17 here with a gentle reminder that code freeze for this release is on November 18. Is this issue still intended for 1.17?
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
@jpbetz A kind reminder that code freeze for this release is on November 14. Is this issue still intended for 1.17? If no traction by weeks end we will have to move this to milestone 1.18
@markyjackson-taulia Trying to get #83520 in this week. It solves the "relist" problem that contributes to this issue but not the "restart" issue, so this issue will definitely not be fully resolved in 1.17.
@jpbetz ok, I am going to move to 1.18 milestone
/milestone v1.18
kubernetes/enhancements#1404 proposes how we might fix the last remaining underlying problem we need to fix to resolve this issue.
/milestone v1.19
Hello from the bug triage team! This issue has not been updated for a long time, so I'd like to check what's the status. The code freeze is starting June 25th (about 4 weeks from now) and while there is still plenty of time, we want to ensure that each PR has a chance to be merged on time.
As the issue is tagged for 1.19, is it still planned for this release?
This issue hasn't been updated in 4 months. Are you still expecting to complete it for 1.19? @kubernetes/sig-api-machinery-bugs @kubernetes/sig-apps-bugs @kubernetes/sig-scalability-bugs
We slightly improved that in 1.18, but nothing more is planned for 1.19.
👋🏽 Hello! I'm from the Bug Triage Team again! The code freeze for 1.20 release is starting November 12th which is about 3 weeks from now. Is this issue still planned for 1.20 release?
👋 Hi Bug Triage here, seeing that there hasn't been recent activity on this issue, I am going to remove the milestone target.
/milestone clear
Hi,
Maybe it would be a good idea to have a set of options to completely disable watch cache ?
As I understand today we can disable watch-cache in api-server by using --watch-cache=false
but the local cache cannot be disabled in controllers.
These options would be helpful for testing purpose, but also would open more broadly the way etcd shims such as Kine or FoundationDB
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
As I understand today we can disable watch-cache in api-server by using
--watch-cache=false
but the local cache cannot be disabled in controllers.
What do you mean by local cache? The cache.Store
interface? What would be the benefit or use case of not using a local cache?
I am not very familiar with exactly how many caching layers there are between etcd and final client, but the idea would be to deactivate all of them except the one in the final client (we do not convert Watch in Get semantic).
As I understand the final client that currently observe stale reads are controllers and the source of truth is etcd.
Api-server is for sure one layer of cache, but I have the feeling that due to shared-cache / local-cache etcd there can also be several layers of cache after Api-server.
@julienlau you might want to read the details shared in this project, explaining the time travel issues in detail and how to detect/prevent them: https://github.com/sieve-project/sieve
Just setting a status, this is still fundamentally broken in Kube, and the use of watch cache is unsafe in HA apiservers in a way that allows two kubelets to be running a pod with the same name at the same time, which violates assumptions that statefulsets and PVC depend on. Any cache that is HA must establish that it has the freshest data (data from time T > time request was made) to preserve the guarantees that Kube controllers implicitly rely on.
I have not seen a solution proposed other than the watch cache being required to verify the cache is up to date before serving a LIST @ rv=0, but we do need to address this soon.
Welcome to the CAP theorem, I guess?
Maybe defining specific test cases to work it out would help ?
It seems to me that several other issues linked to this one use some sort of workaround to get things working but could use a proper way to handle things out at controller level.
Sieve seem to be designed to debug controllers, but the same failure scenarii could be tested at Apiserver level to test apiserver watch cache and loadbalancing behavior ?
Maybe a read-through write-through debug interface to Apiserver would also help ?
I wanted to make an update on this for quite some time and finally getting here (motivated by a need to link it)
With #83520 (so since 1.18), this problem has much lower exposure.
That particular PR ensures, then when a reflector has to relist (e.g. because of out-of-history in watch or whatever other reason), the relist will NOT go back in time (we either list from cache by providing the currently known RV which ensures that our LIST will be at least that fresh or list from etcd using quorum reads).
This means, that within a single incarnation of a component, it will never go back in time.
The remaining (and unsolved case) is what happens on component start. It still lists from cache then (without any preconditions), which means that it can go back in time comparing to where it was before it was restarted.
But within a single incarnation of a component - we're now safe.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are on a team that was mentioned.
Update appreciated.
Ok for RV usage in general, but regarding corner cases you mentionned "component start".
In your view does kubernetes updates also checks these criteria ?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are on a team that was mentioned.
It still lists from cache then (without any preconditions), which means that it can go back in time comparing to where it was before it was restarted.
Naive solution, which likely is not applicable, but I'll ask nevertheless, is listing from etcd without quorum read is an option here?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are on a team that was mentioned.
Naive solution, which likely is not applicable, but I'll ask nevertheless, is listing from etcd with quorum read an option here?
This break scalability (especially badly for some resource, pods in particular, because you can't list "pods from my node" from etcd - you can only list all pods and then deserialize in kube-apiserver and filter out. For each node independently. If you have 5k-node cluster with 150k pods that had an incident and all of kubelets are now starting at the same time ....]
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are on a team that was mentioned.
With https://github.com/sieve-project/sieve we found several bugs in real-world controllers, some of those caused by time travel anomalies.
While not a perfect mitigation (bc always depending on the problem domain), using optimistic concurrency control patterns not only during updates but also (conditional) deletes, also can mitigate some of the time travel anomalies by fencing off operations from stale views. Again, not perfect, but a good practice to not sacrifice scalability.
On smaller clusters, disabling the API server cache might be a safe alternative, too. This eliminates the aforementioned (re)start time travel anomaly in controllers.
Thoughts @wojtek-t ?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are on a team that was mentioned.
Sure - those are all good, but are only partial mitigations, not a full solution. And what we're trying to do is to provide full solution.
FWIW - kubernetes/enhancements#3142 is one of the things we're looking at.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
Triage notifications on the go with GitHub Mobile for iOS or Android.
You are receiving this because you are on a team that was mentioned.