I stand by my (roughly stated above) litmus test.
If a stop/restart cycle can be described as "a really slow reboot" (ie. local disk state (eg. emptyDirs - irreplaceable), probably remote disk attachedness, (maybe) IP address + secondary static routes, etc. are preserved) then the Node object should probably persist. Perhaps more concretely: if a Pod existed on the Node at the time it was stopped, and the restart was within the node timeout (ie. the Pod was not moved off), then the pod should restart EXACTLY as if the Node had just surprise-rebooted. If anything important is lost during the stop/restart that would prevent this, the Node MUST be evacuated and deleted, including detaching storage, releasing routes, etc.
I said "probably" to disk attachedness. If a machine rebooted and all its network disks became unattached, we SHOULD recover. If we don't it's probably a bug. That said, I don't think we have any such test, so no confidence. For sanity, a reboot should not detach disks. @kubernetes/sig-storage-misc
I said "maybe" to IP. It's not 100% clear to me if a new IP on Node reboot would break anything. Probably would make some long-tail apps that do DNS lookups of Nodes unhappy but I am not sure that's enough to outlaw it. It's certainly less worrisome if the IP didn't change, but I could be swayed on this.
—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.![]()
Related to disk attachedness issue during vm stop/restart cycle.
For GCE PD
when a vm stop/restart, the disk will be automatically detached from the vm. Our attach_detach_controller periodically checks disk attached/detached through cloud provider. If during stop/restart cycle, the node and pods are not deleted from API server, controller will detect disk detached and tries to attach it again to the vm automatically.
For AWS EBS
my understanding is during stop/restart cycle, ebs volume is still attached to the vm. So controller does not really need to do anything. There is a bug fix related to this issue #46463
/sig azure
adding for visibility
So I think we're all furiously agreeing here, and we would like to standardise on the GCE behaviour (ie: include stopped/inactive/disabled nodes in Instances.List()).
Can we build a list of reasons why we don't want this? (and if the list is empty, then we're done :)
@divyenpatel made the point above (if I can unravel the triple negative correctly) that a "cordoned" (but otherwise live) node might still end up running active k8s jobs, and this would be undesirable.
I feel we just say we don't automatically support this at present: if the node is able to pass kubelet health checks, and you still don't want k8s to use it for some reason, then you need to also use explicit k8s mechanisms to drain the node - we won't automatically propagate whatever underlying cloud feature exists for this situation.
At least in case of vSphere, I think we should delete the node from kubernetes cluster when the node is powered off in vSphere.
Approach1:
I observe the following issues when the node is not deleted by the node controller and is in "NotReady" state:
Approach2:
But if the node is removed by the Node controller altogether then we don't observe above issues at all.
So the current viable solution in case of vSphere would be delete the node from kubernetes cluster when the node is powered off in vCenter.
But if we still want to node to remain "Inactive" when node is powered off, we need a new power off handler in attach/detach controller to check if the node is powered off and make necessary attach/detach operations to achieve as described in 2nd approach.
#60009 could address the concerns with nodes going away in the cloud provider.
Do we want to revisit this issue and look at creating a well defined node lifecycle? Possibly with Running, Suspended, and Terminated states?
@rrati Thanks for rejuvenating this issue. Different cloud providers manage states in their own ways, and the semantics of particular states are lost. I can see this adding technical debt into the CCM, #60009 is a harbinger of this debt IMO.
I do agree that we need a well defined lifecycle for a node, so that we can present a clean, uniform API to the cloud that implement it. As we discussed in the #wg-cloud-provider this morning, having these three states sounds adequate
In case of Suspended state, I think it is important to define when what happens when this node transitions to Running again.
There are two ways I can see this being modeled
One question that might come up while considering this state model is that the IP address of the node can change depending on the cloud and if the instance uses an elastic IP or not. In such cases, since we already tolerate network address changes in nodes, I believe this is ok.
Lastly, knowing that a node exists (even when suspended) can factor into automated decision making process of node selection, affinity, scheduling, auto scaling etc.
If there are any other concerns with approach, please discuss.
I think there is probably value to nodes which are halted/suspended - they exist and are registered, but are not schedulable. Terminated nodes should be deleted.
So no "states" needed, just a bool that indicates haltedness, and this might simply be the unschedulable field.
I think there's another issue aside from how kubernetes handles the nodes. When I looked at the cloud providers, some of them were filtering the instances and returning only running instances. In that case kubernetes would not be given information on a node that is halted/suspended and would think the node has been deleted.
We may want to look at a way to remove that decision making from the cloud provider and bring it back into kubernetes.
Thanks for this comment @rrati.
@thockin, the above comment reinforces what I was saying. Since the test of node existing is open to interpretation (in retrospect), cloud providers have varying implementations of this test.
GKE considers stopped nodes to be existing, whereas AWS and OpenStack don't. This disparity leads to stopped nodes being deleted in AWS/OpenStack clusters when the user turns it off in the cloud for scheduled maintenance, but not in GKE. (via the logic in CCM to delete nodes that are not responding and cannot be found in cloud API)
This in-turn makes it hard to reason about volumes attached to the nodes that have been shutdown. Should they be recycled and put back into the unused pool or left attached, since the node might come back up? This dilemma led to the PR I linked above (#60009)
I proposed the three "states" of a node to come up with a clean API for nodes, where the semantics of this test is clear and nothing is left to interpretation. The implementation of this may not necessarily include defining these three states explicitly.
It sounds to me like we're all agreeing that some differentiation is required between the "states", my proposal is to "enforce" this difference through the CloudProvider interface{}.
Instead of inferring that a node exists or doesn't by looking for its ExternalID (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cloud/node_controller.go#L420), I propose we use this function instead
func InstanceStateByProviderID(providerID string) (InstanceState, error)
where InstanceState can be one of,
I like the idea of an interface that removes the decision making from the cloud providers themselves. A core issue, imo, is that the cloud provider modules are making decisions for kubernetes.
As the proposal #58635 for PR #60009 mentioned, a single node state "NotReady" controlled by node controller is not enough in some cases (e.g., determine whether volume could be safely detached or not). I agree to have at least states (Running, shutdown/suspended, terminated). Running and terminated states are more clear, but shutdown or suspended might cause some confusion. From volume management point of view, we want to know whether OS is shutdown and mounts are gone. Some system might support pause/suspend which is not an OS shutdown. So in PR #60009 we propose to add "shutdown" taint. For some use cases, it might be useful to have "NoSchedule" taint for shutdown/suspended/paused machine.
cc @yujuhong
@anguslees As @rrati mentioned, the core issue is that cloudproviders, which are now going to be maintained by providers instead of kubernetes, can make decisions about kubernetes itself. This is what we want to avoid in the future.
In the short term, yes fixing the open stack provider might sound good, but that also will probably break some functionality for users - who are expecting suspended nodes not to get deleted. I'd rather add a new API and move to that API and manage this change carefully.
Also, I want to define just 3 states. @jingxu97 is confusing NodeState with NodeConditions. We are not going to evolve NodeConditions any further. They were being used by the scheduler, but we've moved to taints quite a while back.
The three states are:
CanDetach method, this can serve as the behavior. This is close to existing functionality, where volumes attached to stopped nodes cannot be detached.WDYT?
@jingxu97 I've already addressed your concerns in my comment earlier - #46442 (comment)
Please read it. The three states will serve the functionality that you mention. The Terminated state and Suspended state are there to exactly make the distinction between 'CanDetachVolume` or not.
just FYI to all, that I'm not proposing we change NodeConditions. All of these will and should be implemented as taints
Looping in @kubernetes/sig-storage-feature-requests since we're talking about volume lifecycle while making changes to node lifecycle.
A WIP implementation by @zetaab
zetaab/kubernetes@0313435
@wlan0 I am not proposing to add new NodeConditions either. The proposal is just to add "shutdown" taint, not condition.
I am trying to see whether we have a clear definition of "suspended". For example, in openstack (https://docs.openstack.org/nova/latest/reference/vm-states.html), the VM has paused, suspended, and stopped etc, how we map these states into our definition of"suspended"
@wlan0 I am not proposing to add new NodeConditions either. The proposal is just to add "shutdown" taint, not condition.
Thanks for clarifying
I am trying to see whether we have a clear definition of "suspended"
The reason for adding this state is to provide a uniform API, which clearly allows kubernetes and its plugins (CCMs in this case) to go through its lifecycle without implicit assumptions.
Today, plugins are evolving independently, and it is becoming difficult to reason about behavior of workloads in a consistent manner.
One such inconsistency is the decision to detach a volume from a node in NotReady state. It is not clear today because NotReady can mean either of the two things below
These two situations are encapsulated in Suspended and Terminated states respectively.
I'm having difficulty determining how compatible this issue and #58635 are. Can anyone clarify for me? As far as I can tell so far the proposal of a "node shutdown taint" that would allow the attach/detach controller to detach volumes from Nova instances (in the Openstack case which particularly interests me) is incompatible with the proposal for NodeState with three different values of Running, Suspended, and Terminated. If #60009 is merged will this issue be closed? Or will someone go ahead and implement the recommended NodeState here in addition to the node shutdown taint proposed there?
@wlan0, I'm also confused because in two different places in this thread of discussion you seem to be saying different things about the meaning of the Suspended State:
Unused volumes may be detached from the node.
not schedulable, and attached volumes SHOULD NOT be detached, since the node might come back
If you could clarify, I would appreciate it -- thanks!
Also, for what it's worth, I don't understand why it would ever be desirable for a Deployment or StatefulSet associated with a persistent volume claim to have a new pod rescheduled onto a different node but to disallow detaching the volume backing that PVC from following the Pod, even if it's only for a matter of ~10 minutes. This is the problem I am encountering in my test clusters running on Openstack when simulating a node failure using the Openstack CLI to shutdown and later restart a node.
Maybe that's not a very good way to simulate node failure? Maybe there is some way to configure Nova or Cinder to automatically detach volumes from a shutdown node? I'm open to suggestions if anyone can enlighten me on a good path forward or even a better way to think about node failures vs node shutdowns. Thanks!
I lost track of this - did we rech some conclusion? In general I dislike thinking about things in terms of states (collections of properties) and we have found that state-oriented APIs are hard to evolve.
I do think we need to disambiguate machines that are offline but can resume from machines that are gone and will be replaced another equivalent-but-different machine.
@thockin I'm not married to the state idea, but I think whatever solution we decided upon needs to leave the decision making for what to do with the nodes in kubernetes vs in the cloud provider. The states make sense to me because effectively it separates the nodes into 3 groups: Running, offline but could come back, gone and won't return. We would need the cloud provider to "sort" the nodes into the different groups but then what is done to the nodes in each group is up to kubernetes.
Is there a better way to achieve those separate node groups?
Is there any plan here?
I am trying to fix a bug where a customer shuts down their instance in AWS. The Node resource is deleted because of this:
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go?#L4278
On reboot the Node resource is recreated and looses all its user-defined labels.
The simple removal of that query filter resolves the issue.
We would need the cloud provider to "sort" the nodes into the different groups but then what is done to the nodes in each group is up to kubernetes.
@rrati I think this is a good separation of concerns. I also think volume detachment behavior should not be coupled with the ability for kubernetes to differentiate between stopped/suspended and terminated/deleted nodes.
Is this expected to reach a conclusion as part of the 1.11 milestone?
We just encountered this (again) in our aws instances, with the node controller deleting Node API objects while instances were temporarily shut down, losing node labels applied by the scale-up process.
This is pretty clearly a data loss issue, and I think resolving that should be prioritized over responsiveness of auto-draining/rescheduling pods in node shutdown cases. Administrators already have tools they can use to improve that:
If we want to automatically add information to represent a shut down Node, and make auto-drain/reschedule decisions based on that, that could be a useful follow-up, but the next best thing is not to delete potentially unrecoverable Node config for instances which still exist, but are shut down.
I'm convinced that (for openstack) the right solution is just to remove the ACTIVE filter, so the node exists in k8s for as long as it exists in openstack. This is a ~1 line change.
The only reason I haven't fixed this already, is that this issue calls for a consistent approach across cloud providers. I believe this has left us in analysis paralysis and unable to move forward, since there is no-one in the project with sufficient breadth across all providers to be able to confidently declare that any proposed approach is the "right" one.
So a counter proposal: I propose we just improve this independently in each provider, with release notes, etc as for any other visible provider change.
To be clear, my understanding of "improve" is that the node should exist in k8s (ie: be included in the results of Instances.List()) for as long as it exists in the underlying IaaS. There might be some never-to-be-restored corner cases in your favourite IaaS, so apply some editorial judgement here. My understanding is that the issues being reported are all cases where k8s should have retained knowledge of the node for longer - so "better" is in that direction somewhere.
So a counter proposal: I propose we just improve this independently in each provider, with release notes, etc as for any other visible provider change.
It is important to clarify the meaning of the interface cloud providers are implementing, so that Kubernetes can make consistent decisions based on what they return, but we might be able to do so in bite-size pieces. As we pin down the meaning of the interface where we currently have divergent implementations, per-provider efforts to reconcile their implementations to the agreed on meaning is reasonable.
InstanceExistsByProviderID() and InstanceID() seems like a good place to start. I think it is incorrect to answer "the instance does not exist" when it does exist and is in a not-running-but-resumable state (as aws and openstack do). I agree with @thockin that the distinction between "offline but can resume" and "terminated" is important.
My immediate concerns are around data loss and security implications of removing the Node objects for instances which still exist and can reactivate in the cluster:
We discussed this today in sig-aws, and I've added it to the agenda for the 8/8 meeting of sig-cloud-provider.
One point was raised that attributes about the instance can be changed while stopped. Having the node controller delete the Node object, and the restarted kubelet re-register is one way to flush out potentially stale data on the Node object.
While convenient, that approach is not a reliable way to ensure the Node object is correct. It depends on the node controller observing the stopped instance state. That can fail if the node controller is momentarily not running, or doesn't happen to check that instance in the window before the instance is started again.
It would be good for the aws and openstack experts to weigh in with the types of changes that can happen with an instance stop/start cycle, and check whether the kubelet (or cloud node controller) reflects those changes correctly on an existing Node. I think that effort should happen independent of the fix for the node deletion.
the types of changes that can happen with an instance stop/start cycle
We can start building such a list, but I think we should also keep in mind that these are different complex systems, and the lifecycle state diagrams will not match perfectly.
In other words, the list of mishandled cases is going to be non-empty, and that's ok! Imo the correct response to a "doctor, it hurts when I ..." bug report is "so don't do that", not to attempt to turn k8s into a grand unified openstack+aws+gce+vsphere+etc.
For openstack, the list of server states reported in the openstack API is here. In particular a server instance with the same ID at the openstack level can:
(*) In the k8s openstack provider, we match k8s node name with openstack instance name (not ID) in order to go to/from the two worlds. Amongst other issues, the openstack instance name is not unique or immutable at the openstack API level. There's been a long effort to improve this, but the only real fix is to persist the openstack instance ID in the k8s Node structure somewhere, and pass that back to the provider code rather than node name. Note also that the k8s node name and openstack instance name are often not the same as the guest hostname, and very often not resolvable in DNS.
I'm convinced that (for openstack) the right solution is just to remove the ACTIVE filter, so the node exists in k8s for as long as it exists in openstack. This is a ~1 line change.
@anguslees was this already resolved for openstack in #59931?