Re: [kubernetes/kubernetes] Common handling of stopped instances across cloud providers (#46442)

Tim Hockin

unread,

Jul 7, 2017, 11:55:08 AM7/7/17

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I stand by my (roughly stated above) litmus test.

If a stop/restart cycle can be described as "a really slow reboot" (ie. local disk state (eg. emptyDirs - irreplaceable), probably remote disk attachedness, (maybe) IP address + secondary static routes, etc. are preserved) then the Node object should probably persist. Perhaps more concretely: if a Pod existed on the Node at the time it was stopped, and the restart was within the node timeout (ie. the Pod was not moved off), then the pod should restart EXACTLY as if the Node had just surprise-rebooted. If anything important is lost during the stop/restart that would prevent this, the Node MUST be evacuated and deleted, including detaching storage, releasing routes, etc.

I said "probably" to disk attachedness. If a machine rebooted and all its network disks became unattached, we SHOULD recover. If we don't it's probably a bug. That said, I don't think we have any such test, so no confidence. For sanity, a reboot should not detach disks. @kubernetes/sig-storage-misc

I said "maybe" to IP. It's not 100% clear to me if a new IP on Node reboot would break anything. Probably would make some long-tail apps that do DNS lookups of Nodes unhappy but I am not sure that's enough to outlaw it. It's certainly less worrisome if the IP didn't change, but I could be swayed on this.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

Jing Xu

unread,

Jul 7, 2017, 7:28:36 PM7/7/17

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Related to disk attachedness issue during vm stop/restart cycle.

For GCE PD
when a vm stop/restart, the disk will be automatically detached from the vm. Our attach_detach_controller periodically checks disk attached/detached through cloud provider. If during stop/restart cycle, the node and pods are not deleted from API server, controller will detect disk detached and tries to attach it again to the vm automatically.

For AWS EBS
my understanding is during stop/restart cycle, ebs volume is still attached to the vm. So controller does not really need to do anything. There is a bug fix related to this issue #46463

Jaice Singer DuMars

unread,

Jul 26, 2017, 12:17:03 PM7/26/17

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

/sig azure
adding for visibility

Angus Lees

unread,

Jul 26, 2017, 9:14:47 PM7/26/17

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

So I think we're all furiously agreeing here, and we would like to standardise on the GCE behaviour (ie: include stopped/inactive/disabled nodes in Instances.List()).

Can we build a list of reasons why we don't want this? (and if the list is empty, then we're done :)

@divyenpatel made the point above (if I can unravel the triple negative correctly) that a "cordoned" (but otherwise live) node might still end up running active k8s jobs, and this would be undesirable.
I feel we just say we don't automatically support this at present: if the node is able to pass kubelet health checks, and you still don't want k8s to use it for some reason, then you need to also use explicit k8s mechanisms to drain the node - we won't automatically propagate whatever underlying cloud feature exists for this situation.

BaluDontu

unread,

Sep 13, 2017, 6:29:30 PM9/13/17

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

At least in case of vSphere, I think we should delete the node from kubernetes cluster when the node is powered off in vSphere.

Approach1:
I observe the following issues when the node is not deleted by the node controller and is in "NotReady" state:

The pods on the powered off node are not instantaneously created on the other available node. Only after 5 minutes timeout, the pods will be created on other available nodes with the volume attached to it. This means an application downtime of around 5 minutes which is not good at all.
The volume on the powered off node are not detached at all when the pod with the volume is already moved to other available node. Hence any attempt to restart the powered off node will fail as the same volume is attached to other node which is present on this powered off node. (Please note that the volumes are not automatically detached from powered off in vSphere as opposed to GCE, AWS where volume is automatically detached from when node is powered off).

Approach2:
But if the node is removed by the Node controller altogether then we don't observe above issues at all.

Since the node is deleted, the pod on the powered off node becomes instantaneously available on other available nodes with the volume attached to the new nodes. Hence there is no application downtime at all.
After a period of 6 minutes (timeout period), the volumes are automatically detached from the powered off node. Hence any restarts after 6 minutes on the powered off node would work and not cause any problems as volumes are already detached.

So the current viable solution in case of vSphere would be delete the node from kubernetes cluster when the node is powered off in vCenter.

But if we still want to node to remain "Inactive" when node is powered off, we need a new power off handler in attach/detach controller to check if the node is powered off and make necessary attach/detach operations to achieve as described in 2nd approach.

Clayton Coleman

unread,

Sep 13, 2017, 8:48:45 PM9/13/17

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

The vsphere cloud provider has enough information to make the correct
decision about what to do to a stopped instance, since it should recognize
a stopped node via the API. For instance, the cloud controller could drain
pods and detach volumes from stopped nodes immediately.

However, tims slow reboot metaphor holds. What if you want to stop
instances to idle machines and then bring them back when needed (discussed
recently in cluster autoscaler)? Then deleting the stopped instance is a
cluster level choice that you want to opt out of.

I don't think it's unreasonable to delete and evacuate in certain
scenarios. I'm more concerned with the behavior being unpredictable.
Dangerous things should be opt in in general, and stopping a node is a
deliberate action by an admin. It seems unreasonable to take destructive
action aggressively on the admins behalf, when an admin is capable of
draining and deleting on their own.

On Sep 13, 2017, at 6:28 PM, BaluDontu <notifi...@github.com> wrote:

At least in case of vSphere, I think we should delete the node from
kubernetes cluster when the node is powered off in vSphere.

*Approach1:*

I observe the following issues when the node is not deleted by the node
controller and is in "NotReady" state:

- The pods on the powered off node are not *instantaneously* created on

the other available node. Only after 5 minutes timeout, the pods will be
created on other available nodes with the volume attached to it. This means
an application downtime of around 5 minutes which is not good at all.

- The volume on the powered off node are not detached at all when the

pod with the volume is already moved to other available node. Hence any
attempt to restart the powered off node will fail as the same volume is
attached to other node which is present on this powered off node. (Please
note that the volumes are not automatically detached from powered off in
vSphere as opposed to GCE, AWS where volume is automatically detached from
when node is powered off).

*Approach2:*

But if the node is removed by the Node controller altogether then we don't
observe above issues at all.

- Since the node is deleted, the pod on the powered off node becomes

instantaneously available on other available nodes with the volume attached
to the new nodes. Hence there is no application downtime at all.

- After a period of 6 minutes (timeout period), the volumes are

automatically detached from the powered off node. Hence any restarts after
6 minutes on the powered off node would work and not cause any problems as
volumes are already detached.

So the current viable solution in case of vSphere would be delete the node
from kubernetes cluster when the node is powered off in vCenter.

But if we still want to node to remain "Inactive" when node is powered off, *we

need a new power off handler in attach/detach controller to check if the
node is powered off and make necessary attach/detach operations to achieve

as described in 2nd approach.*

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub

<https://github.com/kubernetes/kubernetes/issues/46442#issuecomment-329314982>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p3Nu-oZvFMabHWeKS_m-iqNtd5lFks5siFbxgaJpZM4NmiYZ>

Rob Rati

unread,

Feb 28, 2018, 1:30:49 PM2/28/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@wlan0 @nckturner

#60009 could address the concerns with nodes going away in the cloud provider.

Do we want to revisit this issue and look at creating a well defined node lifecycle? Possibly with Running, Suspended, and Terminated states?

Sidhartha Mani

unread,

Feb 28, 2018, 1:53:02 PM2/28/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@rrati Thanks for rejuvenating this issue. Different cloud providers manage states in their own ways, and the semantics of particular states are lost. I can see this adding technical debt into the CCM, #60009 is a harbinger of this debt IMO.

I do agree that we need a well defined lifecycle for a node, so that we can present a clean, uniform API to the cloud that implement it. As we discussed in the #wg-cloud-provider this morning, having these three states sounds adequate

Running - In use node.
Suspended - Node which has been shutdown, but not deleted from the cluster.
Terminated - Node which has been deleted from the cluster.

In case of Suspended state, I think it is important to define when what happens when this node transitions to Running again.

There are two ways I can see this being modeled

The node which is brought back into running state is considered the same as the node before it was suspended. I can see this making sense for the usecase of managing volumes attached to particular nodes as they go through the lifecycle of being suspeded and running and so on and so forth.

One question that might come up while considering this state model is that the IP address of the node can change depending on the cloud and if the instance uses an elastic IP or not. In such cases, since we already tolerate network address changes in nodes, I believe this is ok.

Lastly, knowing that a node exists (even when suspended) can factor into automated decision making process of node selection, affinity, scheduling, auto scaling etc.

If there are any other concerns with approach, please discuss.

The node which goes into suspended state is essentially a lost node. When it comes back up, it is considered a new node that is added to the cluster.

Tim Hockin

unread,

Feb 28, 2018, 11:43:05 PM2/28/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I think there is probably value to nodes which are halted/suspended - they exist and are registered, but are not schedulable. Terminated nodes should be deleted.

So no "states" needed, just a bool that indicates haltedness, and this might simply be the unschedulable field.

Rob Rati

unread,

Mar 1, 2018, 4:50:06 PM3/1/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I think there's another issue aside from how kubernetes handles the nodes. When I looked at the cloud providers, some of them were filtering the instances and returning only running instances. In that case kubernetes would not be given information on a node that is halted/suspended and would think the node has been deleted.

We may want to look at a way to remove that decision making from the cloud provider and bring it back into kubernetes.

Sidhartha Mani

unread,

Mar 1, 2018, 8:26:48 PM3/1/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Thanks for this comment @rrati.

@thockin, the above comment reinforces what I was saying. Since the test of node existing is open to interpretation (in retrospect), cloud providers have varying implementations of this test.

GKE considers stopped nodes to be existing, whereas AWS and OpenStack don't. This disparity leads to stopped nodes being deleted in AWS/OpenStack clusters when the user turns it off in the cloud for scheduled maintenance, but not in GKE. (via the logic in CCM to delete nodes that are not responding and cannot be found in cloud API)

This in-turn makes it hard to reason about volumes attached to the nodes that have been shutdown. Should they be recycled and put back into the unused pool or left attached, since the node might come back up? This dilemma led to the PR I linked above (#60009)

I proposed the three "states" of a node to come up with a clean API for nodes, where the semantics of this test is clear and nothing is left to interpretation. The implementation of this may not necessarily include defining these three states explicitly.

It sounds to me like we're all agreeing that some differentiation is required between the "states", my proposal is to "enforce" this difference through the CloudProvider interface{}.

Instead of inferring that a node exists or doesn't by looking for its ExternalID (https://github.com/kubernetes/kubernetes/blob/master/pkg/controller/cloud/node_controller.go#L420), I propose we use this function instead

func InstanceStateByProviderID(providerID string) (InstanceState, error)

where InstanceState can be one of,

Running - The node is active, currently functional and can be used to schedule pods.
Suspended - The node cannot be used for scheduling pods, and the node's IP address may change when it goes back to Running state. Unused volumes may be detached from the node.
Terminated - The node cannot be recovered, and all aspects of its identity are lost. i.e. It has been deleted from the cloud/it was never a part of the cluster.

WDYT @thockin @rrati?

Angus Lees

unread,

Mar 4, 2018, 3:10:58 PM3/4/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

> GKE considers stopped nodes to be existing, whereas AWS and OpenStack
don't. This disparity leads to stopped nodes being deleted in AWS/OpenStack
clusters when the user turns it off in the cloud for scheduled maintenance,
but not in GKE.

I agree that this is the core issue here, and I propose we just change this
part of the providers rather than add another layer that works around this
difference.

- Gus

Rob Rati

unread,

Mar 5, 2018, 9:55:45 AM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I like the idea of an interface that removes the decision making from the cloud providers themselves. A core issue, imo, is that the cloud provider modules are making decisions for kubernetes.

Jing Xu

unread,

Mar 5, 2018, 1:50:55 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

As the proposal #58635 for PR #60009 mentioned, a single node state "NotReady" controlled by node controller is not enough in some cases (e.g., determine whether volume could be safely detached or not). I agree to have at least states (Running, shutdown/suspended, terminated). Running and terminated states are more clear, but shutdown or suspended might cause some confusion. From volume management point of view, we want to know whether OS is shutdown and mounts are gone. Some system might support pause/suspend which is not an OS shutdown. So in PR #60009 we propose to add "shutdown" taint. For some use cases, it might be useful to have "NoSchedule" taint for shutdown/suspended/paused machine.

cc @yujuhong

Angus Lees

unread,

Mar 5, 2018, 3:28:30 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

State transitions are fragile imo - what if the node rebooted so quickly
that a controller missed the critical "shutdown" transition?

Better would be to poll some sort of level-triggered "is the volume still
attached?" function when needed.

- Gus

On Tue, 6 Mar 2018, 05:51 Jing Xu <notifi...@github.com> wrote:

> As the proposal #58635
> <https://github.com/kubernetes/kubernetes/issues/58635> for PR #60009
> <https://github.com/kubernetes/kubernetes/pull/60009> mentioned, a single

> node state "NotReady" controlled by node controller is not enough in some
> cases (e.g., determine whether volume could be safely detached or not). I
> agree to have at least states (Running, shutdown/suspended, terminated).
> Running and terminated states are more clear, but shutdown or suspended
> might cause some confusion. From volume management point of view, we want
> to know whether OS is shutdown and mounts are gone. Some system might
> support pause/suspend which is not an OS shutdown. So in PR #60009

> <https://github.com/kubernetes/kubernetes/pull/60009> we propose to add

> "shutdown" taint. For some use cases, it might be useful to have
> "NoSchedule" taint for shutdown/suspended/paused machine.
>

> cc @yujuhong <https://github.com/yujuhong>
>
> —
> You are receiving this because you were mentioned.

> Reply to this email directly, view it on GitHub

> <https://github.com/kubernetes/kubernetes/issues/46442#issuecomment-370522424>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AANr4KEFSJ6hrmzJIzCla7YCch2-mpZtks5tbYkKgaJpZM4NmiYZ>
> .
>
> Message protected by MailGuard: e-mail anti-virus, anti-spam and content
> filtering.
> http://www.mailguard.com.au/mg
>
> Report this message as spam
> <https://console.mailguard.com.au/ras/1SOHmWOFm4/7ee9UzvRUqJi2OZ692L71W/0.2>

Sidhartha Mani

unread,

Mar 5, 2018, 3:41:28 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@anguslees As @rrati mentioned, the core issue is that cloudproviders, which are now going to be maintained by providers instead of kubernetes, can make decisions about kubernetes itself. This is what we want to avoid in the future.

In the short term, yes fixing the open stack provider might sound good, but that also will probably break some functionality for users - who are expecting suspended nodes not to get deleted. I'd rather add a new API and move to that API and manage this change carefully.

Also, I want to define just 3 states. @jingxu97 is confusing NodeState with NodeConditions. We are not going to evolve NodeConditions any further. They were being used by the scheduler, but we've moved to taints quite a while back.

The three states are:

Running (schedulable, and attached volume can be polled and determined if ok to detach)
Suspended (not schedulable, and attached volumes SHOULD not be detached, since the node might come back). This is the next logical step rather than making breaking API changes for the volumes - where we remove the volumes in suspended machines. Until the volume implementation adds a CanDetach method, this can serve as the behavior. This is close to existing functionality, where volumes attached to stopped nodes cannot be detached.
Terminated - This means that the node is gone, and cannot be recovered. Therefore, volumes attached to nodes here can be detached.

WDYT?

Sidhartha Mani

unread,

Mar 5, 2018, 3:46:38 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@jingxu97 I've already addressed your concerns in my comment earlier - #46442 (comment)

Please read it. The three states will serve the functionality that you mention. The Terminated state and Suspended state are there to exactly make the distinction between 'CanDetachVolume` or not.

Sidhartha Mani

unread,

Mar 5, 2018, 3:47:43 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

just FYI to all, that I'm not proposing we change NodeConditions. All of these will and should be implemented as taints

Sidhartha Mani

unread,

Mar 5, 2018, 4:00:32 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Looping in @kubernetes/sig-storage-feature-requests since we're talking about volume lifecycle while making changes to node lifecycle.

Sidhartha Mani

unread,

Mar 5, 2018, 4:13:56 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

A WIP implementation by @zetaab
zetaab/kubernetes@0313435

Jing Xu

unread,

Mar 5, 2018, 4:24:30 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@wlan0 I am not proposing to add new NodeConditions either. The proposal is just to add "shutdown" taint, not condition.

I am trying to see whether we have a clear definition of "suspended". For example, in openstack (https://docs.openstack.org/nova/latest/reference/vm-states.html), the VM has paused, suspended, and stopped etc, how we map these states into our definition of"suspended"

Sidhartha Mani

unread,

Mar 5, 2018, 4:56:29 PM3/5/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@wlan0 I am not proposing to add new NodeConditions either. The proposal is just to add "shutdown" taint, not condition.

Thanks for clarifying

I am trying to see whether we have a clear definition of "suspended"

The reason for adding this state is to provide a uniform API, which clearly allows kubernetes and its plugins (CCMs in this case) to go through its lifecycle without implicit assumptions.

Today, plugins are evolving independently, and it is becoming difficult to reason about behavior of workloads in a consistent manner.

One such inconsistency is the decision to detach a volume from a node in NotReady state. It is not clear today because NotReady can mean either of the two things below

that the node is down for scheduled maintenance (volumes may have to be left attached) i.e. Suspended
it is gone - completely deleted from the cloud (volume should be detached and reused elsewhere, since node is gone) i.e. Terminated

These two situations are encapsulated in Suspended and Terminated states respectively.

wayne

unread,

Mar 25, 2018, 2:01:26 PM3/25/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I'm having difficulty determining how compatible this issue and #58635 are. Can anyone clarify for me? As far as I can tell so far the proposal of a "node shutdown taint" that would allow the attach/detach controller to detach volumes from Nova instances (in the Openstack case which particularly interests me) is incompatible with the proposal for NodeState with three different values of Running, Suspended, and Terminated. If #60009 is merged will this issue be closed? Or will someone go ahead and implement the recommended NodeState here in addition to the node shutdown taint proposed there?

@wlan0, I'm also confused because in two different places in this thread of discussion you seem to be saying different things about the meaning of the Suspended State:

In #46442 (comment), you state:

Unused volumes may be detached from the node.

In #46442 (comment), you state:

not schedulable, and attached volumes SHOULD NOT be detached, since the node might come back

If you could clarify, I would appreciate it -- thanks!

Also, for what it's worth, I don't understand why it would ever be desirable for a Deployment or StatefulSet associated with a persistent volume claim to have a new pod rescheduled onto a different node but to disallow detaching the volume backing that PVC from following the Pod, even if it's only for a matter of ~10 minutes. This is the problem I am encountering in my test clusters running on Openstack when simulating a node failure using the Openstack CLI to shutdown and later restart a node.

Maybe that's not a very good way to simulate node failure? Maybe there is some way to configure Nova or Cinder to automatically detach volumes from a shutdown node? I'm open to suggestions if anyone can enlighten me on a good path forward or even a better way to think about node failures vs node shutdowns. Thanks!

Tim Hockin

unread,

Apr 12, 2018, 1:05:00 AM4/12/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I lost track of this - did we rech some conclusion? In general I dislike thinking about things in terms of states (collections of properties) and we have found that state-oriented APIs are hard to evolve.

I do think we need to disambiguate machines that are offline but can resume from machines that are gone and will be replaced another equivalent-but-different machine.

Rob Rati

unread,

Apr 16, 2018, 8:45:35 AM4/16/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

@thockin I'm not married to the state idea, but I think whatever solution we decided upon needs to leave the decision making for what to do with the nodes in kubernetes vs in the cloud provider. The states make sense to me because effectively it separates the nodes into 3 groups: Running, offline but could come back, gone and won't return. We would need the cloud provider to "sort" the nodes into the different groups but then what is done to the nodes in each group is up to kubernetes.

Is there a better way to achieve those separate node groups?

Seth Jennings

unread,

Apr 27, 2018, 10:08:12 AM4/27/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Is there any plan here?

I am trying to fix a bug where a customer shuts down their instance in AWS. The Node resource is deleted because of this:
https://github.com/kubernetes/kubernetes/blob/master/pkg/cloudprovider/providers/aws/aws.go?#L4278

On reboot the Node resource is recreated and looses all its user-defined labels.

The simple removal of that query filter resolves the issue.

Nicholas Turner

unread,

May 1, 2018, 4:58:36 PM5/1/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

We would need the cloud provider to "sort" the nodes into the different groups but then what is done to the nodes in each group is up to kubernetes.

@rrati I think this is a good separation of concerns. I also think volume detachment behavior should not be coupled with the ability for kubernetes to differentiate between stopped/suspended and terminated/deleted nodes.

Derek Carr

unread,

May 3, 2018, 2:26:16 AM5/3/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Is this expected to reach a conclusion as part of the 1.11 milestone?

Jordan Liggitt

unread,

Jul 26, 2018, 11:54:48 AM7/26/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

We just encountered this (again) in our aws instances, with the node controller deleting Node API objects while instances were temporarily shut down, losing node labels applied by the scale-up process.

This is pretty clearly a data loss issue, and I think resolving that should be prioritized over responsiveness of auto-draining/rescheduling pods in node shutdown cases. Administrators already have tools they can use to improve that:

apply a NoExecute taint prior to shutdown
mark unschedulable and drain prior to shutdown

If we want to automatically add information to represent a shut down Node, and make auto-drain/reschedule decisions based on that, that could be a useful follow-up, but the next best thing is not to delete potentially unrecoverable Node config for instances which still exist, but are shut down.

Angus Lees

unread,

Jul 27, 2018, 4:03:20 AM7/27/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I'm convinced that (for openstack) the right solution is just to remove the ACTIVE filter, so the node exists in k8s for as long as it exists in openstack. This is a ~1 line change.

The only reason I haven't fixed this already, is that this issue calls for a consistent approach across cloud providers. I believe this has left us in analysis paralysis and unable to move forward, since there is no-one in the project with sufficient breadth across all providers to be able to confidently declare that any proposed approach is the "right" one.

So a counter proposal: I propose we just improve this independently in each provider, with release notes, etc as for any other visible provider change.

To be clear, my understanding of "improve" is that the node should exist in k8s (ie: be included in the results of Instances.List()) for as long as it exists in the underlying IaaS. There might be some never-to-be-restored corner cases in your favourite IaaS, so apply some editorial judgement here. My understanding is that the issues being reported are all cases where k8s should have retained knowledge of the node for longer - so "better" is in that direction somewhere.

Jordan Liggitt

unread,

Jul 27, 2018, 11:58:49 AM7/27/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

So a counter proposal: I propose we just improve this independently in each provider, with release notes, etc as for any other visible provider change.

It is important to clarify the meaning of the interface cloud providers are implementing, so that Kubernetes can make consistent decisions based on what they return, but we might be able to do so in bite-size pieces. As we pin down the meaning of the interface where we currently have divergent implementations, per-provider efforts to reconcile their implementations to the agreed on meaning is reasonable.

InstanceExistsByProviderID() and InstanceID() seems like a good place to start. I think it is incorrect to answer "the instance does not exist" when it does exist and is in a not-running-but-resumable state (as aws and openstack do). I agree with @thockin that the distinction between "offline but can resume" and "terminated" is important.

My immediate concerns are around data loss and security implications of removing the Node objects for instances which still exist and can reactivate in the cluster:

Losing node labels loses information that is not auto-restored in some deployments
Losing taints exposes the cluster to running incorrect pods on the instance when it starts up and re-registers its taint-free Node object

Jordan Liggitt

unread,

Jul 27, 2018, 1:18:33 PM7/27/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

We discussed this today in sig-aws, and I've added it to the agenda for the 8/8 meeting of sig-cloud-provider.

One point was raised that attributes about the instance can be changed while stopped. Having the node controller delete the Node object, and the restarted kubelet re-register is one way to flush out potentially stale data on the Node object.

While convenient, that approach is not a reliable way to ensure the Node object is correct. It depends on the node controller observing the stopped instance state. That can fail if the node controller is momentarily not running, or doesn't happen to check that instance in the window before the instance is started again.

It would be good for the aws and openstack experts to weigh in with the types of changes that can happen with an instance stop/start cycle, and check whether the kubelet (or cloud node controller) reflects those changes correctly on an existing Node. I think that effort should happen independent of the fix for the node deletion.

Angus Lees

unread,

Jul 30, 2018, 3:06:43 AM7/30/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

the types of changes that can happen with an instance stop/start cycle

We can start building such a list, but I think we should also keep in mind that these are different complex systems, and the lifecycle state diagrams will not match perfectly.

In other words, the list of mishandled cases is going to be non-empty, and that's ok! Imo the correct response to a "doctor, it hurts when I ..." bug report is "so don't do that", not to attempt to turn k8s into a grand unified openstack+aws+gce+vsphere+etc.

For openstack, the list of server states reported in the openstack API is here. In particular a server instance with the same ID at the openstack level can:

be suspended to disk (ie: come back after some extended delay without rebooting)
come back with a fresh/replaced/reset hard disk
migrate to a new physical host (including availability zone)
change hostname, DNS suffix, IP addresses, interfaces/MAC address (without rebooting)
change instance name(*)

(*) In the k8s openstack provider, we match k8s node name with openstack instance name (not ID) in order to go to/from the two worlds. Amongst other issues, the openstack instance name is not unique or immutable at the openstack API level. There's been a long effort to improve this, but the only real fix is to persist the openstack instance ID in the k8s Node structure somewhere, and pass that back to the provider code rather than node name. Note also that the k8s node name and openstack instance name are often not the same as the guest hostname, and very often not resolvable in DNS.

Jordan Liggitt

unread,

Aug 8, 2018, 4:47:14 PM8/8/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

I'm convinced that (for openstack) the right solution is just to remove the ACTIVE filter, so the node exists in k8s for as long as it exists in openstack. This is a ~1 line change.

@anguslees was this already resolved for openstack in #59931?

Kubernetes Submit Queue

unread,

Aug 15, 2018, 6:32:12 PM8/15/18

to kubernetes/kubernetes, k8s-mirror-storage-misc, Team mention

Closed #46442 via #67254.

Reply all

Reply to author

Forward