VM behavior when the node fails

Zang Li

unread,

Jun 2, 2021, 7:29:00 PM6/2/21

to kubevirt-dev

Hi,

When a kubernetes node fails, such as losing connection, the node will
be marked as NotReady. Pods on the node will be evicted after
pod-eviction-timeout, and a new pod will be scheduled on a different
node [1].
I thought kubevirt vm pod would have the same behavior, but when I
tried it did not. The virt-launcher pod got into the terminating state
as expected, but a new pod was not generated for the vm. Only after
the failed node reconnected back, the old pod terminated successfully
and the new pod for the vm was created. Is this behavior configurable?
We may want to failover the vm to a different node when the original
node hosting it failed non-gracefully.

Thanks,
Zang

[1] https://medium.com/tailwinds-navigator/kubernetes-tip-what-happens-to-pods-running-on-node-that-become-unreachable-3d409f734e5d

Dan Kenigsberg

unread,

Jun 3, 2021, 1:14:32 AM6/3/21

to Zang Li, kubevirt-dev, Andrew Beekhof

On Thu, Jun 3, 2021 at 2:29 AM 'Zang Li' via kubevirt-dev <kubevi...@googlegroups.com> wrote:

Hi,

When a kubernetes node fails, such as losing connection, the node will
be marked as NotReady. Pods on the node will be evicted after
pod-eviction-timeout, and a new pod will be scheduled on a different
node [1].
I thought kubevirt vm pod would have the same behavior, but when I
tried it did not. The virt-launcher pod got into the terminating state
as expected, but a new pod was not generated for the vm. Only after
the failed node reconnected back, the old pod terminated successfully
and the new pod for the vm was created. Is this behavior configurable?
We may want to failover the vm to a different node when the original
node hosting it failed non-gracefully.

When a node becomes unreachable, the cluster cannot tell if the VM running there is still writing to shared storage or serving clients over a Multus connection. Allowing another VMI to run elsewhere creates a risk of "split brains" and data corruption. This is true for pods too - if they use persistent storage or serve on secondary interfaces, which historically they have not.

I like the idea behind https://github.com/medik8s : it should ensure that after a known period of time the unreachable worker no longer runs anything, so that slightly afterwards the cluster may reschedule the workload elsewhere. Would you consider running medik8s's poison-pill and node-healthcheck on your cluster?

Thanks,
Zang

[1] https://medium.com/tailwinds-navigator/kubernetes-tip-what-happens-to-pods-running-on-node-that-become-unreachable-3d409f734e5d

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94i-9fHdG8XgOwP%3DVjRhHE2zMJ5WERe8P8HSosM56_bANg%40mail.gmail.com.

Zang Li

unread,

Jun 3, 2021, 12:50:08 PM6/3/21

to Dan Kenigsberg, kubevirt-dev, Andrew Beekhof

Hi Dan,

Yes, the pod may still connect to outside if they use multus, but
isn't the persistent storage using the node connection so most likely
it cannot read/write to the storage? There is a corner case that the
node cannot reach API server but can reach the storage - would the
status of the vm change if storage access fail? Is it possible for the
local virt-handler to detect the node connection failure and stop the
local VM to prevent the split brain situation? I can check how medik8s
works, but I think we might be able to solve the issue within
kubevirt. What do you think?

Thanks,
Zang

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAHOEP57jyVmRix3dRZiJ%3DZ%2BGy%2Bv9SbKOSo3rd7DsrZ2uGPcpaQ%40mail.gmail.com.

dvo...@redhat.com

unread,

Jun 3, 2021, 5:33:01 PM6/3/21

to kubevirt-dev

On Thursday, June 3, 2021 at 12:50:08 PM UTC-4 Zang Li wrote:

Hi Dan,

Yes, the pod may still connect to outside if they use multus, but
isn't the persistent storage using the node connection so most likely
it cannot read/write to the storage? There is a corner case that the
node cannot reach API server but can reach the storage - would the
status of the vm change if storage access fail? Is it possible for the
local virt-handler to detect the node connection failure and stop the
local VM to prevent the split brain situation? I can check how medik8s
works, but I think we might be able to solve the issue within
kubevirt. What do you think?

KubeVirt is designed to treat the VMI pods in the same way the StatefulSet controller treats pods during node failure. My expectation is that the behavior you're talking about where a pod comes back somewhere else after node failure occurs is likely involving a ReplicaSet. Try the same thing with a StatefulSet to see our expected behavior for VMs.

The most reliable way to guarantee that a workload on a failed node is no longer accessing shared storage or any other shared resource is to have something external of that failed node pull the plug and report to the cluster that the node is really gone. so, fencing [1] or as medik8s calls it "node remediation".

Theoretically it might be possible for virt-handler to detect a node connection failure and stop VMs, but i think we'd rather stay out of the business of fencing entirely because it get's complex quickly and there's no "one size fits all" kind of solution. For the behavior you're talking about, maybe take a look at the poison pill controller [2] [3]. It's a software based fencing method similar to what you where describing virt-handler could do. From what I gather, after node failure occurs the poison pill controller reboots the node it's running on after.

- David

1. https://en.wikipedia.org/wiki/Fencing_(computing)

2. https://github.com/medik8s/poison-pill

3. https://www.openshift.com/blog/kubernetes-self-remediation-aka-poison-pill

Zang Li

unread,

Jun 3, 2021, 5:50:00 PM6/3/21

to dvo...@redhat.com, kubevirt-dev

Hi David,

Thank you for your inputs! Please see my comments below:

> The most reliable way to guarantee that a workload on a failed node is no longer accessing shared storage or any other shared resource is to have something external of that failed node pull the plug and report to the cluster that the node is really gone. so, fencing [1] or as medik8s calls it "node remediation".

If the node failed to connect on its node interface, how can something
external make sure that the workload on the failed node doesn't access
the storage?

> Theoretically it might be possible for virt-handler to detect a node connection failure and stop VMs, but i think we'd rather stay out of the business of fencing entirely because it get's complex quickly and there's no "one size fits all" kind of solution. For the behavior you're talking about, maybe take a look at the poison pill controller [2] [3]. It's a software based fencing method similar to what you where describing virt-handler could do. From what I gather, after node failure occurs the poison pill controller reboots the node it's running on after.
>

The poison pill controller would try to reboot the nodes only if it
can connect to the node or it is local to the node, right? This may
not always be possible for some failures, such as the server lose
power or lose network connection.
Also what if the node rebooting doesn't fix the issue? You still need
the kubevirt controller to be able to reschedule the VM, right?

Zang

Andrew Beekhof

unread,

Jun 3, 2021, 8:44:21 PM6/3/21

to Zang Li, Dan Kenigsberg, kubevirt-dev, Nir Yehia, Roy Golan

On Fri, 4 Jun 2021 at 02:50, Zang Li <zan...@google.com> wrote:

Hi Dan,

Yes, the pod may still connect to outside if they use multus, but
isn't the persistent storage using the node connection so most likely
it cannot read/write to the storage?

That will be true for some failure scenarios, but definitely not all, and the problem is that they generally look the same from inside the cluster.

To make this a safe assumption, the kubevirt would have to intimately understand the physical and logical layout of the cluster - which would be a lot of work and smells suspiciously like a layering violation.

There is a corner case that the
node cannot reach API server but can reach the storage - would the
status of the vm change if storage access fail? Is it possible for the
local virt-handler to detect the node connection failure and stop the
local VM to prevent the split brain situation?

That is possible. We're implementing something for medik8s along those lines right now based on a concept borrowed from my time with traditional HA clustering.

You need to arrange for the VM on the bad node to stop within a known finite period of time, for the peers to wait that amount of time, and then unblock the scheduler (today that's only possible by deleting the Node CR, which is a horrible horrible hack).

What I would caution though, is that this turns into a rabbit hole really fast, and is something you'd want solved once in a common component so that every application with this kind of requirement doesn't need to re-implement and re-learn the same lessons. Actually wasted effort is just the beginning as you'd have different implementations observing the system at slightly different time points - making different decisions on different inputs and generally tripping over one-another, re-recovering nodes that were already recovered by a different application.

Once you have good heuristics that avoid the obvious false positives and negatives, you'd want to integrate with a watchdog (ideally in hardware, not softdog) to ensure that the VM gets stopped even if the machine is suffering a lockup or resource starvation. Also to reduce the amount of time your peers are waiting for recovery continues.

The thing about heuristics though, is that you can always find a way to trick them. So if there is a way for the surviving peers to actively stop the VM (possibly using a BMC to power cycle the node, or using something like Machine API to reprovision it), that will always be preferable for those kinds of cluster. So whatever solution you come up with will want to be able to make use of different mechanisms - which is why we have different components for failure detection and remediation.

I can check how medik8s
works, but I think we might be able to solve the issue within
kubevirt. What do you think?

I'm biased, but in my experience, that's a bad idea.

We're probably a week away from polishing medik8s to the point that I'd be comfortable having kubevirt folks try out a beta.

Definitely reach out if you have questions, the website is also hopelessly incomplete at this stage.

Roman Mohr

unread,

Jun 4, 2021, 3:23:55 AM6/4/21

to dvo...@redhat.com, kubevirt-dev

On Thu, Jun 3, 2021 at 11:33 PM dvo...@redhat.com <dvo...@redhat.com> wrote:

On Thursday, June 3, 2021 at 12:50:08 PM UTC-4 Zang Li wrote:
Hi Dan,

Yes, the pod may still connect to outside if they use multus, but
isn't the persistent storage using the node connection so most likely
it cannot read/write to the storage? There is a corner case that the
node cannot reach API server but can reach the storage - would the
status of the vm change if storage access fail? Is it possible for the
local virt-handler to detect the node connection failure and stop the
local VM to prevent the split brain situation? I can check how medik8s
works, but I think we might be able to solve the issue within
kubevirt. What do you think?

KubeVirt is designed to treat the VMI pods in the same way the StatefulSet controller treats pods during node failure. My expectation is that the behavior you're talking about where a pod comes back somewhere else after node failure occurs is likely involving a ReplicaSet. Try the same thing with a StatefulSet to see our expected behavior for VMs.

That should be it. If you use a VirtualMachineInstanceReplicaSet you get exactly the behaviour like for Deployment or ReplicaSet, because we don't have to care about concurrent disk access in this scenario.

Zhang, also have alook at https://github.com/kubevirt/kubevirt/issues/3332 and https://github.com/kubevirt/kubevirt/pull/4398 for additional relevant discussions.

Best regards,

Roman

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/d6fa1e93-580f-4fb7-8412-7b3607a3008en%40googlegroups.com.

Zang Li

unread,

Jun 4, 2021, 2:18:23 PM6/4/21

to Roman Mohr, dvo...@redhat.com, kubevirt-dev

Thanks Roman and Andrew for your great insights on this issue! I
understand the concerns here are concurrent disk access.
VirtualMachineInstanceReplicaSet works because it avoids concurrent
writes by using an ephemeral disk or read only disk. What I was hoping
to achieve is something like a cold migration of the VM when failure
is detected. Is there a way to tell if a VM is working properly? If we
are worried about concurrent writing, is there a way to probe that? Is
there a way to block that from outside?

I agree that it is more efficient to solve things in a common
component. In this problem, there are two aspects: abnormal detection
and abnormal remedy. The former is already at the infrastructure
level and is indicated by Kubernetes node status. I understand that
the node status there reflects the view from the controller, so it
might not be what the node perceives or what the true situation is. I
think that is part of medik8s does from my quick glance. But do we
want the action to the node failure to be also generic or specific to
each component? How should the user recover their VMs from that
situation? Poison-pill approach couldn't achieve that if the node
cannot be repaired, right? Sometimes it might be desired to quarantine
the problematic node for further analysis instead of restarting it.
Andrew, how about you give a talk on medik8s and show us how it works
in the next community meeting? I am very interested in knowing more
about it.

Thanks,
Zang

> To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CALDPj7s0nyf2YJajzfSXwqDr2%2BZi5q4DWgyMtZT7JAn%2Bh4g5og%40mail.gmail.com.

Roman Mohr

unread,

Jun 7, 2021, 3:40:13 AM6/7/21

to Zang Li, dvo...@redhat.com, kubevirt-dev

On Fri, Jun 4, 2021 at 8:18 PM Zang Li <zan...@google.com> wrote:

Thanks Roman and Andrew for your great insights on this issue! I
understand the concerns here are concurrent disk access.
VirtualMachineInstanceReplicaSet works because it avoids concurrent
writes by using an ephemeral disk or read only disk. What I was hoping
to achieve is something like a cold migration of the VM when failure
is detected. Is there a way to tell if a VM is working properly? If we
are worried about concurrent writing, is there a way to probe that? Is
there a way to block that from outside?

I agree that it is more efficient to solve things in a common
component. In this problem, there are two aspects: abnormal detection
and abnormal remedy. The former is already at the infrastructure
level and is indicated by Kubernetes node status. I understand that
the node status there reflects the view from the controller, so it
might not be what the node perceives or what the true situation is. I
think that is part of medik8s does from my quick glance. But do we
want the action to the node failure to be also generic or specific to
each component? How should the user recover their VMs from that
situation?

Users or admins can force-delete pods on the corresponding nodes which will also unblock the VMs.

However for that they will need admin insight to know that the workload is really dead.

All other pure user-level solutions (without node-insight and without automatic fencing) basically require application-level leader election and/or replication in combination with load balancers.

For instance starting 3 VMs where each one runs postgresql with replication, or active-passive standby on activemq.

Poison-pill approach couldn't achieve that if the node
cannot be repaired, right? Sometimes it might be desired to quarantine
the problematic node for further analysis instead of restarting it.

I think until you want to somehow ensure that the node gets isolated (hard cut off storage and network), or you force-delete all remaining containers on the node, the workloads on that node are stuck

if the applications do not have their own HA models.

Maybe Andrew has more insights.

Best regards,

Roman

Zang Li

unread,

Jun 8, 2021, 3:04:23 PM6/8/21

to Roman Mohr, dvo...@redhat.com, kubevirt-dev

Hi Roman,

I am still debating whether we should do this in a generic way or we
solve it at kubevirt level because the former will impact the whole
system behavior which makes it harder to be accepted.
Also here is some reference on how vmware does for this problem:
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.avail.doc/GUID-33A65FF7-DA22-4DC5-8B18-5A7F97CCA536.html
They introduced both host heartbeat and datastore heartbeat.

Best,
Zang

David Vossel

unread,

Jun 9, 2021, 8:35:12 AM6/9/21

to Zang Li, Roman Mohr, kubevirt-dev

On Tue, Jun 8, 2021 at 3:04 PM Zang Li <zan...@google.com> wrote:

Hi Roman,

I am still debating whether we should do this in a generic way or we
solve it at kubevirt level because the former will impact the whole
system behavior which makes it harder to be accepted.

Also here is some reference on how vmware does for this problem:
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.avail.doc/GUID-33A65FF7-DA22-4DC5-8B18-5A7F97CCA536.html
They introduced both host heartbeat and datastore heartbeat.

one thing to note here, vmware has tight coupling between all their respective components. For example, they know exactly what the datastore is that is providing storage to the VMs and have ways of communicating with the datastore independently of the compute nodes.

In a k8s cluster, everything is mix and match. It's difficult to find a solution here regarding how to handle workloads during node failure that's one size fits all. This is why KubeVirt offloads handling the node failure scenario to a system wide component. This allows KubeVirt to simply react to whatever state it observes in the cluster regarding workload availability.

KubeVirt is designed to react to the state the cluster tells us pod workloads are in. So, handling this at a system wide level aligns with that design.

All that said, we still want to be as pragmatic as possible. If there's an approach for handling node failure that makes sense at the KubeVirt level, there's no reason why it can't be explored. We'd just need to make sure we preserve the current expected dependency on system wide behavior as well.

Fabian Deutsch

unread,

Jun 14, 2021, 3:14:56 PM6/14/21

to Zang Li, Roman Mohr, dvo...@redhat.com, kubevirt-dev

On Fri, Jun 4, 2021 at 8:18 PM 'Zang Li' via kubevirt-dev <kubevi...@googlegroups.com> wrote:

Thanks Roman and Andrew for your great insights on this issue! I
understand the concerns here are concurrent disk access.
VirtualMachineInstanceReplicaSet works because it avoids concurrent
writes by using an ephemeral disk or read only disk. What I was hoping
to achieve is something like a cold migration of the VM when failure
is detected. Is there a way to tell if a VM is working properly? If we
are worried about concurrent writing, is there a way to probe that? Is
there a way to block that from outside?

I do think that the last three questions outline the dilemma: SOmething is wrong, we just don't know in detail what. The best way to get some ground truth is by killing the node. Then we simply know that it is not running anymore.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94hMqcrLsyEo8gN3hO%3DTQxQYTjDgPDoR8ztU%2B2%2BOEAYytA%40mail.gmail.com.

Fabian Deutsch

unread,

Jun 14, 2021, 3:22:34 PM6/14/21

to David Vossel, Zang Li, Roman Mohr, kubevirt-dev

On Wed, Jun 9, 2021 at 2:35 PM David Vossel <dvo...@redhat.com> wrote:

On Tue, Jun 8, 2021 at 3:04 PM Zang Li <zan...@google.com> wrote:
Hi Roman,

I am still debating whether we should do this in a generic way or we
solve it at kubevirt level because the former will impact the whole
system behavior which makes it harder to be accepted.
Also here is some reference on how vmware does for this problem:
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.avail.doc/GUID-33A65FF7-DA22-4DC5-8B18-5A7F97CCA536.html
They introduced both host heartbeat and datastore heartbeat.

one thing to note here, vmware has tight coupling between all their respective components. For example, they know exactly what the datastore is that is providing storage to the VMs and have ways of communicating with the datastore independently of the compute nodes.

+1

This is also the case i.e. with oVirt.

In a k8s cluster, everything is mix and match. It's difficult to find a solution here regarding how to handle workloads during node failure that's one size fits all. This is why KubeVirt offloads handling the node failure scenario to a system wide component. This allows KubeVirt to simply react to whatever state it observes in the cluster regarding workload availability.

KubeVirt is designed to react to the state the cluster tells us pod workloads are in. So, handling this at a system wide level aligns with that design.

Yes … Maybe the key takeaway here is: If we agree that it is a node problem, then it's a cluster problem. And KubeVirt is an extension to the cluster.

And let me add that the absence of such a mechanism in KubeVirt does not reflect the importance we are giving to it :)

Just like i.e. hot plug of disks or memory, fencing is a feature that we believe needs to be solved in the platform first. As this problem is also not KubeVirt specific.

Think of a shared PVC which is attached to a node which becomes unavailable. The application requires this PVC to be working - somehow the situation must be cleared in order to allow the application inside the pod to continue to operate. This is the same problem we are having with VMs.

All that said, we still want to be as pragmatic as possible. If there's an approach for handling node failure that makes sense at the KubeVirt level, there's no reason why it can't be explored. We'd just need to make sure we preserve the current expected dependency on system wide behavior as well.

Yes - we'd be open to it. But seeing how much time ANdrew has spent on this topic I became a little disillusioned :)

Again: By solving it in the platform we aim for a higher goal that is accompanied by KubeVirt: Strengthening the platform where we see that it's not a VM only thing. I've mentioned hot-plug. NUMA, SWAP, and rebalancing are other examples.

All three are seeing attention in K8s, and we are slowly starting to push on that side as well, to first let K8s benefit from this, then allow us to (re-)introduce this for VMs as well.

Did we explore if KubeVirt would be interested to "bundle" external components like medik8s?

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAPjOJFu441g03fLcNYQ95YkYyEHJ%2BS9z4M99N5FnEkOFfV3ofA%40mail.gmail.com.

Reply all

Reply to author

Forward