Hi,
When a kubernetes node fails, such as losing connection, the node will
be marked as NotReady. Pods on the node will be evicted after
pod-eviction-timeout, and a new pod will be scheduled on a different
node [1].
I thought kubevirt vm pod would have the same behavior, but when I
tried it did not. The virt-launcher pod got into the terminating state
as expected, but a new pod was not generated for the vm. Only after
the failed node reconnected back, the old pod terminated successfully
and the new pod for the vm was created. Is this behavior configurable?
We may want to failover the vm to a different node when the original
node hosting it failed non-gracefully.
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94i-9fHdG8XgOwP%3DVjRhHE2zMJ5WERe8P8HSosM56_bANg%40mail.gmail.com.
Hi Dan,
Yes, the pod may still connect to outside if they use multus, but
isn't the persistent storage using the node connection so most likely
it cannot read/write to the storage? There is a corner case that the
node cannot reach API server but can reach the storage - would the
status of the vm change if storage access fail? Is it possible for the
local virt-handler to detect the node connection failure and stop the
local VM to prevent the split brain situation? I can check how medik8s
works, but I think we might be able to solve the issue within
kubevirt. What do you think?
Hi Dan,
Yes, the pod may still connect to outside if they use multus, but
isn't the persistent storage using the node connection so most likely
it cannot read/write to the storage?
There is a corner case that the
node cannot reach API server but can reach the storage - would the
status of the vm change if storage access fail? Is it possible for the
local virt-handler to detect the node connection failure and stop the
local VM to prevent the split brain situation?
I can check how medik8s
works, but I think we might be able to solve the issue within
kubevirt. What do you think?
On Thursday, June 3, 2021 at 12:50:08 PM UTC-4 Zang Li wrote:Hi Dan,
Yes, the pod may still connect to outside if they use multus, but
isn't the persistent storage using the node connection so most likely
it cannot read/write to the storage? There is a corner case that the
node cannot reach API server but can reach the storage - would the
status of the vm change if storage access fail? Is it possible for the
local virt-handler to detect the node connection failure and stop the
local VM to prevent the split brain situation? I can check how medik8s
works, but I think we might be able to solve the issue within
kubevirt. What do you think?KubeVirt is designed to treat the VMI pods in the same way the StatefulSet controller treats pods during node failure. My expectation is that the behavior you're talking about where a pod comes back somewhere else after node failure occurs is likely involving a ReplicaSet. Try the same thing with a StatefulSet to see our expected behavior for VMs.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/d6fa1e93-580f-4fb7-8412-7b3607a3008en%40googlegroups.com.
Thanks Roman and Andrew for your great insights on this issue! I
understand the concerns here are concurrent disk access.
VirtualMachineInstanceReplicaSet works because it avoids concurrent
writes by using an ephemeral disk or read only disk. What I was hoping
to achieve is something like a cold migration of the VM when failure
is detected. Is there a way to tell if a VM is working properly? If we
are worried about concurrent writing, is there a way to probe that? Is
there a way to block that from outside?
I agree that it is more efficient to solve things in a common
component. In this problem, there are two aspects: abnormal detection
and abnormal remedy. The former is already at the infrastructure
level and is indicated by Kubernetes node status. I understand that
the node status there reflects the view from the controller, so it
might not be what the node perceives or what the true situation is. I
think that is part of medik8s does from my quick glance. But do we
want the action to the node failure to be also generic or specific to
each component? How should the user recover their VMs from that
situation?
Poison-pill approach couldn't achieve that if the node
cannot be repaired, right? Sometimes it might be desired to quarantine
the problematic node for further analysis instead of restarting it.
Hi Roman,
I am still debating whether we should do this in a generic way or we
solve it at kubevirt level because the former will impact the whole
system behavior which makes it harder to be accepted.
Also here is some reference on how vmware does for this problem:
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.avail.doc/GUID-33A65FF7-DA22-4DC5-8B18-5A7F97CCA536.html
They introduced both host heartbeat and datastore heartbeat.
Thanks Roman and Andrew for your great insights on this issue! I
understand the concerns here are concurrent disk access.
VirtualMachineInstanceReplicaSet works because it avoids concurrent
writes by using an ephemeral disk or read only disk. What I was hoping
to achieve is something like a cold migration of the VM when failure
is detected. Is there a way to tell if a VM is working properly? If we
are worried about concurrent writing, is there a way to probe that? Is
there a way to block that from outside?
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94hMqcrLsyEo8gN3hO%3DTQxQYTjDgPDoR8ztU%2B2%2BOEAYytA%40mail.gmail.com.
On Tue, Jun 8, 2021 at 3:04 PM Zang Li <zan...@google.com> wrote:Hi Roman,
I am still debating whether we should do this in a generic way or we
solve it at kubevirt level because the former will impact the whole
system behavior which makes it harder to be accepted.Also here is some reference on how vmware does for this problem:
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.avail.doc/GUID-33A65FF7-DA22-4DC5-8B18-5A7F97CCA536.html
They introduced both host heartbeat and datastore heartbeat.one thing to note here, vmware has tight coupling between all their respective components. For example, they know exactly what the datastore is that is providing storage to the VMs and have ways of communicating with the datastore independently of the compute nodes.
In a k8s cluster, everything is mix and match. It's difficult to find a solution here regarding how to handle workloads during node failure that's one size fits all. This is why KubeVirt offloads handling the node failure scenario to a system wide component. This allows KubeVirt to simply react to whatever state it observes in the cluster regarding workload availability.KubeVirt is designed to react to the state the cluster tells us pod workloads are in. So, handling this at a system wide level aligns with that design.
All that said, we still want to be as pragmatic as possible. If there's an approach for handling node failure that makes sense at the KubeVirt level, there's no reason why it can't be explored. We'd just need to make sure we preserve the current expected dependency on system wide behavior as well.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAPjOJFu441g03fLcNYQ95YkYyEHJ%2BS9z4M99N5FnEkOFfV3ofA%40mail.gmail.com.