[kubevirt-dev] VMI re-enqueued after critical network error

169 views
Skip to first unread message

Felix Enrique Llorente Pastora

unread,
Jun 17, 2021, 4:51:07 AM6/17/21
to kubevirt-dev
Hi All,

   Looking at the following issue [1] the problem is related to the nodes not being able to create tap devices and this ends up propagating a critical network error, I was expecting this to mark the VMI as Failed and not being re-enqueue. Looking at the code looks like we are always re-enqueuing on error [2] shouldn't kubevirt to stop re-enqueing if the error is critical ?

BR

Roman Mohr

unread,
Jun 17, 2021, 5:18:10 AM6/17/21
to Felix Enrique Llorente Pastora, kubevirt-dev, Vossel, David
Hi,

On Thu, Jun 17, 2021 at 10:51 AM Felix Enrique Llorente Pastora <ello...@redhat.com> wrote:
Hi All,

   Looking at the following issue [1] the problem is related to the nodes not being able to create tap devices and this ends up propagating a critical network error, I was expecting this to mark the VMI as Failed and not being re-enqueue. Looking at the code looks like we are always re-enqueuing on error [2] shouldn't kubevirt to stop re-enqueing if the error is critical ?

Right now we behave here similar to the kubelet on errors. The kubelet will for instance also indefinitely try to create a pod, even if there is a permanent error on the node and the container sandboxes can't be created.

This approach has pros- and cons. The con is that your workload can be stuck indefinitely if you don't resolve the error somehow, which can worst-case lead to full downtimes if your application can not handle replicas.
The pro is that in combination with the retry back-off, we don't overload the cluster with recreated pods which likely will end up on the same node again and fail there again. By accident, David just posted a mail [3] where
he describes what can happen to a cluster if workloads just fail and get rescheduled.

It is btw. somethwing which we are seeing pretty often also in kubevirt CI if nodes have issues. A node with issues tends to kill pods fast, leading to fast re-scheduling to exactly this node, because it is from the scheduler perspective the most-attractive one (no long running workloads present, which means a lot of free resources from the scheduler perspective).

However there are cases where the kubelet can completely reject pods. That is something which we can't do right now. It may make sense to have this flow for certain use-cases.

Best regards,
Roman


--
Quique Llorente

CNV networking Senior Software Engineer

Red Hat EMEA

ello...@redhat.com   


--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAHVoYmLU%2BcdE2w6JaaPJE8eSJ3ti_QneqjgiYB%2BSoZLMHhCZXA%40mail.gmail.com.

David Vossel

unread,
Jun 17, 2021, 9:09:35 AM6/17/21
to Roman Mohr, Felix Enrique Llorente Pastora, kubevirt-dev
On Thu, Jun 17, 2021 at 5:18 AM Roman Mohr <rm...@redhat.com> wrote:
Hi,

On Thu, Jun 17, 2021 at 10:51 AM Felix Enrique Llorente Pastora <ello...@redhat.com> wrote:
Hi All,

   Looking at the following issue [1] the problem is related to the nodes not being able to create tap devices and this ends up propagating a critical network error, I was expecting this to mark the VMI as Failed and not being re-enqueue. Looking at the code looks like we are always re-enqueuing on error [2] shouldn't kubevirt to stop re-enqueing if the error is critical ?

Right now we behave here similar to the kubelet on errors. The kubelet will for instance also indefinitely try to create a pod, even if there is a permanent error on the node and the container sandboxes can't be created.

This approach has pros- and cons. The con is that your workload can be stuck indefinitely if you don't resolve the error somehow, which can worst-case lead to full downtimes if your application can not handle replicas.

in our case, as long as the VMI pod hits the "running" phase, the VMI pod times out after 5 minutes regardless if we retry or ignore critical errors during virt-handler->virt-laucher startup flow. 
 
The pro is that in combination with the retry back-off, we don't overload the cluster with recreated pods which likely will end up on the same node again and fail there again. By accident, David just posted a mail [3] where
he describes what can happen to a cluster if workloads just fail and get rescheduled.

It is btw. somethwing which we are seeing pretty often also in kubevirt CI if nodes have issues. A node with issues tends to kill pods fast, leading to fast re-scheduling to exactly this node, because it is from the scheduler perspective the most-attractive one (no long running workloads present, which means a lot of free resources from the scheduler perspective). 

However there are cases where the kubelet can completely reject pods. That is something which we can't do right now. It may make sense to have this flow for certain use-cases.


if the node is unhealthy for hosting VMIs for some reason, then virt-handler should stop heartbeating, right?  That's the signal we have to stop getting VMIs scheduled 

a combination of enhanced virt-handler health analysis and a VM crash loop back off (meaning exponentially backoff the reschedule of VMI pods when VMI didn't hit phase=running) seems like it could help here. 

Edward Haas

unread,
Jun 17, 2021, 9:34:23 AM6/17/21
to Roman Mohr, Felix Enrique Llorente Pastora, kubevirt-dev, David Vossel
On Thu, Jun 17, 2021 at 4:09 PM David Vossel <dvo...@redhat.com> wrote:


On Thu, Jun 17, 2021 at 5:18 AM Roman Mohr <rm...@redhat.com> wrote:
Hi,

On Thu, Jun 17, 2021 at 10:51 AM Felix Enrique Llorente Pastora <ello...@redhat.com> wrote:
Hi All,

   Looking at the following issue [1] the problem is related to the nodes not being able to create tap devices and this ends up propagating a critical network error, I was expecting this to mark the VMI as Failed and not being re-enqueue. Looking at the code looks like we are always re-enqueuing on error [2] shouldn't kubevirt to stop re-enqueing if the error is critical ?

Right now we behave here similar to the kubelet on errors. The kubelet will for instance also indefinitely try to create a pod, even if there is a permanent error on the node and the container sandboxes can't be created.

This approach has pros- and cons. The con is that your workload can be stuck indefinitely if you don't resolve the error somehow, which can worst-case lead to full downtimes if your application can not handle replicas.

 I think there is a subtle difference between kubelet and what kubevirt does: Kubelet will retry to start a new pod if one fails, so any resources it created until failing, will get cleaned up with the containers removal.
But with kubevirt, I think we are trying to re-deploy on the same pod and not to start a new one.
I am not 100% sure about this, but based on the errors this seemed to be the case.

Felix Enrique Llorente Pastora

unread,
Jun 18, 2021, 1:19:34 AM6/18/21
to Edward Haas, David Vossel, Roman Mohr, kubevirt-dev
On Thu, 17 Jun 2021 at 15:34, Edward Haas <edw...@redhat.com> wrote:


On Thu, Jun 17, 2021 at 4:09 PM David Vossel <dvo...@redhat.com> wrote:


On Thu, Jun 17, 2021 at 5:18 AM Roman Mohr <rm...@redhat.com> wrote:
Hi,

On Thu, Jun 17, 2021 at 10:51 AM Felix Enrique Llorente Pastora <ello...@redhat.com> wrote:
Hi All,

   Looking at the following issue [1] the problem is related to the nodes not being able to create tap devices and this ends up propagating a critical network error, I was expecting this to mark the VMI as Failed and not being re-enqueue. Looking at the code looks like we are always re-enqueuing on error [2] shouldn't kubevirt to stop re-enqueing if the error is critical ?

Right now we behave here similar to the kubelet on errors. The kubelet will for instance also indefinitely try to create a pod, even if there is a permanent error on the node and the container sandboxes can't be created.

This approach has pros- and cons. The con is that your workload can be stuck indefinitely if you don't resolve the error somehow, which can worst-case lead to full downtimes if your application can not handle replicas.

 I think there is a subtle difference between kubelet and what kubevirt does: Kubelet will retry to start a new pod if one fails, so any resources it created until failing, will get cleaned up with the containers removal.
But with kubevirt, I think we are trying to re-deploy on the same pod and not to start a new one.
I am not 100% sure about this, but based on the errors this seemed to be the case.

From what I know I am 99% sure you are right, virt-controller creates the virt-launcher pod and virt-handler taps into network namespace to setup networking if it fails and VMI is not marked as Failed it will tap again at the very same pod, so reenqueue does not mean virt-launcher re-creation.

Roman Mohr

unread,
Jun 18, 2021, 3:17:14 AM6/18/21
to Felix Enrique Llorente Pastora, Edward Haas, David Vossel, kubevirt-dev
On Fri, Jun 18, 2021 at 7:19 AM Felix Enrique Llorente Pastora <ello...@redhat.com> wrote:


On Thu, 17 Jun 2021 at 15:34, Edward Haas <edw...@redhat.com> wrote:


On Thu, Jun 17, 2021 at 4:09 PM David Vossel <dvo...@redhat.com> wrote:


On Thu, Jun 17, 2021 at 5:18 AM Roman Mohr <rm...@redhat.com> wrote:
Hi,

On Thu, Jun 17, 2021 at 10:51 AM Felix Enrique Llorente Pastora <ello...@redhat.com> wrote:
Hi All,

   Looking at the following issue [1] the problem is related to the nodes not being able to create tap devices and this ends up propagating a critical network error, I was expecting this to mark the VMI as Failed and not being re-enqueue. Looking at the code looks like we are always re-enqueuing on error [2] shouldn't kubevirt to stop re-enqueing if the error is critical ?

Right now we behave here similar to the kubelet on errors. The kubelet will for instance also indefinitely try to create a pod, even if there is a permanent error on the node and the container sandboxes can't be created.

This approach has pros- and cons. The con is that your workload can be stuck indefinitely if you don't resolve the error somehow, which can worst-case lead to full downtimes if your application can not handle replicas.

 I think there is a subtle difference between kubelet and what kubevirt does: Kubelet will retry to start a new pod if one fails, so any resources it created until failing, will get cleaned up with the containers removal.
But with kubevirt, I think we are trying to re-deploy on the same pod and not to start a new one.
I am not 100% sure about this, but based on the errors this seemed to be the case.

From what I know I am 99% sure you are right, virt-controller creates the virt-launcher pod and virt-handler taps into network namespace to setup networking if it fails and VMI is not marked as Failed it will tap again at the very same pod, so reenqueue does not mean virt-launcher re-creation.

Yes, but that is not the critical point. The point is that the kubelet has to work with what it got on its level (e.g. there is a node with a buggy containerd version). It does not matter if it destroys an already created network namespace and starts it fresh, or if it would retry picking up the last action in the network namespace. It will retry one or the other way at such a failure without marking the pod as failed and potentially causing re-create loops which will potentially end with the same result and a huge cluster pressure.

Apart from the overall start timeout which David mentioned (which is problematic as we see), we are doing the same on our level. virt-handler has to work with what it got on its level. For instance a  device plugin provides devices in a delayed way, a kernel version needs longer until a device becomes ready, the kubelet has a bug which causes things to happen in an unwanted async way when the pod should already be ready (all real cases we had).
It has to assume that failing the VMI will not resolve the situation on the node, but may in contrast create  a lot of load on the cluster due to rescheduling and so on.

Therefore the kubelet and virt-handler have this behaviour:
 1. At any point in the flow, if errors occur, one can either retry directly from that point, or it falls back to a checkpoint, restoring some things, and then retrying ([1] seems to indicate that we have  a bug in the setup code which destroys idempotency)
 2. If errors occur, they are reported as events (and there is a suppression mechanism in the event recorder to not overload the system)
 3. If errors occur, they may be express on the API like with conditions (there on retries, if a condition shows an error already, we do NOT update the condition again and just assume that we are still stuck at the same level to not overload the system)

Just to highlight it again, the main point is that we are retrying on the node level. If the retry goes back to checkpoints and restores something first, or if it can directly continue at the error location is not important. That the retry is idempotent is impotent and that we don't overload the cluster is important.

Best regards,
Roman
Reply all
Reply to author
Forward
0 new messages