Background evictions (aka descheduler support) and failed migrations

24 views
Skip to first unread message

Simone Tiraboschi

unread,
Jul 3, 2025, 7:59:27 AMJul 3
to kubevirt-dev, Federico Fossemo, Vladik Romanovsky, Jan Chaloupka, Ricardo Maraschini, kva...@gmail.com
Hi,
I'm writing because I believe there's a design flaw in the current interaction between the descheduler and failed migrations — particularly when failures become recurring for specific VMs.

The original proposal ( https://github.com/kubevirt/community/blob/main/design-proposals/descheduler-support.md ) aligns well with the current implementation, but certain edge cases may lead to unintended behaviors.

Current flow:
- We now have a load-aware descheduler that periodically (and asynchronously) checks node load. It may trigger evictions on virt-launcher pods via the eviction API (without real awareness of the underlying VMs). Due to KubeVirt logic, this will initiate a live migration to rebalance the cluster.
- The eviction request for the virt-launcher pod is intercepted by a validating webhook on the KubeVirt side, which checks if the correspondent VMI is migratable (based on its evictionStrategy and its migration readiness according to a condition).
- If the VMI is migratable, the webhook responds with HTTP 422 (UnprocessableEntity), but annotates the virt-launcher pod with descheduler.alpha.kubernetes.io/eviction-in-progress. This acts as an out-of-band signal to the descheduler. Simultaneously, the Status.EvacuationNodeName field is set on the VMI to signal the evacuation controller.
- On subsequent runs, the stateless descheduler notices the annotation and treats the eviction as in progress.
- If the migration succeeds, the new virt-launcher pod (on a different node) won't carry the annotation, and Status.EvacuationNodeName will be cleared from the VMI. Both the descheduler and the evacuation controller consider their work done.
- However, if the migration fails, the annotation is removed from the pod (so the descheduler no longer considers it in progress), but Status.EvacuationNodeName remains on the VMI. The evacuation controller will continue retrying the migration indefinitely, using a backoff strategy.

The issue:
This leads to a situation where we have two independent retry loops acting on different objects:
the descheduler sees the eviction as failed and may attempt to migrate a second VM from the overloaded node.
Meanwhile, the KubeVirt evacuation controller continues retrying the migration of the original VM, even if the node has been balanced.
This can result in conflicting and counterproductive actions, where older migration attempts — no longer necessary — trigger unnecessary live migrations and potentially unbalance the cluster again.

Complicating the matter further, we don’t currently track:
- the reason for the migration (e.g., descheduler-initiated),
- nor a timestamp,
on either Status.EvacuationNodeName or on the VirtualMachineInstanceMigration (VMIM) object. As a result, admins only see a system-initiated migration that may worsen the balance of the cluster, with no context.
In order to understand what really happened, the cluster dmin should match descheduler and virt-launcher logs knowing in advance what to look for.

Ideally, we should avoid having two independent retry loops. Below are a few possible approaches I considered, though neither looks fully satisfying to me.

Proposal 1: Keep the annotation during failures
Idea: The virt-controller should not remove the descheduler.alpha.kubernetes.io/eviction-in-progress annotation if the migration fails.
Pros:
- Semantically correct: The eviction is still "in progress" as there's no cancellation mechanism, and the evacuation controller will keep retrying forever.
- Prevents redundant retries by the descheduler.
Cons:
- The descheduler might have to wait unnecessarily due to the backoff period of the evacuation controller, slowing down rebalancing.
- The descheduler limits the number of concurrent evictions. If several are stuck (due to persistent failures), it might block the descheduler entirely from making further progress.


Proposal 2: Try descheduler-initiated evictions only once
Idea: Add a mechanism to detect descheduler-originated eviction requests (e.g., via user-agent or ancillary metadata in the eviction request). Only attempt them once.
Pros:
- Only the descheduler handles retries, based on actual cluster state.
- Migration intent ("reason") could be propagated to the VMIM and surfaced via events, improving observability for admins.
Cons:
- Potentially confusing behavior: some evictions are retried forever, others not depending on the origin.  See also https://github.com/kubevirt/kubevirt/issues/9585 which is somehow the root cause of this discussion.


Proposal 3: Set an upper bound on eviction controller retries
Idea: Introduce a mechanism in the evacuation controller to track the number of migration retry attempts and stop retrying after reaching a defined threshold.
Pros:
- Relatively simple to implement.
- Prevents indefinite retries of failed migrations that may no longer be relevant.
Cons:
- It's only a mitigation, not a full solution (during the retry window, the dual retry loop issue still exists).
- Could interfere with other legitimate use cases, like node drains where persistence is expected.

Do you have any thoughts or alternative ideas?

Thanks,
Simone

Fabian Deutsch

unread,
Aug 12, 2025, 6:07:47 AMAug 12
to Simone Tiraboschi, Luboslav Pivarc, Igor Bezukh, kubevirt-dev, Federico Fossemo, Vladik Romanovsky, Jan Chaloupka, Ricardo Maraschini, kva...@gmail.com
Simone, please see my comments inline below

On Thu, Jul 3, 2025 at 1:59 PM 'Simone Tiraboschi' via kubevirt-dev <kubevi...@googlegroups.com> wrote:
Hi,
I'm writing because I believe there's a design flaw in the current interaction between the descheduler and failed migrations — particularly when failures become recurring for specific VMs.

The original proposal ( https://github.com/kubevirt/community/blob/main/design-proposals/descheduler-support.md ) aligns well with the current implementation, but certain edge cases may lead to unintended behaviors.

Current flow:
- We now have a load-aware descheduler that periodically (and asynchronously) checks node load. It may trigger evictions on virt-launcher pods via the eviction API (without real awareness of the underlying VMs). Due to KubeVirt logic, this will initiate a live migration to rebalance the cluster.
- The eviction request for the virt-launcher pod is intercepted by a validating webhook on the KubeVirt side, which checks if the correspondent VMI is migratable (based on its evictionStrategy and its migration readiness according to a condition).
- If the VMI is migratable, the webhook responds with HTTP 422 (UnprocessableEntity), but annotates the virt-launcher pod with descheduler.alpha.kubernetes.io/eviction-in-progress. This acts as an out-of-band signal to the descheduler. Simultaneously, the Status.EvacuationNodeName field is set on the VMI to signal the evacuation controller.
- On subsequent runs, the stateless descheduler notices the annotation and treats the eviction as in progress.
- If the migration succeeds, the new virt-launcher pod (on a different node) won't carry the annotation, and Status.EvacuationNodeName will be cleared from the VMI. Both the descheduler and the evacuation controller consider their work done.
- However, if the migration fails, the annotation is removed from the pod (so the descheduler no longer considers it in progress), but Status.EvacuationNodeName remains on the VMI. The evacuation controller will continue retrying the migration indefinitely, using a backoff strategy.

The issue:
This leads to a situation where we have two independent retry loops acting on different objects:
the descheduler sees the eviction as failed and may attempt to migrate a second VM from the overloaded node.
Meanwhile, the KubeVirt evacuation controller continues retrying the migration of the original VM, even if the node has been balanced.
This can result in conflicting and counterproductive actions, where older migration attempts — no longer necessary — trigger unnecessary live migrations and potentially unbalance the cluster again.

Yes, this is problematic. And the upper level controller, descheduler in this case, is not even aware of the problem, and can not do it right.
 

Complicating the matter further, we don’t currently track:
- the reason for the migration (e.g., descheduler-initiated),

We really want to enahnce our migration coe path to allow injecting a reason from the different callers.
 
- nor a timestamp,
on either Status.EvacuationNodeName or on the VirtualMachineInstanceMigration (VMIM) object. As a result, admins only see a system-initiated migration that may worsen the balance of the cluster, with no context.
In order to understand what really happened, the cluster dmin should match descheduler and virt-launcher logs knowing in advance what to look for.

Ideally, we should avoid having two independent retry loops. Below are a few possible approaches I considered, though neither looks fully satisfying to me.

Proposal 1: Keep the annotation during failures
Idea: The virt-controller should not remove the descheduler.alpha.kubernetes.io/eviction-in-progress annotation if the migration fails.
Pros:
- Semantically correct: The eviction is still "in progress" as there's no cancellation mechanism, and the evacuation controller will keep retrying forever.
- Prevents redundant retries by the descheduler.
Cons:
- The descheduler might have to wait unnecessarily due to the backoff period of the evacuation controller, slowing down rebalancing.

OTOH this is then exactly achieveing what we wanted: Protecting the cluster from too many VMIM
 
- The descheduler limits the number of concurrent evictions. If several are stuck (due to persistent failures), it might block the descheduler entirely from making further progress.

Yes.

Overall this solution seems to get things right: By changing the annotation semantics, KubeVirt is telling the descheduler that the eviction is still in progress (which is correct, the workload (VM) was not yet evicted), therefore the upper controller can make a better decision.

Let me also link to https://github.com/kubevirt/kubevirt/pull/14587 which came up (not too) recently, which seems to work around this problem that you are describing here.

@Luboslav Pivarc @Igor Bezukh looping you in, as you commented on the PR.

tl;dr
- Probably good to explore if canging the semantics of the annotation is addressing the problem, and finding out if it leads to new problems
- To consider to merge the cancelevict PR, but in reality this is just a las ressort life line, to cleanup a cluster after failure conditions

After all something we have to fix, otherwise we are DoSing our clusters with VMIM.

- fabian
 


Proposal 2: Try descheduler-initiated evictions only once
Idea: Add a mechanism to detect descheduler-originated eviction requests (e.g., via user-agent or ancillary metadata in the eviction request). Only attempt them once.
Pros:
- Only the descheduler handles retries, based on actual cluster state.
- Migration intent ("reason") could be propagated to the VMIM and surfaced via events, improving observability for admins.
Cons:
- Potentially confusing behavior: some evictions are retried forever, others not depending on the origin.  See also https://github.com/kubevirt/kubevirt/issues/9585 which is somehow the root cause of this discussion.


Proposal 3: Set an upper bound on eviction controller retries
Idea: Introduce a mechanism in the evacuation controller to track the number of migration retry attempts and stop retrying after reaching a defined threshold.
Pros:
- Relatively simple to implement.
- Prevents indefinite retries of failed migrations that may no longer be relevant.
Cons:
- It's only a mitigation, not a full solution (during the retry window, the dual retry loop issue still exists).
- Could interfere with other legitimate use cases, like node drains where persistence is expected.

Do you have any thoughts or alternative ideas?

Thanks,
Simone

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubevirt-dev/CAN8-ONoat%3DH_1uPCfDfvbeZzeEnkcGfB8pECyMPw8GHOmvQoCw%40mail.gmail.com.
Reply all
Reply to author
Forward
0 new messages