I'm writing because I believe there's a design flaw in the current interaction between the descheduler and failed migrations — particularly when failures become recurring for specific VMs.
The original proposal (
https://github.com/kubevirt/community/blob/main/design-proposals/descheduler-support.md ) aligns well with the current implementation, but certain edge cases may lead to unintended behaviors.
Current flow:
- We now have a load-aware descheduler that periodically (and asynchronously) checks node load. It may trigger evictions on virt-launcher pods via the eviction API (without real awareness of the underlying VMs). Due to KubeVirt logic, this will initiate a live migration to rebalance the cluster.
- The eviction request for the virt-launcher pod is intercepted by a validating webhook on the KubeVirt side, which checks if the correspondent VMI is migratable (based on its evictionStrategy and its migration readiness according to a condition).
- If the VMI is migratable, the webhook responds with HTTP 422 (UnprocessableEntity), but annotates the virt-launcher pod with
descheduler.alpha.kubernetes.io/eviction-in-progress. This acts as an out-of-band signal to the descheduler. Simultaneously, the Status.EvacuationNodeName field is set on the VMI to signal the evacuation controller.
- On subsequent runs, the stateless descheduler notices the annotation and treats the eviction as in progress.
- If the migration succeeds, the new virt-launcher pod (on a different node) won't carry the annotation, and Status.EvacuationNodeName will be cleared from the VMI. Both the descheduler and the evacuation controller consider their work done.
- However, if the migration fails, the annotation is removed from the pod (so the descheduler no longer considers it in progress), but Status.EvacuationNodeName remains on the VMI. The evacuation controller will continue retrying the migration indefinitely, using a backoff strategy.
The issue:
This leads to a situation where we have two independent retry loops acting on different objects:
the descheduler sees the eviction as failed and may attempt to migrate a second VM from the overloaded node.
Meanwhile, the KubeVirt evacuation controller continues retrying the migration of the original VM, even if the node has been balanced.
This can result in conflicting and counterproductive actions, where older migration attempts — no longer necessary — trigger unnecessary live migrations and potentially unbalance the cluster again.
Complicating the matter further, we don’t currently track:
- the reason for the migration (e.g., descheduler-initiated),
- nor a timestamp,
on either Status.EvacuationNodeName or on the VirtualMachineInstanceMigration (VMIM) object. As a result, admins only see a system-initiated migration that may worsen the balance of the cluster, with no context.
In order to understand what really happened, the cluster dmin should match descheduler and virt-launcher logs knowing in advance what to look for.
Ideally, we should avoid having two independent retry loops. Below are a few possible approaches I considered, though neither looks fully satisfying to me.
Proposal 1: Keep the annotation during failures
Idea: The virt-controller should not remove the
descheduler.alpha.kubernetes.io/eviction-in-progress annotation if the migration fails.
Pros:
- Semantically correct: The eviction is still "in progress" as there's no cancellation mechanism, and the evacuation controller will keep retrying forever.
- Prevents redundant retries by the descheduler.
Cons:
- The descheduler might have to wait unnecessarily due to the backoff period of the evacuation controller, slowing down rebalancing.
- The descheduler limits the number of concurrent evictions. If several are stuck (due to persistent failures), it might block the descheduler entirely from making further progress.
Proposal 2: Try descheduler-initiated evictions only once
Idea: Add a mechanism to detect descheduler-originated eviction requests (e.g., via user-agent or ancillary metadata in the eviction request). Only attempt them once.
Pros:
- Only the descheduler handles retries, based on actual cluster state.
- Migration intent ("reason") could be propagated to the VMIM and surfaced via events, improving observability for admins.
Cons:
- Potentially confusing behavior: some evictions are retried forever, others not depending on the origin. See also
https://github.com/kubevirt/kubevirt/issues/9585 which is somehow the root cause of this discussion.