Disabling Feature Gates with running VMs

41 views
Skip to first unread message

Javier Cano Cano

unread,
Apr 3, 2024, 10:30:38 AM4/3/24
to kubevirt-dev
Hi all!

Recently, we have observed that if you enable a given feature gate, spin a VM using the feature behind the feature, remove the feature gate and try to stop the VM, you get an error from the vmi create admitter.

For instance, if we have a VMI trying to use downwardMetrics:

1. Enable the DownwardMetrics FG.
2. Spin a VM using the DownwardMetrics (add spec.template.spec.domain.devices.downwardMetrics: {})
3. Remove the DownwardMetrics FG.
4. Stop the VM.

Then you will get the following error:

Error stopping VirtualMachine Internal error occurred: admission webhook "virtualmachine-validator.kubevirt.io" denied the request: downwardMetrics virtio serial is not allowed: DownwardMetrics feature gate is not enabled

I've checked it out with other FG such as VSOCK, and it shows the same behavior.

Therefore, I have two questions:

1) Should we add some special case to allow stopping a running VM even if the FG has been disabled?
2) Should we do not allow disabling a FG if a running VM is using a feature protected by that FG?

Thanks a lot!
J.Cano.


Edward Haas

unread,
Apr 3, 2024, 10:56:26 AM4/3/24
to Javier Cano Cano, kubevirt-dev
On Wed, Apr 3, 2024 at 5:30 PM Javier Cano Cano <jcan...@redhat.com> wrote:
Hi all!

Recently, we have observed that if you enable a given feature gate, spin a VM using the feature behind the feature, remove the feature gate and try to stop the VM, you get an error from the vmi create admitter.

For instance, if we have a VMI trying to use downwardMetrics:

1. Enable the DownwardMetrics FG.
2. Spin a VM using the DownwardMetrics (add spec.template.spec.domain.devices.downwardMetrics: {})
3. Remove the DownwardMetrics FG.
4. Stop the VM.

Then you will get the following error:

Error stopping VirtualMachine Internal error occurred: admission webhook "virtualmachine-validator.kubevirt.io" denied the request: downwardMetrics virtio serial is not allowed: DownwardMetrics feature gate is not enabled

I've checked it out with other FG such as VSOCK, and it shows the same behavior.

Nice catch!


Therefore, I have two questions:

1) Should we add some special case to allow stopping a running VM even if the FG has been disabled?
2) Should we do not allow disabling a FG if a running VM is using a feature protected by that FG?

I think this is a classic incident of when validation admitters are a bad solution to a validation need.
A webhook admitter does not provide the ability for the system or the user to fix the problem, potentially
canceling the desired request instead of reconciling (retry) it until success.

The solution IMO is to move the validation to the relevant controller. At which point it will know not
to validate the FG beyond start time (or whatever strategy is chosen).


Thanks a lot!
J.Cano.


--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/ea74706c-1a6a-4e66-b25c-0e870df153c6n%40googlegroups.com.

Orel Misan

unread,
Apr 3, 2024, 10:58:42 AM4/3/24
to Javier Cano Cano, kubevirt-dev
Hi Javier,

It makes sense to me that in general, the settings that were present in the cluster when the VM was first created - should persist throughout the VM's life-cycle.
For example, the logic behind the AlignCPUs feature gate [1] persists the state that was determined during the creation of the VMI.
Changing this feature gate will not change the behavior of existing VMs, but will affect new ones.

As you've mentioned in your email, this is not the case with all feature gates, which some are checked at certain checkpoints and could have different values at different times.
This could lead to unexpected behavior.

Best regards, 
Orel


On Wed, Apr 3, 2024 at 5:30 PM Javier Cano Cano <jcan...@redhat.com> wrote:

Javier Cano Cano

unread,
Apr 4, 2024, 5:39:12 AM4/4/24
to Edward Haas, Geetika Kapoor, kubevirt-dev


On Wed, Apr 3, 2024 at 5:06 PM Edward Haas <edw...@redhat.com> wrote:


On Wed, Apr 3, 2024 at 5:55 PM Edward Haas <edw...@redhat.com> wrote:


On Wed, Apr 3, 2024 at 5:30 PM Javier Cano Cano <jcan...@redhat.com> wrote:
Hi all!

Recently, we have observed that if you enable a given feature gate, spin a VM using the feature behind the feature, remove the feature gate and try to stop the VM, you get an error from the vmi create admitter.

For instance, if we have a VMI trying to use downwardMetrics:

1. Enable the DownwardMetrics FG.
2. Spin a VM using the DownwardMetrics (add spec.template.spec.domain.devices.downwardMetrics: {})
3. Remove the DownwardMetrics FG.
4. Stop the VM.

Then you will get the following error:

Error stopping VirtualMachine Internal error occurred: admission webhook "virtualmachine-validator.kubevirt.io" denied the request: downwardMetrics virtio serial is not allowed: DownwardMetrics feature gate is not enabled

I've checked it out with other FG such as VSOCK, and it shows the same behavior.

Nice catch!

All the credit to +Geetika Kapoor. She discovered this issue. I just explored if other FGs were affected.

Therefore, I have two questions:

1) Should we add some special case to allow stopping a running VM even if the FG has been disabled?
2) Should we do not allow disabling a FG if a running VM is using a feature protected by that FG?

I think this is a classic incident of when validation admitters are a bad solution to a validation need.
A webhook admitter does not provide the ability for the system or the user to fix the problem, potentially
canceling the desired request instead of reconciling (retry) it until success.

The solution IMO is to move the validation to the relevant controller. At which point it will know not
to validate the FG beyond start time (or whatever strategy is chosen).

To mitigate this, here is a potential practical (quick?) solution:
Make sure the FG validations are only occurring on creation of the manifest and not on updates.

Any objections to this? I will try to create a PR implementing this approach.

As a follow up, FG validations on the spec (i.e. do not input a field that its FG is not enabled) should be done only
at the controller. But this may be controversial and more effort needs to be put in it.
I suggest a design, as the behavior will change a bit and convestation is expected.

Edward Haas

unread,
Apr 9, 2024, 4:28:25 AM4/9/24
to Javier Cano Cano, kubevirt-dev
On Wed, Apr 3, 2024 at 5:55 PM Edward Haas <edw...@redhat.com> wrote:


On Wed, Apr 3, 2024 at 5:30 PM Javier Cano Cano <jcan...@redhat.com> wrote:
Hi all!

Recently, we have observed that if you enable a given feature gate, spin a VM using the feature behind the feature, remove the feature gate and try to stop the VM, you get an error from the vmi create admitter.

For instance, if we have a VMI trying to use downwardMetrics:

1. Enable the DownwardMetrics FG.
2. Spin a VM using the DownwardMetrics (add spec.template.spec.domain.devices.downwardMetrics: {})
3. Remove the DownwardMetrics FG.
4. Stop the VM.

Then you will get the following error:

Error stopping VirtualMachine Internal error occurred: admission webhook "virtualmachine-validator.kubevirt.io" denied the request: downwardMetrics virtio serial is not allowed: DownwardMetrics feature gate is not enabled

I've checked it out with other FG such as VSOCK, and it shows the same behavior.

Nice catch!


Therefore, I have two questions:

1) Should we add some special case to allow stopping a running VM even if the FG has been disabled?
2) Should we do not allow disabling a FG if a running VM is using a feature protected by that FG?

I think this is a classic incident of when validation admitters are a bad solution to a validation need.
A webhook admitter does not provide the ability for the system or the user to fix the problem, potentially
canceling the desired request instead of reconciling (retry) it until success.

The solution IMO is to move the validation to the relevant controller. At which point it will know not
to validate the FG beyond start time (or whatever strategy is chosen).

To mitigate this, here is a potential practical (quick?) solution:
Make sure the FG validations are only occurring on creation of the manifest and not on updates.

As a follow up, FG validations on the spec (i.e. do not input a field that its FG is not enabled) should be done only
at the controller. But this may be controversial and more effort needs to be put in it.
I suggest a design, as the behavior will change a bit and convestation is expected.

Felix Matouschek

unread,
Apr 9, 2024, 10:02:34 AM4/9/24
to Orel Misan, Javier Cano Cano, kubevirt-dev
Hi,

Am Mittwoch, dem 03.04.2024 um 17:58 +0300 schrieb Orel Misan:
> Hi Javier,
>
> It makes sense to me that in general, the settings that were present
> in the cluster when the VM was first created - should
> persist throughout the VM's life-cycle.

To guarantee the presence of a feature persists throughout the
lifecycle of a VM it sounds to me like we need a webhook guarding the
removal of feature gates still in use. IMO that would be suited best
for the way feature gates work at the moment and without changing
actual controller logic, at least in short term.

Sarah Bennert ( she / her )

unread,
Apr 9, 2024, 12:52:29 PM4/9/24
to Felix Matouschek, Orel Misan, Javier Cano Cano, kubevirt-dev
On Tue, Apr 9, 2024 at 10:02 AM Felix Matouschek <fmato...@redhat.com> wrote:
Hi,

Am Mittwoch, dem 03.04.2024 um 17:58 +0300 schrieb Orel Misan:
> Hi Javier,
>
> It makes sense to me that in general, the settings that were present
> in the cluster when the VM was first created - should
> persist throughout the VM's life-cycle.

To guarantee the presence of a feature persists throughout the
lifecycle of a VM it sounds to me like we need a webhook guarding the
removal of feature gates still in use. IMO that would be suited best
for the way feature gates work at the moment and without changing
actual controller logic, at least in short term.

This may be a good short term work-around, but I am concerned about the long term and larger installations.
This would require an admin to halt new VM creation in-order to shutdown affected VMs and disable feature gates.
 

> For example, the logic behind the AlignCPUs feature gate [1] persists
> the state that was determined during the creation of the VMI.
> Changing this feature gate will not change the behavior of
> existing VMs, but will affect new ones.

As mentioned above re AlignCPU, if it is possible to allow the admin to disable a feature gate, preventing new VMs from using the feature, while allowing existing VMs to be restarted when it is viable, this might be a more comprehensive and admin-friendly solution.

 
--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.


--
Sarah Bennert

she / her 🏳️‍⚧

RED HAT QE

Red Hat

sben...@redhat.com 

Fabian Deutsch

unread,
Apr 10, 2024, 3:09:51 AM4/10/24
to Sarah Bennert ( she / her ), Felix Matouschek, Orel Misan, Javier Cano Cano, kubevirt-dev
On Tue, Apr 9, 2024 at 6:52 PM Sarah Bennert ( she / her ) <sben...@redhat.com> wrote:


On Tue, Apr 9, 2024 at 10:02 AM Felix Matouschek <fmato...@redhat.com> wrote:
Hi,

Am Mittwoch, dem 03.04.2024 um 17:58 +0300 schrieb Orel Misan:
> Hi Javier,
>
> It makes sense to me that in general, the settings that were present
> in the cluster when the VM was first created - should
> persist throughout the VM's life-cycle.

To guarantee the presence of a feature persists throughout the
lifecycle of a VM it sounds to me like we need a webhook guarding the
removal of feature gates still in use. IMO that would be suited best
for the way feature gates work at the moment and without changing
actual controller logic, at least in short term.

This may be a good short term work-around, but I am concerned about the long term and larger installations.
This would require an admin to halt new VM creation in-order to shutdown affected VMs and disable feature gates.

It is for sure inconvenient, but I tend to agree with Felix that it would be correct.

If a feature is in use, but the FG is representing something else, then this is just asking for error conditions and problems.
 

Sarah Bennert ( she / her )

unread,
Apr 10, 2024, 8:25:55 AM4/10/24
to Fabian Deutsch, Felix Matouschek, Orel Misan, Javier Cano Cano, kubevirt-dev
On Wed, Apr 10, 2024 at 3:09 AM Fabian Deutsch <fdeu...@redhat.com> wrote:


On Tue, Apr 9, 2024 at 6:52 PM Sarah Bennert ( she / her ) <sben...@redhat.com> wrote:


On Tue, Apr 9, 2024 at 10:02 AM Felix Matouschek <fmato...@redhat.com> wrote:
Hi,

Am Mittwoch, dem 03.04.2024 um 17:58 +0300 schrieb Orel Misan:
> Hi Javier,
>
> It makes sense to me that in general, the settings that were present
> in the cluster when the VM was first created - should
> persist throughout the VM's life-cycle.

To guarantee the presence of a feature persists throughout the
lifecycle of a VM it sounds to me like we need a webhook guarding the
removal of feature gates still in use. IMO that would be suited best
for the way feature gates work at the moment and without changing
actual controller logic, at least in short term.

This may be a good short term work-around, but I am concerned about the long term and larger installations.
This would require an admin to halt new VM creation in-order to shutdown affected VMs and disable feature gates.

It is for sure inconvenient, but I tend to agree with Felix that it would be correct.

If a feature is in use, but the FG is representing something else, then this is just asking for error conditions and problems.

I do not disagree.
Perhaps moving forward, a more granular approach to all feature gates might be considered.
If all feature gates could be limited to a namespace, role, or user, then the impact can be managed better than any cluster-wide feature gate would be. 

Fabian Deutsch

unread,
Apr 17, 2024, 7:56:16 AM4/17/24
to Sarah Bennert ( she / her ), Felix Matouschek, Orel Misan, Javier Cano Cano, kubevirt-dev
On Wed, Apr 10, 2024 at 2:25 PM Sarah Bennert ( she / her ) <sben...@redhat.com> wrote:


On Wed, Apr 10, 2024 at 3:09 AM Fabian Deutsch <fdeu...@redhat.com> wrote:


On Tue, Apr 9, 2024 at 6:52 PM Sarah Bennert ( she / her ) <sben...@redhat.com> wrote:


On Tue, Apr 9, 2024 at 10:02 AM Felix Matouschek <fmato...@redhat.com> wrote:
Hi,

Am Mittwoch, dem 03.04.2024 um 17:58 +0300 schrieb Orel Misan:
> Hi Javier,
>
> It makes sense to me that in general, the settings that were present
> in the cluster when the VM was first created - should
> persist throughout the VM's life-cycle.

To guarantee the presence of a feature persists throughout the
lifecycle of a VM it sounds to me like we need a webhook guarding the
removal of feature gates still in use. IMO that would be suited best
for the way feature gates work at the moment and without changing
actual controller logic, at least in short term.

This may be a good short term work-around, but I am concerned about the long term and larger installations.
This would require an admin to halt new VM creation in-order to shutdown affected VMs and disable feature gates.

It is for sure inconvenient, but I tend to agree with Felix that it would be correct.

If a feature is in use, but the FG is representing something else, then this is just asking for error conditions and problems.

I do not disagree.
Perhaps moving forward, a more granular approach to all feature gates might be considered.
If all feature gates could be limited to a namespace, role, or user, then the impact can be managed better than any cluster-wide feature gate would be. 

Maybe we have more critical things to address.

FeatureGates are about enabling - system wide - certain features which can affect workloads, or the system as a whole. Limiting this to a namespace or workooad or or or is for sure technically possible. (Ain't the sky the limit in software engineering? :) )

Let's try to keep it simple.
Reply all
Reply to author
Forward
0 new messages