VMI Crash Loops and startup queues

David Vossel

unread,

Jun 16, 2021, 3:28:28 PM6/16/21

to kubevirt-dev

Hey,

We experienced something interesting today. If the majority of a cluster goes offline (lets say as a result of a power failure) and all the nodes come back online at the same time, every VM with runStrategy=Always gets started at pretty much the same time. For very large clusters, this massive start event causes a strain on the control plane components, network, storage, and just about everything else. Due to this strain, we start seeing time outs occur during VMI startup, which causes even more VMI's to get re-created to satisfy runStrategy=Always.

This has me thinking about two things.

1. I think we need to introduce a VM crashloop backoff for VM's with runStrategy=Always that continue to crash before ever having their VMI hit phase=Running.

2. I think we should consider a max queue len for how many VMIs we allow to be in the startup state before phase=Running.

Item 1 would prevent our control plane from thrashing during a VMI crash loop and item 2 would prevent the control plane from thrashing during a massive startup event.

Has anyone else given these types of scenarios any thought? Do these seem like reasonable approaches?

Thanks,

- David

Stu Gott

unread,

Jun 16, 2021, 3:53:56 PM6/16/21

to David Vossel, kubevirt-dev

On Wed, Jun 16, 2021 at 3:28 PM David Vossel <dvo...@redhat.com> wrote:

Hey,

We experienced something interesting today. If the majority of a cluster goes offline (lets say as a result of a power failure) and all the nodes come back online at the same time, every VM with runStrategy=Always gets started at pretty much the same time. For very large clusters, this massive start event causes a strain on the control plane components, network, storage, and just about everything else. Due to this strain, we start seeing time outs occur during VMI startup, which causes even more VMI's to get re-created to satisfy runStrategy=Always.

This has me thinking about two things.

1. I think we need to introduce a VM crashloop backoff for VM's with runStrategy=Always that continue to crash before ever having their VMI hit phase=Running.

Definitely a good idea. This could certainly help the situation, and doesn't require an API change.

2. I think we should consider a max queue len for how many VMIs we allow to be in the startup state before phase=Running.

I'd think this might make sense to be configurable in some way--with a sensible default of course. There are some cases where we deliberately try to rapidly boot as many VMs as we can e.g. for testing.

Item 1 would prevent our control plane from thrashing during a VMI crash loop and item 2 would prevent the control plane from thrashing during a massive startup event.

Has anyone else given these types of scenarios any thought? Do these seem like reasonable approaches?

Thanks,
- David

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAPjOJFsyD6eexGAoJ00giBiRSGcVzHikSe0Bn0avNo-K2CbhEQ%40mail.gmail.com.

Zang Li

unread,

Jun 17, 2021, 12:51:14 AM6/17/21

to David Vossel, kubevirt-dev

HI David,

I just read what the runStrategy is and I am quite confused on its usage. May I ask for some clarifications here?

The API comment says runStrategy is mutually exclusive with the running flag.

1. If I set running to true, a vmi will be able to start and stop - is that corresponding to the manual strategy?

2. If I set running to true, and in your situation, will they all be brought up again after the nodes come back?

3. For the runStrategy RunStrategyRerunOnFailure, what failure are we talking about here? Are we continuously monitoring vmi failure, or just the exit code?

As for the issue you see, is VMI recreation handled by a virt-controller? Would some rate limiting at the virt-controller processing queue help?

Thanks,

Zang

--

Stu Gott

unread,

Jun 17, 2021, 7:14:41 AM6/17/21

to Zang Li, David Vossel, kubevirt-dev

Hi Zang!

On Thu, Jun 17, 2021 at 12:51 AM 'Zang Li' via kubevirt-dev <kubevi...@googlegroups.com> wrote:

HI David,

I just read what the runStrategy is and I am quite confused on its usage. May I ask for some clarifications here?
The API comment says runStrategy is mutually exclusive with the running flag.

They're mutually exclusive simply because RunStrategy completely eclipses Running--thus it could be possible to issue contradictory directives if both were allowed at the same time.

1. If I set running to true, a vmi will be able to start and stop - is that corresponding to the manual strategy?

No. Running=True is equivalent to RunStrategy=Always

We want to move away from "Running" for two reasons.

First off, it's confusing. Running in the spec is a request for a state. It's not a guarantee that the VMI can actually run (e.g. resource request can't be filled by any running node). People have seen the word "Running" and assumed that that's a true reflection of the state of the VM.

Secondly, it turned out we needed more rich policies in some cases, like RerunOnFailure. Some users prefer to shut down the VMI from inside the guest and found it confusing when it came back online immediately.

2. If I set running to true, and in your situation, will they all be brought up again after the nodes come back?

Yes.

3. For the runStrategy RunStrategyRerunOnFailure, what failure are we talking about here? Are we continuously monitoring vmi failure, or just the exit code?

Just the exit code. As qemu is continuously monitoring the state of the guest, we can trust the exit code reflects what happened--and we don't need to react until the guest is offline anyways.

As for the issue you see, is VMI recreation handled by a virt-controller?

Yes.

Would some rate limiting at the virt-controller processing queue help?

We still want to react to cluster state changes as quickly as possible. A backoff should only be introduced for crashed VMIs to lessen the impact of a restart storm.

Thanks,
Zang

On Wed, Jun 16, 2021 at 12:28 PM David Vossel <dvo...@redhat.com> wrote:
Hey,

We experienced something interesting today. If the majority of a cluster goes offline (lets say as a result of a power failure) and all the nodes come back online at the same time, every VM with runStrategy=Always gets started at pretty much the same time. For very large clusters, this massive start event causes a strain on the control plane components, network, storage, and just about everything else. Due to this strain, we start seeing time outs occur during VMI startup, which causes even more VMI's to get re-created to satisfy runStrategy=Always.

This has me thinking about two things.

1. I think we need to introduce a VM crashloop backoff for VM's with runStrategy=Always that continue to crash before ever having their VMI hit phase=Running.

2. I think we should consider a max queue len for how many VMIs we allow to be in the startup state before phase=Running.

Item 1 would prevent our control plane from thrashing during a VMI crash loop and item 2 would prevent the control plane from thrashing during a massive startup event.

Has anyone else given these types of scenarios any thought? Do these seem like reasonable approaches?

Thanks,
- David

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAPjOJFsyD6eexGAoJ00giBiRSGcVzHikSe0Bn0avNo-K2CbhEQ%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/kubevirt-dev/CAO_S94ghUR%3DGB6Rx-Pt3CYagRcX%3DQ3ahfJCjdgCWFm4uBQOg%3DQ%40mail.gmail.com.

David Vossel

unread,

Jun 23, 2021, 5:00:22 PM6/23/21

to Stu Gott, Zang Li, kubevirt-dev

On Thu, Jun 17, 2021 at 7:14 AM Stu Gott <sg...@redhat.com> wrote:

Hi Zang!

On Thu, Jun 17, 2021 at 12:51 AM 'Zang Li' via kubevirt-dev <kubevi...@googlegroups.com> wrote:
HI David,

I just read what the runStrategy is and I am quite confused on its usage. May I ask for some clarifications here?
The API comment says runStrategy is mutually exclusive with the running flag.

They're mutually exclusive simply because RunStrategy completely eclipses Running--thus it could be possible to issue contradictory directives if both were allowed at the same time.

1. If I set running to true, a vmi will be able to start and stop - is that corresponding to the manual strategy?

No. Running=True is equivalent to RunStrategy=Always

We want to move away from "Running" for two reasons.

First off, it's confusing. Running in the spec is a request for a state. It's not a guarantee that the VMI can actually run (e.g. resource request can't be filled by any running node). People have seen the word "Running" and assumed that that's a true reflection of the state of the VM.

Secondly, it turned out we needed more rich policies in some cases, like RerunOnFailure. Some users prefer to shut down the VMI from inside the guest and found it confusing when it came back online immediately.

2. If I set running to true, and in your situation, will they all be brought up again after the nodes come back?

Yes.

3. For the runStrategy RunStrategyRerunOnFailure, what failure are we talking about here? Are we continuously monitoring vmi failure, or just the exit code?

Just the exit code. As qemu is continuously monitoring the state of the guest, we can trust the exit code reflects what happened--and we don't need to react until the guest is offline anyways.

As for the issue you see, is VMI recreation handled by a virt-controller?

Yes.

Would some rate limiting at the virt-controller processing queue help?

We still want to react to cluster state changes as quickly as possible. A backoff should only be introduced for crashed VMIs to lessen the impact of a restart storm.

I created a PR [1] to handle VM crash looping when runStrategy: Always|RerunOnFailure is used. It ended up being way more invasive than I anticipated due to needing to provide user feedback that the crash loop is occurring and needing to adjust our vm stop subresource to handle stopping a VM that has no current VMI... so stopping a VM that is in a crash loop while it's in the middle of exponential backoff waiting to create the next VMI.

As for creating a per cluster limit on the number of VMIs in the startup phase, the more I've looked into this the less I'm convinced this is an area KubeVirt should jump into. I'd expect the k8s scheduler and the individual node kubelets to be throttling pod creation if something like that is necessary.

1. https://github.com/kubevirt/kubevirt/pull/5905

Reply all

Reply to author

Forward