2026-03-11 sig-ci quarantine catch-up

count: 18 tests in quarantine currently - no change to last week
kubevirt/kubevirt quarantined tests over time (by SIG):
Legend: Red(Total) | Blue(Compute) | Green(Storage) | Orange(Network) | Purple(Monitoring)

check status, i.e. who is working on those

look at PRs that want to fix flakes

nad: https://github.com/kubevirt/kubevirt/pull/17084

see whether we can dequarantine tests

misc topics

Action items

update/create issues with latest flakes spotted
communication

send meeting notes to kubevirt-dev, bcc sig people for spotted flakes (include meeting changes for upcoming instances)

--

Kind regards,

Daniel Hiller

He / Him / His

Principal Software Engineer, KubeVirt CI, OpenShift Virtualization

Red Hat

dhi...@redhat.com

Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Avril Crosse O'Flaherty

Lee Yarwood

unread,

Mar 11, 2026, 5:40:43 AM (9 days ago) Mar 11

to Daniel Hiller, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

On Wed, 11 Mar 2026 at 09:10, 'Daniel Hiller' via kubevirt-dev
<kubevi...@googlegroups.com> wrote:
>
> [..]

>
> flake stats - create issues accordingly
>
> [dhiller] all sig-compute periodics are still suffering from clustered failures

Hey Daniel,

Apologies for not being on the call, I spoke briefly to Lubo about
this while dropping my kids off at school and it looks like these
clustered failures stopped once
https://github.com/kubevirt/project-infra/pull/4784 landed on Monday
afternoon.

I often find the full 7 days of CI data can be very misleading once
fixes have landed. Can I suggest that we also consider sharing 24, 48
and 72 hour trends when summarising the state of CI? That way fixes
like this should show up and reflect the true current state of things.

Cheers,

Lee

Daniel Hiller

unread,

Mar 11, 2026, 6:20:22 AM (9 days ago) Mar 11

to Lee Yarwood, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

Hey Lee,

thanks for raising that!

Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.

Best,

Daniel

--

Best,

Daniel

Daniel Hiller

unread,

Mar 11, 2026, 6:40:20 AM (9 days ago) Mar 11

to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

Hey again,

On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:

Hey Lee,

thanks for raising that!

Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.

indeed I can confirm that no clustered failure occurred after the fix had landed 🎉

Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.

https://grafana.ci.kubevirt.io/d/efpTS3t4z/e2e-jobs-overview-v2?orgId=1&from=1773054000000&to=1773223199000&var-job_name=periodic-kubevirt-e2e-k8s-.%2Asig-compute&viewPanel=15

Best,
Daniel

On Wed, Mar 11, 2026 at 10:40 AM Lee Yarwood <lyar...@redhat.com> wrote:
On Wed, 11 Mar 2026 at 09:10, 'Daniel Hiller' via kubevirt-dev
<kubevi...@googlegroups.com> wrote:
>
> [..]
>
> flake stats - create issues accordingly
>
> [dhiller] all sig-compute periodics are still suffering from clustered failures

Hey Daniel,

Apologies for not being on the call, I spoke briefly to Lubo about
this while dropping my kids off at school and it looks like these
clustered failures stopped once
https://github.com/kubevirt/project-infra/pull/4784 landed on Monday
afternoon.

I'm open to moving the call into the afternoon if that fits better - I believe the afternoon around 3PM would also help include the US TZ folks - although we should not overlap with community meeting.

Any suggestions to which time would fit best from the usual attendees?

@Federico Fossemo @Luboslav Pivarc @Nir Dothan @Denis Ollier Pinas ?

I often find the full 7 days of CI data can be very misleading once
fixes have landed. Can I suggest that we also consider sharing 24, 48
and 72 hour trends when summarising the state of CI? That way fixes
like this should show up and reflect the true current state of things.

I agree - ths would be beneficial for ci-health badges also, since short term trends are largely invisible there.

Created https://issues.redhat.com/browse/CNV-81690 to track this.

Cheers,

Lee

--
--
Best,
Daniel

--

Best,

Daniel

Lee Yarwood

unread,

Mar 11, 2026, 9:34:05 AM (9 days ago) Mar 11

to Daniel Hiller, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

On Wed, 11 Mar 2026 at 10:40, Daniel Hiller <dhi...@redhat.com> wrote:
>
> Hey again,
>
> On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:
>>
>> Hey Lee,
>>
>> thanks for raising that!
>>
>> Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.
>>
>
> indeed I can confirm that no clustered failure occurred after the fix had landed 🎉

Excellent thanks!

> Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.
>

> https://grafana.ci.kubevirt.io/d/efpTS3t4z/e2e-jobs-overview-v2?orgId=1&from=1773054000000&to=1773223199000&var-job_name=periodic-kubevirt-e2e-k8s-.%2Asig-compute&viewPanel=15

Stupid question but what is the benefit in running quarantined as part
of our periodics? If the intention is to confirm if they are still
flaky, shouldn't we run them in their own dedicated, quarantined
periodic jobs to avoid polluting the unquarantined tests?

Cheers,

Lee

Daniel Hiller

unread,

Mar 11, 2026, 10:18:19 AM (9 days ago) Mar 11

to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

Hey Lee,

On Wed, Mar 11, 2026 at 2:34 PM 'Lee Yarwood' via kubevirt-dev <kubevi...@googlegroups.com> wrote:

On Wed, 11 Mar 2026 at 10:40, Daniel Hiller <dhi...@redhat.com> wrote:
>
> Hey again,
>
> On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:
>>
>> Hey Lee,
>>
>> thanks for raising that!
>>
>> Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.
>>
>
> indeed I can confirm that no clustered failure occurred after the fix had landed 🎉

Excellent thanks!

> Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.
>
> https://grafana.ci.kubevirt.io/d/efpTS3t4z/e2e-jobs-overview-v2?orgId=1&from=1773054000000&to=1773223199000&var-job_name=periodic-kubevirt-e2e-k8s-.%2Asig-compute&viewPanel=15

Stupid question but what is the benefit in running quarantined as part
of our periodics? If the intention is to confirm if they are still
flaky, shouldn't we run them in their own dedicated, quarantined
periodic jobs to avoid polluting the unquarantined tests?

Not stupid at all! We decided looooong ago that we didn't want to have another set of quarantined lanes.

Those lanes would duplicate the existing periodics, since they require the exact same configuration, they would compete with the existing lanes for resources, they would need to get bumped periodically, be looked after, tooling adjusted etc.

I am not sure whether that is really worth the effort.

WDYT?

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubevirt-dev/CAPkJ9DvaAsioJvemLJkvPj12U1LwVDWNChCaFBy57fvWNkCkjg%40mail.gmail.com.

--

Best,

Daniel

Lee Yarwood

unread,

Mar 11, 2026, 12:47:10 PM (9 days ago) Mar 11

to Daniel Hiller, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

On Wed, 11 Mar 2026 at 14:18, Daniel Hiller <dhi...@redhat.com> wrote:
>
> Hey Lee,
>
> On Wed, Mar 11, 2026 at 2:34 PM 'Lee Yarwood' via kubevirt-dev <kubevi...@googlegroups.com> wrote:
>>
>> On Wed, 11 Mar 2026 at 10:40, Daniel Hiller <dhi...@redhat.com> wrote:
>> >
>> > Hey again,
>> >
>> > On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:
>> >>
>> >> Hey Lee,
>> >>
>> >> thanks for raising that!
>> >>
>> >> Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.
>> >>
>> >
>> > indeed I can confirm that no clustered failure occurred after the fix had landed 🎉
>>
>> Excellent thanks!
>>
>> > Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.
>> >
>> > https://grafana.ci.kubevirt.io/d/efpTS3t4z/e2e-jobs-overview-v2?orgId=1&from=1773054000000&to=1773223199000&var-job_name=periodic-kubevirt-e2e-k8s-.%2Asig-compute&viewPanel=15
>>
>> Stupid question but what is the benefit in running quarantined as part
>> of our periodics? If the intention is to confirm if they are still
>> flaky, shouldn't we run them in their own dedicated, quarantined
>> periodic jobs to avoid polluting the unquarantined tests?
>
>
> Not stupid at all! We decided looooong ago that we didn't want to have another set of quarantined lanes.
>
> Those lanes would duplicate the existing periodics, since they require the exact same configuration, they would compete with the existing lanes for resources, they would need to get bumped periodically, be looked after, tooling adjusted etc.
>
> I am not sure whether that is really worth the effort.
>
> WDYT?

Yeah I understand the overhead involved in maintaining separate jobs
but IMHO that would be outweighed by the improved insight the
resulting unpolluted data would give us into the periodic state of
unquarantined and quarantined tests.

I question the need for the unquarantined periodic jobs on main right
now anyway, given the sheer amount of incoming changes. They make
sense on less active branches but could this job ever reveal something
we wouldn't already be seeing on main in PR and merge related runs?

Cheers,

Lee

Daniel Hiller

unread,

Mar 13, 2026, 6:59:57 AM (7 days ago) Mar 13

to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

Fair point - however there's a bunch of flakes that aren't quarantined since they don't exceed the set boundaries for quarantining. Also I would argue that the quarantined tests also need to be stable inside the main periodic test lanes, whereas a separate run with probably different lane configuration might hide test dependency issues.

I question the need for the unquarantined periodic jobs on main right
now anyway, given the sheer amount of incoming changes. They make
sense on less active branches but could this job ever reveal something
we wouldn't already be seeing on main in PR and merge related runs?

I see them as a baseline so we can evaluate whether tests seen on PRs are actually flaky on periodics as well, which is not easy to evaluate if you don't have that data. However I'd be open to suggestions on how we can work on keeping all this aligned while reducing the load. Also compared to the number of presubmit runs the periodics don't play that much of a role in resource consumption.

Best,

Daniel

Cheers,

Lee

--

Best,

Daniel

Daniel Hiller

unread,

Mar 13, 2026, 7:02:48 AM (7 days ago) Mar 13

to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc

(apologies, forgot to add the link to the dashboard: https://grafana.ci.kubevirt.io/d/lTSssoODz/prow-job-load-v3?orgId=1&from=now-2d&to=now&var-org=All&var-repo=All&var-type=presubmit&var-type=postsubmit&var-type=batch&var-cluster=prow-workloads )

Best,
Daniel

Cheers,

Lee

--
--
Best,
Daniel

--

Best,

Daniel

Reply all

Reply to author

Forward

2026-03-11 sig-ci quarantine catch-up - notes

Daniel Hiller

Lee Yarwood

Daniel Hiller

Daniel Hiller

Lee Yarwood

Daniel Hiller

Lee Yarwood

Daniel Hiller

Daniel Hiller