2026-03-11 sig-ci quarantine catch-up - notes

9 views
Skip to first unread message

Daniel Hiller

unread,
Mar 11, 2026, 5:10:06 AM (9 days ago) Mar 11
to kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna
2026-03-11 sig-ci quarantine catch-up

When: Weekly on Wed, 9:45 – 10:15am

Notes: KubeVirt CI SIG meeting notes

Attendees: dhiller, dollierp,


Reminders:

Topics:

Action items

  • unchecked
  • unchecked

    update/create issues with latest flakes spotted

  • unchecked

    communication

    • unchecked

      send meeting notes to kubevirt-dev, bcc sig people for spotted flakes (include meeting changes for upcoming instances)




--

Kind regards,


Daniel Hiller

He / Him / His

Principal Software Engineer, KubeVirt CI, OpenShift Virtualization

Red Hat

dhi...@redhat.com   

Red Hat GmbH, Registered seat: Werner von Siemens Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Avril Crosse O'Flaherty  

Lee Yarwood

unread,
Mar 11, 2026, 5:40:43 AM (9 days ago) Mar 11
to Daniel Hiller, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
On Wed, 11 Mar 2026 at 09:10, 'Daniel Hiller' via kubevirt-dev
<kubevi...@googlegroups.com> wrote:
>
> [..]
>
> flake stats - create issues accordingly
>
> [dhiller] all sig-compute periodics are still suffering from clustered failures

Hey Daniel,

Apologies for not being on the call, I spoke briefly to Lubo about
this while dropping my kids off at school and it looks like these
clustered failures stopped once
https://github.com/kubevirt/project-infra/pull/4784 landed on Monday
afternoon.

I often find the full 7 days of CI data can be very misleading once
fixes have landed. Can I suggest that we also consider sharing 24, 48
and 72 hour trends when summarising the state of CI? That way fixes
like this should show up and reflect the true current state of things.

Cheers,

Lee

Daniel Hiller

unread,
Mar 11, 2026, 6:20:22 AM (9 days ago) Mar 11
to Lee Yarwood, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
Hey Lee,

thanks for raising that!

Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.

Best,
Daniel
--
-- 
Best,
Daniel

Daniel Hiller

unread,
Mar 11, 2026, 6:40:20 AM (9 days ago) Mar 11
to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
Hey again,

On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:
Hey Lee,

thanks for raising that!

Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.


indeed I can confirm that no clustered failure occurred after the fix had landed 🎉

Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.

image.png

 
Best,
Daniel

On Wed, Mar 11, 2026 at 10:40 AM Lee Yarwood <lyar...@redhat.com> wrote:
On Wed, 11 Mar 2026 at 09:10, 'Daniel Hiller' via kubevirt-dev
<kubevi...@googlegroups.com> wrote:
>
> [..]
>
> flake stats - create issues accordingly
>
> [dhiller] all sig-compute periodics are still suffering from clustered failures

Hey Daniel,

Apologies for not being on the call, I spoke briefly to Lubo about
this while dropping my kids off at school and it looks like these
clustered failures stopped once
https://github.com/kubevirt/project-infra/pull/4784 landed on Monday
afternoon.

I'm open to moving the call into the afternoon if that fits better - I believe the afternoon around 3PM would also help include the US TZ folks - although we should not overlap with community meeting.

Any suggestions to which time would fit best from the usual attendees?


 

I often find the full 7 days of CI data can be very misleading once
fixes have landed. Can I suggest that we also consider sharing 24, 48
and 72 hour trends when summarising the state of CI? That way fixes
like this should show up and reflect the true current state of things.

I agree - ths would be beneficial for ci-health badges also, since short term trends are largely invisible there.



Cheers,

Lee



--
-- 
Best,
Daniel


--
-- 
Best,
Daniel

Lee Yarwood

unread,
Mar 11, 2026, 9:34:05 AM (9 days ago) Mar 11
to Daniel Hiller, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
On Wed, 11 Mar 2026 at 10:40, Daniel Hiller <dhi...@redhat.com> wrote:
>
> Hey again,
>
> On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:
>>
>> Hey Lee,
>>
>> thanks for raising that!
>>
>> Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.
>>
>
> indeed I can confirm that no clustered failure occurred after the fix had landed 🎉

Excellent thanks!

> Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.
>
> https://grafana.ci.kubevirt.io/d/efpTS3t4z/e2e-jobs-overview-v2?orgId=1&from=1773054000000&to=1773223199000&var-job_name=periodic-kubevirt-e2e-k8s-.%2Asig-compute&viewPanel=15

Stupid question but what is the benefit in running quarantined as part
of our periodics? If the intention is to confirm if they are still
flaky, shouldn't we run them in their own dedicated, quarantined
periodic jobs to avoid polluting the unquarantined tests?

Cheers,

Lee

Daniel Hiller

unread,
Mar 11, 2026, 10:18:19 AM (9 days ago) Mar 11
to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
Hey Lee,

On Wed, Mar 11, 2026 at 2:34 PM 'Lee Yarwood' via kubevirt-dev <kubevi...@googlegroups.com> wrote:
On Wed, 11 Mar 2026 at 10:40, Daniel Hiller <dhi...@redhat.com> wrote:
>
> Hey again,
>
> On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:
>>
>> Hey Lee,
>>
>> thanks for raising that!
>>
>> Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.
>>
>
> indeed I can confirm that no clustered failure occurred after the fix had landed 🎉

Excellent thanks!

> Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.
>
> https://grafana.ci.kubevirt.io/d/efpTS3t4z/e2e-jobs-overview-v2?orgId=1&from=1773054000000&to=1773223199000&var-job_name=periodic-kubevirt-e2e-k8s-.%2Asig-compute&viewPanel=15

Stupid question but what is the benefit in running quarantined as part
of our periodics? If the intention is to confirm if they are still
flaky, shouldn't we run them in their own dedicated, quarantined
periodic jobs to avoid polluting the unquarantined tests?

Not stupid at all! We decided looooong ago that we didn't want to have another set of quarantined lanes.

Those lanes would duplicate the existing periodics, since they require the exact same configuration, they would compete with the existing lanes for resources, they would need to get bumped periodically, be looked after, tooling adjusted etc.

I am not sure whether that is really worth the effort.

WDYT?

--
You received this message because you are subscribed to the Google Groups "kubevirt-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kubevirt-dev...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/kubevirt-dev/CAPkJ9DvaAsioJvemLJkvPj12U1LwVDWNChCaFBy57fvWNkCkjg%40mail.gmail.com.



--
-- 
Best,
Daniel

Lee Yarwood

unread,
Mar 11, 2026, 12:47:10 PM (9 days ago) Mar 11
to Daniel Hiller, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
On Wed, 11 Mar 2026 at 14:18, Daniel Hiller <dhi...@redhat.com> wrote:
>
> Hey Lee,
>
> On Wed, Mar 11, 2026 at 2:34 PM 'Lee Yarwood' via kubevirt-dev <kubevi...@googlegroups.com> wrote:
>>
>> On Wed, 11 Mar 2026 at 10:40, Daniel Hiller <dhi...@redhat.com> wrote:
>> >
>> > Hey again,
>> >
>> > On Wed, Mar 11, 2026 at 11:20 AM Daniel Hiller <dhi...@redhat.com> wrote:
>> >>
>> >> Hey Lee,
>> >>
>> >> thanks for raising that!
>> >>
>> >> Interesting - I was under the impression that another PR which was merged last week should have brought improvement. I'll take another look and share my findings here then.
>> >>
>> >
>> > indeed I can confirm that no clustered failure occurred after the fix had landed 🎉
>>
>> Excellent thanks!
>>
>> > Also we are seeing an overall decrease of failures in the periodics from 80 to around 60% - this might not sound like much but since the periodics are running the quarantined tests, this indeed is significant.
>> >
>> > https://grafana.ci.kubevirt.io/d/efpTS3t4z/e2e-jobs-overview-v2?orgId=1&from=1773054000000&to=1773223199000&var-job_name=periodic-kubevirt-e2e-k8s-.%2Asig-compute&viewPanel=15
>>
>> Stupid question but what is the benefit in running quarantined as part
>> of our periodics? If the intention is to confirm if they are still
>> flaky, shouldn't we run them in their own dedicated, quarantined
>> periodic jobs to avoid polluting the unquarantined tests?
>
>
> Not stupid at all! We decided looooong ago that we didn't want to have another set of quarantined lanes.
>
> Those lanes would duplicate the existing periodics, since they require the exact same configuration, they would compete with the existing lanes for resources, they would need to get bumped periodically, be looked after, tooling adjusted etc.
>
> I am not sure whether that is really worth the effort.
>
> WDYT?

Yeah I understand the overhead involved in maintaining separate jobs
but IMHO that would be outweighed by the improved insight the
resulting unpolluted data would give us into the periodic state of
unquarantined and quarantined tests.

I question the need for the unquarantined periodic jobs on main right
now anyway, given the sheer amount of incoming changes. They make
sense on less active branches but could this job ever reveal something
we wouldn't already be seeing on main in PR and merge related runs?

Cheers,

Lee

Daniel Hiller

unread,
Mar 13, 2026, 6:59:57 AM (7 days ago) Mar 13
to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
Fair point - however there's a bunch of flakes that aren't quarantined since they don't exceed the set boundaries for quarantining. Also I would argue that the quarantined tests also need to be stable inside the main periodic test lanes, whereas a separate run with probably different lane configuration might hide test dependency issues.
 

I question the need for the unquarantined periodic jobs on main right
now anyway, given the sheer amount of incoming changes. They make
sense on less active branches but could this job ever reveal something
we wouldn't already be seeing on main in PR and merge related runs?


I see them as a baseline so we can evaluate whether tests seen on PRs are actually flaky on periodics as well, which is not easy to evaluate if you don't have that data. However I'd be open to suggestions on how we can work on keeping all this aligned while reducing the load. Also compared to the number of presubmit runs the periodics don't play that much of a role in resource consumption.
image.png
image.png

Best,
Daniel

 
Cheers,

Lee



--
-- 
Best,
Daniel

Daniel Hiller

unread,
Mar 13, 2026, 7:02:48 AM (7 days ago) Mar 13
to Lee Yarwood, Federico Fossemo, Nir Dothan, Denis Ollier Pinas, kubevirt-dev, Stu Gott, Kedar Bidarkar, Petr Horacek, Adam Litke, Jan Schintag, Siddu Vamsikrishna, Luboslav Pivarc
Reply all
Reply to author
Forward
0 new messages