2024-09-25 sig-ci quarantine catch-up

When: Weekly on Wed, 9:45 – 10:15am

Attendees: dhiller, brianmcarey,

Topics:

[urgent]

revisit previous action items

Daniel Hiller check network clustered failures issue state - https://github.com/kubevirt/kubevirt/issues/12898

Ammar: to look whether there’s a correlation between the kci bump
Edy: https://github.com/kubevirt/kubevirt/pull/12913

seems that the cluster is more loaded now

bc: will take a look again at the cluster load

was talking to the arm - Howard - it seems that the VMIs are taking longer to come up

edy: we might need some more profiling (about k8s?)

looks like sth is taking more resources but we can’t confirm whether this is just a side effect - overall it takes more time to reconcile, then tests are either failing due to timeout
we are not sure why load has increased
proposal: have a profiling on a regular basis - in the overall sense
q: should we ignore intermediate errors that are coming due to slowness - take this to the community meeting

Daniel Hiller q VSOCK test PR: https://github.com/kubevirt/kubevirt/pull/12901

seems as if failure rate has decreased - decision: close or not?

Daniel Hiller check: Storage lanes are now running with etcd in memory so we should no longer see etcd timeouts there.

we don’t see the timeouts on the presubmit lanes any more: https://search.ci.kubevirt.io/?search=etcdserver%3A+request+timed+out&maxAge=48h&context=1&type=build-log&name=&excludeName=periodic.*&maxMatches=5&maxBytes=20971520&groupBy=job

discuss imminent topics

Daniel Hiller arm lane still failing

[bc] Howard might get to looking at it this Friday

running on their hw, so it’s just a waste
we could switch to always_run: false, so that it can still be executed
we might do this for 1.3 and 1.2 also, since they are failing
we might check for a backport that has caused it to help locate the issue

Look at flakes

flake stats - create issues accordingly

we’ve seen still rather high numbers of failures on Monday and Tuesday this week

sig-network: major clustered failure - 31 tests: https://prow.ci.kubevirt.io/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_kubevirt/12667/pull-kubevirt-e2e-k8s-1.29-sig-network/1838121657995104256
sig-compute-migrations: three clustered failures

also VSOCK test slightly flaky

about to create an issue on both of the tests, since both show up in flake stats

dequarantine tests:

look at list of quarantined tests
check status, i.e. who is working on those

look at PRs that want to fix flakes

see whether we can dequarantine tests

Action items

Daniel Hiller create a tracker issue - evaluate wholistic profiling
Daniel Hiller convert https://github.com/kubevirt/kubevirt/pull/12901 into issue, close PR
Brian Carey increase the range of stored metrics on the workloads cluster
Brian Carey arm lane - switch to always_run: false for main, release-1.3 and release-1.2
Daniel Hiller what commitment do we have wrt arm arch?
Daniel Hiller get back to terminationGracePeriod Q test - who is working on it?
https://storage.googleapis.com/kubevirt-prow/reports/quarantined-tests/kubevirt/kubevirt/index.html
Daniel Hiller create flake issues

Kind regards,

Daniel Hiller

He / Him / His

Senior Software Engineer, KubeVirt CI, OpenShift Virtualization

Red Hat

dhi...@redhat.com

Red Hat GmbH, Registered seat: Werner-von-Siemens-Ring 12, D-85630 Grasbrunn, Germany  
Commercial register: Amtsgericht München/Munich, HRB 153243,
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross

Reply all

Reply to author

Forward

0 new messages