[Exception Request] Add VMNonRecoverableOSPanic alert (PR #18004) for v1.9

16 views
Skip to first unread message

Aviv Litman

unread,
Jun 29, 2026, 7:32:26 AM (yesterday) Jun 29
to kubevi...@googlegroups.com, Fabian Deutsch, Kedar Bidarkar, Luboslav Pivarc, Jed Lejosne, Daniel Hiller, Federico Fossemo, Tal Nisan, Stu Gott, Aviv Litman, João Vilaça, Itamar Holder, Igor Bezukh
Hi kubevirt-dev,
I'm requesting a code-freeze exception for PR #18004 - "Add VMNonRecoverableOSPanic alert."
This PR was approved and LGTM'd before the feature freeze but couldn't merge due to unrelated CI failures in the sig-monitoring lane.

Current state:
The PR is approved + LGTM'd. All review comments have been addressed. The do-not-merge/hold label is the current merge blocker. Alert PRs do not require a VEP.

What this PR does:
This PR adds a single Prometheus alert definition based on the existing kubevirt_vmi_guest_os_panic_total metric. It introduces no new runtime code - the change is 2 files, ~38 lines total (an alert rule and its promtool test). A companion runbook PR (kubevirt/monitoring#387) is already merged.

The alert fires when a VM experiences repeated non-recoverable guest OS panics (more than 5 in a 24-hour window), indicating crash-looping due to issues like driver bugs, filesystem corruption, or OS instability.

Why it matters to the community:
KubeVirt currently has no way to notify users when their VMs experience guest OS panics. The kubevirt_vmi_guest_os_panic_total metric was added to capture these events (via pvpanic on Linux, Hyper-V on Windows), but without an alert, users must either actively query Prometheus or discover crashes after the fact. With RunStrategy: Always, the VM automatically restarts after a panic and appears healthy again, making it easy for crash-restart cycles to go unnoticed. This alert is the first and only mechanism in KubeVirt to surface these events proactively.

Timeline and what went wrong:
  • June 1 - PR created, originally with a new metric + alert
  • June 8-10 - Review feedback led to simplification: the new metric was dropped, and the alert was reworked to use the existing kubevirt_vmi_guest_os_panic_total metric
  • June 16 - Approved by @iholder101, LGTM'd by @sradco
  • June 16-23 - CI failures on the sig-monitoring lane blocked merge. The failures were caused by a pre-existing Fedora memory mismatch bug, unrelated to this PR. The fix for it (#18119) was itself only able to merge 5 days ago, which also contributed to the delay.
  • June 24 - Put on hold by @dhiller to prioritize approved-VEP PRs
The risk here is minimal: the PR is purely declarative - an alert definition and a promtool test, using an existing already-shipped metric. It cannot break any existing functionality.

I understand that exceptions should be rare and I don't take this request lightly. However, given the minimal risk, the value it provides to users who currently have no visibility into guest OS panics, and the fact that the delay was largely caused by a pre-existing CI issue outside this PR, I believe this is a reasonable case for an exception. Looking back, we should have been more proactive in investigating the CI failures ourselves rather than relying on retests, and we'll take this as a lesson for future release cycles.

I appreciate the community's time in considering this request.
Thanks, Aviv

--

Aviv Litman

BI Software Engineer

Red Hat

ali...@redhat.com   

Itamar Holder

unread,
Jun 29, 2026, 12:54:11 PM (yesterday) Jun 29
to Aviv Litman, kubevi...@googlegroups.com, Fabian Deutsch, Kedar Bidarkar, Luboslav Pivarc, Jed Lejosne, Daniel Hiller, Federico Fossemo, Tal Nisan, Stu Gott, João Vilaça, Igor Bezukh
Hi Aviv,

Thanks for the well-written exception request.

In principle, a PR missing the merge window due to CI load during code-freeze is a reasonable basis for an exception.

However, I think this case is a bit different:
- The PR was LGTM'd and approved on June 16 - two full weeks before the freeze. That's a significant window.
- The failing lane was the sig-monitoring lane itself.
- No one from sig-CI was aware of the fix attempt (#17832) - they were neither pinged on the PR nor received any other message about it. No one from sig-observability showed up to the sig-CI meeting or commented on the meeting notes where the quarantine was first discussed (June 17). The fix that actually landed (#18119) was handled entirely by sig-CI.
- After recurring unnecessary retests, the lane was eventually overridden rather than fixed, which sig-CI has said will not happen again.

In short - there were opportunities to resolve this earlier, and the responsibility for doing so sat closer to sig-monitoring than the timeline suggests.

That said, I tend to approve this exception for two reasons. First, I told you previously that being LGTM'd and approved before the freeze should be sufficient, which created confusion - and I take responsibility for that. Second, the PR is genuinely low risk - a declarative alert rule and a promtool test, nothing more.

I do want to be clear that the process here wasn't right and shouldn't repeat. For future cycles, I'd expect the owning SIG to proactively investigate CI failures on their own lanes, coordinate with sig-CI visibly (meetings or async), and raise blockers publicly and early.

BR,
Itamar

Federico Fossemo

unread,
3:03 AM (18 hours ago) 3:03 AM
to Itamar Holder, Aviv Litman, kubevi...@googlegroups.com, Fabian Deutsch, Kedar Bidarkar, Luboslav Pivarc, Jed Lejosne, Daniel Hiller, Tal Nisan, Stu Gott, João Vilaça, Igor Bezukh
Hey,

I can't add anything to what Itamar said: fully agree.

Thank you,
Federico

Aviv Litman

unread,
5:14 AM (16 hours ago) 5:14 AM
to kubevirt-dev
Thanks a lot, Itamar and Federico.
I really appreciate your feedback. I'll make sure to follow the guidelines you provided in future releases. I take your feedback seriously and will use it to continue improving.
Reply all
Reply to author
Forward
0 new messages