Hi kubevirt-dev,
I'm requesting a code-freeze exception for PR
#18004 - "Add VMNonRecoverableOSPanic alert."
This PR was approved and LGTM'd before the feature freeze but couldn't merge due to unrelated CI failures in the sig-monitoring lane.
Current state:The PR is approved + LGTM'd. All review comments have been addressed. The do-not-merge/hold label is the current merge blocker. Alert PRs do not require a VEP.
What this PR does:This PR adds a single Prometheus alert definition based on the existing kubevirt_vmi_guest_os_panic_total metric. It introduces no new runtime code - the change is 2 files, ~38 lines total (an alert rule and its promtool test). A companion runbook PR (
kubevirt/monitoring#387) is already merged.
The alert fires when a VM experiences repeated non-recoverable guest OS panics (more than 5 in a 24-hour window), indicating crash-looping due to issues like driver bugs, filesystem corruption, or OS instability.
Why it matters to the community:KubeVirt currently has no way to notify users when their VMs experience guest OS panics. The kubevirt_vmi_guest_os_panic_total metric was added to capture these events (via pvpanic on Linux, Hyper-V on Windows), but without an alert, users must either actively query Prometheus or discover crashes after the fact. With RunStrategy: Always, the VM automatically restarts after a panic and appears healthy again, making it easy for crash-restart cycles to go unnoticed. This alert is the first and only mechanism in KubeVirt to surface these events proactively.
Timeline and what went wrong:- June 1 - PR created, originally with a new metric + alert
- June 8-10 - Review feedback led to simplification: the new metric was dropped, and the alert was reworked to use the existing kubevirt_vmi_guest_os_panic_total metric
- June 16 - Approved by @iholder101, LGTM'd by @sradco
- June 16-23 - CI failures on the sig-monitoring lane blocked merge. The failures were caused by a pre-existing Fedora memory mismatch bug, unrelated to this PR. The fix for it (#18119) was itself only able to merge 5 days ago, which also contributed to the delay.
- June 24 - Put on hold by @dhiller to prioritize approved-VEP PRs
The risk here is minimal: the PR is purely declarative - an alert definition and a promtool test, using an existing already-shipped metric. It cannot break any existing functionality.
I understand that exceptions should be rare and I don't take this request lightly. However, given the minimal risk, the value it provides to users who currently have no visibility into guest OS panics, and the fact that the delay was largely caused by a pre-existing CI issue outside this PR, I believe this is a reasonable case for an exception. Looking back, we should have been more proactive in investigating the CI failures ourselves rather than relying on retests, and we'll take this as a lesson for future release cycles.
I appreciate the community's time in considering this request.
Thanks, Aviv