On Fri, 23 Oct 2020 at 22:41, Jimmy the Greek <
matthe...@gmail.com> wrote:
[...]
> As you can see I run prom QL test rate(coredns_nodecache_setup_errors_total{}[5m]) that evaluates to 1.666. Therefore when I test NodeLocalDNSSetupErrorsHigh which will trigger when that value is above 0 for 5 minute period the test only passes if I set eval_time to 6m, and fails if I set it to 5m (alert doesn't trigger).
>
> What is the relation between the for time in the alert rule itself and the eval_time in the test?
In this case the "for: 5m" means that the alert rule has to be firing
for 5 minutes. Because a rate() needs two samples in order to
calculate a rate your rate() function starts to return a value at 1m,
then when your rules are evaluated at 6m the alert starts firing.
Aside: if you get the promtool from 2.22.0 it's now possible to look
at the ALERTS timeseries, including pending alerts where the for
threshold hasn't been reached, I wouldn't recommend you actually test
the "for" threshold in rules in most cases (you're kind of testing
Prometheus rather than your rules then). But it is possible to
temporarily add a test for debugging like:
- expr: ALERTS{alertstate="pending"}
eval_time: 5m
Which the failure output of will tell you that your alert is pending
at that point, e.g.:
expr: "ALERTS{alertstate=\"pending\"}", time: 5m,
exp:"nil"
got:"{__name__=\"ALERTS\",
alertname=\"NodeLocalDNSSetupErrorsHigh\", alertstate=\"pending\",
errortype=\"configmap\", pod=\"unit-test\", severity=\"critical\"}
1E+00"
David