ALERT ExternalAlertServiceAvailabilityCritical
IF sum(rate(execution_failure_count[2m])) by (script) / sum(rate(execution_count[2m])) by (script) > 0
FOR 5mThe scrape target doesn't go down, and the query from the conditional reports a consistent 1 during this period. The scrape interval is 60s, the scrape timeout is 20s, and the execution interval is the default 15s. I'm hoping there's something obvious here I'm missing - any insight into why this is happening and what we can do to ensure that the alert moves from pending to firing?
Thanks!
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e0638661-868a-4a47-a12f-aafdf396aeab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
That's odd. Do you use any client-side timestamps that get older than 2 minutes? How often are you scraping the underlying data? If rate() does not always find at least two points under a 2m time window, it could be that the result disappears completely briefly for a moment (and will still superficially look like a constant-1 graph with underlying invisible gaps). Also, I guess the same alert fires properly if you just change the alerting expression to "up"?
On Tue, Dec 20, 2016 at 5:22 PM, <jkas...@gmail.com> wrote:
Howdy! We have an alert configured that doesn't ever appear to transition from alertstate="pending" to alertstate="firing". We execute a script that performs an external service health check and exposes an endpoint with a failure count and total count. We're using the following condition to trigger an alert when the rate of failures is non-zero for a :ALERT ExternalAlertServiceAvailabilityCritical
IF sum(rate(execution_failure_count[2m])) by (script) / sum(rate(execution_count[2m])) by (script) > 0
FOR 5mWhen the condition is met, the execution failure count rate is equal to the execution count rate, and provides a consistent value of 1. This 1 > 0, when active for 5m, is intended to fire the alert. We're seeing the alert go into pending state, but then we see the alert count drop to 0 for ~20s before getting added again in pending state:
The scrape target doesn't go down, and the query from the conditional reports a consistent 1 during this period. The scrape interval is 60s, the scrape timeout is 20s, and the execution interval is the default 15s. I'm hoping there's something obvious here I'm missing - any insight into why this is happening and what we can do to ensure that the alert moves from pending to firing?
Thanks!
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/569f3aaf-fe4a-4c9f-8911-b684488b3e30%40googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/dc9e417f-7d51-42b3-b650-3e6f524e3244%40googlegroups.com.