Alert not entering firing state

1,198 views
Skip to first unread message

jkas...@gmail.com

unread,
Dec 20, 2016, 11:22:31 AM12/20/16
to Prometheus Users
Howdy! We have an alert configured that doesn't ever appear to transition from alertstate="pending" to alertstate="firing". We execute a script that performs an external service health check and exposes an endpoint with a failure count and total count. We're using the following condition to trigger an alert when the rate of failures is non-zero for a :


ALERT ExternalAlertServiceAvailabilityCritical
  IF sum
(rate(execution_failure_count[2m])) by (script) / sum(rate(execution_count[2m])) by (script) > 0
  FOR
5m


When the condition is met, the execution failure count rate is equal to the execution count rate, and provides a consistent value of 1. This 1 > 0, when active for 5m, is intended to fire the alert. We're seeing the alert go into pending state, but then we see the alert count drop to 0 for ~20s before getting added again in pending state:



The scrape target doesn't go down, and the query from the conditional reports a consistent 1 during this period. The scrape interval is 60s, the scrape timeout is 20s, and the execution interval is the default 15s. I'm hoping there's something obvious here I'm missing - any insight into why this is happening and what we can do to ensure that the alert moves from pending to firing? 


Thanks!


Julius Volz

unread,
Dec 20, 2016, 12:38:23 PM12/20/16
to jkas...@gmail.com, Prometheus Users
That's odd. Do you use any client-side timestamps that get older than 2 minutes? How often are you scraping the underlying data? If rate() does not always find at least two points under a 2m time window, it could be that the result disappears completely briefly for a moment (and will still superficially look like a constant-1 graph with underlying invisible gaps). Also, I guess the same alert fires properly if you just change the alerting expression to "up"?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e0638661-868a-4a47-a12f-aafdf396aeab%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jkas...@gmail.com

unread,
Dec 20, 2016, 4:08:06 PM12/20/16
to Prometheus Users, jkas...@gmail.com
We're not using any client-side timestamps I'm aware of. We're using the script-exporter https://github.com/adhocteam/script_exporter, which collects the underlying data for each scrape (every ~60s).

Not sure if this is relevant - when we added the script exporter to the prometheus exporter list, Brian Bazil recommended we use a gauge vs a histogram for these metrics. We've implemented this (https://github.com/adhocteam/script_exporter/pull/5), but haven't yet integrated it, as we were hoping to understand the issue behind this before throwing a hail mary at the new approach. I'm not sure if it matters, but the `_count` metrics from the query are exposed by the histogram collection; it was my understanding that those are simply counter types.

The exporter was up for the affected period. We haven't encountered any similar issues when using `up` gauges for alerting...



On Tuesday, December 20, 2016 at 10:38:23 AM UTC-7, Julius Volz wrote:
That's odd. Do you use any client-side timestamps that get older than 2 minutes? How often are you scraping the underlying data? If rate() does not always find at least two points under a 2m time window, it could be that the result disappears completely briefly for a moment (and will still superficially look like a constant-1 graph with underlying invisible gaps). Also, I guess the same alert fires properly if you just change the alerting expression to "up"?
On Tue, Dec 20, 2016 at 5:22 PM, <jkas...@gmail.com> wrote:
Howdy! We have an alert configured that doesn't ever appear to transition from alertstate="pending" to alertstate="firing". We execute a script that performs an external service health check and exposes an endpoint with a failure count and total count. We're using the following condition to trigger an alert when the rate of failures is non-zero for a :


ALERT ExternalAlertServiceAvailabilityCritical
  IF sum
(rate(execution_failure_count[2m])) by (script) / sum(rate(execution_count[2m])) by (script) > 0
  FOR
5m


When the condition is met, the execution failure count rate is equal to the execution count rate, and provides a consistent value of 1. This 1 > 0, when active for 5m, is intended to fire the alert. We're seeing the alert go into pending state, but then we see the alert count drop to 0 for ~20s before getting added again in pending state:



The scrape target doesn't go down, and the query from the conditional reports a consistent 1 during this period. The scrape interval is 60s, the scrape timeout is 20s, and the execution interval is the default 15s. I'm hoping there's something obvious here I'm missing - any insight into why this is happening and what we can do to ensure that the alert moves from pending to firing? 


Thanks!


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Julius Volz

unread,
Dec 20, 2016, 5:29:47 PM12/20/16
to jkas...@gmail.com, Prometheus Users
Yeah, that wouldn't explain it.

When you graph "sum(rate(execution_failure_count[2m])) by (script) / sum(rate(execution_count[2m])) by (script) > 0" with a given resolution (say, 15s) and time range (say, 10m), could you check whether the returned AJAX response really contains data points every 15s, or whether there are points missing at some of the intervals?

Also, again: what's the scrape interval?

Cheers,
Julius

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/569f3aaf-fe4a-4c9f-8911-b684488b3e30%40googlegroups.com.

jkas...@gmail.com

unread,
Dec 27, 2016, 5:16:26 PM12/27/16
to Prometheus Users, jkas...@gmail.com
Thanks - I can confirm I see consistent value of "1" every 15s for the entire period each point, with nothing missing. Scrape interval is 60s. This is super odd, anything else you can think of checking?

Best,
James

Julius Volz

unread,
Dec 29, 2016, 8:37:16 PM12/29/16
to James Kassemi, Prometheus Users
Hmm, not really. It would be interesting to see if you could reduce it to some reproducible case that others could try out to investigate (along with a bug report).

Not sure if that's feasible?

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/dc9e417f-7d51-42b3-b650-3e6f524e3244%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages