Repeat alerts not firing

556 views
Skip to first unread message

Ionel Sirbu

unread,
Jun 25, 2022, 10:42:08 AM6/25/22
to Prometheus Users
Hello,

I'm trying to set up some alerts that fire on critical errors, so I'm aiming for immediate & consistent reporting for as much as possible.

So for that matter, I defined the alert rule without a for clause:

groups:
- name: Test alerts
  rules:
  - alert: MyService Test Alert
    expr: 'sum(error_counter{service="myservice",other="labels"} unless error_counter{service="myservice",other="labels"} offset 1m) > 0
     or sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'


Prometheus is configured to scrape & evaluate at 10 s:

global:
  scrape_interval: 10s
  scrape_timeout: 10s
  evaluation_interval: 10s


And the alert manager (docker image quay.io/prometheus/alertmanager:v0.23.0) is configured with these parameters:

route:
  group_by: ['alertname', 'node_name']
  group_wait: 30s
  group_interval: 1m # used to be 5m
  repeat_interval: 2m # used to be 3h


Now what happens when testing is this:
- on the very first metric generated, the alert fires as expected;
- on subsequent tests it stops firing;
- I kept on running a new test each minute for 20 minutes, but no alert fired again;
- I can see the alert state going into FIRING in the alerts view in the Prometheus UI;
- I can see the metric values getting generated when executing the expression query in the Prometheus UI;

Redid the same test suite after a 2 hour break & exactly the same thing happened, including the fact that the alert fired on the first test!

What am I missing here? How can I make the alert manager fire that alert on repeated error metric hits? Ok, it doesn't have to be as soon as 2m, but let's consider that for testing's sake.

Pretty please, any advice is much appreciated!

Kind regards,
Ionel

Brian Candler

unread,
Jun 25, 2022, 11:52:17 AM6/25/22
to Prometheus Users
Try putting the whole alerting "expr" into the PromQL query browser, and switching to graph view.

This will show you the alert vector graphically, with a separate line for each alert instance.  If this isn't showing multiple lines, then you won't receive multiple alerts.  Then you can break down your query into parts, try them individually, to try to understand why it's not working as you expect.

Looking at just part of your expression:

sum(error_counter{service="myservice",other="labels"} unless error_counter{service="myservice",other="labels"} offset 1m) > 0

And taking just the part inside sum():

error_counter{service="myservice",other="labels"} unless error_counter{service="myservice",other="labels"} offset 1m

This expression is weird. It will only generate a value when the error counter first springs into existence.  As soon as it has existed for more than 1 minute - even with value zero - then the "unless" cause will suppress the expression completely, i.e. it will be an empty instance vector.

I think this is probably not what you want.  In any case it's not a good idea to have timeseries which come and go; it's very awkward to alert on a timeseries appearing or disappearing, and you may have problems with staleness, i.e. the timeseries may continue to exist for 5 minutes after you've stopped generating points in it.

It's much better to have a timeseries which continues to exist.  That is, "error_counter" should spring into existence with value 0, and increment when errors occur, and stop incrementing when errors don't occur - but continue to keep the value it had before.

If your error_counter timeseries *does* exist continuously, then this 'unless' clause is probably not what you want.

Ionel Sirbu

unread,
Jun 27, 2022, 4:39:45 AM6/27/22
to Prometheus Users
Hi Brian,

Thanks for your reply! To be honest, you can pretty much ignore that first part of the expression, that doesn't change anything in the "repeat" behaviour. In fact, we don't even have that bit at the moment, that's just something I've been playing with in order to capture that very first springing into existence of the metric, which isn't covered by the current expression,  sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'.
Also, I've already done the PromQL graphing that you suggested, I could see those multiple lines that you were talking about, but then there was no alert firing... 🤷‍♂️

Any other pointers?

Thanks,
Ionel

Brian Candler

unread,
Jun 27, 2022, 4:59:59 AM6/27/22
to Prometheus Users
I suspect the easiest way to debug this is to focus on "repeat_interval: 2m".  Even if a single alert is statically firing, you should get the same notification resent every 2 minutes.  So don't worry about catching second instances of the same expr; just set a simple alerting expression which fires continuously, say just "expr: vector(0)", to find out why it's not resending.

You can then look at logs from alertmanager (e.g. "journalctl -eu alertmanager" if running under systemd). You can also look at the metrics alertmanager itself generates:

    curl localhost:9093/metrics | grep alertmanager

Hopefully, one of these may give you a clue as to what's happening (e.g. maybe your mail system or other notification endpoint has some sort of rate limiting??).

However, if the vector(0) expression *does* send repeated alerts successfully, then your problem is most likely something to do with your actual alerting expr, and you'll need to break it down into simpler pieces to debug it.

Apart from that, all I can say is "it works for me™": if an alerting expression subsequently generates a second alert in its result vector, then I get another alert after group_interval.

Ionel Sirbu

unread,
Jun 27, 2022, 11:28:46 AM6/27/22
to Prometheus Users
Ok, added a rule with an expression of vector(1), went live at 12:31, when it fired 2 alerts  (?!), but then went completely silent until 15:36, when it fired again 2x (so more than 3 h in). The alert has been stuck in the FIRING state the whole time, as expected.
Unfortunately the logs don't shed any light - there's nothing logged aside from the bootstrap logs. It isn't a systemd process - it's run in a container & there seems to be just a big executable in there.
The meta-metrics contain quite a lot of data in there - any particulars I should be looking for?

Either way, I'm now inclined to believe that this is definitely an alertmanager setting matter. As I was mentioning in my initial email, I've already tweaked group_wait, group_interval & repeat_interval, but they probably didn't take effect, as I thought they would. So maybe that's something I need to sort out. And better logging should help understand all of that, which I still need to figure out how to do.

Thank you very much for your help Brian!

Brian Candler

unread,
Jun 27, 2022, 12:20:29 PM6/27/22
to Prometheus Users
Look at container logs then.

Metrics include things like the number of notifications attempted, succeeded and failed.  Those would be the obvious first place to look.  (For example: is it actually trying to send a mail? if so, is it succeeding or failing?)

Aside: vector(0) and vector(1) are the same for generating alerts. It's only the presence of a value that triggers an alert, the actual value itself can be anything.

Ionel Sirbu

unread,
Jun 28, 2022, 11:12:53 AM6/28/22
to Prometheus Users
Hi Brian,

So my previous assumption proved to be correct - it was in fact the alertmanager settings that weren't getting properly applied on the fly. Today I ensured they were applied in a guaranteed way & I can see the alerts firing every 6 minutes now, for these settings:
    group_wait: 30s
    group_interval: 2m
    repeat_interval: 5m

Now I'm trying to sort out the fact that the alerts fire twice each time. We have some form of HA in place, where we spawn 2 pods for the alertmanager & looking at their logs, I can see that each container fires the alert, which explains why I see 2 of them:

prometheus-alertmanager-0 level=debug ts=2022-06-28T14:27:40.121Z caller=notify.go:735 component=dispatcher receiver=pager integration=slack[0] msg="Notify success" attempts=1
prometheus-alertmanager-1 level=debug ts=2022-06-28T14:27:40.418Z caller=notify.go:735 component=dispatcher receiver=pager integration=slack[0] msg="Notify success" attempts=1


Any idea why that is?

Thank you!

Brian Candler

unread,
Jun 28, 2022, 2:02:26 PM6/28/22
to Prometheus Users
It's correct for prometheus to send alerts to both alertmanagers, but I suspect you haven't got the alertmanagers clustered together correctly.
See: https://prometheus.io/docs/alerting/latest/alertmanager/#high-availability

Make sure you've configured the cluster flags, and check your alertmanager container logs for messages relating to clustering or "gossip".

Ionel Sirbu

unread,
Jul 13, 2022, 4:21:25 AM7/13/22
to Prometheus Users
Hi Brian,

Indeed, that was the issue this time, we didn't have HA properly configured. All seems to work fine after adjusting accordingly.
Thank you very much!

Reply all
Reply to author
Forward
0 new messages