Hello,
I'm trying to set up some alerts that fire on critical errors, so I'm aiming for immediate & consistent reporting for as much as possible.
So for that matter, I defined the alert rule without a for clause:
groups:
- name: Test alerts
rules:
- alert: MyService Test Alert
expr: 'sum(error_counter{service="myservice",other="labels"} unless error_counter{service="myservice",other="labels"} offset 1m) > 0
or sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'
Prometheus is configured to scrape & evaluate at 10 s:
global:
scrape_interval: 10s
scrape_timeout: 10s
evaluation_interval: 10s
route:
group_by: ['alertname', 'node_name']
group_wait: 30s
group_interval: 1m # used to be 5m
repeat_interval: 2m # used to be 3h
Now what happens when testing is this:
- on the very first metric generated, the alert fires as expected;
- on subsequent tests it stops firing;
- I kept on running a new test each minute for 20 minutes, but no alert fired again;
- I can see the alert state going into FIRING in the alerts view in the Prometheus UI;
- I can see the metric values getting generated when executing the expression query in the Prometheus UI;
Redid the same test suite after a 2 hour break & exactly the same thing happened, including the fact that the alert fired on the first test!
What am I missing here? How can I make the alert manager fire that alert on repeated error metric hits? Ok, it doesn't have to be as soon as 2m, but let's consider that for testing's sake.
Pretty please, any advice is much appreciated!
Kind regards,
Ionel