reset Alert from batch process / push-gateway

33 views
Skip to first unread message

yariv...@gmail.com

unread,
Apr 29, 2021, 2:55:38 AM4/29/21
to Prometheus Users
hi
I have some batch process that push metrics to push-gateway. 
the batch process runs every-day for ~2 minutes.
i also have alerts on the metrics, and it seems to work fine except to the following scenario.

when alert is fired, there is nothing that "clears" the alert from Prometheus, and the alert manager keeps sending the notifications. the alert can be cleared only after 24 hours, when the batch process is triggered again.

i was hoping that "resolve_timeout: 5m" will solve this, but it's not.
any idea how deal with such senario?

thanks

Bjoern Rabenstein

unread,
May 4, 2021, 11:20:23 AM5/4/21
to yariv...@gmail.com, Prometheus Users
Most alerts (or you could say: well designed alerts) fire for as long
as the alerting condition still applies. In your case, it sounds like
if the daily job fails, the alert will fire for a day.

What happens if you trigger an immediate re-run? Would it clear the
alert if that re-run succeeds?

If you can really only run the job once per day, I'd say it is "the
right thing to do" to keep the alert firing until a run has finally
succeeded.

The usual workflow as an operator to say "I have seen the alert, and
now I'm working on it, but I'm aware that it is still firing" is to
place a silence for the exected time you need to fix the isse.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in
Reply all
Reply to author
Forward
0 new messages