Hey Brian.
I'm still thinking whether it is what I want - or not ;-)
Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One drive fails so some alert starts firing. Eventually the computing centre replaces the drive and it starts rebuilding (guess it doesn't matter whether the rebuilding is still considered to cause an alert or not). Eventually it finishes and the alert should go away (and I should e.g. get a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.
As said, this example is a bit made up, because even without the keep firing for, I wouldn't see the next device if that fails *while* the first one is still failing.
But the point is, I will loose follow up alerts that are close to a previous one, when I use keep_firing_for to solve the flapping problem.
Also, depending on how large I have to set keep_firing_for, I will also get resolve messages later... which depending on what one does with the alerts may also be less desirable.
Well what I meant is basically the same as a above, just outside of the flapping scenario (in which, I guess, the scrape failures last never longer than perhaps 1-10 mins):
- Imagine I have a node with several alerts firing (e.g. again that some upgrades aren't installed yet, or root fs has too much utilisation, things which typically last unless there's some manual intervention).
- Also, I have e.g. set my alert manager, to repeat these alerts say once a week (to nag the admin to finally do something about it).
What I'd expect should happen is e.g. the following:
- I already got the mails from the above alerts, so unless something changes, they should only be re-sent in a week.
- If one of those alerts resolves (e.g. someone frees up disk space), but disk space runs over my threshold again later, I'd like a new notification - now, not just in a week.
(but back now to the situation, where the alert is still running from the first time and only one mail has been sent)
What I e.g. I reboot the system. Maybe the admin upgraded the packages with security updates and also did some firmware upgrades which easily may take a while (we have servers where that runs for an hour or so... o.O).
So the system is down for one hour in which scraping fails (and the alert condition would be gone) and any reasonable keep_firing_for: (at least reasonable with it's current semantics) will also have run out already.
The system comes up again, but the over utilisation of the root fs is still there and the alert that had already fired before begins again respectively continues to do so.
At that point, we cannot really know,whether it's the same alert (i.e. the alert condition never resolved) or whether it's a new one (it did resolve but came back again.
(Well in my example we can be pretty sure it's the same one, since I rebooted - but generally speaking, when scrapes fail for a longer while, e.g. the network might be down, we cannot really know).
So question is, what would one want? A new alert (and new notification) or the same ongoing alert (and no new notification)?
Prometheus of course doesn't seem to track the state of an alert (at least not as far as I understand), except for keep_firing_for.
So it will always consider this a new alert (unless keep_firing_for is in effect).
But in practise, from a sysadmin PoV, I'd likely want it to consider the alert the same (and sent no new notification, until in one week).
Why? Well, even if it was actually a new alert I cannot know anyway (since there is no monitoring). And I will know that something is fishy on the system, because I have gotten another alert, that it was down (so if I want, I can investigate in more depth).
Apart from that, newly send alerts - after I had e.g. my little firmware upgrades downtime - would be rather like spam.
It's basically the same scenario as above with the flapping alerts, just that i takes expectedly much longer, and one likely doesn't want to set keep_firing_for: (in it's current implementation) to anything longer than ~10m, because otherwise resolved messages would come in quite delayed.
Again, I think a solution my be an alternative to keep_firing_for: which takes into account whether the respective target is up or not. And this should IMO then even allow to set an infinite value, which would cause the alert to keep firing unless Prometheus saw at least once no outcome from the expression, while the target was up.
"Target" might need to be actually targetS ... if one can combine them together in an expression.