what to do about flapping alerts?

809 views
Skip to first unread message

Christoph Anton Mitterer

unread,
Apr 5, 2024, 11:03:07 PM4/5/24
to Prometheus Users
Hey.

I have some simple alerts like:
    - alert: node_upgrades_non-security_apt
      expr:  'sum by (instance,job) ( apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )'
    - alert: node_upgrades_security_apt
      expr:  'sum by (instance,job) ( apt_upgrades_pending{origin=~"(?i)^.*-security(?:\\PL.*)?$"} )'

If there's no upgrades, these give no value.
Similarly, for all other simple alerts, like free disk space:
1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs", instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} / node_filesystem_size_bytes  >  0.80

No value => all ok, some value => alert.

I do have some instances which are pretty unstable (i.e. scraping fails every know and then - or more often than that), which are however mostly out of my control, so I cannot do anything about that.

When the target goes down, the alert clears and as soon as it's back, it pops up again, sending a fresh alert notification.

Now I've seen:
which describes keep_firing_for as "the minimum amount of time that an alert should remain firing, after the expression does not return any results", respectively in https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule :
# How long an alert will continue firing after the condition that triggered it # has cleared. [ keep_firing_for: <duration> | default = 0s ]

but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep firing, when the scraping failed, but also when it actually goes back to an ok state, right?
That's IMO however rather undesirable.

Similarly, when a node goes completely down (maintenance or so) and then up again, all alerts would then start again to fire (and even a generous keep_firing_for would have been exceeded)... and send new notifications.


Is there any way to solve this? Especially that one doesn't get new notifications sent, when the alert never really stopped?

At least I wouldn't understand how keep_firing_for would do this.

Thanks,
Chris.

Brian Candler

unread,
Apr 6, 2024, 3:33:27 AM4/6/24
to Prometheus Users
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep firing, when the scraping failed, but also when it actually goes back to an ok state, right?

It affects all alerts individually, and I believe it's exactly what you want. A brief flip from "failing" to "OK" doesn't resolve the alert; it only resolves if it has remained in the "OK" state for the keep_firing_for duration. Therefore you won't get a fresh alert until it's been OK for at least keep_firing_for and *then* fails again.

As you correctly surmise, an alert isn't really a boolean condition, it's a presence/absence condition: the expr returns a vector of 0 or more alerts, each with a unique combination of labels.  "keep_firing_for" retains a particular labelled value in the vector for a period of time even if it's no longer being generated by the alerting "expr".  Hence if it does reappear in the expr output during that time, it's just a continuation of the previous alert.

> Similarly, when a node goes completely down (maintenance or so) and then up again, all alerts would then start again to fire (and even a generous keep_firing_for would have been exceeded)... and send new notifications.

I don't understand what you're saying here. Can you give some specific examples?

If you have an alerting expression like "up == 0" and you take 10 machines down then your alerting expression will return a vector of ten zeros and this will generate ten alerts (typically grouped into a single notification, if you use the default alertmanager config)

When they revert to up == 1 then they won't "start again to fire", because they were already firing. Indeed, it's almost the opposite. Let's say you have keep_firing_for: 10m, then if any machine goes down in the 10 minutes after the end of maintenance then it *won't* generate a new alert, because it will just be a continuation of the old one.

However, when you're doing maintenance, you might also be using silences to prevent notifications. In that case you might want your silence to extend 10 minutes past the end of the maintenance period.

Christoph Anton Mitterer

unread,
Apr 8, 2024, 3:57:34 PM4/8/24
to Prometheus Users
Hey Brian.

On Saturday, April 6, 2024 at 9:33:27 AM UTC+2 Brian Candler wrote:
> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep firing, when the scraping failed, but also when it actually goes back to an ok state, right?

It affects all alerts individually, and I believe it's exactly what you want. A brief flip from "failing" to "OK" doesn't resolve the alert; it only resolves if it has remained in the "OK" state for the keep_firing_for duration. Therefore you won't get a fresh alert until it's been OK for at least keep_firing_for and *then* fails again.

I'm still thinking whether it is what I want - or not ;-)

Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One drive fails so some alert starts firing. Eventually the computing centre replaces the drive and it starts rebuilding (guess it doesn't matter whether the rebuilding is still considered to cause an alert or not). Eventually it finishes and the alert should go away (and I should e.g. get a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.

As said, this example is a bit made up, because even without the keep firing for, I wouldn't see the next device if that fails *while* the first one is still failing.
But the point is, I will loose follow up alerts that are close to a previous one, when I use keep_firing_for to solve the flapping problem.
Also, depending on how large I have to set keep_firing_for, I will also get resolve messages later... which depending on what one does with the alerts may also be less desirable.


 
As you correctly surmise, an alert isn't really a boolean condition, it's a presence/absence condition: the expr returns a vector of 0 or more alerts, each with a unique combination of labels.  "keep_firing_for" retains a particular labelled value in the vector for a period of time even if it's no longer being generated by the alerting "expr".  Hence if it does reappear in the expr output during that time, it's just a continuation of the previous alert.

I think the main problem behind may be rather a conceptual one, namely that Prometheus uses "no data" for no alert, which happens as well when there is no data because of e.g. scrape failures, so it can’t really differentiate between the two conditions.

What one would IMO need is a keep_firing_for, that works only while the target is down. But as soon as it goes up again (and even if just for one scrape), the effect would be gone and the alert would stop firing immediately (unless of course, there's still a value that comes out).
Wouldn't that make sense?
 

> Similarly, when a node goes completely down (maintenance or so) and then up again, all alerts would then start again to fire (and even a generous keep_firing_for would have been exceeded)... and send new notifications.
I don't understand what you're saying here. Can you give some specific examples?

Well what I meant is basically the same as a above, just outside of the flapping scenario (in which, I guess, the scrape failures last never longer than perhaps 1-10 mins):
- Imagine I have a node with several alerts firing (e.g. again that some upgrades aren't installed yet, or root fs has too much utilisation, things which typically last unless there's some manual intervention).
- Also, I have e.g. set my alert manager, to repeat these alerts say once a week (to nag the admin to finally do something about it).

What I'd expect should happen is e.g. the following:
- I already got the mails from the above alerts, so unless something changes, they should only be re-sent in a week.
- If one of those alerts resolves (e.g. someone frees up disk space), but disk space runs over my threshold again later, I'd like a new notification - now, not just in a week.
(but back now to the situation, where the alert is still running from the first time and only one mail has been sent)

What I e.g. I reboot the system. Maybe the admin upgraded the packages with security updates and also did some firmware upgrades which easily may take a while (we have servers where that runs for an hour or so... o.O).
So the system is down for one hour in which scraping fails (and the alert condition would be gone) and any reasonable keep_firing_for: (at least reasonable with it's current semantics) will also have run out already.

The system comes up again, but the over utilisation of the root fs is still there and the alert that had already fired before begins again respectively continues to do so.

At that point, we cannot really know,whether it's the same alert (i.e. the alert condition never resolved) or whether it's a new one (it did resolve but came back again.
(Well in my example we can be pretty sure it's the same one, since I rebooted - but generally speaking, when scrapes fail for a longer while, e.g. the network might be down, we cannot really know).

So question is, what would one want? A new alert (and new notification) or the same ongoing alert (and no new notification)?

Prometheus of course doesn't seem to track the state of an alert (at least not as far as I understand), except for keep_firing_for.
So it will always consider this a new alert (unless keep_firing_for is in effect).

But in practise, from a sysadmin PoV, I'd likely want it to consider the alert the same (and sent no new notification, until in one week).
Why? Well, even if it was actually a new alert I cannot know anyway (since there is no monitoring). And I will know that something is fishy on the system, because I have gotten another alert, that it was down (so if I want, I can investigate in more depth).
Apart from that, newly send alerts - after I had e.g. my little firmware upgrades downtime - would be rather like spam.

It's basically the same scenario as above with the flapping alerts, just that i takes expectedly much longer, and one likely doesn't want to set keep_firing_for: (in it's current implementation) to anything longer than ~10m, because otherwise resolved messages would come in quite delayed.

Again, I think a solution my be an alternative to keep_firing_for: which takes into account whether the respective target is up or not. And this should IMO then even allow to set an infinite value, which would cause the alert to keep firing unless Prometheus saw at least once no outcome from the expression, while the target was up.

"Target" might need to be actually targetS ... if one can combine them together in an expression.
 

If you have an alerting expression like "up == 0" and you take 10 machines down then your alerting expression will return a vector of ten zeros and this will generate ten alerts (typically grouped into a single notification, if you use the default alertmanager config)

When they revert to up == 1 then they won't "start again to fire", because they were already firing. Indeed, it's almost the opposite. Let's say you have keep_firing_for: 10m, then if any machine goes down in the 10 minutes after the end of maintenance then it *won't* generate a new alert, because it will just be a continuation of the old one.

Yes but I think one cannot reasonably set the the current keep_firing_for to much larger lengths than something around ~10mins or otherwise one will not only increase the chances to loose (new) follow up alerts, but also delay the resolving too long.

 
However, when you're doing maintenance, you might also be using silences to prevent notifications. In that case you might want your silence to extend 10 minutes past the end of the maintenance period.

For a planned maintenance this may be possible (though perhaps a bit cumbersome).
But the example (with the firmware upgrades) that I gave above should work the same if one has e.g. an unplanned network outage for some longer time ... in which case keep_firing_for would likely also run out and once the scraping works again, the alerts would all fire as new alerts.


Do you guys think, that my idea with the alternative keep_firing_for, that takes into account whether the respectively needed targets for each element of the alert expression is down would make sense?
If so, I could perhaps submit a feature request.

Cheers,
Chris.

Brian Candler

unread,
Apr 8, 2024, 5:05:41 PM4/8/24
to Prometheus Users
On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:
Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One drive fails so some alert starts firing. Eventually the computing centre replaces the drive and it starts rebuilding (guess it doesn't matter whether the rebuilding is still considered to cause an alert or not). Eventually it finishes and the alert should go away (and I should e.g. get a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.

If the alerts have the exact same set of labels (e.g. the alert is at the level of the RAID controller, not at the level of individual drives) then yes.

It failed, it fixed, it failed again within keep_firing_for: then you only get a single alert, with no additional notification.

But that's not the problem you originally asked for:

"When the target goes down, the alert clears and as soon as it's back, it pops up again, sending a fresh alert notification."

keep_firing_for can be set differently for different alerts.  So you can set it to 10m for the "up == 0" alert, and not set it at all for the RAID alert, if that's what you want.

 

Also, depending on how large I have to set keep_firing_for, I will also get resolve messages later... which depending on what one does with the alerts may also be less desirable.

Surely that delay is essential for the de-flapping scenario you describe: you can't send the alert resolved message until you are *sure* the alert has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before keeping_firing_for had expired), and the problem recurred, then you'd have to send out a new alert failing message - which is the flap noise I think you are asking to suppress.

In any case, sending out resolved messages is arguably a bad idea:

I turned them off, and:
(a) it immediately reduced notifications by 50%
(b) it encourages that alerts are properly investigated (or that alerts are properly tuned)

That is: if something was important enough to alert on in the first place, then it's important enough to investigate thoroughly, even if the threshold has been crossed back to normal since then. And if it wasn't important enough to alert on, then the alerting rule needs adjusting to make it less noisy.

This is expanded upon in this document:
 

I think the main problem behind may be rather a conceptual one, namely that Prometheus uses "no data" for no alert, which happens as well when there is no data because of e.g. scrape failures, so it can’t really differentiate between the two conditions.

I think it can.

Scrape failures can be explicitly detected by up == 0.  Alert on those separately.

The odd occasional missed scrape doesn't affect most other queries because of the lookback-delta: i.e. instant vector queries will look up to 5 minutes into the past. As long as you're scraping every 2 minutes, you can always survive a single failed scrape without noticing it.

If your device goes away for longer than 5 minutes, then sure the alerting data will no longer be there - but then you have no idea whether the condition you were alerting on or not exists (since you have no visibility of the target state).  Instead, you have a "scrape failed" condition, which as I said already, is easy to alert on.

Christoph Anton Mitterer

unread,
Apr 8, 2024, 6:33:14 PM4/8/24
to Prometheus Users
On Monday, April 8, 2024 at 11:05:41 PM UTC+2 Brian Candler wrote:
On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:
But for Prometheus, with keep_firing_for, it will be like the same alert.

If the alerts have the exact same set of labels (e.g. the alert is at the level of the RAID controller, not at the level of individual drives) then yes.

Which will still be quite often the case, I guess. Sometimes it may not matter, i.e. when a "new" alert (which has the same label set) is "missed" because of keep_firing_for, but sometimes it may.
 

It failed, it fixed, it failed again within keep_firing_for: then you only get a single alert, with no additional notification.
But that's not the problem you originally asked for:
"When the target goes down, the alert clears and as soon as it's back, it pops up again, sending a fresh alert notification."

Sure, and this can be avoided with keep_firing_for, but as far as I can see only in some cases (since one wants to keep keep_firing_for shortish) and at a cost of loosing information when the alert condition actually went away (which Prometheus does can in principle know) and came back while still firing.

 
keep_firing_for can be set differently for different alerts.  So you can set it to 10m for the "up == 0" alert, and not set it at all for the RAID alert, if that's what you want.

If there was no other way than the current keep_firing_for respectively my idea for an alternative keep_firing_for that considers the up/down state of the queried metrics isn't possible and/or reasonable - then rather than being able to set keep_firing_for per alert I'd wish to be able to set it per queried instance.

For some cases what I'm working at the university it might have been a nice try to (automatically) query the status of an alert and take action if it fires, but then I'd also rather like to stop that, rather soon after the alert (actually) stops. If I have to use a longer keep_firing_for because of a set of unstable nodes, then either, I get the penalty of unnecessarily long firing alerts for all nodes, or I maintain different set of alerts, which would be possible but also quite ugly.


 
Surely that delay is essential for the de-flapping scenario you describe: you can't send the alert resolved message until you are *sure* the alert has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before keeping_firing_for had expired), and the problem recurred, then you'd have to send out a new alert failing message - which is the flap noise I think you are asking to suppress.

Okay maybe we have a misunderstanding here, or better said, I guess there are two kinds of flapping alerts:

For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%.

Type 1 Flapping:
- The scraping of the metrics works all the time (i.e. `up` is all the time 1).
- But IO is happening, that just causes the 80% to be exceeded and then fallen below every few seconds.

Type 2 Flapping
- There is IO, but the utilisation is always above 80%, say it's already at ~ 90% all the time.
- My scrapes fail every now and then[0]

I honestly haven't even thought about type 1 yet. But I think these are the ones which would be perfectly solved by keep_firing_for.
Well even there I'd still like to be able to have the keep_firing_for applied only to a given label set e.g. something like: keep_firing_for: 10m on {alertnames~="regex-for-my-known-flapping-alerts"}

Type 2 is the one that causes me headaches right now.

That is why I thought before, it could be solved by something like keep_firing_for but that also takes into account whether any of the metrics it queries were from a target that is "currently" down - and only then let keep_firing_for take effect.


Thanks,
Chris.


[0] I do have a number of hosts, where this constantly happen, not really sure why TBH, but even with niceness of -20 and IOniceness of 0 (though in best-effort class) it happens quite often. The node is under high load (it's one of our compute node for the LHC Computing Grid)... so I guess maybe it's just "overloaded". So I don't think this will go away and I somehow have to get it working with the scrapes failing every now and then.

What actually puzzled me more is this:
Screenshot from 2024-04-09 00-24-59.png
That's some of the graphs from the Node Full Exporter Grafana dashboard, all for one node (which is one of the flapping ones).
As you can see, Memory Basic and Disc Space Used Basic have a gap, where scraping failed.
My assumption was, that - for a given target&instance - either scraping fails for all metrics or succeeds for all.
But here, only the right side plots have gaps, the left side ones don't.

Maybe that's just some consequence of these using counters and rate() or irate()?
Reply all
Reply to author
Forward
0 new messages