Frequent Prometheus alerts

258 views
Skip to first unread message

nihra l

unread,
Apr 2, 2023, 4:54:57 AM4/2/23
to Prometheus Users
Hi,

we are using following expression for node memory alert. 

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

we are getting alert when we have less than 10% memory available. But when the value changes in next evaluation cycle which is still less than 10%, it is resolving previous alert and firing fresh alert instead of following repeat inverval.
Configuration details: 
For condition in alert rule: 1min
group_wait: 30s, group_interval: 5m, repeat_interval: 24h

Example:  
8:00AM memory available is 9.5% alert1 triggered
8:05 AM memory available is 9.8% instead of following repeat cycle it is resolving alert1 and firing fresh alert.

Thanks


Julius Volz

unread,
Apr 2, 2023, 5:54:03 AM4/2/23
to nihra l, Prometheus Users
Hi Nihra,

* Is your rule evaluation interval in Prometheus really 5 minutes, as your examples show? In that case it would explain why the Alertmanager auto-resolves your alerts in between cycles, because the default resolve timeout for alerts received by the Alertmanager is 5 minutes. So Alertmanager expects to receive the same firing alert from Prometheus every <5m in order for it to not auto-resolve. See https://github.com/prometheus/alertmanager/blob/747430cd42e1aad5a5f0a25737e9a398b0ef371e/config/config.go#L621

* If that's not the case, are you 100% sure that no evaluation cycle ever goes above 10% in between? What happens if you manually evaluate your expression at a high resolution, do you see it constantly under 10%?

Regards,
Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/933e1f53-6dae-4427-b1b0-9303bec44d4fn%40googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com

Brian Candler

unread,
Apr 2, 2023, 6:15:27 AM4/2/23
to Prometheus Users
Enter your alerting expression exactly as-is into the PromQL query browser in the prometheus web interface:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10

If you see any gaps between 8:00AM and 8:05AM then this is where the alert was resolved.  Or you can look at the negated expression to see more clearly if there are values which exceed 10% between 8:00 and 8:05:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 >= 10

The important point here is, if the alerting expression doesn't return a value for even a single rule evaluation interval, then the alert will be immediately resolved.  However, Prometheus 2.42 has added a new feature which will should help you:
  • [FEATURE] Add 'keep_firing_for' field to alerting rules. #11827

For example: if you set "keep_firing_for: 5m" then your alert won't resolve until the alerting value has been absent for 5 minutes continuously (that is, on every evaluation cycle over 5 minutes).  This is complementary to the existing feature "for: 5m" which means that the alert won't fire until the alerting value has been present for 5 minutes continuously.

Reply all
Reply to author
Forward
0 new messages