Prometheus not restoring Alert state on restart.

47 views
Skip to first unread message

soha...@gmail.com

unread,
Jun 26, 2020, 4:40:00 AM6/26/20
to Prometheus Users
Say an alert is in firing state then after a few minutes, I restart the Prometheus. After the restart, Prometheus sends the resolved notification to the alert-manager (and alert-manager to the webhook).

Is there any way to restore an alert state for example if it was in firing state than it should remain in firing state after a restart and not send resolved alert.
Note alert rule for=5m

Prometheus version: 2.17.2
Alertmanger version: .0.21.0
Env: running as stateful sets in Kubernetes

Prometheus runtime args:
rules.alert.for-grace-period=3m 
rules.alert.for-outage-tolerance=1h 
storage.tsdb.retention.time=2h 

Prometheus config:
global:  
scrape_interval: 30s  
scrape_timeout: 10s  
evaluation_interval: 30s 

Please let me know if need more information

Brian Brazil

unread,
Jun 26, 2020, 4:57:53 AM6/26/20
to soha...@gmail.com, Prometheus Users
On Fri, 26 Jun 2020 at 09:40, soha...@gmail.com <soha...@gmail.com> wrote:
Say an alert is in firing state then after a few minutes, I restart the Prometheus. After the restart, Prometheus sends the resolved notification to the alert-manager (and alert-manager to the webhook).

Is there any way to restore an alert state for example if it was in firing state than it should remain in firing state after a restart and not send resolved alert.
Note alert rule for=5m

The alert restoration at startup feature is intended for really long for clauses, in the hours to days range. For short periods such as 5m things like a long restart are going to be enough to cause issues, and you should look at making the alerts and your handling of them more robust. Keeping in mind that on restart Prometheus may not yet have scraped much data for the alert.

Brian
 

Prometheus version: 2.17.2
Alertmanger version: .0.21.0
Env: running as stateful sets in Kubernetes

Prometheus runtime args:
rules.alert.for-grace-period=3m 
rules.alert.for-outage-tolerance=1h 
storage.tsdb.retention.time=2h 

Prometheus config:
global:  
scrape_interval: 30s  
scrape_timeout: 10s  
evaluation_interval: 30s 

Please let me know if need more information

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6055ce69-3360-4c1d-957a-298c44bdf18bn%40googlegroups.com.


--

Sohaib Omar

unread,
Jun 26, 2020, 6:53:18 AM6/26/20
to Brian Brazil, Prometheus Users
Thanks for the quick response. I get your point
Reply all
Reply to author
Forward
0 new messages