repeat_interval in AlertManager not working

382 views
Skip to first unread message

talk...@gmail.com

unread,
Mar 23, 2019, 2:35:27 PM3/23/19
to Prometheus Users
I've configured the below to generation alerts for pod restart metrics. The group_wait and group_interval values work fine, however,
the repeat_interval is not. I've tried an number of configurations without any success. Appreciate any recommendations. 


alertmanager:v0.14.0
rancer 2.0.11
prometheus 2.2.1

prometheus.yml file
global:
  scrape_interval:     15s
  evaluation_interval: 15s

alertmanager.yml file
route:
  receiver: test-email
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with severity=pod-restart label match this sub-route.
  # They are grouped by pod and namespace
  - receiver: pod-restart
    group_by: [alertname, pod, namespace]
    group_wait: 30s
    group_interval: 40m
    repeat_interval: 3h
  - match_re:
      severity: pod-restart
  # All alerts with severity=pod-critical label match this sub-route.
  # They are grouped by pod and namespace
  - receiver: pod-critical
    group_by: [alertname, pod, namespace]
    group_wait: 30s
    group_interval: 35m
    repeat_interval: 3h
  - match_re:
      severity: pod-critical

rules.yml file
#Monitoring for Container/Pod Restart
- name: Pod Restart
  rules:
  - alert: Pod Restart
    expr: rate(kube_pod_container_status_restarts_total[5m]) * 300 > 0
    for: 2m
    labels:
      severity: pod-restart
    annotations:
     description: 'The {{$labels.pod}} Pod running in Namespace {{$labels.namespace}} located in Container {{$labels.container}} has restarted in the previous 5 minutes.'
     summary: 'Container {{$labels.container}} in Pod {{$labels.namespace}}/{{$labels.pod}} has restarted in the previous 5 minutes.'

# ---------------------------------------
 Monitoring for Pod ErrImagePull Error
- name: Pod ErrImagePull Error
  rules:
  - alert: Pod ErrImagePull Error
    expr: kube_pod_container_status_waiting_reason{reason=~"ErrImagePull|ImagePullBackOff"} > 0
    for: 2m
    labels:
      severity: pod-critical
    annotations:
     description: 'The {{$labels.pod}} Pod running in Namespace {{$labels.namespace}} located in Container {{$labels.container}} has failed due to a {{$labels.reason}} error in the previous 5 minutes.'
     summary: 'Container {{$labels.container}} in Pod {{$labels.namespace}}/{{$labels.pod}} has failed due to a {{$labels.reason}} error in the previous 5 minutes.'

Brian Brazil

unread,
Mar 23, 2019, 4:20:35 PM3/23/19
to Jeff O'Hara, Prometheus Users
On Sat, 23 Mar 2019 at 18:35, <talk...@gmail.com> wrote:
I've configured the below to generation alerts for pod restart metrics. The group_wait and group_interval values work fine, however,
the repeat_interval is not. I've tried an number of configurations without any success. Appreciate any recommendations. 

What are you seeing that makes you think it's not working?

Brian
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fc604043-a2cd-464a-8142-805d00fc63f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


--

talk...@gmail.com

unread,
Mar 24, 2019, 1:40:44 PM3/24/19
to Prometheus Users
Thanks Brian for reviewing.For the below list of alerts the receiver in alertmanager.yml is configured for 35 minutes (severity: pod-critical) with the 
prometeheus rule for configured for 2 minutes. In reviewing the below prometheus documentation the group_interval appears to be working the 
way I would expect the repeat_interval to work allowing me to suppress duplicates for alerts already successfully sent. I may be interpreting this
incorrectly so I want to determine if this working properly and if there's another way to suppress the alerts. Thanks. 


Subject Received Time Between Alerts
[FIRING:1] | Pod ErrImagePull Error | 10.x.x.x:8080 | The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes. 8:53 AM  
[FIRING:1] | Pod ErrImagePull Error | 10.x.x.x:8080 | The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes. 10:14 AM 1:21
[FIRING:1] | Pod ErrImagePull Error | 10.x.x.x:8080 | The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes. 10:56 AM 0:42
[FIRING:1] | Pod ErrImagePull Error | 10.x.x.x:8080 | The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes. 11:37 AM 0:41
[FIRING:1] | Pod ErrImagePull Error | 10.x.x.x:8080 | The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes. 12:18 PM 0:41


# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.)
[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> | default = 4h ]

Brian Brazil

unread,
Mar 24, 2019, 4:07:28 PM3/24/19
to Jeff O'Hara, Prometheus Users
On Sun, 24 Mar 2019 at 17:40, <talk...@gmail.com> wrote:
Thanks Brian for reviewing.For the below list of alerts the receiver in alertmanager.yml is configured for 35 minutes (severity: pod-critical) with the 
prometeheus rule for configured for 2 minutes. In reviewing the below prometheus documentation the group_interval appears to be working the 
way I would expect the repeat_interval to work allowing me to suppress duplicates for alerts already successfully sent. I may be interpreting this
incorrectly so I want to determine if this working properly and if there's another way to suppress the alerts. Thanks. 

Is the alert itself flapping? If it was firing and not firing every 5-ish minutes you'd see this.

Brian
 

For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages