repeat_interval in AlertManager not working

talk...@gmail.com

unread,

Mar 23, 2019, 2:35:27 PM3/23/19

to Prometheus Users

I've configured the below to generation alerts for pod restart metrics. The group_wait and group_interval values work fine, however,

the repeat_interval is not. I've tried an number of configurations without any success. Appreciate any recommendations.

alertmanager:v0.14.0

rancer 2.0.11

prometheus 2.2.1

prometheus.yml file

global:

scrape_interval: 15s

evaluation_interval: 15s

alertmanager.yml file

route:

receiver: test-email

# All alerts that do not match the following child routes

# will remain at the root node and be dispatched to 'default-receiver'.

routes:

# All alerts with severity=pod-restart label match this sub-route.

# They are grouped by pod and namespace

- receiver: pod-restart

group_by: [alertname, pod, namespace]

group_wait: 30s

group_interval: 40m

repeat_interval: 3h

- match_re:

severity: pod-restart

# All alerts with severity=pod-critical label match this sub-route.

# They are grouped by pod and namespace

- receiver: pod-critical

group_by: [alertname, pod, namespace]

group_wait: 30s

group_interval: 35m

repeat_interval: 3h

- match_re:

severity: pod-critical

rules.yml file

#Monitoring for Container/Pod Restart

- name: Pod Restart

rules:

- alert: Pod Restart

expr: rate(kube_pod_container_status_restarts_total[5m]) * 300 > 0

for: 2m

labels:

severity: pod-restart

annotations:

description: 'The {{$labels.pod}} Pod running in Namespace {{$labels.namespace}} located in Container {{$labels.container}} has restarted in the previous 5 minutes.'

summary: 'Container {{$labels.container}} in Pod {{$labels.namespace}}/{{$labels.pod}} has restarted in the previous 5 minutes.'

# ---------------------------------------

Monitoring for Pod ErrImagePull Error

- name: Pod ErrImagePull Error

rules:

- alert: Pod ErrImagePull Error

expr: kube_pod_container_status_waiting_reason{reason=~"ErrImagePull|ImagePullBackOff"} > 0

for: 2m

labels:

severity: pod-critical

annotations:

description: 'The {{$labels.pod}} Pod running in Namespace {{$labels.namespace}} located in Container {{$labels.container}} has failed due to a {{$labels.reason}} error in the previous 5 minutes.'

summary: 'Container {{$labels.container}} in Pod {{$labels.namespace}}/{{$labels.pod}} has failed due to a {{$labels.reason}} error in the previous 5 minutes.'

Brian Brazil

unread,

Mar 23, 2019, 4:20:35 PM3/23/19

to Jeff O'Hara, Prometheus Users

On Sat, 23 Mar 2019 at 18:35, <talk...@gmail.com> wrote:

I've configured the below to generation alerts for pod restart metrics. The group_wait and group_interval values work fine, however,
the repeat_interval is not. I've tried an number of configurations without any success. Appreciate any recommendations.

What are you seeing that makes you think it's not working?

Brian

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fc604043-a2cd-464a-8142-805d00fc63f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

talk...@gmail.com

unread,

Mar 24, 2019, 1:40:44 PM3/24/19

to Prometheus Users

Thanks Brian for reviewing.For the below list of alerts the receiver in alertmanager.yml is configured for 35 minutes (severity: pod-critical) with the

prometeheus rule for configured for 2 minutes. In reviewing the below prometheus documentation the group_interval appears to be working the

way I would expect the repeat_interval to work allowing me to suppress duplicates for alerts already successfully sent. I may be interpreting this

incorrectly so I want to determine if this working properly and if there's another way to suppress the alerts. Thanks.

Subject	Received	Time Between Alerts
[FIRING:1] \| Pod ErrImagePull Error \| 10.x.x.x:8080 \| The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes.	8:53 AM
[FIRING:1] \| Pod ErrImagePull Error \| 10.x.x.x:8080 \| The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes.	10:14 AM	1:21
[FIRING:1] \| Pod ErrImagePull Error \| 10.x.x.x:8080 \| The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes.	10:56 AM	0:42
[FIRING:1] \| Pod ErrImagePull Error \| 10.x.x.x:8080 \| The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes.	11:37 AM	0:41
[FIRING:1] \| Pod ErrImagePull Error \| 10.x.x.x:8080 \| The crash-test-599864b657-6t4nr Pod running in Namespace web1 located in Container crash-test has failed due to a ImagePullBackOff error in the previous 5 minutes.	12:18 PM	0:41

# How long to wait before sending a notification about new alerts that

# are added to a group of alerts for which an initial notification has

# already been sent. (Usually ~5m or more.)

[ group_interval: <duration> | default = 5m ]

# How long to wait before sending a notification again if it has already

# been sent successfully for an alert. (Usually ~3h or more).

[ repeat_interval: <duration> | default = 4h ]

Brian Brazil

unread,

Mar 24, 2019, 4:07:28 PM3/24/19

to Jeff O'Hara, Prometheus Users

On Sun, 24 Mar 2019 at 17:40, <talk...@gmail.com> wrote:

Thanks Brian for reviewing.For the below list of alerts the receiver in alertmanager.yml is configured for 35 minutes (severity: pod-critical) with the
prometeheus rule for configured for 2 minutes. In reviewing the below prometheus documentation the group_interval appears to be working the
way I would expect the repeat_interval to work allowing me to suppress duplicates for alerts already successfully sent. I may be interpreting this
incorrectly so I want to determine if this working properly and if there's another way to suppress the alerts. Thanks.

Is the alert itself flapping? If it was firing and not firing every 5-ish minutes you'd see this.

Brian

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/41a2ea43-57d9-475a-bbc0-50a12426a0f2%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Reply all

Reply to author

Forward