Alert when pods are in crashloopbackoff

6,540 views
Skip to first unread message

isha girdhar

unread,
Jun 8, 2018, 5:12:58 AM6/8/18
to Prometheus Users
Currently, we are alerting if pods are restarting too much. We are using rate function in our alert. 
We checked the standard rules provided here  https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-state/templates/kube-state-metrics.rules.yaml where increase function is used instead of rate. What is the difference and which one is better to use in this scenario?




alert: K8sPodRestartingTooMuch
expr: rate(kube_pod_container_status_restarts_total[1m])
  > 1 / (5 * 60)
for: 30m
labels:
  severity: warning
alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
  severity: warning
annotations:
  description: Pod {{`{{$labels.namespace}}`}}/{{`{{$labels.pod}}`}} was restarted {{`{{$value}}`}}
     times within the last hour
  summary: Pod is restarting frequently
{{ end }}

Paul from okmeter.io

unread,
Jun 8, 2018, 9:07:01 AM6/8/18
to Prometheus Users
Hi


TLDR - increase is extrapolating and it's not good.

Brian Brazil

unread,
Jun 8, 2018, 10:03:23 AM6/8/18
to isha girdhar, Prometheus Users
On 8 June 2018 at 10:12, isha girdhar < hashag...@gmail.com > wrote:
Currently, we are alerting if pods are restarting too much. We are using rate function in our alert. 
We checked the standard rules provided here  https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-state/templates/kube-state-metrics.rules.yaml where increase function is used instead of rate. What is the difference and which one is better to use in this scenario?

The difference is fairly minor, increase is syntactic sugar over rate. So increase(x[1h]) is the exact same as rate(x[1h]) * 3600. Rate produces a per second result, increase is per the time range you pass it.


alert: K8sPodRestartingTooMuch
expr: rate(kube_pod_container_status_restarts_total[1m])
  > 1 / (5 * 60)
for: 30m
labels:
  severity: warning
alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
  severity: warning
annotations:
  description: Pod {{`{{$labels.namespace}}`}}/{{`{{$labels.pod}}`}} was restarted {{`{{$value}}`}}
     times within the last hour
  summary: Pod is restarting frequently
{{ end }}

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7df7e427-22b7-4571-9062-61cab7077f89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Brian Brazil

unread,
Jun 8, 2018, 10:11:49 AM6/8/18
to isha girdhar, Prometheus Users
Apparently Crtl-Enter sends an email, let me finish that...

On 8 June 2018 at 15:03, Brian Brazil <brian....@robustperception.io> wrote:
On 8 June 2018 at 10:12, isha girdhar < hashag...@gmail.com > wrote:
Currently, we are alerting if pods are restarting too much. We are using rate function in our alert. 
We checked the standard rules provided here  https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-state/templates/kube-state-metrics.rules.yaml where increase function is used instead of rate. What is the difference and which one is better to use in this scenario?

The difference is fairly minor, increase is syntactic sugar over rate. So increase(x[1h]) is the exact same as rate(x[1h]) * 3600. Rate produces a per second result, increase is per the time range you pass it.


This is effectively saying if there are at least 1/300 restarts per second. As restarts are integers the smallest non-zero value is 1/60 restarts per second over the minute time range, so this expression is a bit misleading.

This is going to be a very fragile alert, as it requires at a minimum one restart every single minute for half an hour (the "for: 30m") to trigger so will have many false negatives.
 

for: 30m
labels:
  severity: warning
alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5

This one says if there are more than 5 restarts in a hour to alert, and this has to be the case for 10 minutes. This should be a robust alert.


In this case the difference between increase and rate largely boils down to readability of the expression. The second expression with rate would be rate(kube_pod_container_status_restarts_total[1h])  > 5 / 3600

The lesson here is mostly to avoid short time ranges in your alerts.

Brian
 

for: 10m
labels:
  severity: warning
annotations:
  description: Pod {{`{{$labels.namespace}}`}}/{{`{{$labels.pod}}`}} was restarted {{`{{$value}}`}}
     times within the last hour
  summary: Pod is restarting frequently
{{ end }}

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7df7e427-22b7-4571-9062-61cab7077f89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--




--

isha girdhar

unread,
Jun 12, 2018, 5:56:08 AM6/12/18
to Prometheus Users
Thanks Brian, that was really helpful.


On Friday, 8 June 2018 19:41:49 UTC+5:30, Brian Brazil wrote:
Apparently Crtl-Enter sends an email, let me finish that...

Reply all
Reply to author
Forward
0 new messages