Alert when pods are in crashloopbackoff

isha girdhar

unread,

Jun 8, 2018, 5:12:58 AM6/8/18

to Prometheus Users

Currently, we are alerting if pods are restarting too much. We are using rate function in our alert.
We checked the standard rules provided here https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-state/templates/kube-state-metrics.rules.yaml where increase function is used instead of rate. What is the difference and which one is better to use in this scenario?

alert: K8sPodRestartingTooMuch
expr: rate(kube_pod_container_status_restarts_total[1m])
  > 1 / (5 * 60)
for: 30m
labels:
  severity: warning

alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
  severity: warning
annotations:
  description: Pod {{`{{$labels.namespace}}`}}/{{`{{$labels.pod}}`}} was restarted {{`{{$value}}`}}
     times within the last hour
  summary: Pod is restarting frequently
{{ end }}

Paul from okmeter.io

unread,

Jun 8, 2018, 9:07:01 AM6/8/18

to Prometheus Users

Hi

Read this https://github.com/prometheus/prometheus/issues/3746

TLDR - increase is extrapolating and it's not good.

Brian Brazil

unread,

Jun 8, 2018, 10:03:23 AM6/8/18

to isha girdhar, Prometheus Users

On 8 June 2018 at 10:12, isha girdhar < hashag...@gmail.com > wrote:

Currently, we are alerting if pods are restarting too much. We are using rate function in our alert.
We checked the standard rules provided here https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-state/templates/kube-state-metrics.rules.yaml where increase function is used instead of rate. What is the difference and which one is better to use in this scenario?

The difference is fairly minor, increase is syntactic sugar over rate. So increase(x[1h]) is the exact same as rate(x[1h]) * 3600. Rate produces a per second result, increase is per the time range you pass it.

alert: K8sPodRestartingTooMuch
expr: rate(kube_pod_container_status_restarts_total[1m])
  > 1 / (5 * 60)
for: 30m
labels:
  severity: warning

alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 10m
labels:
  severity: warning
annotations:
  description: Pod {{`{{$labels.namespace}}`}}/{{`{{$labels.pod}}`}} was restarted {{`{{$value}}`}}
     times within the last hour
  summary: Pod is restarting frequently
{{ end }}

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7df7e427-22b7-4571-9062-61cab7077f89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Brian Brazil

unread,

Jun 8, 2018, 10:11:49 AM6/8/18

to isha girdhar, Prometheus Users

Apparently Crtl-Enter sends an email, let me finish that...

On 8 June 2018 at 15:03, Brian Brazil <brian....@robustperception.io> wrote:

On 8 June 2018 at 10:12, isha girdhar < hashag...@gmail.com > wrote:
Currently, we are alerting if pods are restarting too much. We are using rate function in our alert.
We checked the standard rules provided here https://github.com/coreos/prometheus-operator/blob/master/helm/exporter-kube-state/templates/kube-state-metrics.rules.yaml where increase function is used instead of rate. What is the difference and which one is better to use in this scenario?

The difference is fairly minor, increase is syntactic sugar over rate. So increase(x[1h]) is the exact same as rate(x[1h]) * 3600. Rate produces a per second result, increase is per the time range you pass it.

alert: K8sPodRestartingTooMuch expr: rate(kube_pod_container_status_restarts_total[1m]) > 1 / (5 * 60)

This is effectively saying if there are at least 1/300 restarts per second. As restarts are integers the smallest non-zero value is 1/60 restarts per second over the minute time range, so this expression is a bit misleading.

This is going to be a very fragile alert, as it requires at a minimum one restart every single minute for half an hour (the "for: 30m") to trigger so will have many false negatives.


for: 30m
labels:
  severity: warning

alert: PodFrequentlyRestarting
expr: increase(kube_pod_container_status_restarts_total[1h]) > 5

This one says if there are more than 5 restarts in a hour to alert, and this has to be the case for 10 minutes. This should be a robust alert.

In this case the difference between increase and rate largely boils down to readability of the expression. The second expression with rate would be rate(kube_pod_container_status_restarts_total[1h]) > 5 / 3600

The lesson here is mostly to avoid short time ranges in your alerts.

Brian

for: 10m labels: severity: warning annotations: description: Pod {{`{{$labels.namespace}}`}}/{{`{{$labels.pod}}`}} was restarted {{`{{$value}}`}} times within the last hour summary: Pod is restarting frequently {{ end }}

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7df7e427-22b7-4571-9062-61cab7077f89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Brian Brazil
www.robustperception.io

--

Brian Brazil

www.robustperception.io

isha girdhar

unread,

Jun 12, 2018, 5:56:08 AM6/12/18

to Prometheus Users

Thanks Brian, that was really helpful.

On Friday, 8 June 2018 19:41:49 UTC+5:30, Brian Brazil wrote:

Apparently Crtl-Enter sends an email, let me finish that...

Reply all

Reply to author

Forward