Alert when Pod is close to Resource Limits

Brett Porter

unread,

Mar 16, 2018, 1:55:16 PM3/16/18

to Prometheus Users

Hi there,

I'm trying to alert when a pod's memory usage is 90% of the resource limits. I have 4 replicas, and each of them varies differently with memory usage.

- alert: ExamplePodLowMemory

expr: container_memory_usage_bytes{job=~"prod-kubernetes.*",container_name="container-name"} / kube_pod_container_resource_limits_memory_bytes{container=~"container-name"} > .9

for: 1m

labels:

severity: critical

annotations:

description: '{{ $labels.pod_name }} is low on memory.'

If I hardcode the expression it works:

'sum by (pod_name)(container_memory_usage_bytes{job=~"prod-kubernetes.*",container_name="container-name"}) / 4294967296 > .9'

4294967296 is the value of 'kube_pod_container_resource_limits_memory_bytes{container=~"container-name"}'. But, if I increase memory limits at some point, I don't want to change alert config.

Is there a way to achieve this?

brett....@axial.net

unread,

Mar 18, 2018, 10:19:00 PM3/18/18

to Prometheus Users

I figured I wouldn't get a response since this is sort of a "I don't know, can you solve it for me" kinda question.. But, Googling around for a few days helps. I solved my own question.. Here is the answer for anyone else having the same issue I had when getting started with Prometheus..

sum by (pod_name)(container_memory_usage_bytes{job=~"prod-kubernetes.*",container_name=~"container-name.*"}) / ignoring(pod_name) group_left max(kube_pod_container_resource_limits_memory_bytes{container="container-name"}) > .9

Cheers!

gwla...@gmail.com

unread,

May 6, 2018, 12:37:44 PM5/6/18

to Prometheus Users

Hi Brett ,

thanks for the help your logic worked for me as i could not utilize the complete expression as you pasted so i had to customize according to my Env. which is below

-----------------------for POD memory --------------------------------------------------------------------------------------------------------------------------------------------------------------------

ALERT POD_MEMORY_HIGH_UTILIZATION
IF sum(rate(container_memory_working_set_bytes{container_name!="POD",image!="",name=~"^k8s_.*",name=~"^k8s_.*"}[5m]) / 15403581) BY (container_name, pod_name) > 50
FOR 1m
LABELS {severity="warning"}
ANNOTATIONS {description="pod {{$labels.pod_name}} is using high memory", summary="HIGH Memory USAGE WARNING for{{$labels.pod_name}}"}
============================================================for POD CPU =========================================================

ALERT POD_CPU_HIGH_UTILIZATION
IF sum(rate(container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*"}[5m])) BY (pod_name) > 50
FOR 1m
LABELS {severity="warning"}
ANNOTATIONS {description="pod {{$labels.pod_name}} is using high cpu", summary="HIGH CPU USAGE WARNING for POD {{$labels.pod_name} on{{$labels.host}}"}

===============================================================================================================

iviak...@gmail.com

unread,

Dec 13, 2018, 8:40:21 AM12/13/18

to Prometheus Users

Hi Brett,

thanks for replying to your own question. I came up with a solution which is generic for any containers by relabelling some labels. I don't know why yours is working for you as there should be similar labels in both metrics, so I don't know how matching could work with your query (container_name label != container label)

For CPU:

round(100 * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name != "POD", image !=""}[1m])) by (pod_name, container_name, namespace) , "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace)) > 75

For memory:

round(100 * label_join(label_join(sum(container_memory_usage_bytes{container_name != "POD", image !=""}) by (container_name, pod_name, namespace), "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace)) > 75

пʼятниця, 16 березня 2018 р. 18:55:16 UTC+1 користувач Brett Porter написав:

iviak...@gmail.com

unread,

Dec 13, 2018, 9:25:12 AM12/13/18

to Prometheus Users

Careful with the alert I posted about memory, you need to use container_memory_working_set_bytes (which equals total memory) instead of container_memory_usage_bytes (which equals used memory in linux terms)

пʼятниця, 16 березня 2018 р. 18:55:16 UTC+1 користувач Brett Porter написав:

Hi there,

Meier

unread,

Dec 28, 2018, 4:14:49 AM12/28/18

to Prometheus Users

From openshift origin, they ship a more generic resource quota alert:

100 * kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics", type="used"}

/ ignoring(instance, job, type)

(kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics", type="hard"} > 0)

> 90

which can be simply adapted with suitable label filters and ignoring clauses.

https://github.com/openshift/cluster-monitoring-operator/blob/master/assets/prometheus-k8s/rules.yaml#L621

Reply all

Reply to author

Forward