Alert when Pod is close to Resource Limits

7,594 views
Skip to first unread message

Brett Porter

unread,
Mar 16, 2018, 1:55:16 PM3/16/18
to Prometheus Users
Hi there,

I'm trying to alert when a pod's memory usage is 90% of the resource limits. I have 4 replicas, and each of them varies differently with memory usage.

 - alert: ExamplePodLowMemory
   expr: container_memory_usage_bytes{job=~"prod-kubernetes.*",container_name="container-name"} / kube_pod_container_resource_limits_memory_bytes{container=~"container-name"} > .9
   for: 1m
   labels:
     severity: critical
   annotations:
     description: '{{ $labels.pod_name }} is low on memory.'

If I hardcode the expression it works:
'sum by (pod_name)(container_memory_usage_bytes{job=~"prod-kubernetes.*",container_name="container-name"}) / 4294967296 > .9'

4294967296  is the value of 'kube_pod_container_resource_limits_memory_bytes{container=~"container-name"}'. But, if I increase memory limits at some point, I don't want to change alert config. 

Is there a way to achieve this?

brett....@axial.net

unread,
Mar 18, 2018, 10:19:00 PM3/18/18
to Prometheus Users
I figured I wouldn't get a response since this is sort of a "I don't know, can you solve it for me" kinda question.. But, Googling around for a few days helps. I solved my own question.. Here is the answer for anyone else having the same issue I had when getting started with Prometheus..

sum by (pod_name)(container_memory_usage_bytes{job=~"prod-kubernetes.*",container_name=~"container-name.*"}) / ignoring(pod_name) group_left max(kube_pod_container_resource_limits_memory_bytes{container="container-name"}) > .9

Cheers!

gwla...@gmail.com

unread,
May 6, 2018, 12:37:44 PM5/6/18
to Prometheus Users
Hi Brett ,

thanks for the help your logic worked for me as i could not utilize the complete expression as you pasted so i had to customize according to my Env. which is below 

-----------------------for POD memory --------------------------------------------------------------------------------------------------------------------------------------------------------------------

ALERT POD_MEMORY_HIGH_UTILIZATION
  IF sum(rate(container_memory_working_set_bytes{container_name!="POD",image!="",name=~"^k8s_.*",name=~"^k8s_.*"}[5m]) / 15403581) BY (container_name, pod_name) > 50
  FOR 1m
  LABELS {severity="warning"}
  ANNOTATIONS {description="pod {{$labels.pod_name}} is using high memory", summary="HIGH Memory USAGE WARNING for{{$labels.pod_name}}"}
============================================================for POD CPU =========================================================


ALERT POD_CPU_HIGH_UTILIZATION
  IF sum(rate(container_cpu_usage_seconds_total{image!="",name=~"^k8s_.*"}[5m])) BY (pod_name) > 50
  FOR 1m
  LABELS {severity="warning"}
  ANNOTATIONS {description="pod {{$labels.pod_name}} is using high cpu", summary="HIGH CPU USAGE WARNING for POD {{$labels.pod_name} on{{$labels.host}}"}

===============================================================================================================

iviak...@gmail.com

unread,
Dec 13, 2018, 8:40:21 AM12/13/18
to Prometheus Users
Hi Brett,

thanks for replying to your own question. I came up with a solution which is generic for any containers by relabelling some labels. I don't know why yours is working for you as there should be similar labels in both metrics, so I don't know how matching could work with your query (container_name label != container label)


For CPU:

round(100 * label_join(label_join(sum(rate(container_cpu_usage_seconds_total{container_name != "POD", image !=""}[1m])) by (pod_name, container_name, namespace) , "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_cpu_cores) by (pod, container, namespace)) > 75


For memory:

round(100 * label_join(label_join(sum(container_memory_usage_bytes{container_name != "POD", image !=""}) by (container_name, pod_name, namespace), "pod", "", "pod_name"), "container", "", "container_name") / ignoring(container_name, pod_name) avg(kube_pod_container_resource_limits_memory_bytes) by (container, pod, namespace)) > 75

пʼятниця, 16 березня 2018 р. 18:55:16 UTC+1 користувач Brett Porter написав:

iviak...@gmail.com

unread,
Dec 13, 2018, 9:25:12 AM12/13/18
to Prometheus Users
Careful with the alert I posted about memory, you need to use container_memory_working_set_bytes (which equals total memoryinstead of container_memory_usage_bytes (which equals used memory in linux terms)


пʼятниця, 16 березня 2018 р. 18:55:16 UTC+1 користувач Brett Porter написав:
Hi there,

Meier

unread,
Dec 28, 2018, 4:14:49 AM12/28/18
to Prometheus Users
From openshift origin, they ship a more generic resource quota alert:

 100 * kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics", type="used"}
 / ignoring(instance, job, type)
 (kube_resourcequota{namespace=~"(openshift-.*|kube-.*|default|logging)",job="kube-state-metrics", type="hard"} > 0)
 > 90     


which can be simply adapted with suitable label filters and ignoring clauses.

Reply all
Reply to author
Forward
0 new messages