How to optimize prometheus memory for best perfromance and prevent pods from crashes due to memory.

Ravi Teja Reddy

unread,

Apr 13, 2020, 12:48:07 AM4/13/20

to Prometheus Users

HI All,

I am new to prometheus, we started using prometheus for monitoring our kops based kubernetes cluster, most of the time we run of of the memory and we started increasing the memory limits from the pods which is not the right option. We started with the memeory limits as 200Mi and right we we ended with almost 6000Mi which is around 6GB is too much. I tried adding few flags and it did not work any helm in making sure that my prometheus does not go out of memory and crash.

My Prometheus Server Version 2.4.2

Deployed in Kops Kubernetes Cluster

Current configuration:

Name: prometheus-k8s-1

Namespace: monitoring

Priority: 0

Node: ip-172-31-98-1.us-west-2.compute.internal/172.31.98.1

Start Time: Mon, 13 Apr 2020 09:35:05 +0530

Labels: app=prometheus

controller-revision-hash=prometheus-k8s-dcbcc8f48

prometheus=k8s

statefulset.kubernetes.io/pod-name=prometheus-k8s-1

Annotations: cni.projectcalico.org/podIP: 100.119.250.205/32

Status: Running

IP: 100.119.250.205

IPs:

IP: 100.119.250.205

Controlled By: StatefulSet/prometheus-k8s

Containers:

prometheus:

Container ID: docker://d5d29d96371596f99b04c9b8c673577612ba8ee0e55c7148efd9be9fbec88fca

Image: quay.io/prometheus/prometheus:v2.4.2

Image ID: docker-pullable://quay.io/prometheus/prometheus@sha256:8e4d8817b1eb40d793f7207fd064ef2a3d47e3dd6290738ca3c6d642489cea93

Port: 9090/TCP

Host Port: 0/TCP

Args:

--config.file=/etc/prometheus/config_out/prometheus.env.yaml

--storage.tsdb.path=/prometheus

--storage.tsdb.retention=90d

--web.enable-lifecycle

--storage.tsdb.no-lockfile

--web.external-url=http://prometheus:9090

--web.route-prefix=/

State: Running

Started: Mon, 13 Apr 2020 10:09:35 +0530

Last State: Terminated

Reason: OOMKilled

Exit Code: 137

Started: Mon, 13 Apr 2020 10:02:33 +0530

Finished: Mon, 13 Apr 2020 10:08:07 +0530

Ready: False

Restart Count: 5

Limits:

cpu: 300m

memory: 6000Mi

Requests:

cpu: 200m

memory: 5700Mi

Liveness: http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6

Readiness: http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120

Environment: <none>

Mounts:

/etc/prometheus/config_out from config-out (ro)

/prometheus from prometheus-k8s-db (rw,path="prometheus-db")

/var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-44tjh (ro)

prometheus-config-reloader:

Container ID: docker://1dea8054b610bc82411f84454fba3b83089fc85b138a1b5ab49a57571daff822

Image: quay.io/coreos/prometheus-config-reloader:v0.0.4

Image ID: docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:b15f35af5c3e4bd75c7e74bd27b862f1c119fc51080a838e5b3399a134c862e5

Port: <none>

Host Port: <none>

Args:

--reload-url=http://localhost:9090/-/reload

--config-file=/etc/prometheus/config/prometheus.yaml

--rule-list-file=/etc/prometheus/config/configmaps.json

--config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml

--rule-dir=/etc/prometheus/config_out/rules

State: Running

Started: Mon, 13 Apr 2020 09:35:31 +0530

Ready: True

Restart Count: 0

Limits:

cpu: 10m

memory: 50Mi

Requests:

cpu: 10m

memory: 50Mi

Environment:

POD_NAME: prometheus-k8s-1 (v1:metadata.name)

Mounts:

/etc/prometheus/config from config (rw)

/etc/prometheus/config_out from config-out (rw)

/var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-44tjh (ro)

Conditions:

Type Status

Initialized True

Ready False

ContainersReady False

PodScheduled True

Volumes:

prometheus-k8s-db:

Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)

ClaimName: prometheus-k8s-db-prometheus-k8s-1

ReadOnly: false

config:

Type: Secret (a volume populated by a Secret)

SecretName: prometheus-k8s

Optional: false

config-out:

Type: EmptyDir (a temporary directory that shares a pod's lifetime)

Medium:

SizeLimit: <unset>

prometheus-k8s-token-44tjh:

Type: Secret (a volume populated by a Secret)

SecretName: prometheus-k8s-token-44tjh

Optional: false

QoS Class: Burstable

Node-Selectors: <none>

Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s

node.kubernetes.io/unreachable:NoExecute for 300s

Events:

Type Reason Age From Message

---- ------ ---- ---- -------

Normal Scheduled <unknown> default-scheduler Successfully assigned monitoring/prometheus-k8s-1 to ip-172-31-98-1.us-west-2.compute.internal

Normal SuccessfulAttachVolume 39m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-ef266625-ca5e-11e9-8b8f-069b6823c94e"

Normal Pulled 39m kubelet, ip-172-31-98-1.us-west-2.compute.internal Container image "quay.io/prometheus/prometheus:v2.4.2" already present on machine

Normal Created 39m kubelet, ip-172-31-98-1.us-west-2.compute.internal Created container prometheus

Normal Started 39m kubelet, ip-172-31-98-1.us-west-2.compute.internal Started container prometheus

Normal Pulled 39m kubelet, ip-172-31-98-1.us-west-2.compute.internal Container image "quay.io/coreos/prometheus-config-reloader:v0.0.4" already present on machine

Normal Created 39m kubelet, ip-172-31-98-1.us-west-2.compute.internal Created container prometheus-config-reloader

Normal Started 39m kubelet, ip-172-31-98-1.us-west-2.compute.internal Started container prometheus-config-reloader

Warning Unhealthy 4m39s (x380 over 39m) kubelet, ip-172-31-98-1.us-west-2.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503

Brian Candler

unread,

Apr 13, 2020, 4:33:18 AM4/13/20

to Prometheus Users

On Monday, 13 April 2020 05:48:07 UTC+1, Ravi Teja Reddy wrote:

We started with the memeory limits as 200Mi and right we we ended with almost 6000Mi which is around 6GB is too much.

Firstly, why are you using such an old version of prometheus? v2.4.2 was released in September 2018.

Secondly, what makes you think 6GiB is "too much" RAM for prometheus? How many metrics are active (head stats)? How many distinct values do your highest cardinality labels have? This is information which is easy to get from newer versions of Prometheus (under Status > Runtime & Build Information) but I don't think was added until v2.14.0

Ravi Teja Reddy Bonthu

unread,

Apr 24, 2020, 3:37:14 PM4/24/20

to Brian Candler, Prometheus Users

Thank you Brian Candler !

I have upgraded all my prometheus-operator-0.38.0, prometheus-2.17.1 , alertmanager, grafana, node-exporter and kube-state-metrics to their latest versions, i can see those stats and what you said is absolutely correct

i.e under Status > Runtime & Build Information.

I would like to know the thumb rule and the flags that are supported by the newer prometheus for better performance of memory and storage.

Can you pls help me with the thumb rule for Storage and Memory based on what calculation i can find what could be my memory and storage and the resource limits (CPU memory ) that i can set for the prometheus pods in K8S.

I always have an issue with memory my prometheus pod crashed with OOMKILL most of the time any help recommendation would be appreciated.

Thanks

Brian Candler

unread,

Apr 24, 2020, 4:32:31 PM4/24/20

to Prometheus Users

On Friday, 24 April 2020 20:37:14 UTC+1, Ravi Teja Reddy Bonthu wrote:

Can you pls help me with the thumb rule for Storage and Memory based on what calculation i can find what could be my memory and storage and the resource limits (CPU memory ) that i can set for the prometheus pods in K8S.

There is a calculator here:

https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion

As I said, it depends very much on how many timeseries you are scraping.

Reply all

Reply to author

Forward