How to optimize prometheus memory for best perfromance and prevent pods from crashes due to memory.

111 views
Skip to first unread message

Ravi Teja Reddy

unread,
Apr 13, 2020, 12:48:07 AM4/13/20
to Prometheus Users
HI All,

I am new to prometheus, we started using prometheus for monitoring our kops based kubernetes cluster, most of the time we run of of the memory and we started increasing the memory limits from the pods which is not the right option. We started with the memeory limits as 200Mi and right we we ended with almost 6000Mi which is around 6GB is too much. I tried adding few flags and it did not work any helm in making sure that my prometheus does not go out of memory and crash.


My Prometheus Server Version 2.4.2
Deployed in Kops Kubernetes Cluster 

Current configuration:

Name:         prometheus-k8s-1

Namespace:    monitoring

Priority:     0

Node:         ip-172-31-98-1.us-west-2.compute.internal/172.31.98.1

Start Time:   Mon, 13 Apr 2020 09:35:05 +0530

Labels:       app=prometheus

              controller-revision-hash=prometheus-k8s-dcbcc8f48

              prometheus=k8s

              statefulset.kubernetes.io/pod-name=prometheus-k8s-1

Annotations:  cni.projectcalico.org/podIP: 100.119.250.205/32

Status:       Running

IP:           100.119.250.205

IPs:

  IP:           100.119.250.205

Controlled By:  StatefulSet/prometheus-k8s

Containers:

  prometheus:

    Container ID:  docker://d5d29d96371596f99b04c9b8c673577612ba8ee0e55c7148efd9be9fbec88fca

    Image:         quay.io/prometheus/prometheus:v2.4.2

    Image ID:      docker-pullable://quay.io/prometheus/prometheus@sha256:8e4d8817b1eb40d793f7207fd064ef2a3d47e3dd6290738ca3c6d642489cea93

    Port:          9090/TCP

    Host Port:     0/TCP

    Args:

      --config.file=/etc/prometheus/config_out/prometheus.env.yaml

      --storage.tsdb.path=/prometheus

      --storage.tsdb.retention=90d

      --web.enable-lifecycle

      --storage.tsdb.no-lockfile

      --web.external-url=http://prometheus:9090

      --web.route-prefix=/

    State:          Running

      Started:      Mon, 13 Apr 2020 10:09:35 +0530

    Last State:     Terminated

      Reason:       OOMKilled

      Exit Code:    137

      Started:      Mon, 13 Apr 2020 10:02:33 +0530

      Finished:     Mon, 13 Apr 2020 10:08:07 +0530

    Ready:          False

    Restart Count:  5

    Limits:

      cpu:     300m

      memory:  6000Mi

    Requests:

      cpu:        200m

      memory:     5700Mi

    Liveness:     http-get http://:web/-/healthy delay=0s timeout=3s period=5s #success=1 #failure=6

    Readiness:    http-get http://:web/-/ready delay=0s timeout=3s period=5s #success=1 #failure=120

    Environment:  <none>

    Mounts:

      /etc/prometheus/config_out from config-out (ro)

      /prometheus from prometheus-k8s-db (rw,path="prometheus-db")

      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-44tjh (ro)

  prometheus-config-reloader:

    Container ID:  docker://1dea8054b610bc82411f84454fba3b83089fc85b138a1b5ab49a57571daff822

    Image:         quay.io/coreos/prometheus-config-reloader:v0.0.4

    Image ID:      docker-pullable://quay.io/coreos/prometheus-config-reloader@sha256:b15f35af5c3e4bd75c7e74bd27b862f1c119fc51080a838e5b3399a134c862e5

    Port:          <none>

    Host Port:     <none>

    Args:

      --reload-url=http://localhost:9090/-/reload

      --config-file=/etc/prometheus/config/prometheus.yaml

      --rule-list-file=/etc/prometheus/config/configmaps.json

      --config-envsubst-file=/etc/prometheus/config_out/prometheus.env.yaml

      --rule-dir=/etc/prometheus/config_out/rules

    State:          Running

      Started:      Mon, 13 Apr 2020 09:35:31 +0530

    Ready:          True

    Restart Count:  0

    Limits:

      cpu:     10m

      memory:  50Mi

    Requests:

      cpu:     10m

      memory:  50Mi

    Environment:

      POD_NAME:  prometheus-k8s-1 (v1:metadata.name)

    Mounts:

      /etc/prometheus/config from config (rw)

      /etc/prometheus/config_out from config-out (rw)

      /var/run/secrets/kubernetes.io/serviceaccount from prometheus-k8s-token-44tjh (ro)

Conditions:

  Type              Status

  Initialized       True 

  Ready             False 

  ContainersReady   False 

  PodScheduled      True 

Volumes:

  prometheus-k8s-db:

    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)

    ClaimName:  prometheus-k8s-db-prometheus-k8s-1

    ReadOnly:   false

  config:

    Type:        Secret (a volume populated by a Secret)

    SecretName:  prometheus-k8s

    Optional:    false

  config-out:

    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)

    Medium:     

    SizeLimit:  <unset>

  prometheus-k8s-token-44tjh:

    Type:        Secret (a volume populated by a Secret)

    SecretName:  prometheus-k8s-token-44tjh

    Optional:    false

QoS Class:       Burstable

Node-Selectors:  <none>

Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s

                 node.kubernetes.io/unreachable:NoExecute for 300s

Events:

  Type     Reason                  Age                    From                                                Message

  ----     ------                  ----                   ----                                                -------

  Normal   Scheduled               <unknown>              default-scheduler                                   Successfully assigned monitoring/prometheus-k8s-1 to ip-172-31-98-1.us-west-2.compute.internal

  Normal   SuccessfulAttachVolume  39m                    attachdetach-controller                             AttachVolume.Attach succeeded for volume "pvc-ef266625-ca5e-11e9-8b8f-069b6823c94e"

  Normal   Pulled                  39m                    kubelet, ip-172-31-98-1.us-west-2.compute.internal  Container image "quay.io/prometheus/prometheus:v2.4.2" already present on machine

  Normal   Created                 39m                    kubelet, ip-172-31-98-1.us-west-2.compute.internal  Created container prometheus

  Normal   Started                 39m                    kubelet, ip-172-31-98-1.us-west-2.compute.internal  Started container prometheus

  Normal   Pulled                  39m                    kubelet, ip-172-31-98-1.us-west-2.compute.internal  Container image "quay.io/coreos/prometheus-config-reloader:v0.0.4" already present on machine

  Normal   Created                 39m                    kubelet, ip-172-31-98-1.us-west-2.compute.internal  Created container prometheus-config-reloader

  Normal   Started                 39m                    kubelet, ip-172-31-98-1.us-west-2.compute.internal  Started container prometheus-config-reloader

  Warning  Unhealthy               4m39s (x380 over 39m)  kubelet, ip-172-31-98-1.us-west-2.compute.internal  Readiness probe failed: HTTP probe failed with statuscode: 503

Brian Candler

unread,
Apr 13, 2020, 4:33:18 AM4/13/20
to Prometheus Users
On Monday, 13 April 2020 05:48:07 UTC+1, Ravi Teja Reddy wrote:
 We started with the memeory limits as 200Mi and right we we ended with almost 6000Mi which is around 6GB is too much.

Firstly, why are you using such an old version of prometheus? v2.4.2 was released in September 2018.

Secondly, what makes you think 6GiB is "too much" RAM for prometheus?  How many metrics are active (head stats)?  How many distinct values do your highest cardinality labels have?  This is information which is easy to get from newer versions of Prometheus (under Status > Runtime & Build Information) but I don't think was added until v2.14.0

Ravi Teja Reddy Bonthu

unread,
Apr 24, 2020, 3:37:14 PM4/24/20
to Brian Candler, Prometheus Users
Thank you Brian Candler !

I have upgraded all my prometheus-operator-0.38.0, prometheus-2.17.1 , alertmanager, grafana, node-exporter and kube-state-metrics to their latest versions, i can see those stats and what you said is absolutely  correct 
i.e  under Status > Runtime & Build Information.

I would like to know the thumb rule  and the flags that are supported by the newer prometheus for better performance of memory and storage.

Can you pls help me with the thumb rule for Storage and Memory based on what calculation i can find what could be  my memory and storage and the resource limits (CPU memory ) that i can set for the prometheus pods in K8S.

I always have an issue with memory my prometheus pod crashed with OOMKILL most of the time any help recommendation would be appreciated.

Thanks

Brian Candler

unread,
Apr 24, 2020, 4:32:31 PM4/24/20
to Prometheus Users
On Friday, 24 April 2020 20:37:14 UTC+1, Ravi Teja Reddy Bonthu wrote:
Can you pls help me with the thumb rule for Storage and Memory based on what calculation i can find what could be  my memory and storage and the resource limits (CPU memory ) that i can set for the prometheus pods in K8S.

There is a calculator here:

As I said, it depends very much on how many timeseries you are scraping.
Reply all
Reply to author
Forward
0 new messages