Prometheus Evicted state

Oleksandr Shkovyra

unread,

Jul 3, 2020, 1:58:09 AM7/3/20

to Prometheus Users

Hello everyone, I am not sure that it is a bug or something with my configuration, I will provide a little bit more details about my setup and problem. We use several EKS cluster in different AWS accounts. We use helm latest helm prometheus helm-chart(https://github.com/helm/charts/tree/master/stable/prometheus) as well as latest prometheus version. We collect data with scrapping configuration from 5 EKS clusters. The problem what we faced: The disk space is growing pretty fast, 11 GB disk space used in 12 hrs. We use Prometheus as a data source for Grafana, and the big disk space cause the problem with prometheus - The node was low on resource: memory. Container prometheus-server-configmap-reload was using 2420Ki, which exceeds its request of 0. Container prometheus-server was using 30903632Ki, which exceeds its request of 20Gi. as the result - a lot of evicted states:

opengine-backend opengine-prometheus-server-6df9449899-22f9l 0/3 Evicted 0 32m opengine-backend opengine-prometheus-server-6df9449899-2fxdc 0/3 Evicted 0 64m opengine-backend opengine-prometheus-server-6df9449899-2gzlb 0/3 Evicted 0 7h55m opengine-backend opengine-prometheus-server-6df9449899-699gr 0/3 Evicted 0 131m opengine-backend opengine-prometheus-server-6df9449899-6s9q9 0/3 Evicted 0 78m opengine-backend opengine-prometheus-server-6df9449899-blx9q 0/3 Evicted 0 110m opengine-backend opengine-prometheus-server-6df9449899-d5cfw 0/3 Evicted 0 120m opengine-backend opengine-prometheus-server-6df9449899-fkjsp 0/3 Evicted 0 8h opengine-backend opengine-prometheus-server-6df9449899-h8pmp 0/3 Evicted 0 150m opengine-backend opengine-prometheus-server-6df9449899-nf8pb 0/3 Evicted 0 21m opengine-backend opengine-prometheus-server-6df9449899-qgl6b 0/3 Init:0/1 0 4s opengine-backend opengine-prometheus-server-6df9449899-sjzhd 0/3 Evicted 0 7h46m opengine-backend opengine-prometheus-server-6df9449899-txwqx 0/3 Evicted 0 21h opengine-backend opengine-prometheus-server-6df9449899-wh5ck 0/3 Evicted 0 100m opengine-backend opengine-prometheus-server-6df9449899-whpfb 0/3 Evicted 0 42m opengine-backend opengine-prometheus-server-6df9449899-ws4sq 0/3 Evicted 0 7h29m

Please, help me with this. Also, please tell if you need any additional information.

Brian Candler

unread,

Jul 3, 2020, 3:34:43 AM7/3/20

to Prometheus Users

Maybe you are just collecting a lot of metrics in a single prometheus instance. There's a tool which will give you an estimate of RAM usage here:

https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion

For disk space, I'd start with an estimate of 1.7 bytes per metric sample - so that usage depends on your scrape interval. You say it's growing at about 900MB/hour; if you were using a 15-second scrape interval that implies about 2.2m metrics, which is quite high to be putting into one prometheus instance (the recommended maximum is 2 million).

So the first thing to check is how many metrics you're *actually* collecting, and also whether you have a high churn rate in time series (i.e. lots of pods starting and stopping). You can get this info from the prometheus GUI under "status > runtime & build info". Look especially at "Head Stats".

Your 30GB RAM usage suggests high series churn. Beware that if you are monitoring pod-level metrics, every pod is unique, so will generate its own set of timeseries. If you have 10 pods destroyed and created per minute, and each pod generates 10K metrics, that's 6 million new time series every hour. At any instant not all of these will be active, but the "head" chunk typically carries the last 2 hours' worth of timeseries. The solution is not to churn pods so much, or else filter the data collection so you're collecting much less pod-level data.

If you are sure that the number of series you're collecting is much lower than 2m, then there may be a problem. Please report the stats, the *exact* version of prometheus you're running, and also show any logs generated by prometheus itself.

If you are in fact collecting millions of timeseries (and wish to keep them all rather than dropping some), then as I said before this is more than is recommended for a single prometheus instance. If you have 5 clusters then it sounds like you'd be better with a separate prometheus per cluster, especially as they are in separate AWS accounts. You can still have a single Grafana instance, which either queries them individually, or uses something like promxy to combine them, or use federation to collect a subset of metrics into a separate prometheus for a global view, or you can look at higher-performance add-ons like Thanos.

Oleksandr Shkovyra

unread,

Jul 7, 2020, 6:00:17 AM7/7/20

to Prometheus Users

Hello Brian,

thank you, for the reply.

Currently, we deployed Prometheus into each EKS cluster and group them in one grafana.

Looks good so far.

Many thanks.

Will check the behaviors for this setup in the feature.

Reply all

Reply to author

Forward