Single Prometheus for Large Cluster

patricia lee

unread,

Sep 7, 2021, 2:34:51 AM9/7/21

to promethe...@googlegroups.com

Hi everyone, I am new here.

I would like to seek some advice on the design approach we should take.

With the given problem below, in terms of cost, how can we set up Prometheus with a large cluster.

Variables:

Installation: Kube-stack-prometheus helm chart.

Autoscale: yes

No. of Nodes: 1000 up to 1300

Mesh: Istio

Memory Usage: 50GB (Still gets OOM)

Installed: 1 Prometheus, 1 Kiali, 1 Grafana and 1 Jaeger

Issue:

1. We cannot expand a larger node for Prometheus as 60GB memory is already expensive. (cost not approved by management)

2. Removing unnecessary metrics is not yet advised because we do not know which metrics of istio, jaeger and kiali are needed.

Tried solution:

We have federated the single instance of prometheus with Thanos Receivers, however, the issue is still there because kiali queries its data directly from prometheus which eventually gets OOM.

Question:

We are thinking of firing up multiple prometheus for each namespace and adding thanos-sidecar with the same scrape config since thanos will deduplicate all duplicated metrics. This approach would solve the issue in Grafana queries but not in Kiali.

How can we set up a multiple prometheus (low cost) but single instance prometheus for kiali (whole cluster)?

Appreciate any help. Thank you.

Brian Candler

unread,

Sep 7, 2021, 4:50:55 AM9/7/21

to Prometheus Users

It's not clear what you mean by "No. of Nodes" - whether you mean hosts (e.g. which you're scraping using node_exporter), or pods, or something else. But what matters is the total number of metrics, and the amount of metric churn, i.e. the rate at which new timeseries are being created dynamically; and also how much querying is going on.

If you go to Prometheus web interface, Status > TSDB Status, you'll get some statistics which may help you. Consider:

- collecting fewer metrics (by changing what you scrape, and/or using metric_relabel_configs to drop some timeseries which are not of interest)

- see if it's possible to reduce timeseries churn. For example, if you have one application which is generating large numbers of short-lived pods then you may wish to reduce or suppress the metrics collected for those pods.

- have a look at the PromQL queries being executed, and whether any of these are using excessing amounts of RAM. The query log may help. You can also apply limits to how much memory is used by individual queries using

--query.max-concurrency=20 # default

--query.max-samples=50000000 # default

(although that may cause the offending queries to fail)

There are also blog posts out there which you can turn up with a search, e.g.

https://source.coveo.com/2021/03/03/prometheus-memory/

patricia lee

unread,

Sep 7, 2021, 8:57:48 AM9/7/21

to Brian Candler, Prometheus Users

Thank you Brian for the reply. Yes I mean host (nodes).

What we have done for the mean time is we have set the retentionTime of prometheus to 5minutes (which I am not comfortable) but was advised by seniors just for us to continue.

Thanks for the information above, i'll check it out and try on our cluster environment.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/24a15533-094e-4a4c-9644-5d4375b6aaa2n%40googlegroups.com.

Brian Candler

unread,

Sep 7, 2021, 9:51:53 AM9/7/21

to Prometheus Users

Such a short retention is unlikely to help at all; WAL blocks have a 2 hour duration I think.

Across some systems I have here, the average number of metrics per node is 2366: this is the (expensive) query which gives it:

avg(count by (instance) ({job="node"}))

So with 1300 nodes that would be about 3 million metrics. Quite a lot, but not extraordinarily so. I've seen recommendations to start splitting Prometheus servers when you reach 2m. There is a RAM calculation tool here:

https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion

With 3m series and 1m unique label pairs, it still only comes out to 8GB. If you're needing much more than that, then you need to read and understand the stats from the TSDB status page. You can post them here if you want help interpreting them. And you need to understand what queries (if any) are taking place against your database, since those use RAM too.

Looking at "Top 10 series count by metric names" in the Prometheus Status page, in my case it's node_cpu_seconds_total{}. For me it's node_cpu_seconds_total{}. If you don't require the usage of each core individually, then you might be inclined to drop it.

You could also see if victoriametrics + vmagent works better for your use case.

Ben Kochie

unread,

Sep 7, 2021, 10:39:44 AM9/7/21

to Brian Candler, Prometheus Users

I don't know if this is still the case, but there are some label configurations in the helm cart that lead to excessive labels on Kubernetes. This can lead to index/memory bloat.

Most of the memory bloat I've seen in our production clusters lately has more to do with auto-scaling pod churn. If you're using a heavy auto-scaling, and lots of single-core pods, you'll end up bloating the metrics a lot.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com.

patricia lee

unread,

Sep 8, 2021, 3:03:57 AM9/8/21

to Brian Candler, Prometheus Users

Hello Brian,

After leaving the prometheus for 16 hrs with 5 mins retention (my seniors' advice), the memory was initially at 22 GB but after 16 hrs it was already at 39 GB and might still increase.

We checked the TSDB status page and we found that the highest memory usage label is id and the highest count by metric names is kubelet_run_time_operations_duration_second_bucket.

I'll suggest to our seniors in the team if we can drop label id and kubelet_run_time_operations_duration_second_bucket to see if it would reduce memory consumption of our prometheus.

I ran tsdb analyze in the prometheus itself, here are the results as well.

Block ID: 01FF1WDW4PH937C3XT2E9R621K
Duration: 2h0m0s
Series: 4434558
Label names: 311
Postings (unique label pairs): 122598
Postings entries (total label pairs): 47468088

Label pairs most involved in churning:
59339 service=rancher-monitoring-kubelet
59339 job=kubelet
59339 endpoint=https-metrics
52002 metrics_path=/metrics/cadvisor
51475 namespace=cluster2
32853 job=kube-state-metrics
32849 service=rancher-monitoring-kube-state-metrics
32848 endpoint=http
24840 container=POD
17944 namespace=cattle-monitoring-system
15974 container=kube-state-metrics
15249 container=node-exporter
14683 job=node-exporter
14683 endpoint=metrics
14683 service=rancher-monitoring-prometheus-node-exporter
13879 namespace=kube-system

Label names most involved in churning:
110756 __name__
109700 instance
109670 service
109670 endpoint
109670 job
107602 namespace
100450 pod
87636 container
64686 node
59339 metrics_path
51953 id
38376 image
37733 name
21466 device
10720 interface
9706 reason
6072 job_name
5418 le
4746 fstype
4746 mountpoint

Label names with highest cumulative label value length:
2690572 id
1727227 name
812271 container_id
333072 uid
298590 pod
162072 pod_uid
101609 address
67431 pod_ip
63985 replicaset
63634 device
58812 interface
57241 owner_name
54539 image
50714 __name__
45383 node
45334 nodename
45334 label_kubernetes_io_hostname
41997 image_id
41844 created_by_name
39312 provider_id

Highest cardinality labels:
26763 id
17988 name
11127 container_id
9252 uid
9249 pod
5977 address
5022 pod_ip
4502 pod_uid
4203 interface
4135 device
1987 instance
1773 owner_name
1741 replicaset
1741 label_pod_template_hash
1422 __name__
1164 created_by_name
937 node
937 host_ip
936 label_kubernetes_io_hostname
936 nodename

Highest cardinality metric names:
178836 kubelet_runtime_operations_duration_seconds_bucket
161805 container_tasks_state
142212 storage_operation_duration_seconds_bucket
129444 container_memory_failures_total
121212 kubelet_docker_operations_duration_seconds_bucket
67739 kube_pod_container_status_waiting_reason
59724 kubelet_http_requests_duration_seconds_bucket
58062 kube_pod_container_status_terminated_reason
58062 kube_pod_container_status_last_terminated_reason
50292 rest_client_request_duration_seconds_bucket
46260 kube_pod_status_phase
44709 kubelet_runtime_operations_latency_microseconds
39645 container_network_receive_packets_dropped_total
39645 container_network_transmit_bytes_total
39645 container_network_transmit_errors_total
39645 container_network_transmit_packets_total
39645 container_network_receive_packets_total
39645 container_network_transmit_packets_dropped_total
39645 container_network_receive_bytes_total
39645 container_network_receive_errors_total

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bde269c1-119e-4d1e-a899-9f27332b0ff6n%40googlegroups.com.

Ben Kochie

unread,

Sep 8, 2021, 4:41:28 AM9/8/21

to patricia lee, Prometheus Users

The things I'm currently working on:

* Disabling auto-scaling, or setting the auto-scaler minimums higher to avoid down-scaling when it's unnecessary.

* Using https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-configurable-scaling-behavior to dampen up/down behavior

* Using https://keda.sh/ to use better metrics for auto-scaling controls

* Eliminating single-core pods by using worker pools for single-threaded languages like Python/Ruby/Node. Or re-writing services in Go / Java to make them multi-threaded.

* Increasing the node size to reduce the number of nodes per cluster.

* Dropping un-used / duplicate container metrics from cAdvisor (I'm working on a blog post about this)

On Wed, Sep 8, 2021 at 9:20 AM patricia lee <plee...@gmail.com> wrote:

Hello Ben,

Yes, our cluster set up is heavy-autoscaling and a lot of single or less core pods (500m to 1000m cpu).
May we know, what resolution did you take for a heavily auto-scale cluster with single-core pods?

Appreciate your response.

Btw, I ran promtool in our prometheus and these are high churn labels (default config from kube-prometheus-stack)

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CABbyFmrsfgRFDsduqz0ue3o%3DxKVJPn9K-4GvC%3DjhT%3DoqJySMpQ%40mail.gmail.com.

patricia lee

unread,

Sep 23, 2021, 5:21:41 AM9/23/21

to Ben Kochie, Prometheus Users

Thanks for the information.

For the meantime,we are trying to drop the high memory usage label in our prometheus, so we dropped the ID - (test environment)

However, even if we dropped the labels on all jobs, the memory usage is still at 5Gi (which is the same). Will the drop in memory usage of Prometheus will only be seen after a few hours? We saw the same behavior in our different environment - UAT where we drop ID but we waited for almost a day before we saw some memory drops in grafana.

Thank you.

Brian Candler

unread,

Sep 23, 2021, 7:25:58 AM9/23/21

to Prometheus Users

Dropping individual labels isn't likely to make a huge difference, if you're still scraping the same set of timeseries.

The bag of labels is just what distinguishes one timeseries from another. It does have to be kept in memory, but it's static and doesn't use much RAM.

Dropping labels might even give you a short-term *increase* in RAM usage, as the timeseries with the old label set and the timeseries with the new label set are two different timeseries.

You're likely to see a bigger difference by reducing the number of timeseries you're scraping - either by changing the exporters to expose fewer metrics, or using metric relabelling to drop metrics which aren't of interest.

Reply all

Reply to author

Forward