Seeking advice on label values for Prometheus Metrics in Kubernetes

Peter Nguyễn

unread,

Aug 16, 2023, 3:31:29 AM8/16/23

to Prometheus Users

Hi Prometheus experts,

I have a Prometheus Pod (v2.40.7) running on our Kubernetes (k8s) cluster for metric scraping from multiple k8s targets.

Recently, I have observed that whenever I restart a target (a k8s Pod) or perform a Helm upgrade, the memory consumption of Prometheus keeps increasing. After investigating, I discovered that each time the pod gets restarted, new set of time series from that target is generated due to dynamic values of `instance` and `pod_name`.

The instance label value we use is in the format <pod_IP>:port, and `pod_name` label value is the pod name. Consequently, whenever a Pod is restarted, it receives a new allocated IP address, and a new pod name (if not statefulset's Pod) resulting in new values for the instance & pod_name label.

When comes to HEAD truncation, and the number of time series in the HEAD block goes back to the previous low value, Prometheus memory still does not go back to the point before the target restarted. Here is the graph:

I am writing to seek advice on the best practices for handling these label values, particularly for the instance. Do you have any advice on what value format should be for those labels so we ge rid of the memory increased every time pod gets restarted? Any time e.g. after retention triggered, the memory would go back to the previous point?

Regards, Vu

Ben Kochie

unread,

Aug 16, 2023, 7:15:35 AM8/16/23

to Peter Nguyễn, Prometheus Users

FYI, container_memory_working_set_bytes is a misleading metric. It includes page cache memory, which can be unallocated any time, but improves performance of queries.

If you want to know the real memory use, I would recommend using container_memory_rss

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/27961908-8362-42a7-b1ce-ab27dcece7b1n%40googlegroups.com.

Peter Nguyễn

unread,

Aug 16, 2023, 10:42:10 PM8/16/23

to Prometheus Users

Thanks for your replies.

> There is nothing to handle, the instance/pod IP is required for uniqueness tracking. Different instances of the same pod need to be tracked individually. In addition, most Deployment pods are going to get new generated pod names every time anyway.

Then if we have a deployment with a large number of active time series like 01 million, every upgrade or fallback of the deployment would cause a significant memory increase because of the number time series is doubled, 02 millions in this case and Prometheus would get OOM if we don't reserve a huge memory for that scenario.

> Prometheus compacts memory every 2 hours, so old data is flushed out of memory.

I have re-run the test with Prometheus's latest version, v.2.46.0, capturing Prometheus memory using container_memory_rss metric. To me, it looks like the memory is not dropped after cutting HEAD to persistent block.

Do you think it is expected? If yes, could you please share with us why the Memory is not freed up for inactive time series that are no longer in the HEAD block?

Ben Kochie

unread,

Aug 16, 2023, 11:34:52 PM8/16/23

to Peter Nguyễn, Prometheus Users

On Thu, Aug 17, 2023 at 4:42 AM Peter Nguyễn <win...@gmail.com> wrote:

Thanks for your replies.

> There is nothing to handle, the instance/pod IP is required for uniqueness tracking. Different instances of the same pod need to be tracked individually. In addition, most Deployment pods are going to get new generated pod names every time anyway.

Then if we have a deployment with a large number of active time series like 01 million, every upgrade or fallback of the deployment would cause a significant memory increase because of the number time series is doubled, 02 millions in this case and Prometheus would get OOM if we don't reserve a huge memory for that scenario.

2 million series is no big deal, should only take a few extra gigabytes of memory. This is not a huge amount and well within Prometheus capability.

For reference, I have deployments that generate more than 10M series and can use upwards of 200GiB of memory when we go through a number of deploys quickly. After things settle down, the memory is released, but it does take a number of hours.

> Prometheus compacts memory every 2 hours, so old data is flushed out of memory.

I have re-run the test with Prometheus's latest version, v.2.46.0, capturing Prometheus memory using container_memory_rss metric. To me, it looks like the memory is not dropped after cutting HEAD to persistent block.

Do you think it is expected? If yes, could you please share with us why the Memory is not freed up for inactive time series that are no longer in the HEAD block?

It will. Prometheus is written in Go, which is a garbage collected language. It will release RSS memory as it needs to. You can see what Go is currently using with go_memstats_alloc_bytes.

On Wednesday, August 16, 2023 at 6:15:35 PM UTC+7 Ben Kochie wrote:
FYI, container_memory_working_set_bytes is a misleading metric. It includes page cache memory, which can be unallocated any time, but improves performance of queries.

If you want to know the real memory use, I would recommend using container_memory_rss

On Wed, Aug 16, 2023 at 9:31 AM Peter Nguyễn <win...@gmail.com> wrote:
Hi Prometheus experts,

I have a Prometheus Pod (v2.40.7) running on our Kubernetes (k8s) cluster for metric scraping from multiple k8s targets.

Recently, I have observed that whenever I restart a target (a k8s Pod) or perform a Helm upgrade, the memory consumption of Prometheus keeps increasing. After investigating, I discovered that each time the pod gets restarted, new set of time series from that target is generated due to dynamic values of `instance` and `pod_name`.

The instance label value we use is in the format <pod_IP>:port, and `pod_name` label value is the pod name. Consequently, whenever a Pod is restarted, it receives a new allocated IP address, and a new pod name (if not statefulset's Pod) resulting in new values for the instance & pod_name label.

When comes to HEAD truncation, and the number of time series in the HEAD block goes back to the previous low value, Prometheus memory still does not go back to the point before the target restarted. Here is the graph:

I am writing to seek advice on the best practices for handling these label values, particularly for the instance. Do you have any advice on what value format should be for those labels so we ge rid of the memory increased every time pod gets restarted? Any time e.g. after retention triggered, the memory would go back to the previous point?

Regards, Vu
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/27961908-8362-42a7-b1ce-ab27dcece7b1n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/771be8ec-e37a-490b-bcf2-01de2cea591en%40googlegroups.com.

Peter Nguyễn

unread,

Aug 18, 2023, 4:29:46 AM8/18/23

to Prometheus Users

> 2 million series is no big deal, should only take a few extra gigabytes of memory. This is not a huge amount and well within Prometheus capability.

1) I have performed another test with 1M active timeseries. The memory usage of Prometheus with 1M is around 3Bil on my env. I then restarted target at around 18:10; The number of time series in HEAD block now jumped up to 2M, and the RAM usage was around 5Bil, *66% increase* compared to the prior point.

Looking at `go_memstats_alloc_bytes`, the number of allocated bytes go down at HEAD truncation but the Prometheus seems did not.

2) I then left the deployment running over night to see if the memory would go back to the previous low point or not. Here is what I got:

a) It seems that the memory did not go back to its 3Bil. I set retention time to 4h, inactive time series should be swiped out. I am confused why the memory does not return to its low point. Do Prometheus keep any info related to inactive time series in memory?

b) When I performed target restart again at 09:38, the memory keeps jumping up. Now, the current value is at 6.7Bil, almost 100% increase compared to the previous value.

3) When I restarted the target one more time while HEAD block is not truncated yet, the memory jumps up to 10Bil. This is a huge memory increased to us comparing to the starting point.

Ben Kochie

unread,

Aug 18, 2023, 4:54:07 AM8/18/23

to Peter Nguyễn, Prometheus Users

And if you look, GC kicked in just after 15:20 to reduce the RSS from 10GiB to a little over 8GiB. In your 3rd example, you're running with about 3.5KiB of memory per head series. This is perfectly normal and within expected results.

Again, this is all related to Go memory garbage collection. The Go VM does what it does.

There are some tunables. For example, we found that in our larger environment that GOGC=50 is more appropriate for our workloads compared to the Go default of GOGC=100. This should reduce the RSS to around 1.5x the go_memstats_alloc_bytes.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/9fcfe778-6add-482f-b160-1bc4903ffa6en%40googlegroups.com.

Peter Nguyễn

unread,

Aug 18, 2023, 7:18:16 AM8/18/23

to Prometheus Users

Thank Ben for your tips on tuning GOGC.

In regards to the question why Prometheus memory does not go back to the initial setup even after inactive time series have been swiped away from TSDB when reaching retention time, do you have any comment?

Peter Nguyễn

unread,

Aug 22, 2023, 6:24:31 AM8/22/23

to Prometheus Users

Hi,

I has tried to read the code to find out answer for my question in the previous email.

When looking at https://github.com/prometheus/prometheus/blob/main/scrape/scrape.go#L1690, it seems that Promethes caches data for each time series during target scraping to deal with staleness.

However, the cached data for already disappeared targets seems not be cleaned up from scrape loop cache. It keeps growing when targets get restarted.

I tried to add the following code:

.

Repeat the test, and I can see significant memory reduce. The memory drops very earlier when Prometheus receives target update from k8s discovery.

prometheus_instance_ip_port_concern_latest_after_fixing_loop_cache.jpg

Could you please have a look and see if that is a memory leak in Prometheus?

Ben Kochie

unread,

Aug 22, 2023, 4:05:15 PM8/22/23

to Peter Nguyễn, Prometheus Users

Interesting, good work investigating that. Would you mind posting this information as a new issue.

https://github.com/prometheus/prometheus/issues

It could also be related to this PR: https://github.com/prometheus/prometheus/pull/12726

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/23060df4-ece8-486f-b047-f0a102a49bdfn%40googlegroups.com.

Peter Nguyễn

unread,

Aug 22, 2023, 11:45:07 PM8/22/23

to Prometheus Users

Thanks Ben. I have created a ticket for this here https://github.com/prometheus/prometheus/issues/12741

Reply all

Reply to author

Forward