Hello,
I have a Prometheus instance A deployed that scrapes a few k8s targets and has a retention set at 3h. It federates those metrics to another Prometheus instance B which has persistent storage for longer retention.
I am seeing a long term increase in memory usage for Prometheus instance A. It gradually increases in memory usage over days until it reaches the memory limit we configured for it and OOMs. Additionally, when it is near the limit of its memory request, the federated Prometheus B will start to have "context deadline exceeded" when scraping Prom Instance A's federated endpoint.
I have one Prometheus instance A deployed in several regions, and this only happens to the Prometheus instances in some regions. Other regions look healthy, with the memory usage dropping as expected every couple of hours as our retention is 3h.
Looking at the heap profile of the healthy Prom instance A and unhealthy (memory increasing long term trend) Prom instance A, I see an obvious difference.
Healthy:
Unhealthy:
![]()
I see that in the unhealthy instance, most of the memory is being used for tsdb.add. I'm a little confused since we have the retention period set to be 3h for the Prom instance A in both of these regions, yet the unhealthy one is using up a lot of memory for some tsdb related functions.
My question is: What is causing this Prometheus instance to use so much memory on this tsdb function, and how can I prevent this?
Thanks!