Preventing Prometheus from running out of memory

52 views
Skip to first unread message

l.mi...@gmail.com

unread,
Nov 25, 2022, 7:27:17 AM11/25/22
to Prometheus Developers
Hello,

One of the biggest challenges we have when trying to run Prometheus with a constantly growing number of scraped services is keeping resource usage under control.
This usually means memory usage.
Cardinality is often a huge problem and we often end up with services accidentally exposing labels that are risky. One silly mistake we see every now and then is putting raw errors as labels, which then leads to time series with {error="connection from $ip:$port to $ip:$port timed out"} and so on.

We had a lot of way of dealing with this that uses vanilla Prometheus features but none of it really works well for us.
Obviously there is sample_limit that one might use here, but the biggest problem with it is the fact that once you hit sample_limit threshold you lose all metrics, and that's just not acceptable for us.
If I have a service that exports 999 time series and it suddenly goes to 1001 (with sample_limit=1000) I really don't want to lose all metrics just because of that because losing all monitoring is bigger problem than having a few extra time series in Prometheus. It's just too risky.

We're currently running Prometheus with patches from:

This gives us 2 levels of protection:
- global HEAD limit - Prometheus is not allowed to have more than M time series in TSDB
- per scrape sample_limit - but patched so that if you exceed sample_limit it will start rejecting time series that aren't already in TSDB

This works well for us and gives us a system that:
- gives us reassurance that Prometheus won't start getting OOM killed overnight
- service owners can add new metrics without fear that a typo will cost them all metrics

But comments on that PR suggest that it's a highly controversial feature.
I wanted to probe this community to see what the overall feeling is and how likely is that vanilla Prometheus will have something like this.
It's a small patch so I'm happy to just maintain it for our internal deployments but it just feels like a common problem to me, so a baked in solution would be great.

Lukasz

Ben Kochie

unread,
Nov 27, 2022, 2:25:12 PM11/27/22
to l.mi...@gmail.com, Prometheus Developers
Soft / partial failure modes can be very hard problems to deal with. You have to be a lot more careful to not end up missing partial failures.

While it seems like having a soft sample_limt is good, and the hard sample_limit is bad. The "Fail fast" will serve you better in the long run. Most of the Prometheus monitoring design assumes fail fast. Partial results are too hard to reason about from a monitoring perspective. With fail fast you will know quickly and decisively that you've hit a problem. If you treat monitoring outages as just as bad as an actual service outage, you'll end up with a healthier system overall.

For the case of label explosions, there are some good meta metrics[0] that can help you. The "scrape_series_added" metric can allow you to soft detect label leaks.
In addition, there is a new feature flag[1] that adds additional target metrics for monitoring for targets nearing their limits.


--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com.

l.mi...@gmail.com

unread,
Nov 28, 2022, 4:58:34 AM11/28/22
to Prometheus Developers
On Sunday, 27 November 2022 at 19:25:12 UTC sup...@gmail.com wrote:
Soft / partial failure modes can be very hard problems to deal with. You have to be a lot more careful to not end up missing partial failures.

Sure, partial failures modes can be *in general* a hard problem, but with Prometheus the worst case is that some metrics will be missing and some won't, that's at least in my opinion, not that problematic.
What do you mean by "missing partial failures"?
 
While it seems like having a soft sample_limt is good, and the hard sample_limit is bad. The "Fail fast" will serve you better in the long run. Most of the Prometheus monitoring design assumes fail fast. Partial results are too hard to reason about from a monitoring perspective. With fail fast you will know quickly and decisively that you've hit a problem. If you treat monitoring outages as just as bad as an actual service outage, you'll end up with a healthier system overall.

I think I've seen the statement that partial scrapes are "too hard to reason about" a number of times
From my experience they are not. A hard fail means "all or nothing", partial fail means "all or some". It's just a different level of granularity, that's not that hard to get your head around IMHO.
I think that it's a hard problem from a Prometheus developers point of view, because the inner logic and handling of errors is where the (potential) complication is. But from end user perspective it's not that hard at all. Most of users I talked to asked me to implement partial failure mode, because that makes it both safer and easier for them to make changes.
With hard failure any mistake costs you every metric you scrape, there's no forgiveness. Treating it all like service outage works best for micro services where make small deployments and it's easy to roll-back or roll-forward, it's harder for bigger services, especially ones worked on by multiple teams.

Most cardinality issues we see are not because someone made a bad release, although that happens often too, but because a good chunk of all exported metrics are somehow dynamically tied to the workload of each service. And although we work hard to not have any obviously explosive metrics, like using request path as labels etc, most services will export different set of time series if they receive more traffic, or different traffic, or there's an incident with dependent service etc. So the number of time series is always changing just because traffic fluctuates, trying to set a hard limit that won't trigger simply because there happen to be more traffic at 5am on Sunday is difficult. So what teams will do is give themselves a huge buffer, to allow for any spikes, and then you repeat that for each service and the sum of all hard limits is 2-5x you maximum capacity and it doesn't prevent any OOM kill. It will work on individual service level, eventually, but it does nothing to stop Prometheus as a whole from running out of memory.

This is all about sample_limit. Which is just one level of protection.
Part of the patch we run is also a TSDB HEAD limit, that stops it from appending more time series than configured limit.
 

For the case of label explosions, there are some good meta metrics[0] that can help you. The "scrape_series_added" metric can allow you to soft detect label leaks.
In addition, there is a new feature flag[1] that adds additional target metrics for monitoring for targets nearing their limits.



We have plenty of monitoring for the health of all scrapes. Unfortunately monitoring alone doesn't prevent Prometheus from running out of memory.
For a long time we had quotas for all scrapes, that were implemented as alerts for scrape owners. It caused a lot of noise, fatigue and it didn't do anything useful for our capacity management. Common request based on that experience was to add soft limits.

Julius Volz

unread,
Nov 28, 2022, 6:46:09 AM11/28/22
to l.mi...@gmail.com, Prometheus Developers
My reaction is similar to that of Ben and Julien: if metrics within a target go partially missing, all kinds of expectations around aggregations, histograms, and other metric correlations are off in ways that you wouldn't expect in a normal Prometheus setup. Everyone expects full scrape failures, as they are common in Prometheus, and you usually have an alert on the "up" metric. Now you can say that if you explicitly choose to turn on that feature, you can expect a user to deal with the consequences of that. But the consequences are so unpredictable and can range from annoying to dangerous:

* Incorrectly firing alerts (e.g. because some series are missing in an aggregation or a histogram)
* Silently broken alerts that should be firing (same but the other way around)
* Graphs seemingly working for an instance, but showing wrong data because some series of a metric are missing
* Queries that try to correlate different sets of metrics from a target breaking in subtle ways

Essentially it removes a previously strong invariant from Prometheus-based monitoring and introduces a whole new way of having to think about things like the above. Even if you have alerts to catch the underlying situation, you may get a bunch of incorrect alert notifications from the now-broken rules in the meantime.

l.mi...@gmail.com

unread,
Nov 28, 2022, 7:40:25 AM11/28/22
to Prometheus Developers
I agree that this is not side-effect free. With the worst possible outcome of sending bogus alerts.
But.
We're talking about metrics, so from my PoV the consequences range not from "annoying to dangerous" but "annoying to more annoying".
There's no real danger in missing metrics. Nothing will break if they are missing.
And if someone receives a bogus alert that's gonna be along with another alert saying "you're over your limit, some metrics might be missing". Which does help to limit any potential confusion.

Again, the motivation is to prevent cardinality issues pilling up to a point where Prometheus runs out of memory and crashes.
Maybe that's an issue that's more common in the environments I work with then others and so I'm more willing to make trade-offs to fix that.
I do expect us to run in production with HEAD & soft limit patches (as we already do) since doing so without it requires too much firefighting, so I'm happy to share more experience of that later.

Ben Kochie

unread,
Nov 28, 2022, 9:39:12 AM11/28/22
to l.mi...@gmail.com, Prometheus Developers
On Mon, Nov 28, 2022 at 1:40 PM l.mi...@gmail.com <l.mi...@gmail.com> wrote:
I agree that this is not side-effect free. With the worst possible outcome of sending bogus alerts.
But.
We're talking about metrics, so from my PoV the consequences range not from "annoying to dangerous" but "annoying to more annoying".

Actually, we're NOT talking about metrics. We're talking about monitoring. Prometheus is a monitoring system that also happens to be based on metrics.
 
There's no real danger in missing metrics. Nothing will break if they are missing.

Absolutely not true. We depend on Prometheus for our critical monitoring behavior. Without Prometheus we don't know if our service is down. Without Prometheus we could have problems go undetected for millions of active users. If we're not serving our users this affects our company's income.

To repeat, Prometheus is not some generic TSDB. It's a monitoring platform designed to be as robust as possible for base level alerting for everything it watches.
 

Ben Kochie

unread,
Nov 28, 2022, 9:40:56 AM11/28/22
to l.mi...@gmail.com, Prometheus Developers
To add to "Nothing will break if things are missing". Prometheus alerts depend on the existence of data for alerting to work. If you're "just missing some data", Prometheus alerts will fail to fire. This is unacceptable and we make specific design trade-offs in order to avoid this situation.

l.mi...@gmail.com

unread,
Nov 28, 2022, 10:08:04 AM11/28/22
to Prometheus Developers
Let's take a step back.
The fundamental problem this thread is about is trying to prevent Prometheus OOM kills. When Prometheus crashes due to out of memory it stops scraping any metrics and it will take some time to start (due to WAL replay). During that time you lose all you metrics and alerting.
This is the core of the issue.
There seem to be no disagreement that losing all metrics and alerting is considered bad.
 
The disagreement seems to be purely about how to handle it.
At the moment Prometheus behaviour is to fail hard and lose all monitoring.
What the PR I've liked to is about is graceful degradation. So instead of losing everything you're just losing a subset. And losing here means mostly - there's a flood of new time series and there's no space for them, so they don't get scraped. All existing time series already stored in TSDB are unaffected.
For my use cases that's much better than "a flood of time series comes and takes Prometheus offline". Especially that after the flood you end up with huge WAL that might continue to crash Prometheus on replay until you remove all WAL files.

When I say "Nothing will break if metrics are missing" I mean that my services will run regardless if I scrape metrics or not. I might not know if they are broken if I'm missing metrics, but that's all the impact. What is acceptable to lose and what not will be different for different users, so sure, it might be too much for some.

DECQ, Jérôme

unread,
Nov 28, 2022, 6:29:06 PM11/28/22
to l.mi...@gmail.com, Prometheus Developers

While I believe we should be opened to different solutions to limit the damages of Prometheus going OOM, I just want to point out that metrics use cases are wider than visibility and alerting. For example, missing metrics can get you to miss autoscaling .., and hence be dangerous. I’d be curious if we can be selective on partial failures, and have more control on the priority metrics to privilege for success.

 

 

--

Jérôme

 

From: <prometheus...@googlegroups.com> on behalf of "l.mi...@gmail.com" <l.mi...@gmail.com>
Date: Monday, November 28, 2022 at 7:09 AM
To: Prometheus Developers <prometheus...@googlegroups.com>
Subject: RE: [EXTERNAL][prometheus-developers] Preventing Prometheus from running out of memory

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Julius Volz

unread,
Nov 29, 2022, 9:00:18 AM11/29/22
to l.mi...@gmail.com, Prometheus Developers
Yeah, I see what you mean and I do think your arguments are reasonable from a certain point of view. I think in the end we're just making a different judgement call about the potential upsides and downsides of (optionally) removing such a core invariant as "individual scrapes will either fail or succeed atomically" from Prometheus. It introduces new failure modes, new ways you have to think about things and what you have to alert on, and new things you have to watch out for when supporting users who may or may not have this turned on and who are now asking a question that could be affected by it. Some of the resistance is also that a "no" is usually temporary in open-source, but a "yes" is more permanent (in the sense of introducing a feature or behavior change that's hard to remove in the future), so as a maintainer it often feels safer to err on the side of caution here.

On Mon, Nov 28, 2022 at 1:40 PM l.mi...@gmail.com <l.mi...@gmail.com> wrote:
Reply all
Reply to author
Forward
0 new messages