Prometheus and OOM causes

Alin Sînpălean

unread,

Jun 27, 2018, 10:13:32 AM6/27/18

to Prometheus Developers

I have by now seen multiple prometheus-users threads where the common complaint was Prometheus running out of memory and getting killed, as well as quite a few TSDB bugs where Prometheus being killed due to OOM was suspected to be the trigger.

The consensus answer on the former seems to be that there are no knobs in 2.x to tweak memory utilization and so Prometheus will use as much memory as it needs to handle the series and samples it is expected to. This is all nice and good, but looking at my own puny Prometheus instance, the base memory utilization (which one can eyeball at night, when no one is issuing random queries) is at about 1/5 of peak memory utilization, which happens when someone (usually out of ignorance) selects a 90 days time range on some already heavy Grafana dashboard intended to be used with minutes to hours time ranges.

Inspired by @jacksontj's changes to time out slow queries (#4291 and #4300) I started thinking whether it would be possible to do something similar about a query's memory usage. I am a relatively recent Go (and Prometheus) convert and so I don't know of an easy way of achieving this. But even doing it the hard way (i.e. instrumenting the eval logic to vaguely keep track of how much memory it has allocated for the query in progress) doesn't look all that hard. Or whatever the benchmark implementation does to track memory allocations.

Any thoughts? I'm happy to play with any ideas you'd be willing to throw out if this looks like something worth having to you. (Where "this" is setting either a global or per-query memory budged from the command line or config file.)

Cheers,

Alin.

Fabian Reinartz

unread,

Jun 27, 2018, 10:23:38 AM6/27/18

to Alin Sînpălean, Prometheus Developers

Tracking how much memory has been allocated overall is easy enough. Doing it on a per-query level is a fair bit more complex and in Go I suppose approximations are possible at best since we cannot hook into the allocator directly.

Assuming we had that, now making the query engine do something with it is harder once again. Setting an upper bound and failing would be easy then – and probably a pretty good improvement over the current situation.

Actually reacting in some way so that we would trade-off latency for lower peak memory consumption is a different beast. The current evaluation model would probably not even allow for it.

So there are several steps to it, which are all unaddressed at this point. If someone wants to work on this, looking into a way to get a decent estimate on per-query memory usage would be a great start.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To post to this group, send email to prometheus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/CAA%3DW5mXaOTL%2B1m5kpOoP384G1TK2zqcQEAZcN6b7D_a449XCQQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Brian Brazil

unread,

Jun 27, 2018, 10:31:21 AM6/27/18

to Alin Sînpălean, Prometheus Developers

I'm planning on adding a returned sample limit for each node in a promql evaluation, which we can do with the new promql design and would be good enough for most purposes. I haven't gotten to it yet between book stuff and the moratorium.

--

Brian Brazil

www.robustperception.io

Alin Sînpălean

unread,

Jun 28, 2018, 2:40:06 AM6/28/18

to fab.re...@gmail.com, Prometheus Developers

On Wed, Jun 27, 2018 at 4:23 PM Fabian Reinartz <fab.re...@gmail.com> wrote:

Tracking how much memory has been allocated overall is easy enough. Doing it on a per-query level is a fair bit more complex and in Go I suppose approximations are possible at best since we cannot hook into the allocator directly.

I checked the benchmark code, and indeed it merely looks at global memory allocations (by calling runtime.ReadMemStats before and after each benchmark). It will run benchmarks in parallel, on multiple goroutines, but still one benchmark at a time. So this approach is not applicable to Prometheus queries.

So the alternative would be rolling our own memory allocation tracking, something simple like recursively querying all live iterators/matrices/vectors etc., with each of them returning the size of the struct plus any allocated slices. It will likely underestimate actual usage and ignore memory pressure from repeated allocation of iterators/labels whatever, but it's better than nothing. (Plus, it may help identify places where pooling objects might help relieve memory pressure, if someone wanted to spend time on that.)

Assuming we had that, now making the query engine do something with it is harder once again. Setting an upper bound and failing would be easy then – and probably a pretty good improvement over the current situation.

That's exactly as far as I was thinking of going. (As mentioned, I was inspired by the recent query timeout changes.) But I can think of at least one more step beyond that, which would be (given a global query memory budget) to suspend newly issued queries when in-progress queries are close to/over that budget. Not sure how useful that would be, as e.g. in the case of a dashboard load Prometheus will see a number of queries all at the same time, so it would not help in that particular scenario.

Actually reacting in some way so that we would trade-off latency for lower peak memory consumption is a different beast. The current evaluation model would probably not even allow for it.

A naive approach would be to suspend all but one in-progress query when the memory budget is reached, but seeing how a query creates an eval a tree and nodes toward the bottom of the tree are more often than not heavier than those at the top, probably not particularly useful: by the point you're over budget you will have loaded the heavy leaves and are delaying aggregation or downsampling that would release those leaves.

On Wed, Jun 27, 2018 at 4:31 PM Brian Brazil <brian....@robustperception.io> wrote:

I'm planning on adding a returned sample limit for each node in a promql evaluation, which we can do with the new promql design and would be good enough for most purposes. I haven't gotten to it yet between book stuff and the moratorium.

If you're talking about a total number of samples, across all series in the node, then that sounds like a good enough proxy for a maximum memory utilization per query. And a lot less work. :o)

Cheers,

Alin.

Reply all

Reply to author

Forward

Message has been deleted