I have by now seen multiple prometheus-users threads where the common complaint was Prometheus running out of memory and getting killed, as well as quite a few TSDB bugs where Prometheus being killed due to OOM was suspected to be the trigger.
The consensus answer on the former seems to be that there are no knobs in 2.x to tweak memory utilization and so Prometheus will use as much memory as it needs to handle the series and samples it is expected to. This is all nice and good, but looking at my own puny Prometheus instance, the base memory utilization (which one can eyeball at night, when no one is issuing random queries) is at about 1/5 of peak memory utilization, which happens when someone (usually out of ignorance) selects a 90 days time range on some already heavy Grafana dashboard intended to be used with minutes to hours time ranges.
Inspired by @jacksontj's changes to time out slow queries (
#4291 and
#4300) I started thinking whether it would be possible to do something similar about a query's memory usage. I am a relatively recent Go (and Prometheus) convert and so I don't know of an easy way of achieving this. But even doing it the hard way (i.e. instrumenting the eval logic to vaguely keep track of how much memory it has allocated for the query in progress) doesn't look all that hard. Or whatever the benchmark implementation does to track memory allocations.
Any thoughts? I'm happy to play with any ideas you'd be willing to throw out if this looks like something worth having to you. (Where "this" is setting either a global or per-query memory budged from the command line or config file.)
Cheers,
Alin.