Hi,
We are running Prometheus 2.25.0.
We have been running into issues with expensive queries causing prometheus service to crash. We are giving it 64GB ram. We have aggressively limited query timeout to 1m and query.max-samples to 10,000,000 (20% of default value), which based on my reading (
https://www.robustperception.io/why-does-prometheus-use-so-much-ram) should take up to 20MB, totally reasonable to handle.
Yet, our prometheus service crashes. In query log, we see a few occurrences of
> "error": "query processing would load too many samples into memory in query execution",
And then minutes later, we see a lot of IO ops, and OOM, and prometheus service crashes.
It doesn't seem that query.max-samples does anything to prevent prometheus from crashing.
It is almost like the bad queries went on and kept loading data.
Please advise. Thanks!