OOM error for Prometheus

Nishant Ketu

unread,

Apr 13, 2020, 5:03:30 AM4/13/20

to Prometheus Users

We have deployed Prometheus through helm and using after around 2 months we get OOM error and the pods failed to restart. We have manually clean up the /data to get the pod running again. I have used the retention flag but it don't seem to work on wall folder of /data. Any help for this would be nice. Thanks

Julius Volz

unread,

Apr 13, 2020, 5:33:48 AM4/13/20

to Nishant Ketu, Prometheus Users

Hi,

the WAL will always need to contain all data from the last few hours (see https://www.robustperception.io/how-much-space-does-the-wal-take-up for more about WAL space usage), so indeed setting a shorter retention time will not affect it, but you can set a flag to enable WAL compression (saving disk space at the cost of more CPU).

However, that is about disk usage, not memory usage (which is what your OOM is about). There you can either give your Prometheus server more RAM to work with, or just give it less work to do (ingest / process / query less). You can get an idea of Prometheus's memory usage from this post: https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion. Note that that post doesn't consider memory usage by queries and other things going on. But ingestion load and especially the number of concurrently active time series are one large factor in how much memory Prometheus requires.

Cheers,

Julius

On Mon, Apr 13, 2020 at 11:03 AM Nishant Ketu <nishan...@atmecs.com> wrote:

We have deployed Prometheus through helm and using after around 2 months we get OOM error and the pods failed to restart. We have manually clean up the /data to get the pod running again. I have used the retention flag but it don't seem to work on wall folder of /data. Any help for this would be nice. Thanks

The information in this message may be proprietary and/or confidential, and protected from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify ATMECS and delete it from your computer.
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/19776f5b-3667-4dc0-bc49-0c4de42e0fdd%40googlegroups.com.

Martin Man

unread,

Apr 13, 2020, 5:53:13 AM4/13/20

to Nishant Ketu, Prometheus Users

Hi Nishant,

I’m also new to prometheus and faced similar scenario recently.

What helped me was to add a job to monitor prometheus instance itself, then import a prometheus 2.0 grafana dashboard and watch prometheus memory consumption and samples appended per second while defining new servicemonitors. This in the end helped me stabilise the memory usage as well as identify services that generated way too many metrics responsible for huge memory consumption.

HTH,
Martin

Message has been deleted

Adso Castro

unread,

Jul 16, 2020, 3:13:13 PM7/16/20

to Prometheus Users

@Martin

Just a ping about this issue, how did you identify what services were causing you trouble with too much metrics? I'm asking because I'm facing a similar problem at the moment.

Thank you.

Reply all

Reply to author

Forward