Hello All,
We have the following situation and got hit by Prometheus OOM issues.
1. Prometheus(2.19) running on K8s with thanos side car.
2. CPU/Memory: 1 core/ 60GB
3. Retention: 1w/75GB
4. Head block with 6 to 7M Active timeseries. Earlier we use to have 200k to 300k but due to some recent change we have hit with this scenario.
Prometheus is continuously getting restarted due to OOM. So far, below are out findings:
1. The compaction is not happening even after tsdb.min-block-duration(2h by default). Sometimes it fails resulting in *.tmp files. Changes to tsdb.min-block-duration and tsdb.max-block-duration is not recommended as we are running Thanos sidecar.
2. The WAL kept growing as the compaction is not happening.
3. Replaying WAL is taking 5 to 10 mins due to frequent restarts and due to large accumulation of events.
4. We have queries from alerting rules running on Prometheus TSDB which are timing out.Theoretically, it looks like the memory mapped chunks from the disk are getting loaded to the memory causing OOM.
Could you please help me understand with below queries:
1. Since the data is not compacted and converted to TSDB blocks, I believe the alerting rules are running on head block ( which has the memory mapped chunks ) which is in-memory. Initially we have memory set to 30GB and raised it to 60GB. I don't think we have data beyond that limit even with 6M to 7M active timeseries. Why would the memory keep growing causing OOM.(Consider, we are not scraping anything after we have 6 to 7M active timeseries and WAL is stored on the disk).
2. We haven't enabled debug log and atleast there is no traces in the log to understand why the compaction is not happening when the alerting rules are timing out.
The issue resolved after we stopped alerting rules and stopped scraping new metrics.
Please excuse my typos.
Thanks,
Amar