Can Prometheus compaction take a hit if there are long running queries on head block?

133 views
Skip to first unread message

amar

unread,
Jan 21, 2023, 9:09:39 AM1/21/23
to Prometheus Users
Hello All,

We have the following situation and got hit by Prometheus OOM issues.

1. Prometheus(2.19)  running on K8s with thanos side car.
2. CPU/Memory: 1 core/ 60GB
3. Retention: 1w/75GB
4. Head block with 6 to 7M Active timeseries. Earlier we use to have 200k to 300k but due to some recent change we have hit with this scenario.

Prometheus is continuously getting restarted due to OOM. So far, below are out findings:
1. The compaction is not happening even after tsdb.min-block-duration(2h by default). Sometimes it fails resulting in *.tmp files. Changes to tsdb.min-block-duration and tsdb.max-block-duration is not recommended as we are running Thanos sidecar. 
2. The WAL kept growing as the compaction is not happening.
3. Replaying WAL is taking 5 to 10 mins due to frequent restarts and due to large accumulation of events.
4. We have queries from alerting rules running on Prometheus TSDB which are timing out.Theoretically,  it looks like the memory mapped chunks from the disk are getting loaded to the memory causing OOM.

Could you please help me understand with below queries:
1. Since the data is not compacted and converted to TSDB blocks, I believe the alerting rules are running on head block ( which has the memory mapped chunks ) which is in-memory. Initially we have memory set to 30GB and raised it to 60GB. I don't think we have data beyond that limit even with 6M to 7M active timeseries. Why would the memory keep growing causing OOM.(Consider, we are not scraping anything after we have 6 to 7M active timeseries and WAL is stored on the disk).
2. We haven't enabled debug log and atleast there is no traces in the log to understand why the compaction is not happening when the alerting rules are timing out.

The issue resolved after we stopped alerting rules and stopped scraping new metrics.
Please excuse my typos.

Thanks,
Amar


Brian Candler

unread,
Jan 21, 2023, 10:38:09 AM1/21/23
to Prometheus Users
Not answering your question, but just pointing out that Prometheus 2.19 is from Jun 9  2020, so is now over two and a half years old.

Hence, before reporting a performance problem, I suggest you upgrade to something newer and see if the problem disappears.  Version 2.37.x is the current long-term-support release (although it is about to reach the end of its committed end of support date, so I would expect another branch will soon be promoted to LTS)

Reply all
Reply to author
Forward
0 new messages