Hello everyone!
We are running Prometheus on a larger scale - 35 million metrics. We have been at this scale for some time now (~1 year). For the last two months we started to experience problems after a large compaction. This happens from time to time, not periodically, without a time pattern. What happens is that Prometheus stops responding to API requests, /metrics endpoints doesn't work and it stops doing internal processes (further compactions don't happen, alerting rules evaluation throws errors), but seems that it continues to scrape all the targets (network usage on the machine does not drop and the WAL increases). The only way to get it out of this state is to restart the Prometheus docker container.
Image that shows it happens during (or maybe right before it finishes) the large compaction:
Log for alert execution error:
level=warn ts=2022-03-28T10:59:43.076Z caller=manager.go:603 component="rule manager" group=push_requests_submitted.alert msg="Evaluating rule failed" rule="alert: Push_Requests_Submitted expr: "expression" for: 1m " err="query timed out in query queue"