Hi,
we have a set of high-cardinality metrics and currently design recording rules, primarily to improve dashboard performance.
At a certain threshold, we observe group evaluation times exceeding the interval, thus leading to iteration misses [1].
In these cases, we can also see that the next iteration starts at the end of the last evaluation plus the interval. So the the iteration is not really skipped but rather delayed (the schedule has a lag).
What is the impact of this? Do we need to worry about iteration misses?
To be more concrete, here is one of our rule groups:
groups:
- name: http_server_requests_seconds_bucket
rules:
- record: app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m
expr: sum by(app, method, uri, status, le) (rate(http_server_requests_seconds_bucket[1m]))
- record: app_le:http_server_requests_seconds_bucket:rate1m
expr: sum by(app, le) (app_method_uri_status_le:http_server_requests_seconds_bucket:rate1m)
The scrape interval is set to 15s, the evaluation interval to 30s.
With ~3Mio time series [2], we see evaluation times of ~1m.
[1] We use "prometheus_rule_group_iterations_missed_total" to monitor iteration misses
[2] We have a little test tool to simulate load on prometheus before rolling this out. We're trying to find limits of a single prometheus instance before scaling horizontally (federation) or reaching for e.g., Thanos, Cortex.