I am encountering challenges with configuring Prometheus and Alertmanager for my application's alarm system. Below are the configurations I am currently using:
prometheus.yml:
Scrape Interval: 1h
rules.yml:
groups: - name: recording-rule interval: 1h rules: - record: myRecord expr: expression….. (calculating ratio by dividing two metric > than value) - name: alerting-rule interval: 4h rules: - alert: myAlert expr: max_over_time(myRecord[4h]) labels: severity: warning annotations: summary: “summary”alertmanager.yml:
group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 4hIssues:
Inconsistent Alerting: The similarity in scrape interval and recording rule evaluation interval (both set to 1 hour) leads to instances where Prometheus scrapes data before the recording rule evaluation. Consequently, during the recording rule evaluation, there may be no value in the metric, resulting in the recording rule failing to trigger an alert despite the condition being satisfied.
Discrepancy in Firing Alerts: The number of firing alerts in Prometheus varies significantly from the number of alerts received by Alertmanager, causing inconsistency and confusion in alert handling.
Uncertainty in Alert Evaluation Timing: The alerting rule seems to be evaluated inconsistently, sometimes triggering alerts shortly after service restart, while other times with delays beyond the expected 4-hour interval.
Request for Assistance:
I am seeking guidance on configuring Prometheus and Alertmanager to achieve the following:
Thanks in advance.
I believe that recording rules and alerting rules similarly may have
their evaluation time happen at different offsets within their
evaluation interval. This is done for the similar reason of spreading
out the internal load of rule evaluations across time.
Thank you for your prompt response and guidance on addressing the metric staleness issue.
Regarding metric staleness I confirm that I have already implemented the approach to use square brackets for the recording metrics and alerting rule (e.g. max_over_time(metric[1h])). However, the main challenge persists with the discrepancy in the number of alerts generated by Prometheus compared to those displayed in Alertmanager.
To illustrate, when observing Prometheus, I may observe approximately 25,000 alerts triggered within a given period. However, when reviewing the corresponding alerts in Alertmanager, the count often deviates significantly, displaying figures such as 10,000 or 18,000, rather than the expected 25,000.
This inconsistency poses a significant challenge in our alert management process, leading to confusion and potentially overlooking critical alerts.
I would greatly appreciate any further insights or recommendations you may have to address this issue and ensure alignment between Prometheus and Alertmanager in terms of the number of alerts generated and displayed.