Assistance Needed with Prometheus and Alertmanager Configuration

92 views
Skip to first unread message

Trio Official

unread,
Mar 29, 2024, 3:43:05 PM3/29/24
to Prometheus Users

I am encountering challenges with configuring Prometheus and Alertmanager for my application's alarm system. Below are the configurations I am currently using:

prometheus.yml: 

Scrape Interval: 1h

rules.yml:

groups: - name: recording-rule interval: 1h rules: - record: myRecord expr: expression….. (calculating ratio by dividing two metric > than value) - name: alerting-rule interval: 4h rules: - alert: myAlert expr: max_over_time(myRecord[4h]) labels: severity: warning annotations: summary: “summary”

alertmanager.yml:

group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 4h


Issues:

  • Inconsistent Alerting: The similarity in scrape interval and recording rule evaluation interval (both set to 1 hour) leads to instances where Prometheus scrapes data before the recording rule evaluation. Consequently, during the recording rule evaluation, there may be no value in the metric, resulting in the recording rule failing to trigger an alert despite the condition being satisfied.

  • Discrepancy in Firing Alerts: The number of firing alerts in Prometheus varies significantly from the number of alerts received by Alertmanager, causing inconsistency and confusion in alert handling.

  • Uncertainty in Alert Evaluation Timing: The alerting rule seems to be evaluated inconsistently, sometimes triggering alerts shortly after service restart, while other times with delays beyond the expected 4-hour interval.


Request for Assistance:

I am seeking guidance on configuring Prometheus and Alertmanager to achieve the following:

  • Ensuring the alerting expression is evaluated every 4 hours, checking for the maximum of the recording metric over the specified interval.
  • The recording rule is evaluated every 1 hour to maintain accuracy in alert triggering.
  • I would appreciate any insights or recommendations on addressing these challenges and achieving the desired configuration for our use case.

Thanks in advance.

Chris Siebenmann

unread,
Mar 29, 2024, 6:09:18 PM3/29/24
to Trio Official, Prometheus Users, Chris Siebenmann
> I am encountering challenges with configuring Prometheus and Alertmanager
> for my application's alarm system. Below are the configurations I am
> currently using:
>
> *prometheus.yml:*
>
> Scrape Interval: 1h

This scrape interval is far too high. Although it's not well documented,
you can't set scrape_interval higher than two or three minutes without
causing seriously weird issues, where your rules may not see metrics
because Prometheus considers the metrics stale. Prometheus considers
metrics stale if the most recent sample is more than five minutes old;
this time is not adjustable as far as I know. I believe you've already
seen signs of this from your other problems, but really, as far as I
know such a configuration basically isn't supported.

(In my view this is such a problem that Prometheus should at least
require a forced 'I know what I'm doing, really I want this' command
line option to accept a scrape interval that's larger than the staleness
interval, or maybe even within ten seconds or so of it.)

I believe that all rule evaluation intervals similarly need to be no
more than five minutes because of the stale metrics issue, since both
recording rules and alerting rules generate metrics (the recording rules
generate their metrics in an obvious way, the alerting rules generate
ALERTS metrics and some other ones). It's possible that alerts don't go
stale inside Prometheus despite their metrics going stale, but I
wouldn't count on this.

(Although it's possible that metrics from recording and/or alert rules
are special and are exempted from staleness, I would be surprised.)

Prometheus scrapes different targets at different offsets within their
scrape interval, so you can't synchronize scrapes and rule evaluations
the way you apparently want to. The time offset for any particular
scrape target is deterministic but not predictable (and it may change
between eg Prometheus releases, or even on a Prometheus restart).
Prometheus does this to spread out the load of scraping more or less
evenly across the scrape interval, rather than descending on all targets
simultaneously every X seconds or minutes.

I believe that recording rules and alerting rules similarly may have
their evaluation time happen at different offsets within their
evaluation interval. This is done for the similar reason of spreading
out the internal load of rule evaluations across time.

- cks

Brian Candler

unread,
Mar 30, 2024, 4:59:42 AM3/30/24
to Prometheus Users
On Friday 29 March 2024 at 22:09:18 UTC Chris Siebenmann wrote:
I believe that recording rules and alerting rules similarly may have
their evaluation time happen at different offsets within their
evaluation interval. This is done for the similar reason of spreading
out the internal load of rule evaluations across time.

I think it's more accurate to say that *rule groups* are spread spread over their evaluation interval, and rules within the same rule group are evaluated sequentially. This is how you can build rules that depend on each other, e.g. a recording rule followed by other rules that depend on its output; put them in the same rule group.

As for scraping: you *can* change this staleness interval, using --query.lookback-delta, but it's strongly not recommended. Using the default of 5 mins, you should use a maximum scrape interval of 2 mins so that even if you miss one scrape for a random reason, you still have two points within the lookback-delta so that the timeseries does not go stale.

There's no good reason to scrape at one hour intervals:
* Prometheus is extremely efficient with its storage compression, especially when adjacent data points are equal, so scraping the same value every 2 minutes is going to use hardly any more storage than scraping it every hour.
* If you're worried about load on the exporter because responding to a scrape is slow or expensive, then you should run the exporter every hour from a local cronjob, and write its output to a persistent location (e.g. to PushGateway or statsd_exporter, or simply write it to a file which can be picked up by node_exporter textfile-collector or even a vanilla HTTP server).  You can then scrape this as often as you like.

node_exporter textfile-collector exposes an extra metrics for the timestamp on each file, so you can alert in the case that the file isn't being updated.

Trio Official

unread,
Mar 30, 2024, 5:49:04 AM3/30/24
to Prometheus Users

Thank you for your prompt response and guidance on addressing the metric staleness issue.

Regarding metric staleness  I confirm that I have already implemented the approach to use square brackets for the recording metrics and alerting rule (e.g. max_over_time(metric[1h])). However, the main challenge persists with the discrepancy in the number of alerts generated by Prometheus compared to those displayed in Alertmanager. 

To illustrate, when observing Prometheus, I may observe approximately 25,000 alerts triggered within a given period. However, when reviewing the corresponding alerts in Alertmanager, the count often deviates significantly, displaying figures such as 10,000 or 18,000, rather than the expected 25,000.

This inconsistency poses a significant challenge in our alert management process, leading to confusion and potentially overlooking critical alerts.

I would greatly appreciate any further insights or recommendations you may have to address this issue and ensure alignment between Prometheus and Alertmanager in terms of the number of alerts generated and displayed.

Brian Candler

unread,
Mar 30, 2024, 6:20:36 AM3/30/24
to Prometheus Users
Only you can determine that, by comparing the lists of alerts from both sides and seeing what differs, and looking into how they are generated and measured. There are all kinds of things which might affect this, e.g. pending/keep_firing_for alerts, group wait etc.

But you might also want to read this:

If you're generating more than a handful of alerts per day, then maybe you need to reconsider what constitutes an "alert".
Reply all
Reply to author
Forward
0 new messages