Recording rule evaluations vs instant queries

40 views
Skip to first unread message

l.mi...@gmail.com

unread,
Jan 19, 2022, 7:09:10 AM1/19/22
to Prometheus Users
I have a scrape job for node_exporter with "scrape_interval: 1m" and ~100 targets.
Some metrics from that scrape are used to power a recording rule:
sum without(cpu) (rate(node_cpu_seconds_total[2m]))

Turns out that when this rule evaluates it generates a time series for all but one instance, on some random occasion (every 20-60 minutes) that one instance is getting those metrics generated (there are dots all over the graph instead of lines).

When I manually run sum without(cpu) (rate(node_cpu_seconds_total[2m])) I get all the metrics for all instances, including the affected one so the issue manifests itself only when evaluating recording rules.
Rule evaluation metrics from prometheus don't show any problems, no missed iterations or failures, logs are clear.

Now I know that rate() needs at least 2 samples so rate()[2m] only works with scrape_interval:1m only if everything is perfectly aligned.
If it's a problem with rate() not getting both samples then I'm not sure why a range query would work here, are range queries and rule evaluations querying data differently?
And how staleness plays out here?  Will a rule evaluation query data using 5m look-back or does it have more "instant" query mechanism?

Brian Brazil

unread,
Jan 19, 2022, 12:16:28 PM1/19/22
to Lukasz Mierzwa, Prometheus Users


On Wed 19 Jan 2022, 16:51 Lukasz Mierzwa, <l.mi...@gmail.com> wrote:
Thanks!

The blog post doesn't directly mention it but it seems to confirm that rule evaluation won't look back 5m to find samples and simply grabs what's within rate() range (?).

Correct, the 5m only applies to instant vectors not range vectors.

Brian


On Wed, 19 Jan 2022 at 16:17, Brian Brazil <brian....@robustperception.io> wrote:
On Wed, 19 Jan 2022 at 12:09, l.mi...@gmail.com <l.mi...@gmail.com> wrote:
I have a scrape job for node_exporter with "scrape_interval: 1m" and ~100 targets.
Some metrics from that scrape are used to power a recording rule:
sum without(cpu) (rate(node_cpu_seconds_total[2m]))

Turns out that when this rule evaluates it generates a time series for all but one instance, on some random occasion (every 20-60 minutes) that one instance is getting those metrics generated (there are dots all over the graph instead of lines).

When I manually run sum without(cpu) (rate(node_cpu_seconds_total[2m])) I get all the metrics for all instances, including the affected one so the issue manifests itself only when evaluating recording rules.
Rule evaluation metrics from prometheus don't show any problems, no missed iterations or failures, logs are clear.

Now I know that rate() needs at least 2 samples so rate()[2m] only works with scrape_interval:1m only if everything is perfectly aligned.

If it's a problem with rate() not getting both samples then I'm not sure why a range query would work here, are range queries and rule evaluations querying data differently?
And how staleness plays out here?  Will a rule evaluation query data using 5m look-back or does it have more "instant" query mechanism?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c784164f-51b1-43b4-8f84-a24477517e57n%40googlegroups.com.


--
Reply all
Reply to author
Forward
0 new messages