I have a scrape job for node_exporter with "scrape_interval: 1m" and ~100 targets.
Some metrics from that scrape are used to power a recording rule:
sum without(cpu) (rate(node_cpu_seconds_total[2m]))
Turns out that when this rule evaluates it generates a time series for all but one instance, on some random occasion (every 20-60 minutes) that one instance is getting those metrics generated (there are dots all over the graph instead of lines).
When I manually run sum without(cpu) (rate(node_cpu_seconds_total[2m])) I get all the metrics for all instances, including the affected one so the issue manifests itself only when evaluating recording rules.
Rule evaluation metrics from prometheus don't show any problems, no missed iterations or failures, logs are clear.
Now I know that rate() needs at least 2 samples so rate()[2m] only works with scrape_interval:1m only if everything is perfectly aligned.
If it's a problem with rate() not getting both samples then I'm not sure why a range query would work here, are range queries and rule evaluations querying data differently?
And how staleness plays out here? Will a rule evaluation query data using 5m look-back or does it have more "instant" query mechanism?