Unexpected Spikes in data recorded using Recording Rules.

47 views
Skip to first unread message

Kishore Kumar

unread,
Mar 11, 2025, 9:48:54 AM3/11/25
to Prometheus Users
Hi Prometheus users,
          We are having a PromQL query and a Recording Rule that records the PromQL, like the example given below.

-record: rest-server-recording-rule
 expr: sum(increase(example_metric[1m])) by (kubernetes_container_name)

The Recording Rule Scrape interval and Evaluation interval are both 30 seconds, and set in Prometheus configuration. 

We are seeing unexpected spikes in the data recorded by the recording rule, whereas this unexpected spike is not present in the source expression, like shown in the below graph (Used Grafana for comparison).

Can we know the reason why this unexpected spike is being created by the recording rule? We would like to know the explanation of how recording rule captures the data of a query.

Thanks for reading this message, have a great day.

Sum(increase) RawQuery - data produced when we query the raw expression directly.
Recording Rules - Data captured by the recording rule.
image-2025-3-10_18-52-48.png


Brian Candler

unread,
Mar 11, 2025, 3:03:26 PM3/11/25
to Prometheus Users
To more easily debug your issue, please take Grafana out of the equation, as it has its own foibles. To do this, use the PromQL web browser to formulate a query within the PromQL web interface.

Then, show if there's a difference between the results: if there is, show the exact query you're giving to PromQL and the exact definition of the recording rule.  Show both graphs, and highlight the differences.

My *guess* is it's something to do with detected counter resets, i.e. example_metrics is not increasing monotonically.  You can formulate queries to detect this.

Kishore Kumar

unread,
Mar 12, 2025, 10:01:46 AM3/12/25
to Prometheus Users
Hi Brian,
        We have used Thanos Query UI to query the graph, and we observe the same graph that we observed in Grafana. We use the following recording rules, albeit with a different name, and without hidden information. 
-record: rest-server-recording-rule
 expr: sum(increase(envoy_cluster_upstream_rq{kubernetes_namespace=~".*<hidden>.*", kubernetes_pod_name=~"rest-.*", envoy_cluster_name=~"<hidden>"}[3m])/3) by (kubernetes_namespace,kubernetes_container_name,envoy_cluster_name)

The source metric here, envoy_cluster_upstream_rq is not actually monotonically increasing graph, and there are counter resets happening. Attaching the images below.
envoy_cluster_upstream_rq: 
2.png

Sum of envoy_cluster_upstream_rq: 
1.png

Actual Query: 
3.png

Recording Rule: 
4.png

Even if it is not a monotonically increasing graph, new spikes should not have been created in the recording rule as we don't see them in the actual query result. 

We would like to know if we are supposed to change any parameters related to recording rules, to make them match as close as possible. 

Thanks for the response,
Have a nice day.
Kishore

Kishore Kumar

unread,
Mar 14, 2025, 10:45:01 AM3/14/25
to Prometheus Users
Hi Brian,
      I hope you have a good day. I humbly request to take a look at the above attached graphs and reply regarding the same.
Apologies and Thank You,
Kishore.

Brian Candler

unread,
Mar 14, 2025, 12:23:50 PM3/14/25
to Prometheus Users
Sum of envoy_cluster_upstream_rq is not a useful query, because  when timeseries come and go the sum jumps down and up (as you can see). You can't do anything with this.

Instead you need to sum(increase(...)), but that's what you're already doing.

If you select a time range that doesn't include a spike, do the two graphs look the same? If they do, then maybe there's some odd timing issue, e.g. your grafana/thanos graphs are at such a resolution that you're skipping over the spikes (if this was the problem, I'd suggest refreshing the page every 10 seconds, for 5 or 10 minutes, and see if any spikes come and go).

Otherwise, you could look separately at the graphs of
increase(envoy_cluster_upstream_rq)
sum(increase(envoy_cluster_upstream_rq))

Or maybe it's something to do with Thanos and recording rules.

Sorry, I can't think of anything more than that.
Reply all
Reply to author
Forward
0 new messages