Recording rule displaying different results than ad-hoc querying

91 views
Skip to first unread message

Per Lundberg

unread,
Apr 23, 2020, 6:59:05 AM4/23/20
to Prometheus Users
Hi,

We have been using Prometheus (2.13.1) with one of our larger customer installations for a while; thus far, it's been working great and we are very thankful for the nice piece of software that it is. (We are a software company ourselves, using Prometheus to monitor the health of both our own application as well as many other relevant parts of the services involved). Because of the volume of metrics for some of our metrics, we have a number of recording rules set up, to make querying of this data reasonable from e.g. Grafana.

However, today we started some really strange behavior after a planned restart on one of the Tomcat-based application services we are monitoring. Some requests seems to be peaking at 60s (indicating a problem in our application backend), but the strange thing here is that our recording rules produce very different results than just running the same queries in the Prometheus console.

Here is how the recording rule has been defined in a custom_recording_rules.yml file:

  - name: hbx_controller_action_global
    rules:
      - record: global:hbx_controller_action_seconds:histogram_quantile_50p_rate_1m
        expr: histogram_quantile(0.5, sum by (le)(rate(hbx_controller_action_seconds_bucket[1m])))
      - record: global:hbx_controller_action_seconds:histogram_quantile_75p_rate_1m
        expr: histogram_quantile(0.75, sum by (le)(rate(hbx_controller_action_seconds_bucket[1m])))
      - record: global:hbx_controller_action_seconds:histogram_quantile_95p_rate_1m
        expr: histogram_quantile(0.95, sum by (le)(rate(hbx_controller_action_seconds_bucket[1m])))
      - record: global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m
        expr: histogram_quantile(0.99, sum by (le)(rate(hbx_controller_action_seconds_bucket[1m])))

Querying global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m yields an output like this:


However, running the individual query gives a completely different view of this data. Note how the 60-second peaks are completely gone in this screenshot:


I don't really know what to make out of this. Are we doing something fundamentally wrong here in how our recording rules are set up, or could this be a bug in Prometheus (unlikely)? Btw, we have the evaluation_interval set to 15s globally.

Thanks in advance.

Best regards,
Per

Julius Volz

unread,
Apr 23, 2020, 8:39:13 AM4/23/20
to Per Lundberg, Prometheus Users
Odd. Depending on time window alignment it can always be that some spikes might appear in one graph and not another, but such a big difference is strange. Just to make sure, what happens when you bring down the resolution on both queries to 15s (which is your rule evaluation interval) or lower?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2d75ca0f-a24f-42e4-beb8-2ee88e04acdf%40googlegroups.com.

Per Lundberg

unread,
Apr 23, 2020, 9:02:58 AM4/23/20
to Julius Volz, Prometheus Users

With global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m, there are more 60s spikes shown if I change to a 15s or 5s interval. With the other query (histogram_quantile(0.99, sum by (le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't go above 1.2s, oddly enough.

Julius Volz

unread,
Apr 23, 2020, 9:24:42 AM4/23/20
to Per Lundberg, Prometheus Users
Strange! Have you tried a more recent Prometheus version, btw.? Just to rule that part out, since 2.13.1 is pretty old...

Marcin Chmiel

unread,
May 13, 2020, 5:09:43 AM5/13/20
to Prometheus Users
We're facing what I believe is the exact same issue, on v2.16.0. Although we also have some intermittent failures with kube-state-metrics which generates data for this query. I reckon the recording rule should either be empty (due to time skew) or showing the same value. But not having such a dip.

Here's the query that's plotted, on which the recording rule is based

count by (namespace) (kube_namespace_labels{label_xxx="123"})

absent series shows periods where kube-state-metrics is unavailable. Orange color where query and recording rules overlap.

grafana.png



On Thursday, 23 April 2020 15:24:42 UTC+2, Julius Volz wrote:
Strange! Have you tried a more recent Prometheus version, btw.? Just to rule that part out, since 2.13.1 is pretty old...

On Thu, Apr 23, 2020 at 3:02 PM Per Lundberg <per.l...@hibox.tv> wrote:

With global:hbx_controller_action_seconds:histogram_quantile_99p_rate_1m, there are more 60s spikes shown if I change to a 15s or 5s interval. With the other query (histogram_quantile(0.99, sum by (le)(rate(hbx_controller_action_seconds_bucket[1m])))), it still doesn't go above 1.2s, oddly enough.

On 2020-04-23 15:38, Julius Volz wrote:
Odd. Depending on time window alignment it can always be that some spikes might appear in one graph and not another, but such a big difference is strange. Just to make sure, what happens when you bring down the resolution on both queries to 15s (which is your rule evaluation interval) or lower?

To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Per Lundberg

unread,
May 15, 2020, 6:11:19 AM5/15/20
to Prometheus Users
Interesting Marcin. Could indeed be the same root cause yes.

Julius - I tried (by synchronizing 200 GiB of data to my local machine ;) with a more recent Prometheus version, 2.18.1, and I get the same behavior still.

I get the feeling that this could be a bug in Prometheus. Should I perhaps report it via the GitHub issue tracker?

Best regards,
Per

Julius Volz

unread,
May 15, 2020, 6:15:09 AM5/15/20
to Per Lundberg, Prometheus Users
Yeah, that would be great, thanks!

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/15d9d01f-263e-4803-8159-a33f1a8b0cfa%40googlegroups.com.

Per Lundberg

unread,
Aug 26, 2020, 2:53:48 AM8/26/20
to Prometheus Users
Hi,

For the record, I've now (finally!) created a GitHub issue about this: https://github.com/prometheus/prometheus/issues/7852
Sorry for the huge delay.

Best regards,
Per
Reply all
Reply to author
Forward
0 new messages