inaccurate recording rule

73 views
Skip to first unread message

Chuan Li

unread,
Apr 20, 2020, 5:11:39 AM4/20/20
to Prometheus Users

I created this GitHub issue as a bug report, but brian-brazil directed me here so I guess it is intended behaviour. The original issue can be found below.


I don't understand how this is intended behaviour. If rate(some_metrics[1m]) work, why shouldn't rate(some_metrics/other_metrics_with_constant_value[1m]) ?


--- GitHub Issue ---


What did you do?

I wanted to calculate container CPU saturation. Without a recording rule I would do something like this:

rate(container_cpu_usage_seconds_total{pod!="", container!=""}[1m]) / ON (namespace, pod, container) GROUP_LEFT kube_pod_container_resource_limits_cpu_cores

Then I wanted to set up a recording rule for this query while retaining the ability to change the range selector, so I wrote the recording rule like this:

  - record: container_cpu_saturation_total
    expr: container_cpu_usage_seconds_total{pod!="", container!=""} / ON(namespace, pod, container) GROUP_LEFT kube_pod_container_resource_limits_cpu_cores

So I could do something like this:

rate(container_cpu_saturation_total[1m])
rate(container_cpu_saturation_total[5m])
rate(container_cpu_saturation_total[15m])

What did you expect to see?

I expect the original PromQL and the one with the recording rule to yield mostly identical results. Perhaps a margin of error no more than a few percents?

What did you see instead? Under which circumstances?

This is the actual result I got:

image

The difference is rather large. With recording rule, the returned values are always higher, sometimes going over 100% which should not be possible.

I tested this on both v2.15.2 and v2.17.1, both versions have this issue (the screenshot is from v2.15.2)

Environment

  • System information:

Linux 3.10.0-1062.el7.x86_64 x86_64

  • Prometheus version:

I tested this on both v2.15.2 and v2.17.1, both versions have this issue (the screenshot is from v2.15.2)

  • Prometheus configuration file:
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 30s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 10.200.13.200:6007
    scheme: http
    timeout: 10s
    api_version: v1

rule_files:
- /config/*.recording.yml
- /config/*.alerting.yml

scrape_configs:

(other irrelevant config)

- job_name: cadvisor
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics/cadvisor
  scheme: https
  kubernetes_sd_configs:
  - role: node
  tls_config:
    ca_file: /secrets/kubelet/ca
    cert_file: /secrets/kubelet/cert
    key_file: /secrets/kubelet/key
    insecure_skip_verify: false

- job_name: kube-state-metrics
  honor_timestamps: true
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: service
    namespaces:
      names:
      - kube-system
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: kube-state-metrics
    replacement: $1
    action: keep


here is another screenshot with graph instead of instant query

image

Jake Utley

unread,
Apr 20, 2020, 7:38:38 PM4/20/20
to Prometheus Users
Your scrape interval is 15s, your evaluation interval is 30s, and your range is 1m. This means that you'd expect your original query to reflect 4 data samples (sensitive, but reasonable in many use cases), and the new query with the recorded expression will have 2 data samples (incredibly sensitive, and not particularly useful).

If you want to keep using a range of 1m, I would suggest changing the recording rule frequency to 15s to match the scrape interval. You'll also likely see much more comparable results if you use a larger range, such as 5m.

Also, it's absolutely possible for the containers reported usage to go past the limit. The data isn't perfect, and these artifacts can happen–particularly when only looking at two data samples.

Chuan Li

unread,
Apr 21, 2020, 5:11:09 AM4/21/20
to Prometheus Users
I tried changing the evaluation interval 15s and got the same result: a large margin of error (at least 5%). A query on a container with around 95% saturation without recording would almost certainly go over 100% with recording rule, but never without it.

My problem wasn't that it went over 100%. It is the consistently different result that doesn't make sense.

Julien Pivotto

unread,
Apr 21, 2020, 5:44:11 AM4/21/20
to Chuan Li, Prometheus Users
You probably want to see https://www.youtube.com/watch?v=67Ulrq6DxwA to understand everything.

To make it short, as you get the rate[1m], it is possible that
prometheus extrapolates your data a bit too much?

I would suggest, as your scrape interval is 15s, to use rate[3m], which
means that 10% of 180s, 18s is higher than the scrape interval (15s),
which means that prometheus will extrapolate to the boundaries of the
range.

I know this is quite complex, and the video explains it.

On 21 Apr 02:11, Chuan Li wrote:
> I tried changing the evaluation interval 15s and got the same result: a
> large margin of error (at least 5%). A query on a container with around 95%
> saturation without recording would almost certainly go over 100% with
> recording rule, but never without it.
>
> My problem wasn't that it went over 100%. It is the consistently different
> result that doesn't make sense.
>
> On Tuesday, April 21, 2020 at 7:38:38 AM UTC+8, Jake Utley wrote:
> >
> > Your scrape interval is 15s, your evaluation interval is 30s, and your
> > range is 1m. This means that you'd expect your original query to reflect 4
> > data samples (sensitive, but reasonable in many use cases), and the new
> > query with the recorded expression will have 2 data samples (incredibly
> > sensitive, and not particularly useful).
> >
> > If you want to keep using a range of 1m, I would suggest changing the
> > recording rule frequency to 15s to match the scrape interval. You'll also
> > likely see much more comparable results if you use a larger range, such as
> > 5m.
> >
> > Also, it's absolutely possible for the containers reported usage to go
> > past the limit. The data isn't perfect, and these artifacts can
> > happen–particularly when only looking at two data samples.
> >
> > On Monday, April 20, 2020 at 2:11:39 AM UTC-7, Chuan Li wrote:
> >>
> >> I created this GitHub issue
> >> <https://github.com/prometheus/prometheus/issues/7144> as a bug report,
> >> <https://user-images.githubusercontent.com/7102510/79717571-99b76280-830c-11ea-9308-aadc7f2d1a59.png>
> >>
> >> The difference is rather large. With recording rule, the returned values
> >> are always higher, sometimes going over 100% which should not be possible.
> >>
> >> I tested this on both v2.15.2 and v2.17.1, both versions have this issue
> >> (the screenshot is from v2.15.2)
> >>
> >> Environment
> >>
> >> - System information:
> >>
> >> Linux 3.10.0-1062.el7.x86_64 x86_64
> >>
> >> - Prometheus version:
> >>
> >> I tested this on both v2.15.2 and v2.17.1, both versions have this issue
> >> (the screenshot is from v2.15.2)
> >>
> >> - Prometheus configuration file:
> >> [image: image] <https://user-images.githubusercontent.com/7102510/79718322-5bbb3e00-830e-11ea-88cb-78cc84dd9614.png>
> >>
> >>
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/388e7f3c-2071-462f-aa0b-3850906ddacb%40googlegroups.com.


--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
signature.asc
Reply all
Reply to author
Forward
0 new messages