Differences between Recording Rule and Graphing Raw Expression

390 views
Skip to first unread message

Jack Neely

unread,
Mar 2, 2017, 12:14:43 PM3/2/17
to Prometheus Users
Greetings,

One of the groups I support hit me with a question that I've not been able to figure out.  I've been encouraging folks to build Grafana dashboards from recording rules.  Especially, as folks want to produce multiple quantiles it make a lot of since to make a recording rule for the rate given to the histogram_quantile() function.

job:feed_api_endpoint_request_latency:rate5m = 
    sum(
      rate(
        http_endpoint_latency_ms_bucket{
          job="api-service",
        }[5m]
      )
    ) without (instance, aurora_shard, status_code)

job:feed_api_endpoint_request_latency:rate5m_p99 =
  histogram_quantile(0.99, job:feed_api_endpoint_request_latency:rate5m)

During testing a 10,000 ms peak was seen in the 0.99 quantile.  We checked the logs and all request latencies were in the 20 - 30 ms range.  Then we graphed in Prometheus the raw expression and got normal / expected values.  No large 10K peak.

histogram_quantile(0.99, sum(rate(http_endpoint_latency_ms_bucket{environment="prod", endpoint="foobar", method="GET"}

In these histograms 10,000 ms is the largest boundary and we found that the recording rule version had a brief period where there where observations in the +inf bucket that were missing from the 10,000 bucket, so that explains why we saw the sudden peak.  But the question remains, how would a recording rule end up with different data than building a graph from the raw metric data?

Jack

--
Jack Neely
Operations Engineer
42 Lines, Inc.

Brian Brazil

unread,
Mar 2, 2017, 12:19:28 PM3/2/17
to Jack Neely, Prometheus Users
It's a race condition. The recording rule was being run while some but not all of the data from a histogram had been ingested. There's various other forms of this that can happen, the main issue is that scrapes are not atomic in ingestion/querying terms.

--

Jack Neely

unread,
Mar 2, 2017, 12:52:25 PM3/2/17
to Brian Brazil, Prometheus Users
So this implies that we'd see more of this at higher volumes.  We're in testing presently for future features.

Is there a plan to make this situation better, or is there anything we can do on our side to mitigate this?

Brian Brazil

unread,
Mar 2, 2017, 12:55:49 PM3/2/17
to Jack Neely, Prometheus Users
On 2 March 2017 at 17:52, Jack Neely <jjn...@42lines.net> wrote:

There's a few things on the table (atomic scrapes, ingestion-time rules, first-class metrics) that'll help but there's no timelines or certainty that we'll do any of them.

--

Matthias Rampke

unread,
Mar 3, 2017, 3:11:54 AM3/3/17
to Brian Brazil, Jack Neely, Prometheus Users

We record not the aggregation of buckets but go straight to the percentiles we are interested in (usually 0.5, 0.9, 0.99). That works well enough.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAHJKeLo%3DTqv4UM0_N1BJkyF8izrCFjDqryZjUzOmFd55kSeN8Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages