One of the groups I support hit me with a question that I've not been able to figure out. I've been encouraging folks to build Grafana dashboards from recording rules. Especially, as folks want to produce multiple quantiles it make a lot of since to make a recording rule for the rate given to the histogram_quantile() function.
job:feed_api_endpoint_request_latency:rate5m =
sum(
rate(
http_endpoint_latency_ms_bucket{
job="api-service",
}[5m]
)
) without (instance, aurora_shard, status_code)
job:feed_api_endpoint_request_latency:rate5m_p99 =
histogram_quantile(0.99, job:feed_api_endpoint_request_latency:rate5m)
During testing a 10,000 ms peak was seen in the 0.99 quantile. We checked the logs and all request latencies were in the 20 - 30 ms range. Then we graphed in Prometheus the raw expression and got normal / expected values. No large 10K peak.
histogram_quantile(0.99, sum(rate(http_endpoint_latency_ms_bucket{environment="prod", endpoint="foobar", method="GET"}
In these histograms 10,000 ms is the largest boundary and we found that the recording rule version had a brief period where there where observations in the +inf bucket that were missing from the 10,000 bucket, so that explains why we saw the sudden peak. But the question remains, how would a recording rule end up with different data than building a graph from the raw metric data?
Jack
--
Jack Neely
Operations Engineer
42 Lines, Inc.