Aggregations on high cardinality metrics

41 views

Skip to first unread message

Dimitrije M

unread,

Jul 13, 2020, 1:31:21 PM7/13/20

to Prometheus Users

I have a metric with a large cardinality so when I attempt to do something like this

histogram_quantile(0.99, sum(rate(http_request_duration_bucket[15d])) by (le, slo, job, namespace))

I get the error `query processing would load too many samples into memory in query execution`

What is the appropriate way of dealing with something like this?

The label(route) which is causing the high cardinality is not needed for this query, so I attempted to create a recording rule to remove all of the unnecessary labels

{
    expr: 'sum(http_request_duration_bucket) by (le, slo, job, namespace, pod)',
    record: 'http_request_duration_bucket:slo',
}

but the issue with this approach is that I will then be running a "rate" on a "sum" when I try to do the quantile calculation and I have noticed that my alerting rules are slower to react.

Ideally I would "rollup" my aggregations with recording rules, something like

[{
  expr: 'histogram_quantile(0.99, sum(rate(http_request_duration_bucket[1m])) by (le, slo, job, namespace))',
  record: 'http_request_duration:99p:1m'
},
{
  expr: 'histogram_quantile(0.99, http_request_duration:99p:1m[1h])) by (le, slo, job, namespace))',
  record: 'http_request_duration:99p:1h'
},
{
  expr: 'histogram_quantile(0.99, http_request_duration:99p:1h[1d])) by (le, slo, job, namespace))',
  record: 'http_request_duration:99p:1d'
}]

but this produces some funky results.

What is the proper way of calculating averages, quantiles or any other aggregate over a large timeframe? What if I wanted to see the 99th over the course of the year?