Aggregations on high cardinality metrics

41 views
Skip to first unread message

Dimitrije M

unread,
Jul 13, 2020, 1:31:21 PM7/13/20
to Prometheus Users
I have a metric with a large cardinality so when I attempt to do something like this

histogram_quantile(0.99, sum(rate(http_request_duration_bucket[15d])) by (le, slo, job, namespace))

I get the error `query processing would load too many samples into memory in query execution`

What is the appropriate way of dealing with something like this? 

The label(route) which is causing the high cardinality is not needed for this query, so I attempted to create a recording rule to remove all of the unnecessary labels

{
    expr
: 'sum(http_request_duration_bucket) by (le, slo, job, namespace, pod)',
    record
: 'http_request_duration_bucket:slo',
}


but the issue with this approach is that I will then be running a "rate" on a "sum" when I try to do the quantile calculation and I have noticed that my alerting rules are slower to react.


Ideally I would "rollup" my aggregations with recording rules, something like
[{
  expr
: 'histogram_quantile(0.99, sum(rate(http_request_duration_bucket[1m])) by (le, slo, job, namespace))',
  record
: 'http_request_duration:99p:1m'
},
{
  expr
: 'histogram_quantile(0.99, http_request_duration:99p:1m[1h])) by (le, slo, job, namespace))',
  record
: 'http_request_duration:99p:1h'
},
{
  expr
: 'histogram_quantile(0.99, http_request_duration:99p:1h[1d])) by (le, slo, job, namespace))',
  record
: 'http_request_duration:99p:1d'
}]


but this produces some funky results.


What is the proper way of calculating averages, quantiles or any other aggregate over a large timeframe? What if I wanted to see the 99th over the course of the year?

Any help is much appreciated. 


Reply all
Reply to author
Forward
0 new messages