I have a metric with a large cardinality so when I attempt to do something like this
histogram_quantile(0.99, sum(rate(http_request_duration_bucket[15d])) by (le, slo, job, namespace))
I get the error `query processing would load too many samples into memory in query execution`
What is the appropriate way of dealing with something like this?
The label(route) which is causing the high cardinality is not needed for this query, so I attempted to create a recording rule to remove all of the unnecessary labels
{
expr: 'sum(http_request_duration_bucket) by (le, slo, job, namespace, pod)',
record: 'http_request_duration_bucket:slo',
}
but the issue with this approach is that I will then be running a "rate" on a "sum" when I try to do the quantile calculation and I have noticed that my alerting rules are slower to react.
Ideally I would "rollup" my aggregations with recording rules, something like
[{
expr: 'histogram_quantile(0.99, sum(rate(http_request_duration_bucket[1m])) by (le, slo, job, namespace))',
record: 'http_request_duration:99p:1m'
},
{
expr: 'histogram_quantile(0.99, http_request_duration:99p:1m[1h])) by (le, slo, job, namespace))',
record: 'http_request_duration:99p:1h'
},
{
expr: 'histogram_quantile(0.99, http_request_duration:99p:1h[1d])) by (le, slo, job, namespace))',
record: 'http_request_duration:99p:1d'
}]
but this produces some funky results.
What is the proper way of calculating averages, quantiles or any other aggregate over a large timeframe? What if I wanted to see the 99th over the course of the year?
Any help is much appreciated.