Hello all,
TL;DR: measuring `http_request_duration_seconds` on the query path is a bad proxy for query latency as it does not account for data distribution and number of samples/series touched by a query (both of which have significant implications on the performance of a query)
---
I'm
exploring more granular performance metrics for prom queries downstream in Thanos (inspired by
this discussion from Ian Billet) and wanted to reach out to the Prometheus developer community for ideas on how people are measuring and tracking query performance systematically.
The aim is to create a new metric that captures these additional dimensions with respect to the query to better understand/quantify query performance SLI's with respect to number of samples/series touched
before a query is executed.
The current solution I have arrived at is crude n-dimensional histogram, where query_duration is observed/bucketed with labels representing some scale (simplified to t-shirt sizes) of samples touched and series queried. This would allow me to query for query_duration quantiles for some ranges of sample/series sizes (e.g. 90% of queries for up to 1,000,000 samples and up to 10 series complete in less than 2s)
I would love to hear about other approaches members of the community have taken for capturing this level of performance granularity in a metric (as well as stir the pot wrt the thanos proposal).
Thanks,
Moad.