Optimizing Histogram Buckets Format

70 views
Skip to first unread message

Bashar Al Rawi

unread,
Jun 20, 2020, 8:44:20 PM6/20/20
to Prometheus Developers
Hi all,

I would like to propose a new format for storing bucket counts that can provide substantial improvements for sparse/bi-modal metrics along with changes to histogram_quantile that allow the new format. Both the change and format aren't complex and can be done with backward compatibility. The main breaking change would be adding a new reserved label.

I put together my thoughts in this Google doc that is open for comments and a rough implementation in this commit.

Please, take a look and let me know what your thoughts are.

Bashar

Julius Volz

unread,
Jun 21, 2020, 3:23:56 AM6/21/20
to Bashar Al Rawi, Prometheus Developers, Bjoern Rabenstein
+CC Björn, who has been looking into sparse histograms for a while.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/b3cd52af-3e60-4385-8a0d-b914559ab0b0n%40googlegroups.com.

Brian Brazil

unread,
Jun 21, 2020, 5:52:21 AM6/21/20
to Bashar Al Rawi, Prometheus Developers
On Sun, 21 Jun 2020 at 01:44, Bashar Al Rawi <bashar...@gmail.com> wrote:
Hi all,

I would like to propose a new format for storing bucket counts that can provide substantial improvements for sparse/bi-modal metrics along with changes to histogram_quantile that allow the new format. Both the change and format aren't complex and can be done with backward compatibility. The main breaking change would be adding a new reserved label.

I put together my thoughts in this Google doc that is open for comments and a rough implementation in this commit.

Thanks for your doc. The first challenge here is that this involves a format change. The Prometheus text format is basically fixed at this point, OpenMetrics will be replacing it but sticks with the same format for histograms as OpenMetrics aims to leverage the existing installed base. The duplication in the current format isn't really a problem in practice, as compression handles this. This proposal also loses the benefits of the current format, namely that you can no longer drop overly-granular buckets nor can you calculate things like Apdex both of which are important properties. So this would break important use cases, and cause busy work across the ecosystem.

The really big issue here is that understanding is that even with something like this you still need ~300 buckets to not have to worry about bucket choices given the typical range of event sizes. This is well above the ~10 recommended currently when considering other labels, so something would need to be done to make the TSDB handling of histograms ~30x more efficient before this would make sense.


The way to approach this instead is to leave the format as-is, and look at how to optimise this inside the TSDB (and later PromQL) so that more buckets are handled transparently and efficiently for histogram_quantile - but otherwise look the same to PromQL as today.

In short changing the text format - even with it being a major breaking change that'd break key use cases - would be the easy bit, and doesn't actually help. The hard bits are all in the TSDB, and you should try to solve those first as without TSDB improvements there's no point.

Brian
 

Please, take a look and let me know what your thoughts are.

Bashar

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/b3cd52af-3e60-4385-8a0d-b914559ab0b0n%40googlegroups.com.

Bashar Al Rawi

unread,
Jun 21, 2020, 1:44:49 PM6/21/20
to Prometheus Developers
Thanks Brian for the thorough response. There are a few points that I would like to make.
>> "This proposal also loses the benefits of the current format, namely that you can no longer drop overly-granular buckets nor can you calculate things like Apdex both of which are important properties."
You can still calculate calculate the Apdex with the new format. In fact, the formula would be exactly the same as in wikipedia since we're not using cumulative counts. The main difference is that you will need to sum up all the ranges in both the tolerating and satisfied ranges (which I agree with you is a bit annoying). I just wanted to call out that you can calculate it.
I'm not sure I understood the part about dropping overly-granular buckets.

- I agree with you that this won't solve storing 100 buckets per metric. However, the main thing this brings is that you only use the buckets you use vs every single bucket you define. This is a big advantage since it reduces the number of time-series significantly. Suppose Θ(n) denotes the number of time-series defined where n is the number of histogram metrics. In the current format, O(n) = Ω(n) = Θ(n) = N*n, where N is the number of buckets defined per metric. In the new format, I would expect Θ(n) << N*n because it's a function of the buckets used as opposed to buckets defined. This is an important distinction.

- I'm not saying we should drop the existing format. We can support both. There is nothing really breaking about this besides adding a new reserved label. Depending on the use-case, someone could choose to create buckets with "le" or with "range". histogram_quantile is updated to handle both cases as I did in the commit.

I would also love to learn more about any ideas that came up to optimize this in TSDB.

Brian Brazil

unread,
Jun 21, 2020, 2:41:22 PM6/21/20
to Bashar Al Rawi, Prometheus Developers
On Sun, 21 Jun 2020 at 18:44, Bashar Al Rawi <bashar...@gmail.com> wrote:
Thanks Brian for the thorough response. There are a few points that I would like to make.
>> "This proposal also loses the benefits of the current format, namely that you can no longer drop overly-granular buckets nor can you calculate things like Apdex both of which are important properties."
You can still calculate calculate the Apdex with the new format. In fact, the formula would be exactly the same as in wikipedia since we're not using cumulative counts. The main difference is that you will need to sum up all the ranges in both the tolerating and satisfied ranges (which I agree with you is a bit annoying). I just wanted to call out that you can calculate it.
 
I'm not sure I understood the part about dropping overly-granular buckets.

The advantage of cumulative buckets is that you can drop arbitrary buckets at scrape time without breaking the whole histogram. This is an important tactical tool when dealing with overly granular buckets.
 

- I agree with you that this won't solve storing 100 buckets per metric. However, the main thing this brings is that you only use the buckets you use vs every single bucket you define. This is a big advantage since it reduces the number of time-series significantly. 
Suppose Θ(n) denotes the number of time-series defined where n is the number of histogram metrics. In the current format, O(n) = Ω(n) = Θ(n) = N*n, where N is the number of buckets defined per metric. In the new format, I would expect Θ(n) << N*n because it's a function of the buckets used as opposed to buckets defined. This is an important distinction.

This is presuming that there are buckets that are defined which are never used, however that's rarely the case and something that is easy to tweak in application by making a more appropriate bucket choice. Keep in mind that due to buckets being counters it only takes a single event in the lifetime of a process for a bucket to be present. So your proposal doesn't help with the actual problem of how we support more used buckets, as the number of buckets used will basically be the same before and after. This is at best a microoptimisation, and would also cause issues with series not always being exposed. We don't need a 10% improvement in histogram handling that applies only near process start, we need a 10x improvement that works at all times.
 

- I'm not saying we should drop the existing format. We can support both. There is nothing really breaking about this besides adding a new reserved label. Depending on the use-case, someone could choose to create buckets with "le" or with "range". histogram_quantile is updated to handle both cases as I did in the commit.

That's a breaking change, as old Prometheus servers and other consumers of the format cannot deal with this new bespoke format. I'm also strongly against there being two ways to represent a histogram, that's going to cause needless confusion.

Getting into format changes is a distraction from the hard TSDB-related problems of histograms, and the problem should be addressed at the TSDB first rather than making disruptive changes that don't really help with the problem at hand.

Brian
 

I would also love to learn more about any ideas that came up to optimize this in TSDB.
On Sunday, June 21, 2020 at 2:52:21 AM UTC-7 brian....@robustperception.io wrote:
On Sun, 21 Jun 2020 at 01:44, Bashar Al Rawi <bashar...@gmail.com> wrote:
Hi all,

I would like to propose a new format for storing bucket counts that can provide substantial improvements for sparse/bi-modal metrics along with changes to histogram_quantile that allow the new format. Both the change and format aren't complex and can be done with backward compatibility. The main breaking change would be adding a new reserved label.

I put together my thoughts in this Google doc that is open for comments and a rough implementation in this commit.

Thanks for your doc. The first challenge here is that this involves a format change. The Prometheus text format is basically fixed at this point, OpenMetrics will be replacing it but sticks with the same format for histograms as OpenMetrics aims to leverage the existing installed base. The duplication in the current format isn't really a problem in practice, as compression handles this. This proposal also loses the benefits of the current format, namely that you can no longer drop overly-granular buckets nor can you calculate things like Apdex both of which are important properties. So this would break important use cases, and cause busy work across the ecosystem.

The really big issue here is that understanding is that even with something like this you still need ~300 buckets to not have to worry about bucket choices given the typical range of event sizes. This is well above the ~10 recommended currently when considering other labels, so something would need to be done to make the TSDB handling of histograms ~30x more efficient before this would make sense.


The way to approach this instead is to leave the format as-is, and look at how to optimise this inside the TSDB (and later PromQL) so that more buckets are handled transparently and efficiently for histogram_quantile - but otherwise look the same to PromQL as today.

In short changing the text format - even with it being a major breaking change that'd break key use cases - would be the easy bit, and doesn't actually help. The hard bits are all in the TSDB, and you should try to solve those first as without TSDB improvements there's no point.

Brian
 

Please, take a look and let me know what your thoughts are.

Bashar

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/b3cd52af-3e60-4385-8a0d-b914559ab0b0n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Bashar Al Rawi

unread,
Jun 21, 2020, 4:07:10 PM6/21/20
to Prometheus Developers
On Sunday, June 21, 2020 at 11:41:22 AM UTC-7 brian....@robustperception.io wrote:
On Sun, 21 Jun 2020 at 18:44, Bashar Al Rawi <bashar...@gmail.com> wrote:
Thanks Brian for the thorough response. There are a few points that I would like to make.
>> "This proposal also loses the benefits of the current format, namely that you can no longer drop overly-granular buckets nor can you calculate things like Apdex both of which are important properties."
You can still calculate calculate the Apdex with the new format. In fact, the formula would be exactly the same as in wikipedia since we're not using cumulative counts. The main difference is that you will need to sum up all the ranges in both the tolerating and satisfied ranges (which I agree with you is a bit annoying). I just wanted to call out that you can calculate it.
 
I'm not sure I understood the part about dropping overly-granular buckets.

The advantage of cumulative buckets is that you can drop arbitrary buckets at scrape time without breaking the whole histogram. This is an important tactical tool when dealing with overly granular buckets.
 

- I agree with you that this won't solve storing 100 buckets per metric. However, the main thing this brings is that you only use the buckets you use vs every single bucket you define. This is a big advantage since it reduces the number of time-series significantly. 
Suppose Θ(n) denotes the number of time-series defined where n is the number of histogram metrics. In the current format, O(n) = Ω(n) = Θ(n) = N*n, where N is the number of buckets defined per metric. In the new format, I would expect Θ(n) << N*n because it's a function of the buckets used as opposed to buckets defined. This is an important distinction.

This is presuming that there are buckets that are defined which are never used, however that's rarely the case and something that is easy to tweak in application by making a more appropriate bucket choice. Keep in mind that due to buckets being counters it only takes a single event in the lifetime of a process for a bucket to be present. So your proposal doesn't help with the actual problem of how we support more used buckets, as the number of buckets used will basically be the same before and after. This is at best a microoptimisation, and would also cause issues with series not always being exposed. We don't need a 10% improvement in histogram handling that applies only near process start, we need a 10x improvement that works at all times.
 
I agree this isn't a 10x optimization, but I wouldn't call it 10% improvement or say the number of buckets is the same before and after. It's only the same in case where people precisely pick their bucket ranges for their metrics so they get at least 1 sample in every bucket, which is unclear if people actually do that and tweak every metric they have. Even if they did, that won't be true if their metrics were sparse. So, I wouldn't call it a micro-optimization at best (unless we're going to say that quick-sort is at best a micro-optimization since its worst case is still the same). It depends here on the ranges of the buckets and the distribution of the data. What would be beneficial is to look at real-world samples and see on average how much it reduces the number of time-series.

- I'm not saying we should drop the existing format. We can support both. There is nothing really breaking about this besides adding a new reserved label. Depending on the use-case, someone could choose to create buckets with "le" or with "range". histogram_quantile is updated to handle both cases as I did in the commit.

That's a breaking change, as old Prometheus servers and other consumers of the format cannot deal with this new bespoke format. I'm also strongly against there being two ways to represent a histogram, that's going to cause needless confusion.

Getting into format changes is a distraction from the hard TSDB-related problems of histograms, and the problem should be addressed at the TSDB first rather than making disruptive changes that don't really help with the problem at hand.

I do agree that we still need to solve the underlying problem with bucket cardinality at TSDB most likely. I would love to learn more what has been happening in this space (if any). Are there any ideas discussed or resources that someone can point me to?

Brian Brazil

unread,
Jun 21, 2020, 5:06:35 PM6/21/20
to Bashar Al Rawi, Prometheus Developers
On Sun, 21 Jun 2020 at 21:07, Bashar Al Rawi <bashar...@gmail.com> wrote:


On Sunday, June 21, 2020 at 11:41:22 AM UTC-7 brian....@robustperception.io wrote:
On Sun, 21 Jun 2020 at 18:44, Bashar Al Rawi <bashar...@gmail.com> wrote:
Thanks Brian for the thorough response. There are a few points that I would like to make.
>> "This proposal also loses the benefits of the current format, namely that you can no longer drop overly-granular buckets nor can you calculate things like Apdex both of which are important properties."
You can still calculate calculate the Apdex with the new format. In fact, the formula would be exactly the same as in wikipedia since we're not using cumulative counts. The main difference is that you will need to sum up all the ranges in both the tolerating and satisfied ranges (which I agree with you is a bit annoying). I just wanted to call out that you can calculate it.
 
I'm not sure I understood the part about dropping overly-granular buckets.

The advantage of cumulative buckets is that you can drop arbitrary buckets at scrape time without breaking the whole histogram. This is an important tactical tool when dealing with overly granular buckets.
 

- I agree with you that this won't solve storing 100 buckets per metric. However, the main thing this brings is that you only use the buckets you use vs every single bucket you define. This is a big advantage since it reduces the number of time-series significantly. 
Suppose Θ(n) denotes the number of time-series defined where n is the number of histogram metrics. In the current format, O(n) = Ω(n) = Θ(n) = N*n, where N is the number of buckets defined per metric. In the new format, I would expect Θ(n) << N*n because it's a function of the buckets used as opposed to buckets defined. This is an important distinction.

This is presuming that there are buckets that are defined which are never used, however that's rarely the case and something that is easy to tweak in application by making a more appropriate bucket choice. Keep in mind that due to buckets being counters it only takes a single event in the lifetime of a process for a bucket to be present. So your proposal doesn't help with the actual problem of how we support more used buckets, as the number of buckets used will basically be the same before and after. This is at best a microoptimisation, and would also cause issues with series not always being exposed. We don't need a 10% improvement in histogram handling that applies only near process start, we need a 10x improvement that works at all times.
 
I agree this isn't a 10x optimization, but I wouldn't call it 10% improvement or say the number of buckets is the same before and after. It's only the same in case where people precisely pick their bucket ranges for their metrics so they get at least 1 sample in every bucket, which is unclear if people actually do that and tweak every metric they have.

I think it's a fairly safe bet that for the typical user this would trim maybe 1-2 buckets, which is ~10%. Outliers in the upper buckets are inevitable, but there's a chance you'd not hit the very lowest buckets if a service was a relatively slow one to start with (and had no fast paths for e.g. auth failing, 404s, or bad user requests).
 
Even if they did, that won't be true if their metrics were sparse.

Other labels' effect on cardinality is orthogonal.
 
So, I wouldn't call it a micro-optimization at best (unless we're going to say that quick-sort is at best a micro-optimization since its worst case is still the same). It depends here on the ranges of the buckets and the distribution of the data. What would be beneficial is to look at real-world samples and see on average how much it reduces the number of time-series.

Something to keep in mind is that usually it's a very small number of metrics responsible for the vast majority of resource usage, so the number of metrics whose bucket that'd you'd need to keep an eye on optimisation wise is accordingly quite small and manageable. As in probably 1 or 2 metrics.
 

- I'm not saying we should drop the existing format. We can support both. There is nothing really breaking about this besides adding a new reserved label. Depending on the use-case, someone could choose to create buckets with "le" or with "range". histogram_quantile is updated to handle both cases as I did in the commit.

That's a breaking change, as old Prometheus servers and other consumers of the format cannot deal with this new bespoke format. I'm also strongly against there being two ways to represent a histogram, that's going to cause needless confusion.

Getting into format changes is a distraction from the hard TSDB-related problems of histograms, and the problem should be addressed at the TSDB first rather than making disruptive changes that don't really help with the problem at hand.

I do agree that we still need to solve the underlying problem with bucket cardinality at TSDB most likely. I would love to learn more what has been happening in this space (if any). Are there any ideas discussed or resources that someone can point me to?

Björn's talk from the last Promcon would be the main thing.

Brian
 

Bjoern Rabenstein

unread,
Jun 22, 2020, 5:09:56 PM6/22/20
to Bashar Al Rawi, Prometheus Developers
Hi Bashar,

Thanks for your thoughts and ideas.

I more or less agree with all of them, namely:

- We need sparse histograms.

- Cumulative buckets, despite some tactical advantages, are
problematic for sparse histograms, while on the other hand a bit of
math can always emulate dropping buckets or allow Apdex calculation.

- We require a new type of histogram in the exposition format that is
incompatible with the existing format.

- A repetitive representation of buckets in the exposition format is
problematic, and becomes more problematic with more buckets. And no,
compression isn't solving that magically.

I really want histograms to be cheap enogh so that they can be
partioned at will (by status code, path, ...) while still maintaining
a high resolution.

Your approach goes several steps towards this goal.

BUT (and here comes the big "but") it will not go far enough. What we
need, even with sparse histograms, is a histogram implementation that
is efficient enough to suppert hundreds of buckets in a single
histogram at a cost that is comparable or even lower than we have to
pay now for our existing ~10 bucket histograms. I expect that to
require quite invasive changes not only to the exposition format but
also to the way we store histograms in the TSDB and ultimately how we
represent and process them in PromQL.

Now you could say, why not iterate and slowly approach the goal. That
would be totally fine with an experimental software, and I can only
encourage you to play with your approach in an experimental fork. But
we cannot really have those incremental changes in the mainline
Prometheus releases as people will use them in production and then
require backwads compatible support. We cannot really have dozens of
mutually incompatibly ways of dealing with histograms in the released
Prometheus components.

That's why I've been experimenting for a while. I'm currently writing
up a design doc suggesting a plan for the changes we need throughout
the stack. It will not be a precise and perfect solution, but it will
sketch out the direction along which we can then work together towards
a solution. It will take a while before things have stabilized enough
to have them in the regular Prometheus releases. And that's a shame
because in the meantime, people are left with the existing solution
for their production uses – or they can go down the path of adopting
one of the experimental half-baked solution (of which there are more
than just yours) to solve their most pressing problems, with the price
of incompatibility with the future "proper" solution.

I'm currently very focused on getting that design doc done because it
will create the stage for further discussions and the foundation of an
informed decision which way to go.

Stay tuned, I'll publish it here on this list, hopefully very soon.
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Bashar Al Rawi

unread,
Jun 23, 2020, 12:54:05 AM6/23/20
to Prometheus Developers
Thanks a lot Björn for the detailed response. I appreciate you taking the time. I also watched your talk from last year's PromCon and got a bit more context on what you have up your sleeve. In fact, I was reading the DDSketch paper last night. I can't wait to see what you will come up with and feel free to loop me in.

I do agree we should have a single solution that hopefully works for everyone and not one-off breaking changes. We will likely experiment with my implementation internally and see how it goes meanwhile.

Bashar

Reply all
Reply to author
Forward
0 new messages