Exemplars for _count in Summaries

239 views
Skip to first unread message

Fabian Stäber

unread,
Oct 6, 2022, 5:45:29 PM10/6/22
to Prometheus Developers
Hi,

Great question from the CNCF Slack: What's the reason why we don't allow Exemplars for _count in Summary metrics?

Use case: Java's Micrometer provides a Summary with just a _count and a _sum as the default metric for HTTP services. The _count has the HTTP status code as a label, so these metrics are great for request rates and error rates (and for average latencies if you have nothing better).

However, there's currently no way to have exemplars. It would be nice to have them, for example for investigating erroneous calls. If the _count was an explicit counter, and not part of a summary, Exemplars would be supported.

What do you think? Any reason why Exemplars don't work in _count in Summaries? Would that be something we could consider supporting?

Fabian

Bryan Boreham

unread,
Oct 8, 2022, 8:28:19 AM10/8/22
to Prometheus Developers
I suspect the answer is that the OpenMetrics spec does not mention exemplars on Summaries.

I don't see any technical obstacle to attaching them in https://github.com/prometheus/client_golang.
Inside Prometheus, Summaries are just two metrics, so no special handling required.

Bryan

Fabian Stäber

unread,
Oct 8, 2022, 8:36:27 AM10/8/22
to Bryan Boreham, Prometheus Developers
Thanks a lot Bryan.

I think if there is no technical obstacle, we should consider allowing Exemplars for counts in OpenMetrics.

Example: Counts are often used for error rates like this:

sum(rate(http_server_duration_count{http_status_code=~"5.."}[5m])) / sum(rate(http_server_duration_count[5m]))

If the counts came with Exemplars, we could build a feature in Grafana to visualize them on an "error rate" graph, i.e. you could click on an example of an HTTP 500 error and navigate directly to the corresponding trace, or to the logs filtered by trace ID.

Fabian


--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/34b54e23-51dc-4cfb-b00a-fc36a8d8c1bdn%40googlegroups.com.

Jonatan Ivanov

unread,
Oct 11, 2022, 11:47:03 PM10/11/22
to Prometheus Developers
Hi,

Thank you for the answer. Is there a chance that we can standardize this in the OpenMetrics specs so that it is well defined and documented? This can help to avoid confusion and clearly state what is supported and what is not. Also, since Counter already can have an Exemplar and semantically _count is a Counter, everything should be available in the specs, this is mostly about "wiring" them together.

Also, what do you think about supporting _count in Histograms (since Histogram extends Summary with buckets)?

Thanks,
Jonatan

Bryan Boreham

unread,
Oct 12, 2022, 6:57:13 AM10/12/22
to Prometheus Developers
> Is there a chance that we can standardize this in the OpenMetrics specs

I recommend taking that question to an OpenMetrics list.  Whilst there is overlap between the Prometheus developers and OpenMetrics, no decision could be reached here.

(Prometheus could independently decide to go beyond what OpenMetrics says)

> Also, what do you think about supporting _count in Histograms (since Histogram extends Summary with buckets)?

Maybe your point is that exemplars are tied to the _bucket metrics and not the _count metric?
How exactly would you change this while remaining backwards-compatible for existing users?

Perhaps the upcoming "native histograms" or "sparse histograms" feature will suit what you need?

Bryan

Bryan Boreham

unread,
Oct 12, 2022, 7:13:01 AM10/12/22
to Prometheus Developers

On Saturday, October 8, 2022 at 5:36:27 AM UTC-7 fab...@fstab.de wrote:
 
Example: Counts are often used for error rates like this:

sum(rate(http_server_duration_count{http_status_code=~"5.."}[5m])) / sum(rate(http_server_duration_count[5m]))


Small hack: http_server_duration_count is equal to max(http_server_duration_bucket), which does have exemplars.
(I haven't tried this)
 
If the counts came with Exemplars, we could build a feature in Grafana to visualize them on an "error rate" graph, i.e. you could click on an example of an HTTP 500 error and navigate directly to the corresponding trace, or to the logs filtered by trace ID.

They will be in the wrong units, however, so likely clamped to the top or bottom of the graph.

Bryan 

Bjoern Rabenstein

unread,
Oct 18, 2022, 9:05:29 AM10/18/22
to Fabian Stäber, Prometheus Developers
On 06.10.22 14:45, 'Fabian Stäber' via Prometheus Developers wrote:
>
> Great question from the CNCF Slack: What's the reason why we don't allow
> Exemplars for _count in Summary metrics?
>
> What do you think? Any reason why Exemplars don't work in _count in
> Summaries? Would that be something we could consider supporting?

The _count of a Summary _and_ the _count of a Histogram (both
conventional as well as the new native ones) is essentially a counter
within the larger "structured" metric of a Summary/Histogram.

From that perspective, it should have the option of attaching an
examplar, as a regular Counter has, too.

My speculation why it doesn't in OpenMetrics:

In an OM Histogram, the +Inf bucket fulfills exactly the same function
as the _count (spec says: "The +Inf bucket counts all requests.") So
if you would like an examplar on the _count of a Histogram, you can as
well use an exemplar on the +Inf bucket.

That obviously doesn't help in the case of a Summary, but I guess the
rationale is that Histograms are generally to be preferred over
Summaries, and therefore didn't get the thourough treatment when it
came to exemplars.


However, even if you really dislike the precalculated quantiles in
Summaries, there is still the case of a Summary without quantiles. I
think adding exemplars to such a Summary is as much needed as adding
exemplars to any regular Counter.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Fabian Stäber

unread,
Oct 18, 2022, 9:29:27 AM10/18/22
to Prometheus Developers
Side note: In Java this would be particularly useful because the popular Spring Boot framework exposes a Summary http_server_requests_seconds by default that look like this (no quantiles, just _count and _sum):
# HELP http_server_requests_seconds  
# TYPE http_server_requests_seconds summary
http_server_requests_seconds_count{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/",} 1.0
http_server_requests_seconds_sum{exception="None",method="GET",outcome="SUCCESS",status="200",uri="/",} 1.014687278
I think this is pretty useful, you can get request rates and error rates out of it. If Prometheus / OpenMetrics had support for Exemplars on the _count, users could find example traces per HTTP status and URI.

Fabian

Jonatan Ivanov

unread,
Nov 4, 2022, 1:46:39 AM11/4/22
to Prometheus Developers
Hi,

Sorry for the late reply let me go on-by-one, please bear with me. :)


>I recommend taking that question to an OpenMetrics list. 
Thanks, I will open a thread there too based on the discussion here.

>Prometheus could independently decide to go beyond what OpenMetrics says
Since Micrometer uses the Prometheus Java Client this would solve the issue for us but if it makes sense it would be great to have it standardized later(?) (in OM).

>Maybe your point is that exemplars are tied to the _bucket metrics and not the _count metric?
Yes, that's exactly what I'm saying with the caveat that I think this would be useful for Summaries too not only Histograms.

>How exactly would you change this while remaining backwards-compatible for existing users?
As far as I can see, adding Exemplars to _count should be a backward-compatible change (it is an addition).
Would users be broken because of this?

>Perhaps the upcoming "native histograms" or "sparse histograms" feature will suit what you need?
I'm not sure but I'm not only talking about histograms, I'm also talking about Summary.

>In an OM Histogram, the +Inf bucket fulfills exactly the same function
as the _count (spec says: "The +Inf bucket counts all requests.") So
if you would like an examplar on the _count of a Histogram, you can as
well use an exemplar on the +Inf bucket.

I think I disagree with the second sentence. Let's say you have an application where processing the first request is significantly slower than the rest (lazy init, populating caches, GC, establishing connections, etc.). In this environment (I think this is true for lots of apps nowadays) it can easily happen that the +Inf bucket will be populated with an Exemplar for the first request and it will never get updated because the app will never be as slow as it was for the first request. Also, nothing guarantees that the +Inf bucker will have an exemplar, maybe the processing was faster than that. As far as I understand, exemplars are not like cumulative "le" counters so incrementing a bucket does not mean updating an exemplar (quite the opposite, maybe all of the buckets will be incremented but only one will get a new Exemplar). This is also true for apps that can get significantly faster over time (i.e.: JIT).  I think a solution here would be give_me_the_last_updated_bucket(my_histogram) or adding Exemplar to _count.

>Side note: In Java this would be particularly useful because the popular Spring Boot framework exposes a Summary http_server_requests_seconds by default that look like this (no quantiles, just _count and _sum)
This was actually one of the drivers of this request, users are asking for this from us. I also find it extremely useful: without this I need to create an additional counter which does not seem right since I already have one.

Thanks,
Jonatan

Fabian Stäber

unread,
Nov 4, 2022, 3:57:32 AM11/4/22
to Jonatan Ivanov, Prometheus Developers
Thanks a lot Jonatan.

Just a quick heads-up: We have a Prometheus dev summit on Thursday next week, and I put this on the agenda: https://docs.google.com/document/d/11LC3wJcVk00l8w5P3oLQ-m3Y37iom6INAMEu2ZAGIIE/edit

Fabian

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Bryan Boreham

unread,
Nov 4, 2022, 4:09:22 AM11/4/22
to Prometheus Developers
>>> Also, what do you think about supporting _count in Histograms (since Histogram extends Summary with buckets)?
>> How exactly would you change this while remaining backwards-compatible for existing users?
> As far as I can see, adding Exemplars to _count should be a backward-compatible change (it is an addition).

I would like to see what you can see.  Please spell it out for me.

The current API on Histogram is ObserveWithExemplar(), and it adds the exemplar to a _bucket metric.
Would you change the behaviour of that API?  Would you add a new API?  

Bryan

Jonatan Ivanov

unread,
Nov 5, 2022, 12:21:03 AM11/5/22
to Prometheus Developers
Hi,

@Fabian: Thank you very much! Is the summit open, can I join?

@Bryan:
I think I would make the decision if it would make sense for the Prometheus Server to be able to process the Exemplars on _count first. If so, then I would look into the different clients and their APIs. 

For Histograms, I don't think this should result in a client API change, I think this should only affect the behavior of the implementation. Also, I think source, binary, and behavioral compatibility can be kept. I think it does not matter where the Exemplar is coming from (directly from the user or from the sampler), the interesting part happens after the implementation got the Exemplar. In the case of Histograms, this would mean not just updating the reference of the Exemplar of the current bucket but also updating an extra reference to the latest Exemplar (that actually belongs to _count). So the histogram would hold references to N+1 exemplars where N is the number of buckets and the +1 Exemplar is the latest recorded one.

For Summaries, this needs a client API change (an addition) since right now Summaries do not have Exemplars support. I think this should be very similar to the Exemplars support of Counters.

I would like to call out two things:
- I'm not an expert of any of the Prometheus clients (I only used the Java and the JavaScript clients).
- I think I would not even be affected by the changes of the client APIs since Micrometer is not using these, it directly creates a Collector.MetricFamilySamples.Sample instead (that can accept Exemplars as of today). So if the Prometheus server could process Exemplars on _count, I think my use-case should be covered. But I would definitely add the support to the client APIs too so that users who use the clients can enjoy these features.
 
Thanks,
Jonatan

Bryan Boreham

unread,
Nov 9, 2022, 10:17:56 AM11/9/22
to Prometheus Developers
Would I be right in thinking this is the code which tripped you up?
This is insisting that exemplars only work on metrics ending in "_bucket" or "_total".

Personally I would be fine with relaxing that, although it does seem strictly aligned with the OpenMetrics spec.

Fabian Stäber

unread,
Nov 11, 2022, 6:57:54 AM11/11/22
to Prometheus Developers
Hi,

Good news everyone: We discussed it on the Prometheus Dev Summit yesterday, and here's the result:

CONSENSUS: Prometheus will ingest Exemplars on all time series.

This includes the _count time series for Summary metrics. The next steps are:

* Create a PR in Prometheus, as currently Exemplars are discarded for these series.
* Allow client libraries to add Exemplars everywhere.

Strictly speaking the client libraries will then no longer produce compliant OpenMetrics format as long as the OpenMetrics spec isn't changed, but we can start implementing now and get back to the spec later.

Fabian


--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Jonatan Ivanov

unread,
Nov 15, 2022, 2:50:12 PM11/15/22
to Prometheus Developers
Hi,

@Bryan: Yes, it seems so.

@Fabian: This is great news! Thank you all!
Do you think it would worth starting a discussion with the OpenMetrics audience about this in the meantime?

Thanks,
Jonatan

Reply all
Reply to author
Forward
0 new messages