Accurate total over time?

4,989 views
Skip to first unread message

kevinbla...@gmail.com

unread,
Jul 13, 2018, 8:45:56 AM7/13/18
to Prometheus Users
Is there a way to get an accurate count of a counter over a week/month time? 

Say we have a counter messages_received_total{sourcesystem="a"} and there are only 10 source systems.  The 10 sourcesystems might come up at different times, so over 30 days, one of them might only have been running 15 days.  I know I can show this in a graph over time and the results will be very accurate.  But the business is saying they need to be able to get the number of messages received over a week and a month time.   The display of the total is being shown in a grafana singlestat panel.  I'm sure this is a common thing that people try to do, so hoping to get a clear answer here. I've tried with leveraging the increase function.  Both grafana and prometheus seem to limit results at different levels.  Grafana will add an automatic step that increases, so that immediately makes it so that some message stats will be missed.  Doing a sum(max_over_time..)  seems to get pretty close, but still does not line up directly with the total, though I'm not sure why that wouldn't be 100% accurate, it seems that should be looking at every single stat over the entire time.  Maybe it doesn't actually work that way? 

I'd like to understand what options exist for getting an accurate number.  I get that the answer might just be, "go do it in splunk or similar since that is for counting events", but I'd like to know that it's just not possible in prometheus.   I get the feeling that if it's not possible in a direct query that the route might be a recording rule.  If that is really required, would it just be an recording rule that gives the number of total messages over say a 5 minute window, I still am not clear on the right way to count up all the 5 minute totals over a 30 day window. 

This is likely the most common question I get from users, so any input here is GREATLY appreciated!


Brian Brazil

unread,
Jul 13, 2018, 8:48:15 AM7/13/18
to kevinbla...@gmail.com, Prometheus Users
If you want 100% accurate, then a metrics-based solution like Prometheus is not going to cut it and you should look at something logs-based like Splunk. Prometheus can produce numbers whose accuracy is more than sufficient for operational purposes, but they'll never be 100% correct.

Brian
 

This is likely the most common question I get from users, so any input here is GREATLY appreciated!


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/bd67af97-5f48-453b-ae68-431cedd55b3e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Alin Sînpălean

unread,
Jul 13, 2018, 9:45:04 AM7/13/18
to Prometheus Users
I'm merely a user and sometimes contributor, but I happen to disagree with Brian on this particular issue.

First of all, to get this out of the way, it's true you can't get 100% accurate values from a metrics-based system. There are many reasons for that, but I'll only touch on 2:
 (1) Metrics-based systems necessarily use sampling, and you may not be able to have 2 samples that are exactly 30 days apart (there will always be a few seconds or minutes of slip). You can either interpolate/extrapolate (which is what Prometheus does) or take an exact difference, but which doesn't cover _exactly_ 30 days.
 (2) When a monitored process gets restarted, the counter drops to zero (and you can account for that), but it's very likely that a handful more events/messages/whatever happened between the moment of the last scrape and the moment the process actually terminated. Those events/messages/whatever never happened from the point of view of Prometheus.

You could argue that the same kind of issues exist with logs, i.e. it's possible that the timestamp of log events is skewed (because the clock on one machine is off by a few seconds/minutes), so you may end up including/excluding events that happened (strictly speaking) outside/inside the exact 30 day range. Similarly, when a machine crashes and burns, it's possible that some of the events it has logged locally just before the crash are lost because they were never pushed into/ingested by the logs analytics system. Less of an effect, but still not 100% by any stretch of the imagination.

Now going back to your question: I am taking your statement that "graph [...] results will be very accurate" to mean that you/your users are perfectly happy with the accuracy of the numbers provided by Prometheus and you're simply looking for a way of getting one number for the last 30 days, accurate to the extent that Prometheus' numbers are.

The simplest way you can get what you want is to simply do:

    sum(increase(messages_received_total[30d]))

and represent that in Grafana as a singlestat panel with "Instant" checked (you only need the last value, and it's expensive to compute on the fly anyway). There will be some minor artifacts due to the fact that Prometheus takes the values strictly within the interval and extrapolates, but that's an error proportional to scrape_interval / 30d (it's much worse for very short ranges), but unless you run into performance issues with Prometheus, you're all set.

The other alternative, which you hint at, is to compute the increase over shorter ranges (say 5 minutes) as a recorded rule. Then sum those values over the last 30 days to get the total number of messages. There are a couple of pitfalls here, though:
 (1) If one of your systems is not scraped for longer than 5 minutes (or whatever your choice of interval is) then you'll lose all increases from that system for that period.
 (2) As noted above, Prometheus takes the values falling strictly within the range and extrapolates to the whole range. This estimation error will be much more visible here (just as it is on your graphs), as it's now proportional to scrape_interval / 5m. To work around that, what I'm doing is I'm actually recording
    increase(metric[5m + scrape_interval]) * 5m / (5m + scrape_interval)
    with all the values hardcoded. :(
 (3) You will either have to make sure the rule is evaluated every 5 minutes, or, if evaluated more frequently you'll have to divide the final number by the ratio of 5m / eval_interval (because you're counting every message multiple times).

Cheers,
Alin.


On Friday, July 13, 2018 at 2:48:15 PM UTC+2, Brian Brazil wrote:
On 13 July 2018 at 13:45, <kevinbla...@gmail.com> wrote:
Is there a way to get an accurate count of a counter over a week/month time? 

Say we have a counter messages_received_total{sourcesystem="a"} and there are only 10 source systems.  The 10 sourcesystems might come up at different times, so over 30 days, one of them might only have been running 15 days.  I know I can show this in a graph over time and the results will be very accurate.  But the business is saying they need to be able to get the number of messages received over a week and a month time.   The display of the total is being shown in a grafana singlestat panel.  I'm sure this is a common thing that people try to do, so hoping to get a clear answer here. I've tried with leveraging the increase function.  Both grafana and prometheus seem to limit results at different levels.  Grafana will add an automatic step that increases, so that immediately makes it so that some message stats will be missed.  Doing a sum(max_over_time..)  seems to get pretty close, but still does not line up directly with the total, though I'm not sure why that wouldn't be 100% accurate, it seems that should be looking at every single stat over the entire time.  Maybe it doesn't actually work that way? 

I'd like to understand what options exist for getting an accurate number.  I get that the answer might just be, "go do it in splunk or similar since that is for counting events", but I'd like to know that it's just not possible in prometheus.   I get the feeling that if it's not possible in a direct query that the route might be a recording rule.  If that is really required, would it just be an recording rule that gives the number of total messages over say a 5 minute window, I still am not clear on the right way to count up all the 5 minute totals over a 30 day window. 

If you want 100% accurate, then a metrics-based solution like Prometheus is not going to cut it and you should look at something logs-based like Splunk. Prometheus can produce numbers whose accuracy is more than sufficient for operational purposes, but they'll never be 100% correct.

Brian
 

This is likely the most common question I get from users, so any input here is GREATLY appreciated!


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



--

Sergio Bilello

unread,
Sep 24, 2019, 9:18:30 PM9/24/19
to Prometheus Users
Hello guys,

If I have the metric http_server_requests_count and I want to know how many requests happened in a time interval. What should be the formula? I don't have the http_server_requests_total that would solve the problem with the increase function.

Thanks,

Sergio

Aliaksandr Valialkin

unread,
Sep 25, 2019, 4:50:38 AM9/25/19
to Sergio Bilello, Prometheus Users
On Wed, Sep 25, 2019 at 4:18 AM Sergio Bilello <laposta...@gmail.com> wrote:
Hello guys,

If I have the metric http_server_requests_count and I want to know how many requests happened in a time interval. What should be the formula? I don't have the http_server_requests_total that would solve the problem with the increase function.

Probably `sum_over_time(http_server_requests_count[interval])` would work for you if `http_server_requests_count` resets after each scrape.
 

Alin Sînpălean

unread,
Sep 25, 2019, 5:58:06 AM9/25/19
to Prometheus Users
Probably `sum_over_time(http_server_requests_count[interval])` would work for you if `http_server_requests_count` resets after each scrape.

If your counter resets after each scrape, you've got bigger problems than how to count requests in a given time interval. E.g. if you later decide to go with a basic HA setup, i.e. 2 Prometheus replicas, your requests will be arbitrarily split between one Prometheus instance and the other.

Your safest bet is to go with a counter (in the service you're monitoring) and then doing `increase(counter[interval])`. Anything else will either break or cause you lots of pain down the road.

Cheers,
Alin.

Sergio Bilello

unread,
Sep 30, 2019, 9:23:46 PM9/30/19
to Prometheus Users
Thanks for the reply but unfortunately it does not. Those metrics are exported with the default @Timed annotation provided in the https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html#production-ready-metrics-spring-mvc project.
To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.


--
Best Regards,

Aliaksandr

Brian Brazil

unread,
Oct 1, 2019, 5:13:25 AM10/1/19
to Sergio Bilello, Prometheus Users
On Tue, 1 Oct 2019 at 02:23, Sergio Bilello <laposta...@gmail.com> wrote:
Thanks for the reply but unfortunately it does not. Those metrics are exported with the default @Timed annotation provided in the https://docs.spring.io/spring-boot/docs/current/reference/html/production-ready-metrics.html#production-ready-metrics-spring-mvc project.

You want rate() in that case.

Brian
 
Reply all
Reply to author
Forward
0 new messages