rate() occasionally returns zeroes on small time window change

520 views
Skip to first unread message

hai...@gmail.com

unread,
Jan 3, 2018, 10:29:01 PM1/3/18
to Prometheus Users

Good day,

P8s 2.0.0 here.

I have a Summary metric that that tracks API calls latency. I scrape it every 10 seconds. My problem is that rate() function sporadically returns zeroes if start/end time spec moves a bit; and rate(sum)/rate(count) is almost always zero. Here are couple of screenshots:

Here is a proper situation - raw increasing counter and its rate per minute:


Now if I wait some time (up to a minute) and hit refresh, I'll see this:



I.e. counter is growing the same way, but rate returns zeroes for all but a major increases.


Here are my queries:

Green: 60 * sum(rate(api_calls_duration_seconds_count[1m]))

Yellow: 60 * sum(avg_over_time(api_calls_duration_seconds_count[1m]))


Moreover, if I try to calculate latency (red) by dividing _sum by _count, it only comes up for the second major spike, even when rate() gets it right:



Here is the query for latency:

sum(increase(api_calls_duration_seconds_sum[1m])) 

sum(increase(api_calls_duration_seconds_count[1m]))



Another way to demonstrate it:


$ curl -s 'http://localhost:9090/api/v1/query_range?query=60%20*%20sum(rate(api_calls_duration_seconds_count%5B1m%5D))&start=1515031319&end=1515034907&step=60' | grep -o '"0"' | wc -l

46

$ curl -s 'http://localhost:9090/api/v1/query_range?query=60%20*%20sum(rate(api_calls_duration_seconds_count%5B1m%5D))&start=1515031320&end=1515034907&step=60' | grep -o '"0"' | wc -l

58


I.e. I moved start time 1 second forward and there are 12 more zero points in the result - i.e. all those small pics from the first screenshot are gone.


Any ideas what's happening here?


Thanks,
Zaar








Simon Pasquier

unread,
Jan 4, 2018, 3:18:00 AM1/4/18
to hai...@gmail.com, Prometheus Users
Have you tried to display the same queries in the Prometheus UI? Does it exhibit the same behavior?
Also what is your resolution in Grafana?
FWIW there are a couple of issues similar to your problem reported on Grafana:
https://github.com/grafana/grafana/issues/9705
https://github.com/prometheus/prometheus/issues/2364
Simon

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fedb1ff1-c9f9-43ef-a24a-c98a6969f32b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

hai...@gmail.com

unread,
Jan 4, 2018, 4:00:23 AM1/4/18
to Prometheus Users
Hi Simon,

Thanks for the links, but this is not behavior I'm experiencing - they talk about jumpy graphs, while I'm in situation where I don't get any data at all. I can reproduce it with raw P8s API queries (see my last curls).

Since I use fixed 1m step and pass fixed 1m interval to rate(), I would expect that the whole time period would be covered, on way or another, i.e. rate() would return results. However when start value is shifted left/right a little bit, mostly zeroes are returned. I've tried ruing rate() windows wider than step, but the problem still happens (though a tiny bit less). I've rurun the following curl:

With start values from 1515031310 to 1515031390, and rate windows of 60, 61 and 67 seconds (step remained on 60 seconds for all 3). Here are the results: ("." signifies OK data, "!" - missing data, i.e. mostly zeroes):


60: ..........!!!!!!!!!!..................................................!!!!!!!!!!.

61: ...........!!!!!!!!!...................................................!!!!!!!!!.

67: .................!!!.........................................................!!!.


It also quite puzzling why my latency query returns mostly nothing.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
Message has been deleted

Zaar Hai

unread,
Apr 4, 2018, 3:09:56 AM4/4/18
to alin.si...@gmail.com, Prometheus Users
Hi Alin,

Thanks for the elaboration. I red all of your threads/tickets on the subject and I hope your xrate fork will see the light of a day.

I'll get away with the workaround you suggest, though it means using fixed interval in grafana (as you surely know).

Thank again,
Zaar


On Sat, 31 Mar 2018, 17:02 <alin.si...@gmail.com> wrote:
Hey Zaar,

In case you're still looking for an answer, the one I found was to use

    increase(foo[70s]) * 60 / 70

wherever I would have wanted to use increase(foo(60s]) instead (all this assuming a 10s collection interval). The reasoning behind it is to compute an increase over 6 successive collection intervals (i.e. 7 successive data points, or 70s). But Prometheus will see that it's actually (on average 60s worth of data) and extrapolate that to 70s. Hence the * 60 / 70 part at the end to undo that extrapolation. It is not obvious by any stretch of the imagination, but it will produce much better numbers than increase().

Or you can do

    foo - foo offset 1m

but that will not take into account counter resets (if you care about them at all) and will take twice as much CPU.

Cheers,
Alin.

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/GLKaSYQN_gI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.

To post to this group, send email to promethe...@googlegroups.com.

Alin Sînpălean

unread,
Apr 4, 2018, 4:35:15 AM4/4/18
to Prometheus Users
Hey Zaar,

I am using said workaround in a recorded rule, not directly in a Grafana query. It is faster than on-the-fly evaluation and (because it covers 6/6 increases rather than 5/6, as rate(foo[60s]) would) it is aggregatable over time.

I.e. my Grafana query is sum_over_time(foo:increase_60s), where foo:increase_60s is a recorded rule defined as increase(foo[70s]) * 60 / 70. You still have to fix the interval, but you do it in the Prometheus config, close to where you define your scrape and eval intervals.

One of the bones of contention regarding my xrate() proposal is that because evaluation doesn't happen exactly every 10 seconds, it is still possible for some data points collected right around the time of an eval to be included in 2 successive rate calculations (or none). So it's not a perfect solution, and thus not an improvement. (!?)

Cheers,
Alin.

Zaar Hai

unread,
Apr 16, 2019, 6:42:46 AM4/16/19
to Alin Sînpălean, Prometheus Users
Whoever runs into this issues and lands here - here is the article (not mine) that outlines the issue very well (scroll to the last part): https://www.stroppykitten.com/technical/prometheus-grafana-statistics


For more options, visit https://groups.google.com/d/optout.


--
Zaar

Alin Sînpălean

unread,
Apr 17, 2019, 5:51:33 AM4/17/19
to Prometheus Users
On Tuesday, April 16, 2019 at 12:42:46 PM UTC+2, Zaar Hai wrote:
Whoever runs into this issues and lands here - here is the article (not mine) that outlines the issue very well (scroll to the last part): https://www.stroppykitten.com/technical/prometheus-grafana-statistics

I would beg to differ with (most of) the conclusions of that blog post. (I would comment directly on the blog post, but comments are disabled.)

It does identify one issue: calculating a rate over a fixed range with varying resolution is bad. And it describes the obvious (if slightly broken, due to Prometheus' rate() limitations) solution.

But half way down, it describes a solution that was unfortunately implemented in Grafana: forced aligning of the ranges to the step. It does remove all jitter from Grafana graphs, while at the same time making it impossible (when using a range equal to $__interval) to ever see any increases that occur between the end of one range and the beginning of another. So it basically sweeps the problem under the rug, making it even more unlikely that there'll ever be enough pressure to fix the underlying issue in Prometheus. (Because most Prometheus + Grafana users will now never learn that some of their data, always the same, is consistently being thrown away.)

And later on in the post, the author goes on to observe that "Prometheus is doing the right thing given the exposed API of rate interval and step, and the fix needs to be in the client". I.e. Prometheus is right to throw away some data and replace it with extrapolation, and clients should to be aware of the resolution of the underlying data and reverse engineer Prometheus' implementation of rate()/increase() to provide usable, intuitive data. Makes sense. </sarcasm>

Alin.
Reply all
Reply to author
Forward
0 new messages