Visualisation of histogram metrics.

1,821 views
Skip to first unread message

shk...@gmail.com

unread,
Apr 27, 2018, 8:12:23 AM4/27/18
to Prometheus Users
Hello, everyone.
I'm using official python library for Prometheus.
Well, I write logs like this:
log-format =  - - %(user) [%(ltime)] "%(method) %(uri) %(proto)" %(status) %(size) "%(referer)" "%(uagent)" [%(msecs)]

Then parse it and set metrics like this:
prom_timings = Histogram('response_time', 'handlers time response', ['handle'])

....
....
....
   
for url, tm in counters.iteritems():
       
for time in list(tm.elements()):
            prom_timings
.labels(url).observe(float(time) )

As you can see float(time) is msec.

When I checked it in grafana I saw this:
sum(rate(response_time_bucket{project="news", handle="$handle"}[$interval])) 
 /
sum(rate(response_time_count{project="news", handle="$handle"}[$interval]))


It doesn't seem like true. Also, I've checked grabbed metrics and most requests for this handler is under le="0.025" or  le="0.5"
response_time_bucket{handle="/regions/info/",le="0.005"} 0.0
response_time_bucket
{handle="/regions/info/",le="0.01"} 17.0
response_time_bucket
{handle="/regions/info/",le="0.025"} 36.0
response_time_bucket
{handle="/regions/info/",le="0.05"} 36.0
response_time_bucket
{handle="/regions/info/",le="0.075"} 36.0
response_time_bucket
{handle="/regions/info/",le="0.1"} 36.0
response_time_bucket
{handle="/regions/info/",le="0.25"} 36.0
response_time_bucket
{handle="/regions/info/",le="0.5"} 36.0
response_time_bucket
{handle="/regions/info/",le="0.75"} 36.0
response_time_bucket
{handle="/regions/info/",le="1.0"} 36.0
response_time_bucket
{handle="/regions/info/",le="2.5"} 36.0
response_time_bucket
{handle="/regions/info/",le="5.0"} 36.0
response_time_bucket
{handle="/regions/info/",le="7.5"} 36.0
response_time_bucket
{handle="/regions/info/",le="10.0"} 36.0
response_time_bucket
{handle="/regions/info/",le="+Inf"} 36.0
response_time_count
{handle="/regions/info/"} 36.0
response_time_sum
{handle="/regions/info/"} 0.40200000000000025

After I've found an article:

It looks like label 'le' should contain seconds. What I did is 
prom_timings.labels(url).observe(float(time) / 1000 )

And got this:


Well, my questions are:
  • Why on the first pic could I see wrong results?
  • What should I put to  prom_timings.labels(URL).observe(VALUE)
Thank you!



Cody Boggs

unread,
Apr 27, 2018, 10:09:34 AM4/27/18
to shk...@gmail.com, Prometheus Users
To get a good idea of what is happening here, it's worth refreshing the memory on what a histogram is doing under the hood. In essence, a histogram (prometheus or otherwise) is simply a collection of "buckets," each of which tracks a counter for the number of times an observed value fell into that bucket's "range".

From here we can see that the values in the individual buckets are without units - they're not seconds, days, volts, or moose. :-) They're just counters. Instead, the bucket name (or "le" label in the case of Prometheus) is what indicates the range of values that are handled by the counter for that bucket. So in your case, the value of the "le" label is the range of milliseconds accounted for by the bucket.

We can take that knowledge and apply it to your first question - why does the first picture look "wrong"? The query you shared for the first screenshot is simply taking the sum of the rates of change of all buckets (which is a unit-less value) and dividing it by the total number of observations made. This leaves you with a value that sort of resembles an average value for all observed values, but not quite.

Instead, you can run a query like the following using histogram_quantile() to calculate a percentile. A query like this will generally give you a more helpful graph:
histogram_quantile(0.99, sum(rate(response_time_bucket{handle="/regions/info/"}[1m])) by (le))

This will give you the 99th percentile of the response time for the "/regions/info/" handle, and the result will be in milliseconds. 

Hopefully that's helpful!

Cody Boggs | Yak Shaver (Ops)

FreshTracks.io

Intelligent Alerting for Kubernetes and Prometheus


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b34e6eca-26f5-4ad1-af08-4d46dc2bfef7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

shk...@gmail.com

unread,
Apr 27, 2018, 2:25:12 PM4/27/18
to Prometheus Users
Hello, Cody! Thank you for the fast reply!
I understood what was wrong. I saw that I haven't looked for :)

Well, let me provide some details. I'm working on performing our application.
And I would like to see:
  • An average time for every handler
  • In what range handler fell - buckets/histogram
to define the most longer handler than to see response's ranges and then to dive deeper.

So, to achieve viewing AVG I need just :
sum(rate(response_time_sum{project="news", handle!~".*admin.*|.*version.*|.*robots.*"}[$interval])) by (handle)
/
sum
(rate(response_time_count{project="news", handle!~".*admin.*|.*version.*|.*robots.*"}[$interval])) by (handle)  
it's ok.

About second I have some suspicions because:
  sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (job)
/
  sum
(rate(http_request_duration_seconds_count[5m])) by (job)

You can approximate the well-known Apdex score in a similar way. Configure a bucket with the target request duration as the upper bound and another bucket with the tolerated request duration (usually 4 times the target request duration) as the upper bound. Example: The target request duration is 300ms

I guess that prometheus's buckets are about duration ( second exactly ). And also when I observed metrics as milliseconds and then watch it from target it lied above 10.0
response_time_bucket{handle="/v1/clusters/topic/all/",le="0.005"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.01"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.025"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.05"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.075"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.1"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.25"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.5"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="0.75"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="1.0"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="2.5"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="5.0"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="7.5"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="10.0"} 0.0
response_time_bucket
{handle="/v1/clusters/topic/all/",le="+Inf"} 1031.0

The visualization was like:


There is some mess here :(

So, as result, I observe metrics as seconds, divide by 1000 to get milliseconds in grafana.
Could you please explain this behavior? Is there a way to do it right? 

And one more question to make it clear. 
histogram_quantile(0.99, sum(rate(response_time_bucket{handle="/regions/info/"}[1m])) by (le))
it will show 99% of requests the lied ranged by my buckets. In my case from 0.005 till 10.0 for the handler "/regions/info/" with step 1m. 
And this:
sum(rate(response_time_bucket{project="news", handle="/regions/info/"}[1m])) by (le)
Will show all the results similar way.
Am I right?

Thank you, again!

пятница, 27 апреля 2018 г., 17:09:34 UTC+3 пользователь Cody Boggs написал:

Cody Boggs

unread,
May 7, 2018, 8:42:54 AM5/7/18
to Maksim Timokhin, Prometheus Users
Sorry for the delay, I got pulled into a few other things for a while!

Thanks for the update, I didn't pay attention to the fact that you're trying to approximate your apdex score. My mistake!

I think the examples provided in the prometheus docs that you linked should get you a reasonable approximation, since the apdex score is usually tracked as a percentage of requests that fall within an acceptable range.

The graph screenshot you shared looks odd because you are showing the rate of change for each bucket, which doesn't translate to a very helpful graph. :-) You could possibly choose a single bucket, or a range of buckets, and track the rate of change for just those buckets. But I think the percentage approach in the linked doc makes the most sense for a visual indicator and / or alert query.

As for your last question, I think there's a small misunderstanding of the behavior of the "histogram_quantile" function. The query you shared (histogram_quantile(0.99, ...)) will actually show you a single value, which is the duration that 99% of your requests completed under. So for example, if that query gives you a value of, say, "0.105", then that means that you are seeing 99% of your requests for that handler completing in less than 105ms. 

Though, looking back through your original post, there might be a mis-match between the the scale of the default histogram bucket ranges and your code's output. If your code is tracking milliseconds, then the default buckets are going to be hard to make use of, as they range from 0.005 to 10.0 - which in milliseconds, is pretty tiny for tracking web requests. You might try tweaking the "observe()" calls in your Python code to record the duration in seconds, so something like prom_timings.labels(url).observe(float(time * 1000.0))

Happy to answer more questions if you have them, and I hope I stayed on-topic a bit better this time! :-)

Thanks!
Reply all
Reply to author
Forward
0 new messages