[Beginner] Build query for the http request.

Sachin Maharana

unread,

Apr 28, 2020, 3:48:27 AM4/28/20

to Prometheus Users

I am instrumenting my project for prometheus and wanted to query if 500 requests takes more than an average of 3 sec within 5m interval.
I did have this avg_over_time(probe_http_duration_seconds[5m]) > 3

but not sure how to query for 500 requets. Any hint would be helpful.

Brian Candler

unread,

Apr 28, 2020, 4:23:05 AM4/28/20

to Prometheus Users

Do you mean requests with result status code 500?

This is a bit tricky. First thing you have to be careful of is that "probe_http_duration_seconds" is not the total, it's broken down into phases, as you can see if you try the exporter with curl:

$ curl 'localhost:9115/probe?module=http_2xx_example&target=https:%2f%2fwww.google.com'

...

probe_duration_seconds 0.471663605

...

probe_http_duration_seconds{phase="connect"} 0.010641254

probe_http_duration_seconds{phase="processing"} 0.046997224

probe_http_duration_seconds{phase="resolve"} 0.001434721

probe_http_duration_seconds{phase="tls"} 0.421022725

probe_http_duration_seconds{phase="transfer"} 0.001299392

...

probe_http_status_code 200

So really you should be using probe_duration_seconds which is the total time.

Now, you can generate a filtered query like this:

probe_duration_seconds and (probe_http_status_code == 500)

The logical operators are described here and depend on the LHS and RHS having the same set of labels, unless you start doing grouping. This should return only LHS data points where the RHS has a data point.

The trouble is, to do avg_over_time on that you'll need a subquery:

avg_over_time( (probe_duration_seconds and (probe_http_status_code == 500))[5m:1m] ) > 3

Subqueries will resample your data - in the above example it will take 1 minute steps over 5 minutes. So you need to align this with whatever scraping rate you are using. It might be good enough.

In general, a 500 error means your server is failing. If you're getting a noticeable number of 500 errors during a 5 minute period, you probably have bigger problems to worry about than the response time! That is you should fix the error, not worry about how long the error takes to be returned.

Sachin Maharana

unread,

Apr 28, 2020, 5:12:26 AM4/28/20

to Prometheus Users

Thanks for your response. I gained some idea to how to approach such querys. By 500 i meant the number of requests and not the error code(my wrong choice of wording did impy that ),. So if i want to alert on if 1000 requests takes more than avg of 3 seconds within a 5 min interval, .If i take http_request_duration_seconds_sum for example,

avg_over_time( (http_request_duration_seconds_sum and (http_request_duration_seconds_count >= 1000 ))[5m] ) > 3

Is this query right or i am missing something?

Brian Candler

unread,

Apr 28, 2020, 5:50:19 AM4/28/20

to Prometheus Users

Ah right. Because you had "probe" in the metric name I thought you were using blackbox_exporter. That isn't going to help you here, because each probe only make one request. To make 500 requests in 5 minutes, you would need to be scraping nearly two times per second!

It sounds like rather what you want to do is read your webserver log files and turn them into statistics. You'll need to look at tools like mtail or grok_exporter to do that.

In order to answer "how many queries took more than 3 seconds to respond?" you will need to generate histogram buckets at whatever intervals are of interest, e.g.

http_request_duration_seconds_sum{le="0.5"}

http_request_duration_seconds_sum{le="1.0"}

http_request_duration_seconds_sum{le="3.0"}

Reply all

Reply to author

Forward