struggling with alertmanager query

76 views
Skip to first unread message

Kavan Mccanaan

unread,
Oct 21, 2020, 4:28:10 AM10/21/20
to Prometheus Users
# Caculates HTTP error Responses total 
  - recordwindows:windows_iis_worker_request_errors_total:irate5m
    exprirate(windows_iis_worker_request_errors_total[5m])

  - alertIIS error requests rate
    exprsum without () (rate(windows:windows_iis_worker_request_errors_total:irate5m{status_code!="401"}[5m])) > 3
    for5m
    labels:
      severitycritical
      componentWindowsOS
    annotations:
      summary"High IIS worker error rate"
      description"IIS http responses on {{ if $labels.fqdn }}{{ $labels.fqdn }}{{ else }}{{ $labels.instance }}{{ end }}for {{ $labels.app }} has high rate of errors."
      dashboard:
      runbook:

I'm trying to do something like this to alert on when people are getting errors whilst trying to connect to a webapp, the issue is the query itself 'windows_iis_worker_request_errors_total:irate5m' is returning non integer values

The idea was to evaluate over a rolling 5 minute window the number of errors.

of course in an ideal world I'd alert on the rate of errors using the total requests metrics and dividing, however the two metrics have a label mismatch and I am unsure how to perform that query.

Would really appreciate any assistance!

edit:

Someone in the Prometheus developer group provided me with the followering query which does work:

sum by (fqdn, instance, app) (increase(windows_iis_worker_request_errors_total{status_code!="401"}[5m]))

However I was wondering if someone would still know how to get a query working on the rate of errors rather than the increase in count despite the label mismatch between the IIS total requests and IIS error request metrics.

Tim Schwenke

unread,
Oct 21, 2020, 4:49:07 AM10/21/20
to Prometheus Users
Hey again,

do you mean by "rate of errors" the ratio between errors and the total number of requests? If it is just the rate (as in the number of errors per second) you can just replace `increase` with `rate`. This will give you the errors per second averaged over the last 5 minutes.

How does the label mismatch manifest itself? Is it just the label names or do the values differ as well? Can you post the respective labels of interest to you? 

Kavan Mccanaan

unread,
Oct 21, 2020, 5:19:17 AM10/21/20
to Prometheus Users
sorry, to clarify, I guess by rate what I mean is the % of errors compared to total requests, IE if the error rate is more than 10% of total requests we could label it as a warning alert, if over 30% then a critical/outage (for example) - so yes, the ratio of errors to total requests!

The label issue is, to quote my colleague: "
the issue is one metric has differnt lables to the other. this means prometheus cant match up the metrics as lables dont match"

I suppose we could strip the labels but then we ocse context like status code for example.

Tim Schwenke

unread,
Oct 21, 2020, 6:57:48 AM10/21/20
to Prometheus Users
Well I can't give you concrete tips if I don't see the labels. But generally you can use `label_join()` and `label_replace()` in PromQL to work around mismatching labels. <https://prometheus.io/docs/prometheus/latest/querying/functions/#label_join>

Kavan Mccanaan

unread,
Oct 21, 2020, 7:21:28 AM10/21/20
to Prometheus Users
Sorry, the labels for the errors metric are [app, pid, status_code] whilst the labels for the total requests metric are [site, method] thanks very much for your help, I'm having a look at label joining now although I'm not sure if it's going to allow me to do what I want to do!

Tim Schwenke

unread,
Oct 21, 2020, 8:04:39 AM10/21/20
to Prometheus Users
With these two metrics and label combinations you can only calculate ratios against the total number of requests. For answering questions like "What is the percentage of errors for app `service` against all it's request" you will have to improve the instrumentation. A counter for the number of requests by `app` would be helpful.

You are dealing with a many-to-many matching which must be explicit. So use `group_left` or `group_right`. This should work after adapting the metric names:

```
sum(rate(http_request_errors_total[5m]))
/ ignoring (app, pid, status_code) group_left
sum(rate(http_requests_total[5m]))
```
Reply all
Reply to author
Forward
0 new messages