Expression that divides one metric by another metric with the same labels

21 views
Skip to first unread message

Alex K

unread,
Dec 29, 2020, 5:36:26 PM12/29/20
to Prometheus Users
I have a counter metric called response_total. It has labels source, status, and service, plus a few more, but those are the important ones for this question.

response_total{status="200", source="foo", service="bar"} is the counter for successful requests from a service or job called "foo" to a service called "bar". 
response_total{status!="200", source="foo", service="bar"} is the counter for failed requests from a service or job called "foo" to a service called "bar". 

I'm trying to define an alert that will trigger if there's a sudden increase of non-200 requests from a specific source to a specific service relative the increase of 200 requests for the same (source, service). E.g., if the increase of non-200 requests over the last 10 minutes is 10x greater than the increase of 200 requests, trigger an alert.

I'm a bit stuck on how to define this as an expression. So far I've converged on something along these lines:

increase(response_total{status!="200"}[10m])) / increase(response_total{status="200"}[10m]) > 10

This doesn't seem to work, and it's not particularly surprising. I'm not sure how prometheus should "know" that it should be comparing response_total{status!="200", source="foo", service="bar"} to response_total{status="200", source="foo", service="bar"}.

I could define the service up-front, but the sources are defined by our cluster manager, so I can't enumerate them all up-front.

I appreciate any help!

Thanks,
Alex


Ben Kochie

unread,
Dec 30, 2020, 3:55:32 AM12/30/20
to Alex K, Prometheus Users
Since you don't care about the status, the typical thing to do is us a sum() aggregator to remove the label.

sum without (status) (increase(response_total{status!="200"}[10m])) / sum without (status) (increase(response_total{status="200"}[10m]))

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d5f90eb5-f7f5-4049-bbd6-50d530edc545n%40googlegroups.com.

Alex K

unread,
Dec 30, 2020, 10:08:05 AM12/30/20
to Prometheus Users
Hmm. I do care about the status. Maybe when I simplified the question, by labeling oversimplified the problem too much.
I got it to pretty much work by doing this:

(sum without (dst_pod) (
     route_response_total{
       direction="outbound", grpc_status!="0", grpc_status!="", rt_route!="", dst="bar"}))
/ on (rt_route, pod, workload_ns)
(sum without (dst_pod) (
     route_response_total{
       direction="outbound", grpc_status="0", rt_route!="", dst="bar"})) > 10


dst_pod denotes a specific kubernetes pod in the "bar" service. dst denotes the service name. direction="outbound" denotes a counter for requests sent from a pod. 

This gives the correct answer, but only if the denominator is present (i.e. not "absent"). So if the pod has made at least one successful request, this works. But for a pod that has never made a successful request, the denominator is missing. Then the prometheus console returns "no data", and no alert can be triggered in my rule group. Dividing by zero seems like a separate problem, but I still appreciate any input there.

Ben Kochie

unread,
Dec 30, 2020, 10:16:31 AM12/30/20
to Alex K, Prometheus Users
Well, maybe care is the wrong word. It's not relevant when you're doing math that explicitly compares different statuses. So you use aggregation operators to factor it out.

Reply all
Reply to author
Forward
0 new messages