I have a counter metric called response_total. It has labels source, status, and service, plus a few more, but those are the important ones for this question.
response_total{status="200", source="foo", service="bar"} is the counter for successful requests from a service or job called "foo" to a service called "bar".
response_total{status!="200", source="foo", service="bar"} is the counter for failed requests from a service or job called "foo" to a service called "bar".
I'm trying to define an alert that will trigger if there's a sudden increase of non-200 requests from a specific source to a specific service relative the increase of 200 requests for the same (source, service). E.g., if the increase of non-200 requests over the last 10 minutes is 10x greater than the increase of 200 requests, trigger an alert.
I'm a bit stuck on how to define this as an expression. So far I've converged on something along these lines:
increase(response_total{status!="200"}[10m])) / increase(response_total{status="200"}[10m]) > 10
This doesn't seem to work, and it's not particularly surprising. I'm not sure how prometheus should "know" that it should be comparing response_total{status!="200", source="foo", service="bar"} to response_total{status="200", source="foo", service="bar"}.
I could define the service up-front, but the sources are defined by our cluster manager, so I can't enumerate them all up-front.
I appreciate any help!
Thanks,
Alex