Best Practice to create alerting rules for low TPS systems

27 views

Skip to first unread message

Gaurav Tyagi

unread,

Sep 22, 2019, 5:11:18 PM9/22/19

to Prometheus Users

I am working on a low traffic online order management system (min:1 TPS, peak: 20 TPS). The senior engineers in our team have set alerts that are quite sensitive and most of them result in a callout.

In my personal opinion, they are waste of time because in most cases we can't do any changes. Any change to the system will require a prod release which cannot go at any time of the day. As far as I understand, the humans should be alerted if the system requires human intervention. In most cases, we look at the alert & check the cause and then act based on it.

If the system stops responding to clients entirely, then the maximum we could do at that instant is restart the container(s) in prod.

Usually, the alerts that are created in the system are very absolute. They are like

sum by(p1, p2, p3, p4, p5, p6) (increase(application_responses{application="xxxx",resourceName!~"(?i)(a|.*b|.*c)",status=~"5.."}[10m])) > 10

Most of the times, this alert goes off because there is a burst of 10-20 requests and all of them failed due to an unexpected error. And then this is followed by successful requests. Since this alert is setup as a MAJOR alert, a human is paged.

One approach which I would prefer in this case is to not have this alert as a major, but a minor alert that still creates the incident but can be investigated later.

Another approach is to maintain a mean error rate and alert when errors are 3 std. deviation outside the mean. But our data does not seem to follow a normal distribution.

I would like to get someone more knowledgable in this domain on how can we approach this? Is there a best practice in use for setting up alerts for these type of scenarios?

Regards

Gaurav

Dave Cadwallader

unread,

Sep 23, 2019, 3:07:19 PM9/23/19

to Prometheus Users

One thought is to use a "for" clause, such as "for: 10m" in the alert. That way, you'd only be alerted if there is a sustained increase in errors. That would smooth over the "burst" case.