Best practices for alert naming

5,972 views
Skip to first unread message

Matt Bostock

unread,
Jul 7, 2016, 4:51:24 AM7/7/16
to Prometheus Developers
Hello,

I'm interested in anecdotes or suggestions for best practices for alert naming.

Specifically, while Prometheus allows multiple alerting rules with the same name, have people found this useful and what side effects should I be aware of?

I saw some discussion on GitHub:


At this point I'm mostly interested in the practical impact of having multiple alert rules with the same name.

For example, if I have multiple alerts called 'DiskUsage', each targeting different mountpoints on different machines, the benefits are:

- alert name easier to read
- inhibition rules easier to configure in AlertManager
- I can also still distinguish the original alerting rules in the ALERTS metric using labels

I guess the problem comes if I have two alerting rules for disk usage, one targeting all nodes and another targetting a specific node, with different thresholds. What happens in that case? I guess the order in which the alerting rules appear in the rules files is significant in that case?

Thanks,
Matt

Fabian Reinartz

unread,
Jul 7, 2016, 5:14:44 AM7/7/16
to Matt Bostock, Prometheus Developers


The issues you are referencing are about the evaluation dependency of alerts, which has not much to do with your naming of alerts.
In your example of the disk usage alerts on all nodes vs. specific nodes: you shouldn't have the same alert for the same node twice with different thresholds.  If you want different thresholds for different node sets, their intersection should be empty.
Conceptually it rarely make sense, unless they have different severities or similar that will cause them to notify different receivers. In that case you'll have a distinguishing label anyway.
The only thing to take care of is that no two alerting rules generate the exact same alert label sets. That's generally prevented by assigning different static labels in the alerting rule (e.g. `severity` or `service`) or being based on time series with different label results (e.g. `job` or `instance` label).

This might go beyond what you were asking, but here are my thoughts with some explanation. Might be helpful for others as well:

From the Alertmanager perspective, the alert name is yet another label.
I would avoid flattening information into the alert name that can be treated as its own dimension via a (often static) label.

As always with labels, you've to find a pattern that works for your organisation and there's no one magic solution.
We used to have `SUMMARY` and `DESCRIPTION` as fixed keywords in alerting rules. Today, this is solved with annotations but the two are a sane choice in general.

I always thought the following covers most cases very well:
- Use no label information in the summary. It should just be a text description of the alert name.
- Include label information in a more precise description.

Depending on the alert name a summary might be optional (`DiskFull` is pretty clear IMO).
Even if your notification grouping is broader, this allows for comprehendible grouping along the alert name in a single notification template. 

Example:
You group notifications along an entire service, your notification could look like this:

"""
LatencyHigh (84 instances): "99th percentile latency on HTTP endpoints is high"
ErrorRateHigh (34 instances): "Error rate on HTTP endpoints above 2%"
InstanceDown (2): "service instance not reachable"

...
detailed list of the above alert instances
...
"""

(We certainly could make such things easier in templating in the future.)

Keeping standard alert names, e.g. LatencyHigh, consistent across services in your org will then allow you to query (in Prometheus or Alertmanager) for specific alerts along all services to see common problems. This is not the case if you flatten the service name in your alert name for example.


--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Brazil

unread,
Jul 7, 2016, 7:34:01 AM7/7/16
to Fabian Reinartz, Matt Bostock, Prometheus Developers
I'd expect something like LatencyHigh to have the job/service flattened in to it as each alert definition is going to be distinct with different thresholds/considerations. I'd expect this to work out similarly to our guidelines for whether to have timeseries in one metric with labels or multiple metrics. The avg/sum of latency across services is meaningless, however that's not the case for cross-machine disk usage.

Brian

 


On Thu, Jul 7, 2016 at 10:51 AM Matt Bostock <ma...@mattbostock.com> wrote:
Hello,

I'm interested in anecdotes or suggestions for best practices for alert naming.

Specifically, while Prometheus allows multiple alerting rules with the same name, have people found this useful and what side effects should I be aware of?

I saw some discussion on GitHub:


At this point I'm mostly interested in the practical impact of having multiple alert rules with the same name.

For example, if I have multiple alerts called 'DiskUsage', each targeting different mountpoints on different machines, the benefits are:

- alert name easier to read
- inhibition rules easier to configure in AlertManager
- I can also still distinguish the original alerting rules in the ALERTS metric using labels

I guess the problem comes if I have two alerting rules for disk usage, one targeting all nodes and another targetting a specific node, with different thresholds. What happens in that case? I guess the order in which the alerting rules appear in the rules files is significant in that case?

Thanks,
Matt

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Fabian Reinartz

unread,
Jul 7, 2016, 7:42:56 AM7/7/16
to Brian Brazil, Matt Bostock, Prometheus Developers
It has to be a case-by-case decision for sure.
Sometimes the meaning is sufficiently different so that flattening a limited set of labels into the name makes sense. I wouldn't consider different thresholds to be a sufficient reason for that most of the time. Seems more applicable for more complex filtering/side conditions to me.

When talking about an overview of alerts of the same meaning across a fleet that didn't imply aggregating the respective time series values.

Matthias Rampke

unread,
Jul 8, 2016, 3:15:50 AM7/8/16
to Fabian Reinartz, Brian Brazil, Matt Bostock, Prometheus Developers

Another alerting pattern we (sparingly) use to deal with differences in thresholds are metrics (sometimes exported by the service itself, sometimes just constant rules) for the thresholds. These are then used in the one all-encompassing alert expression.

With some generous application of OR you can even have default thresholds.

We usually declare warning and critical alerts with the same name distinguish by a "severity" label. We almost always ensure a "service" label is set, either explicitly in the alert rule through relabelling/label_replace. These two together are the base case of our alert routing.

/MR

Reply all
Reply to author
Forward
0 new messages