Alertmanager: disable grouping?

4,060 views
Skip to first unread message

Rob N

unread,
Oct 12, 2017, 7:46:37 PM10/12/17
to Prometheus Users

Is there a way to configure Alertmanager to disable grouping entirely? I'm only using AM as a proxy to ship alerts from Prometheus to VictorOps, Slack and email, and its grouping behaviour is getting in the way.

Thanks,
Rob.

Brian Brazil

unread,
Oct 13, 2017, 1:31:59 AM10/13/17
to Rob N, Prometheus Users
This is not possible. The goal of the alertmanager is to take alerts and reduce them down to more useful notifications, and that includes grouping.
If you don't want grouping then it's suggested to bypass the alertmanager and consume the alert stream from Prometheus directly.

Brian

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fcb38477-37ac-4423-9b3c-86215e8496a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Rob N ★

unread,
Oct 13, 2017, 1:39:03 AM10/13/17
to Brian Brazil, Prometheus Users
Well that sucks. Removing Alertmanager from the equation means I have to reimplement a bunch of stuff that it already does that is useful to me, like actually accepting Prometheus' alert structure and routing to different providers based on labels.

I've been attempting a workaround by adding using a time() query to add a timestamp label to the alert generated by Prometheus, and then having Alertmanager group on that (so it will obviously never match anything). That keeps throwing a nil pointer exception in the template, but I may have the syntax wrong. I will pursue it further.

Failing that, would you accept a patch to Alertmanager to disable grouping if I provided one?

Thanks,
Rob.

Brian Brazil

unread,
Oct 13, 2017, 1:42:40 AM10/13/17
to Rob N ★, Prometheus Users
On 13 October 2017 at 06:39, Rob N ★ <ro...@fastmail.com> wrote:
Well that sucks. Removing Alertmanager from the equation means I have to reimplement a bunch of stuff that it already does that is useful to me, like actually accepting Prometheus' alert structure and routing to different providers based on labels.

Usually the people looking for this already have an alertmanager-y type thing which handles this, and layering two of them doesn't work too well.
 

I've been attempting a workaround by adding using a time() query to add a timestamp label to the alert generated by Prometheus, and then having Alertmanager group on that (so it will obviously never match anything). That keeps throwing a nil pointer exception in the template, but I may have the syntax wrong. I will pursue it further.

Failing that, would you accept a patch to Alertmanager to disable grouping if I provided one?

This has been discussed previously and won't be accepted. We don't want to make it easy for users to spam themselves, and it'd also cause semantic issues.


Why exactly do you not want to group?

Brian
 

Thanks,
Rob.



On Fri, 13 Oct 2017, at 04:31 PM, Brian Brazil wrote:
This is not possible. The goal of the alertmanager is to take alerts and reduce them down to more useful notifications, and that includes grouping.
If you don't want grouping then it's suggested to bypass the alertmanager and consume the alert stream from Prometheus directly.

Brian

On 13 October 2017 at 00:46, Rob N <ro...@fastmail.com> wrote:

Is there a way to configure Alertmanager to disable grouping entirely? I'm only using AM as a proxy to ship alerts from Prometheus to VictorOps, Slack and email, and its grouping behaviour is getting in the way.

Thanks,
Rob.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fcb38477-37ac-4423-9b3c-86215e8496a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--




--

Rob N ★

unread,
Oct 13, 2017, 3:16:34 AM10/13/17
to Brian Brazil, Prometheus Users
On Fri, 13 Oct 2017, at 04:42 PM, Brian Brazil wrote:
Usually the people looking for this already have an alertmanager-y type thing which handles this, and layering two of them doesn't work too well.

I have VictorOps as the final receiver of alerts, and it does grouping, routing, silencing etc, so I sort of see your point.

That said, without Alertmanager I would still need to have something in between to take Prometheus' alerts and forward them on. The VO docs recommends using Alertmanager to connect Prometheus, so I just followed that advice. I should ask them about their intent here, since receiving pre-grouped alerts kind of doesn't make sense, as you say.

What VO doesn't deal with particularly well though is having "informational" alerts, that don't actually page anyone, but are useful to point out potential issues before they happen. It can be done with their routing rules and alert rewriting tool, but its fiddly and you don't get a nice clean UI at the end of it showing these alerts. Alertmanager can do that easily, just by labeling these alerts differently (the docs describe "severity = warning") and routing them elsewhere, which I do (these go to Slack+email).

I could write a thing to receive those from VO and display them nicely of course, but then VO needs to be able to call back into my monitoring infra which is slightly awkward, and Alertmanager already has all this stuff so why bother?

Why exactly do you not want to group?

Here's a specific case I'm dealing with. I have a general "warning" alert when a disk volume reaches 90%. Its usually indicative of something will become a problem soon. (It also points to a shortcoming in monitoring elsewhere, which is something we need to improve for sure, but that doesn't mean it has no value).

So here's two 90% warnings I have, from a single alert rule (using node_filesystem_avail and node_filesystem_size):

{dc="nyi",device="/dev/mapper/sdn11",fstype="ext4",instance="10.202.2.86:9100",job="prom_node_exporter",mountpoint="/mnt/i36d2t11",node="imap36"}
{dc="quadra",device="/dev/mapper/sde1",fstype="ext4",instance="10.207.2.101:9100",job="prom_node_exporter",mountpoint="/mnt/qb7backup",node="qbackup1"}

These two alerts should never be grouped, as they relate to very different subsystems (hot user data and backups respectively) and aren't related. They won't be, because my group_by config is currently ["alertname","dc","node","job"]. But what about this:

{dc="nyi",device="/dev/mapper/sdm7",fstype="ext4",instance="10.202.2.81:9100",job="prom_node_exporter",mountpoint="/mnt/i31d1t07",node="imap31"}
{dc="nyi",device="/dev/sda5",fstype="ext4",instance="10.202.2.81:9100",job="prom_node_exporter",mountpoint="/local",node="imap31"}

/local is a scratch space volume that most of our internal services use. It reaching a usage threshold at the same time as a user data volume is almost certainly unrelated. These should not be grouped. I could add "mountpoint" to my group_by config, but then group_by starts needing all sorts of different labels for all the different things we might want to alert on.

Not grouping them does mean that we might get flooded if these warning alerts did all fire at the same time for some reason, but that's not really an issue with the receivers configured as they are - a separate Slack channel and email that gets filtered to a folder. They're more low-level awareness things, not things for immediate action.

Now obviously, grouping them doesn't hurt any of this; all the alerts are still visible in the Alertmanager UI, and we do get emails and Slack posts when an alert is added or removed from the group. But its far more useful to receive the full data of the new alert in receiver, rather than just the common elements.


I should note that we are coming from an in-house monitoring & alerting system that is little more than "check thresholds every couple of minutes and wake someone if we go over them". We're moving away from that, but we've got 15 years of legacy to disentangle so its going to take time. Its quite possible that I'm trying to shoehorn an old model into a new world and this is just somewhere it won't fit. On the other hand, it seems like I'm within touching distance and the idea of being able to tell the operators that "this is something you should deal with soon" without reference to anything else doesn't seem ridiculous to me.

All ideas welcome :)

Thanks,
Rob N.

Brian Brazil

unread,
Oct 13, 2017, 3:29:22 AM10/13/17
to Rob N ★, Prometheus Users
On 13 October 2017 at 08:16, Rob N ★ <ro...@fastmail.com> wrote:
On Fri, 13 Oct 2017, at 04:42 PM, Brian Brazil wrote:
Usually the people looking for this already have an alertmanager-y type thing which handles this, and layering two of them doesn't work too well.

I have VictorOps as the final receiver of alerts, and it does grouping, routing, silencing etc, so I sort of see your point.

That said, without Alertmanager I would still need to have something in between to take Prometheus' alerts and forward them on. The VO docs recommends using Alertmanager to connect Prometheus, so I just followed that advice. I should ask them about their intent here, since receiving pre-grouped alerts kind of doesn't make sense, as you say.

What VO doesn't deal with particularly well though is having "informational" alerts, that don't actually page anyone, but are useful to point out potential issues before they happen. It can be done with their routing rules and alert rewriting tool, but its fiddly and you don't get a nice clean UI at the end of it showing these alerts. Alertmanager can do that easily, just by labeling these alerts differently (the docs describe "severity = warning") and routing them elsewhere, which I do (these go to Slack+email).

I could write a thing to receive those from VO and display them nicely of course, but then VO needs to be able to call back into my monitoring infra which is slightly awkward, and Alertmanager already has all this stuff so why bother?

Why exactly do you not want to group?

Here's a specific case I'm dealing with. I have a general "warning" alert when a disk volume reaches 90%. Its usually indicative of something will become a problem soon. (It also points to a shortcoming in monitoring elsewhere, which is something we need to improve for sure, but that doesn't mean it has no value).

So here's two 90% warnings I have, from a single alert rule (using node_filesystem_avail and node_filesystem_size):

{dc="nyi",device="/dev/mapper/sdn11",fstype="ext4",instance="10.202.2.86:9100",job="prom_node_exporter",mountpoint="/mnt/i36d2t11",node="imap36"}
{dc="quadra",device="/dev/mapper/sde1",fstype="ext4",instance="10.207.2.101:9100",job="prom_node_exporter",mountpoint="/mnt/qb7backup",node="qbackup1"}

These two alerts should never be grouped, as they relate to very different subsystems (hot user data and backups respectively) and aren't related. They won't be, because my group_by config is currently ["alertname","dc","node","job"]. But what about this:

{dc="nyi",device="/dev/mapper/sdm7",fstype="ext4",instance="10.202.2.81:9100",job="prom_node_exporter",mountpoint="/mnt/i31d1t07",node="imap31"}
{dc="nyi",device="/dev/sda5",fstype="ext4",instance="10.202.2.81:9100",job="prom_node_exporter",mountpoint="/local",node="imap31"}

/local is a scratch space volume that most of our internal services use. It reaching a usage threshold at the same time as a user data volume is almost certainly unrelated. These should not be grouped. I could add "mountpoint" to my group_by config, but then group_by starts needing all sorts of different labels for all the different things we might want to alert on.

Not grouping them does mean that we might get flooded if these warning alerts did all fire at the same time for some reason, but that's not really an issue with the receivers configured as they are - a separate Slack channel and email that gets filtered to a folder. They're more low-level awareness things, not things for immediate action.

Sending notifications into slack/email like this is what I'd consider to not be very useful, as the important thing about a alert/notification is that a human needs to look at it. What you want here is usually a ticketing system, so these alerts can be assigned to someone to process at. If they're just "for information" I wouldn't even send them to the alertmanager.

For the more ticket type things the questions are more what is the batch size the human works with. For example if I'm processing them once a day I'd like them all in one notification so I can handle them in batch, as they likely are related problems with a single common cause. If each requires substantial unique engineering effort to deal with on varying timescales, then I might want one ticket each - and to look at redesigning the system not to produce so much operational work.
 

Now obviously, grouping them doesn't hurt any of this; all the alerts are still visible in the Alertmanager UI, and we do get emails and Slack posts when an alert is added or removed from the group. But its far more useful to receive the full data of the new alert in receiver, rather than just the common elements.

The receiver gets full details of every alert in the group, if you want to show more than the defaults (we only put a summary in things like Slack to avoid spamming your channels with 10-page notifications every 5 minutes, whereas email has everything) then that can be configured with notification templating.


I should note that we are coming from an in-house monitoring & alerting system that is little more than "check thresholds every couple of minutes and wake someone if we go over them". We're moving away from that, but we've got 15 years of legacy to disentangle so its going to take time. Its quite possible that I'm trying to shoehorn an old model into a new world and this is just somewhere it won't fit. On the other hand, it seems like I'm within touching distance and the idea of being able to tell the operators that "this is something you should deal with soon" without reference to anything else doesn't seem ridiculous to me.

I'd suggest reading https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit to get an idea of how we envision alerting with Prometheus.

Brian



--

Rob N ★

unread,
Oct 13, 2017, 4:04:21 AM10/13/17
to Brian Brazil, Prometheus Users
On Fri, 13 Oct 2017, at 06:29 PM, Brian Brazil wrote:
Sending notifications into slack/email like this is what I'd consider to not be very useful, as the important thing about a alert/notification is that a human needs to look at it. What you want here is usually a ticketing system, so these alerts can be assigned to someone to process at.

That's more about processes and scale. I have three operations people. I have no more than two or three of this warnings on any given day. Eyeballing Slack and reading our email is more than enough to keep track of it.

If they're just "for information" I wouldn't even send them to the alertmanager.

Slight tangent, but does anything else implement the alertmanager API? Or phrased another way, what other methods exist for getting this kind of stuff out of Prometheus? Just make a dashboard/console that runs the interesting queries?

For the more ticket type things the questions are more what is the batch size the human works with. For example if I'm processing them once a day I'd like them all in one notification so I can handle them in batch, as they likely are related problems with a single common cause. If each requires substantial unique engineering effort to deal with on varying timescales, then I might want one ticket each - and to look at redesigning the system not to produce so much operational work.

The receiver gets full details of every alert in the group, if you want to show more than the defaults (we only put a summary in things like Slack to avoid spamming your channels with 10-page notifications every 5 minutes, whereas email has everything) then that can be configured with notification templating.

Yeah, I think we're definitely talking about difference in scale. In my case I just don't have enough warnings to be overwhelming, but do have enough that I want to know about.

Also you say you want related things in one batch, but my point here is that these things look related, but aren't.

Don't get me wrong; I totally see why the grouping behaviour is useful if that's what you need, but in this case its in my way.

Anyway, its not a showstopper, and I have a few other ideas for how to do what I need, so I'll go and give those a try. I do appreciate the discussion; it's helped to clarify a few things in my mind. Cheers!

Rob N.

Brian Brazil

unread,
Oct 13, 2017, 4:19:22 AM10/13/17
to Rob N ★, Prometheus Users


On 13 Oct 2017 09:04, "Rob N ★" <ro...@fastmail.com> wrote:
On Fri, 13 Oct 2017, at 06:29 PM, Brian Brazil wrote:
Sending notifications into slack/email like this is what I'd consider to not be very useful, as the important thing about a alert/notification is that a human needs to look at it. What you want here is usually a ticketing system, so these alerts can be assigned to someone to process at.

That's more about processes and scale. I have three operations people. I have no more than two or three of this warnings on any given day. Eyeballing Slack and reading our email is more than enough to keep track of it.

If they're just "for information" I wouldn't even send them to the alertmanager.

Slight tangent, but does anything else implement the alertmanager API?

I'm only aware of company internal things.


Or phrased another way, what other methods exist for getting this kind of stuff out of Prometheus? Just make a dashboard/console that runs the interesting queries?

Using PromQL is one option.



For the more ticket type things the questions are more what is the batch size the human works with. For example if I'm processing them once a day I'd like them all in one notification so I can handle them in batch, as they likely are related problems with a single common cause. If each requires substantial unique engineering effort to deal with on varying timescales, then I might want one ticket each - and to look at redesigning the system not to produce so much operational work.

The receiver gets full details of every alert in the group, if you want to show more than the defaults (we only put a summary in things like Slack to avoid spamming your channels with 10-page notifications every 5 minutes, whereas email has everything) then that can be configured with notification templating.

Yeah, I think we're definitely talking about difference in scale. In my case I just don't have enough warnings to be overwhelming, but do have enough that I want to know about.

Also you say you want related things in one batch, but my point here is that these things look related, but aren't.

This is what the power of routing and groupings is designed for, so you can configure business rules like these.
Message has been deleted

Sam Zhao

unread,
May 7, 2021, 3:26:33 AM5/7/21
to Prometheus Users
I am also encounter the situation, which need to disable group. I want to store the alert one by one into my database when new alert trigger. As grouping alerts, it would duplicate alerts in the group when one of alerts  was resolved. 

Julien Pivotto

unread,
May 7, 2021, 3:35:15 AM5/7/21
to Sam Zhao, Prometheus Users
On 07 May 00:26, Sam Zhao wrote:
> I am also encounter the situation, which need to disable group. I want to
> store the alert one by one into my database when new alert trigger. As
> grouping alerts, it would duplicate alerts in the group when one of alerts
> was resolved.

You can disable grouping with `group_by: [...]`.
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/32142625-e4de-4fa8-b343-6928c2fa8855n%40googlegroups.com.


--
Julien Pivotto
@roidelapluie
Reply all
Reply to author
Forward
0 new messages