Alertmanager and Pagerduty are not best friends

1,298 views
Skip to first unread message

tsim...@digitalocean.com

unread,
Aug 29, 2018, 6:10:40 PM8/29/18
to Prometheus Users
I'm writing to tell a sad story. There's nothing really to be done, I think Alertmanager is doing the right thing, but I think it should be noted that the behavior that Pagerduty has
when working with Alertmanager is not (at least what I) expected.

To describe it as simply as I can:
- You set up a Pagerduty Service, and create an integration with the "Events v2 API" type, as that is the latest and greatest thing (in AM and PD).
- You configure Alertmanager to group on alertnames, so you get one alert for many "InstanceDown"s, for example.
- A Prometheus alert like "InstanceDown" fires because one instance goes down.
- A Pagerduty incident is triggered
- 2 minutes later, a second Instance goes down, and another message is fired to Pagerduty.
- That alert is deduplified into the first "alert" inside the pagerduty "incident". Unless you go very deep in the Pagerduty UI/API, you will never know that second alert was delivered. You will not be re-paged, you will not see it when you look at the incident.

What I would have expected is that, like a Slack alert or an email alert that is redelivered if context changes, a second alert (with new context) would become a second "alert" inside a Pagerduty "incident". This is not the case, and in fact it does not seem
possible to make it so.

You can reproduce the behavior with a setup here:

The reason seems to be that Pagerduty's Events v2 API uses a "dedup_key" that is meant to deduplicate alerts (duh). However it provides no additional method (providing an incident key?) for you to attach additional
context (via another alert) to an existing incident. They have some rudimentary features like "group all alerts within 5 minutes from an integration into one incident", or perhaps some more expensive "AI-based" method for grouping you can pay for.

I think this is not a big deal for most, because their flow for working an incident would be:
- get the page
- work it from alertmanager, or slack, or email, or whatever

But there are certainly some people with little knowledge of Prometheus/Alertmanager, but familiarity with Pagerduty that would like to work that incident from Pagerduty. 
Unfortunately in this type of case, I don't see how that would ever work well. Perhaps specialized routing/regrouping on alerts/severities you know will go to pagerduty to create N incidents for N instances. But that's not super scaleable re: AM yaml.

Anyway, I just wanted to share this, in case someday someone happens upon it via a Google search and they don't have to go down the same sad rabbit hole I did.

If I've got it mangled and someone can correct my mistake, I'd be thrilled to hear about it!

Thanks,
Tim Simmons
Engineer - Observability
DigitalOcean

elutf...@maestrohealth.com

unread,
Mar 7, 2019, 11:48:51 AM3/7/19
to Prometheus Users
We've been hitting the same wall.

People question why we would need to use PD when there's slack and email. I'm guessing those people don't need to be woken up at 3am and get to work right away if something breaks, or they don't have a team structure that would need an on-call rotation and escalation paths.

Did you ever figure out a solution before we chase you down the rabbit hole? We don't necessarily need to work the incident from PD. It's the on-call scheduling and notification options that draw us there.

I wonder if it works better with AlertOps, OpsGenie, VictorOps, or something else? All I really care about is that I will get woken up when I need to, and optionally a ServiceNow integration that will open an INC for me. 

Simon Pasquier

unread,
Mar 8, 2019, 5:14:02 AM3/8/19
to elutf...@maestrohealth.com, Prometheus Users
This looks similar to https://github.com/prometheus/alertmanager/issues/1587
Pager Duty was supposed to enhance its support for AlertManager so you
might want to ping them.
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To post to this group, send email to promethe...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6a388de8-bf26-4ed7-8836-a4b87058f0f9%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
Reply all
Reply to author
Forward
0 new messages