I'm writing to tell a sad story. There's nothing really to be done, I think Alertmanager is doing the right thing, but I think it should be noted that the behavior that Pagerduty has
when working with Alertmanager is not (at least what I) expected.
To describe it as simply as I can:
- You set up a Pagerduty Service, and create an integration with the "Events v2 API" type, as that is the latest and greatest thing (in AM and PD).
- You configure Alertmanager to group on alertnames, so you get one alert for many "InstanceDown"s, for example.
- A Prometheus alert like "InstanceDown" fires because one instance goes down.
- A Pagerduty incident is triggered
- 2 minutes later, a second Instance goes down, and another message is fired to Pagerduty.
- That alert is deduplified into the first "alert" inside the pagerduty "incident". Unless you go very deep in the Pagerduty UI/API, you will never know that second alert was delivered. You will not be re-paged, you will not see it when you look at the incident.
What I would have expected is that, like a Slack alert or an email alert that is redelivered if context changes, a second alert (with new context) would become a second "alert" inside a Pagerduty "incident". This is not the case, and in fact it does not seem
possible to make it so.
You can reproduce the behavior with a setup here:
The reason seems to be that Pagerduty's Events v2 API uses a "dedup_key" that is meant to deduplicate alerts (duh). However it provides no additional method (providing an incident key?) for you to attach additional
context (via another alert) to an existing incident. They have some rudimentary features like "group all alerts within 5 minutes from an integration into one incident", or perhaps some more expensive "AI-based" method for grouping you can pay for.
I think this is not a big deal for most, because their flow for working an incident would be:
- get the page
- work it from alertmanager, or slack, or email, or whatever
But there are certainly some people with little knowledge of Prometheus/Alertmanager, but familiarity with Pagerduty that would like to work that incident from Pagerduty.
Unfortunately in this type of case, I don't see how that would ever work well. Perhaps specialized routing/regrouping on alerts/severities you know will go to pagerduty to create N incidents for N instances. But that's not super scaleable re: AM yaml.
Anyway, I just wanted to share this, in case someday someone happens upon it via a Google search and they don't have to go down the same sad rabbit hole I did.
If I've got it mangled and someone can correct my mistake, I'd be thrilled to hear about it!
Thanks,
Tim Simmons
Engineer - Observability
DigitalOcean