Customizing Alertmanager Notifications for Telegram

bashar madani

unread,

Oct 23, 2024, 11:26:30 AM10/23/24

to Prometheus Users

Hi everyone! 👋

I'm setting up custom notifications in Alertmanager for a Telegram receiver, and I want to send different messages when an alert is FIRING and when it is RESOLVED.

The issue I’m facing is that Alertmanager keeps repeating the FIRING message even after the issue is resolved. I want to ensure that only the RESOLVED message is sent when the problem is fixed.

Here’s my current configuration:
global:
resolve_timeout: 5m

route:
receiver: telegram_receiver
group_by: ["alertname", "Host"]
group_wait: 1s
group_interval: 1s
repeat_interval: 24h

routes:
- receiver: 'telegram_receiver'
matchers:
- severity="Critical"

receivers:
- name: 'telegram_receiver'
telegram_configs:
- api_url: 'https://api.telegram.org'
send_resolved: true
bot_token: xxxxx
chat_id: ttttttttttttt
message: '{{ range .Alerts }}Alert: {{ printf "%s\n" .Labels.alertname }}{{ printf "%s\n" .Annotations.summary }}{{ printf "%s\n" .Annotations.description }}{{ end }}'
parse_mode: 'HTML'

Does anyone have examples or best practices to share?

Brian Candler

unread,

Oct 24, 2024, 9:25:19 AM10/24/24

to Prometheus Users

On Wednesday 23 October 2024 at 16:26:30 UTC+1 bashar madani wrote:

The issue I’m facing is that Alertmanager keeps repeating the FIRING message even after the issue is resolved. I want to ensure that only the RESOLVED message is sent when the problem is fixed.

If you have a group of alerts, and some of them are resolved, then you'll get a new [FIRING] message with the smaller set of alerts. That's because, clearly, at least one is still firing. You'll only get [RESOLVED] when the last alert in the group has stopped firing.

If you want, you can disabling grouping entirely and then each alert will individually generate its own mails (firing and resolved). But that could mean a lot more mails if there are lots of similar alerts which would normally be grouped.

https://prometheus.io/docs/alerting/latest/configuration/#route

# To aggregate by all possible labels use the special value '...' as the sole label name, for example: # group_by: ['...'] # This effectively disables aggregation entirely, passing through all # alerts as-is. This is unlikely to be what you want, unless you have # a very low alert volume or your upstream notification system performs # its own grouping.

Does anyone have examples or best practices to share?

Personally, I'd say the best practice with resolved messages is *not to send them at all* (send_resolved: false). For an explanation see:

https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped

https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

If something was worth alerting on then it's worth investigating: even if the alert condition is no longer present, it clearly was earlier. Just saying "oh look, it's gone away, never mind" is not helping to understand or fix the problem (with the system and/or with the alert itself). Seriously: turning off resolved messages is great. At very least, it reduces your notification volume by 50%.

Chris Siebenmann

unread,

Oct 24, 2024, 11:13:06 AM10/24/24

to Brian Candler, Prometheus Users, Chris Siebenmann

> If something was worth alerting on then it's worth investigating: even if
> the alert condition is no longer present, it clearly was earlier. Just
> saying "oh look, it's gone away, never mind" is not helping to understand
> or fix the problem (with the system and/or with the alert itself).
> Seriously: turning off resolved messages is great. At very least, it
> reduces your notification volume by 50%.

As a counterpoint: we send resolved alerts so that we can know when a
problem stopped as well as when it started (which helps for diagnosis),
and so we can know that a problem is not happening *right now*, which
would make it more urgent for our environment and changes our response.

If a machine is down right now, we need to go get it back up. If a
machine went down and then came back up, we need to investigate why,
which involves a fairly different set of activities.

(But we're not a 24/7 operation where people are paged if something is
down; we're a university department running physical servers on a more
or less 8/5 basis.)

- cks

Brian Candler

unread,

Oct 24, 2024, 12:28:53 PM10/24/24

to Prometheus Users

On Thursday 24 October 2024 at 16:13:06 UTC+1 Chris Siebenmann wrote:

As a counterpoint: we send resolved alerts so that we can know when a
problem stopped as well as when it started (which helps for diagnosis)

Fair enough, although I will mention that the historical alert information is also available via metrics such as ALERTS and alertmanager_alerts

Reply all

Reply to author

Forward