Should Alertmanager be more tolerant of templating errors?

59 views
Skip to first unread message

George Robinson

unread,
Feb 8, 2023, 8:56:23 AM2/8/23
to Prometheus Developers
Hello!

This is somewhere between a feature request and a number of questions to help me understand some of the design decisions made in Alertmanager.

When Alertmanager cannot expand a template, for example because the operator has made a mistake in the template:

receivers:
- name: test
  email_configs:
  - to: exa...@example.com
    from: nor...@example.com
    smarthost: 127.0.0.1:8585
    require_tls: false
    text: "{{ $labels.foo }}"
route:
  receiver: test
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 1m

 
 it logs an error similar to the following:

ts=2023-02-07T13:28:04.815Z caller=dispatch.go:352 level=error component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="test/email[0]: notify retry canceled due to unrecoverable error after 1 attempts: execute text template: template: :1: undefined variable \"$labels\""

I understand that following this error Alertmanager will begin the retry stage of the notification until the next group_interval or repeat_interval. To fix the issue the user must go and fix their template and reload Alertmanager.

However, it seems to me that it's not uncommon to have quite complex templates, with for loops, if statements, and sub-templates. It can be quite difficult to verify the correctness of these templates at "compile-time", and if using amtool, you need to test all possible branches in the template.

While I appreciate the responsibility of writing correct templates is on the user, I have also been considering whether Alertmanager should be more tolerant of template errors, and attempt to send some kind of notification when this happens. For example, falling back to the default template that we have high confidence of being correct.

However, before discussing the issue further, I would like to first understand whether there is a conscious design choice behind how Alertmanager operates under such failures, or whether it came to be perhaps due to ease of implementation.

Thank you, and I'm very interested to hear you opinions.

Kind regards,

George

Bjoern Rabenstein

unread,
Feb 9, 2023, 12:44:50 PM2/9/23
to George Robinson, Prometheus Developers
On 07.02.23 05:57, 'George Robinson' via Prometheus Developers wrote:
>
> While I appreciate the responsibility of writing correct templates is on
> the user, I have also been considering whether Alertmanager should be more
> tolerant of template errors, and attempt to send some kind of notification
> when this happens. For example, falling back to the default template that
> we have high confidence of being correct.

I think that makes sense. The fall-back template could call out very
explicitly that the intended template failed to expand and therefore
you get a replacement, maybe even with the error message of the
attempt to expand the original template.

But I'm not really an Alertmanager experts. And despite having a lot
of historical context about Prometheus in general, I don't remember
anything specific about error handling in alert templates.

I only remember that trying out an alert "in production" is really
hard since you need to trigger it. And if the moment you notice that
your template doesn't work is also the moment when your alert is
supposed to fire, that's really bad.

So better test tooling might help here, but even if we had that, I
think there should be a safe fall-back so that no alert is ever
swallowed because of a templating error.

--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in

Matthias Rampke

unread,
Feb 9, 2023, 12:48:12 PM2/9/23
to Bjoern Rabenstein, George Robinson, Prometheus Developers
I agree that silently sending *no* alert is the worst possible outcome. I wonder what would be "nicer" in case a template fails - send the alert with the fields that did not fail to render (possibly render the error *into* the fields that failed to make it very obvious?), or (as proposed) fall back to a "safe" template?

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/Y%2BUxD3QTKJbrLACk%40mail.rabenste.in.

George Robinson

unread,
Feb 20, 2023, 10:35:17 AM2/20/23
to Prometheus Developers
I wonder if it would also be possible to hear Julian's perspective on this. I can bring the topic to the dev summit on Thursday?
Reply all
Reply to author
Forward
0 new messages