Is this alerting architecture crazy?

522 views
Skip to first unread message

Tony Di Nucci

unread,
Nov 20, 2021, 5:02:47 AM11/20/21
to Prometheus Developers
Cross-posted from https://discuss.prometheus.io/t/is-this-alerting-architecture-crazy/610

In relation to alerting, I’m looking for a way to get strong alert delivery guarantees (and if delivery is not possible I want to know about it quickly).

Unless I’m mistaken AlertManager only offers best-effort delivery. What’s puzzled me though is that I’ve not found anyone else speaking about this, so I worry I’m missing something obvious. Am I?

Assuming I’m not mistaken I’ve been thinking of building a system with the architecture shown below.

alertmanager-alertrouting.png

Basically rather than having AlertManager try and push to destinations I’d have an AlertRouter which polls AlertManager. On each polling cycle the steps would be (neglecting any optimisations):

  • All active alerts are fetched from AlertManager.
  • The last known set of active alerts is read from the Alert Event Store.
  • The set of active alerts is compared with the last known state.
  • New alerts are added to an “active” partition in the Alert Event Store.
  • Resolved alerts are removed from the “active” partition and added to a “resolved” partition.

A secondary process within AlertRouter would:

  • Check for alerts in the “active” partition which do not have a state of “delivered = true”.
  • Attempt to send each of these alerts and set the “delivered” flag.
  • Check for alerts in the “resolved” partition which do not have a state of “delivered = true”.
  • Attempt to send each of these resolved alerts and set the “delivered” flag.
  • Move all alerts in the “resolved” partition where “delivered=true” to a “completed” partition.

Among other metrics, the AlertRouter would emit one called “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to alert me to cases where any alert could not be delivered quickly enough. Since the alert is still held in the Alert Event Store it should be possible for me to resolve whatever issue is blocking and not lose the alert.

I think there are other benefits to this architecture too, e.g. similar to the way Prometheus scrapes, natural back-pressure is a property of the system.

Anyway, as mentioned I’ve not found anyone else doing something like this and this makes me wonder if there’s a very good reason not to. If anyone knows that this design is crazy I’d love to hear!

Thanks

Ben Kochie

unread,
Nov 20, 2021, 5:29:55 AM11/20/21
to Tony Di Nucci, Prometheus Developers
What gives you the impression that the Alertmanager is "best effort"?

The alertmanager provides a reasonably robust HA solution (gossip clustering). The only thing best-effort here is actually deduplication. The Alertmanager design is "at least once" delivery, so it's robust against network split-brain issues. So in the event of a failure, you may get duplicate alerts, not none.

When it comes to delivery, the Alertmanager does have retries. If a connection to PagerDuty or other receivers has an issue, it will retry. There are also metrics for this, so you can alert on failures to alternate channels.

What you likely need is a heartbeat setup. Because services like PagerDuty and Slack do have outages, you can't guarantee delivery if they're down.

The method here is to have an end-to-end "always firing heartbeat" alert, which goes to a system/service like healthchecks.io or deadmanssnitch.com. These will trigger an alert in the absence of your heartbeat. Letting you know that some part of the pipeline has failed.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com.

Ben Kochie

unread,
Nov 20, 2021, 5:33:12 AM11/20/21
to Tony Di Nucci, Prometheus Developers
Also, the alertmanager does have an "even store", it's a shared state between all instances.

If you're interested in changing some of the behavior of the retry mechanisms or how this works, feel free to open specific issues. You don't need to build an entirely new system, we can add new features to the existing Alertmanager clustering framework.

Tony Di Nucci

unread,
Nov 20, 2021, 6:08:58 AM11/20/21
to Prometheus Developers
Thanks for the feedback.


> What gives you the impression that the Alertmanager is "best effort"?
Sorry, best-effort probably wasn't the right term to use.  I am aware of there being retries however these could still all fail and I'm thinking I wouldn't be made aware of the issue for potentially quite a long time.

My understanding is that an alertmanager_notification_requests_failed_total counter will be incremented each time there is a failed send attempt however from this alone I can't tell the difference between a single alert that's consistently failing and a small number of alerts which are all failing.  I think this means that I've got to wait until alertmanager_notifications_failed_total is incremented before considering an alert to have failed (and this can take many minutes) and then a bit of exploration is needed to figure out which alert(s) failed.  Depending on the criticality of the alert it may be fine for it to take some minutes before we're made aware of a delivery problem, in other cases though it won't be.

A couple of things I didn't really touch on originally which will also help explain where my head is:
* I have a requirement to be able to measure accurate latency per alert through the alerting pipeline, i.e. for each alert I need to know the amount of time it was known to AlertManager before it was successfully written to the destination.
* I have a requirement to be able to analyse historic alerts.

Tony Di Nucci

unread,
Nov 20, 2021, 6:28:30 AM11/20/21
to Prometheus Developers
There are other things I need to do as well, alert enrichment, complex routing, etc.  which means that I think some additional system is needed between AlertManager and the final destination in any case.

The main question in my mind is really; are there reasons why I should prefer to have AlertManager push to this new system over having this new system pull? 

My reasons for preferring a pull based architecture are:
* Just by looking at the AlertRouter we can get a reasonable understanding of overall health.  If alerts are pushed to the router then it alone can't tell the difference between no alerts firing and it not receiving alerts that have fired.
* Backpressure is a natural property of the system.

With this extra context, what do you think?

Stuart Clark

unread,
Nov 20, 2021, 12:38:06 PM11/20/21
to Tony Di Nucci, Prometheus Developers
It sounds like you are planning on creating a fairly complex system that duplicates a reasonable amount of what Alertmanager already does. I'm presuming your diagram is a simplification and that the application is itself a cluster, so each instance would be querying each instance of Alertmanager? Would your storage be part of the clustering system (similar to Alertmanager) or another cluster of something like a relational database?
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Tony Di Nucci

unread,
Nov 20, 2021, 6:42:42 PM11/20/21
to Prometheus Developers
Yes, the diagram is a bit of a simplification but not hugely.

There may be multiple instances of AlertRouter however they will share a database.  Most likely things will be kept simple (at least initially) where each instance holds no state of its own.  Each active alert in the DB will be uniquely identified by the alert fingerprint (which the AlertManager API provides, i.e. a hash of the alert groups labels).  Each non-active alert will have a composite key (where one element is the alert group fingerprint).

In this architecture I see AlertManager having the responsibilities of capturing, grouping, inhibiting and silencing alerts.  The AlertRouter will have the responsibilities of; enriching alerts, routing based on business rules, monitoring/guaranteeing delivery and enabling analysis of alert history.

Due to my requirements, I think I need something like the AlertRouter.  The question is really, am I better to push from AlertManager to AlertRouter, or to have AlertRouter pull from AlertManager.  My current opinion is that pulling comes with more benefits but since I've not seen anyone else doing this I'm concerned there could be good reasons (I'm not aware of) for not doing this.

Stuart Clark

unread,
Nov 21, 2021, 5:28:49 PM11/21/21
to Tony Di Nucci, Prometheus Developers
On 20/11/2021 23:42, Tony Di Nucci wrote:
> Yes, the diagram is a bit of a simplification but not hugely.
>
> There may be multiple instances of AlertRouter however they will share
> a database.  Most likely things will be kept simple (at least
> initially) where each instance holds no state of its own.  Each active
> alert in the DB will be uniquely identified by the alert fingerprint
> (which the AlertManager API provides, i.e. a hash of the alert groups
> labels).  Each non-active alert will have a composite key (where one
> element is the alert group fingerprint).
>
> In this architecture I see AlertManager having the responsibilities of
> capturing, grouping, inhibiting and silencing alerts.  The AlertRouter
> will have the responsibilities of; enriching alerts, routing based on
> business rules, monitoring/guaranteeing delivery and enabling analysis
> of alert history.
>
> Due to my requirements, I think I need something like the
> AlertRouter.  The question is really, am I better to push from
> AlertManager to AlertRouter, or to have AlertRouter pull from
> AlertManager.  My current opinion is that pulling comes with more
> benefits but since I've not seen anyone else doing this I'm concerned
> there could be good reasons (I'm not aware of) for not doing this.

If you really must have another system connected to Alertmanager having
it respond to webhook notifications would be the much simpler option.
You'd still need to run multiple copies of you application behind a load
balancer (and have a clustered database) for HA, but at least you'd not
have the complexity of each instance having to discover all the
Alertmanager instances, query them and then deduplicate amongst the
different instances (again something that Alertmanager does itself already).

I'm still struggling to see why you need an extra system at all - it
feels very much like you'd be increasing complexity significantly which
naturally decreases reliability (more bits to break, have bugs or act in
unexpected ways) and slow things down (as there is another "hop" for an
alert to pass through). All of the things you mention can be done
already through Alertmanager, or could be done pretty simply with a
webhook receiver (without the need for any additional state storage, etc.)

* Adding data to an alert could be done with a simple webhook receiver,
that accepts an alert and then forwards it on to another API with extra
information added (no need for any state)
* Routing can be done within Alertmanager, or for more complex cases
could again be handled by a stateless webhook receiver
* With regards to "guaranteeing" delivery I don't see your suggestion in
allowing that (I believe it would actually make that less likely overall
due to the added complexity and likelihood of bugs/unhandled cases).
Alertmanager already does a good job of retrying on errors (and updating
metrics if that happens) but not much can be done if the final system is
totally down for long periods of time (and for many systems if that
happens old alerts aren't very useful once it is back, as they may have
already resolved).
* Alertmanager and Prometheus already expose a number of useful metrics
(make sure your Prometheus is scraping itself & all the connected
Alertmanagers) which should give you lots of useful information about
alert history (with the advantage of that data being with the monitoring
system you already know [with whatever you have connected like
dashboards, alerts, etc.])

--
Stuart Clark

Tony Di Nucci

unread,
Nov 22, 2021, 10:03:20 AM11/22/21
to Prometheus Developers
Thanks for the feedback Stuart, I really appreciate you taking the time and you've given me reason to pause and reconsider my options.

I fully understand your concerns over having a new data store.  I'm not sure that AlertManager and Prometheus contain the state I need though and I'm not sure I should attempt to use Prometheus as the store for this state (tracking per alert latencies would end up with a metric with unbounded cardinality, each series would just contain a single data point and it would be tricky to analyse this data).

On the "guaranteeing" delivery front.  You of course have a point that the more moving parts there are the more that can go wrong.  From the sounds of things though I don't think we're debating the need for another system (since this is what a webhook receiver would be?).  

Unless I'm mistaken, to hit the following requirements there'll need to be a system external AlertManager and this will have to maintain some state:
* supporting complex alert enrichment (in ways that cannot be defined in alerting rules)
* support business specific alert routing rules (which are defined outside of alerting rules)
* support detailed alert analysis (which includes per alert latencies)

I think this means that the question is limited to; is it better in my case to push or pull from AlertManager.  BTW, I'm sorry for the way I worded my original post because I now realise how important it was to make explicit the requirements that (I think) necessitate the majority of the complexity.

As I still see it, the problems with the push approach (which are not present with the pull approach are):
* It's only possible to know that an alert cannot be delivered after waiting for group_interval (typically many minutes)
* At a given moment it's not possible to determine whether a specific active alert has been delivered (at least I'm not aware of a way to determine this)
* It is possible for alerts to be dropped (e.g. https://github.com/prometheus/alertmanager/blob/b2a4cacb95dfcf1cc2622c59983de620162f360b/cluster/delegate.go#L277

The tradeoffs for this are:
* I'd need to discover the AlertManager instances.  This is pretty straight forward in k8s.
* I may need to dedupe alert groups across AlertManager instances.  I think this would be pretty straight forward too, esp. since AlertManager already populates fingerprints.

Ben Kochie

unread,
Nov 22, 2021, 10:29:08 AM11/22/21
to Tony Di Nucci, Prometheus Developers
On Mon, Nov 22, 2021 at 4:03 PM Tony Di Nucci <tonyd...@gmail.com> wrote:
Thanks for the feedback Stuart, I really appreciate you taking the time and you've given me reason to pause and reconsider my options.

I fully understand your concerns over having a new data store.  I'm not sure that AlertManager and Prometheus contain the state I need though and I'm not sure I should attempt to use Prometheus as the store for this state (tracking per alert latencies would end up with a metric with unbounded cardinality, each series would just contain a single data point and it would be tricky to analyse this data).

On the "guaranteeing" delivery front.  You of course have a point that the more moving parts there are the more that can go wrong.  From the sounds of things though I don't think we're debating the need for another system (since this is what a webhook receiver would be?).  

Unless I'm mistaken, to hit the following requirements there'll need to be a system external AlertManager and this will have to maintain some state:
* supporting complex alert enrichment (in ways that cannot be defined in alerting rules)

We actually are interested in adding this to the alertmanager, there are a few open proposals for this. Basically the idea is that you can make an enrichment call at alert time to do things like grab metrics/dashboard snapshots, other system state, etc.
 
* support business specific alert routing rules (which are defined outside of alerting rules)

The alertmanager routing rules are pretty powerful already. Depending on what you're interested in adding, this is something we could support directly.
 
* support detailed alert analysis (which includes per alert latencies)

This is, IMO, more of a logging problem. I think this is something we could add. You ship the alert notifications to any kind of BI system you like, ELK, etc. 

Maybe something to integrate into https://github.com/yakshaving-art/alertsnitch.
 

I think this means that the question is limited to; is it better in my case to push or pull from AlertManager.  BTW, I'm sorry for the way I worded my original post because I now realise how important it was to make explicit the requirements that (I think) necessitate the majority of the complexity.

Honestly, most of what you want is stuff we could support in Alertmanager without a lot of trouble. And are things that other users would want as well. Rather than build a whole new system, why not contribute improvements directly to the Alertmanager.
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.

Tony Di Nucci

unread,
Nov 22, 2021, 11:01:46 AM11/22/21
to Prometheus Developers
> Honestly, most of what you want is stuff we could support in Alertmanager without a lot of trouble. And are things that other users would want as well. Rather than build a whole new system, why not contribute improvements directly to the Alertmanager.

That's a very good point and something I think would be great to do.  Something I will have to keep in mind though is how things may play out in the world of hosted "Prometheus" solutions - if I were to go with one of these solutions then I'd have no control over when new features would be made available.

FWIW the custom routing that I'm talking about is very business specific and involves consulting (yet another!) system to determine the final alert severity and where it gets routed to.  I guess this could be supported in AlertManager (by having hooks or plugins), whether the maintainers of AM like this will obviously be it's own question.

I'll discuss this with my colleges to see whether we can consider contributing to AlertManager.

Thanks for the help!
Reply all
Reply to author
Forward
0 new messages