In relation to alerting, I’m looking for a way to get strong alert delivery guarantees (and if delivery is not possible I want to know about it quickly).
Unless I’m mistaken AlertManager only offers best-effort delivery. What’s puzzled me though is that I’ve not found anyone else speaking about this, so I worry I’m missing something obvious. Am I?
Assuming I’m not mistaken I’ve been thinking of building a system with the architecture shown below.

Basically rather than having AlertManager try and push to destinations I’d have an AlertRouter which polls AlertManager. On each polling cycle the steps would be (neglecting any optimisations):
A secondary process within AlertRouter would:
Among other metrics, the AlertRouter would emit one called “undelivered_alert_lowest_timestamp_in_seconds” and this could be used to alert me to cases where any alert could not be delivered quickly enough. Since the alert is still held in the Alert Event Store it should be possible for me to resolve whatever issue is blocking and not lose the alert.
I think there are other benefits to this architecture too, e.g. similar to the way Prometheus scrapes, natural back-pressure is a property of the system.
Anyway, as mentioned I’ve not found anyone else doing something like this and this makes me wonder if there’s a very good reason not to. If anyone knows that this design is crazy I’d love to hear!
Thanks
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/be2f9bfd-ba4d-46ea-9816-f19ebef499d6n%40googlegroups.com.
Thanks for the feedback Stuart, I really appreciate you taking the time and you've given me reason to pause and reconsider my options.
I fully understand your concerns over having a new data store. I'm not sure that AlertManager and Prometheus contain the state I need though and I'm not sure I should attempt to use Prometheus as the store for this state (tracking per alert latencies would end up with a metric with unbounded cardinality, each series would just contain a single data point and it would be tricky to analyse this data).On the "guaranteeing" delivery front. You of course have a point that the more moving parts there are the more that can go wrong. From the sounds of things though I don't think we're debating the need for another system (since this is what a webhook receiver would be?).Unless I'm mistaken, to hit the following requirements there'll need to be a system external AlertManager and this will have to maintain some state:
* supporting complex alert enrichment (in ways that cannot be defined in alerting rules)
* support business specific alert routing rules (which are defined outside of alerting rules)
* support detailed alert analysis (which includes per alert latencies)
I think this means that the question is limited to; is it better in my case to push or pull from AlertManager. BTW, I'm sorry for the way I worded my original post because I now realise how important it was to make explicit the requirements that (I think) necessitate the majority of the complexity.
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/db7dd8ec-e6a0-4054-acb8-b1b28278b2e2n%40googlegroups.com.