~Highish traffic, Widely distributed alert managers

34 views

Skip to first unread message

Abel Simon

unread,

Oct 21, 2023, 5:23:26 PM10/21/23

to Prometheus Users

Hi,

Will be doing a redesign of our current observability topology and just started looking into what would be an acceptable solution.

Context:

- Kubernetes
- 40+ clusters

- 50+ leaf prom instances per cluster, ~2k total
- 6 root level prom instance monitoring leaves (3x2 regional, zone redundant), 240 total.
- 6 alert managers per cluster, processing alerts from leaf and root proms (3x2 regional, zone redundant), 240 total.

- Root-level instances are monitored by a HA cortex cluster

----------------------------------------------------------------------------------------------------

Pros of the current setup:

it's very robust
easy to configure
easy to setup

Issues with it:

Lack of global view
clusters are already in overlapping regions and there will be even more overlap, leading to a high amount of alert duplication
traceability
promotes monkey patching, because we have to introduce tags and software constructs for deduplication and grouping

----------------------------------------------------------------------------------------------------

Potential solutions I was thinking of:

move up alert manages to a higher, only regional layer, without them gossiping to each other
create a clustered HA alert manager setup in 3-5 regions

The ideal solution would probably be [2] because of its simplicity and robustness at the same time, however, I have many unknowns here:

- will it bear the load? Currently having around 5k alerts an hour. (not sure what gossip AM uses, if it is one of the random variants then load probably is no issue)

- bandwidth pressure etc etc

wdyt?

Thanks

Reply all

Reply to author

Forward

0 new messages