Hi,
Will be doing a redesign of our current observability topology and just started looking into what would be an acceptable solution.
Context:
- Kubernetes
- 40+ clusters
- 50+ leaf prom instances per cluster, ~2k total
- 6 root level prom instance monitoring leaves (3x2 regional, zone redundant), 240 total.
- 6 alert managers per cluster, processing alerts from leaf and root proms (3x2 regional, zone redundant), 240 total.
- Root-level instances are monitored by a HA cortex cluster
----------------------------------------------------------------------------------------------------
Pros of the current setup:- it's very robust
- easy to configure
- easy to setup
Issues with it:
- Lack of global view
- clusters are already in overlapping regions and there will be even more overlap, leading to a high amount of alert duplication
- traceability
- promotes monkey patching, because we have to introduce tags and software constructs for deduplication and grouping
----------------------------------------------------------------------------------------------------
Potential solutions I was thinking of:
- move up alert manages to a higher, only regional layer, without them gossiping to each other
- create a clustered HA alert manager setup in 3-5 regions
The ideal solution would probably be [2] because of its simplicity and robustness at the same time, however, I have many unknowns here:
- will it bear the load? Currently having around 5k alerts an hour. (not sure what gossip AM uses, if it is one of the random variants then load probably is no issue)
- bandwidth pressure etc etc
wdyt?
Thanks