Alertmanager race condition with inhibited rules

36 views
Skip to first unread message

Auger

unread,
Mar 31, 2020, 11:26:17 AM3/31/20
to Prometheus Users
Hi!

I think this is mostly a configuration issue so i'm posting this here before github to see if someone can help me.

I have a prometheus server configured in kubernetes with 2 alertmanagers in HA. (1 prometheus server and 2 AlertManagers).

Alertmanager Configuration:

================================================
# Deployment relevant bits
  prometheus-alertmanager:
    Image:         prom/alertmanager:v0.19.0
    Port:          9093/TCP
    Host Port:     0/TCP
    Args:
      --config.file=/etc/config/alertmanager.yml
      --storage.path=/data
      --log.level=debug
      --cluster.settle-timeout=2m
      --cluster.listen-address=0.0.0.0:19604



================================================
# Configmap relevant bits
receivers:
   (...)
route:
  group_wait: 120s
  group_interval: 5m
  receiver: default-receiver
  repeat_interval: 168h
  group_by: ['cluster', 'service', 'deployment', 'replicaset', 'alertname', 'objectid', 'alertid', 'resourceid']
  routes:
    - match:
        severity: blackhole
      receiver: blackhole
      continue: false
    - match:
        tag: "source_tag"
      receiver: blackhole
      repeat_interval: 1m
      group_interval: 1m
      continue: false
    (...)
inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
- source_match:
    tag: "source_tag"
  target_match:
    tag: "target_tag"


Inhibition rules work like a charm, until one of the alertmanager dies. If a node on the cluster dies, one of the alertmanager pods needs to be realocated and restarts. When it restarts, we can see on the log file that  the alert with the tag "tag: 'target_tag'", is received before the source tag one and the alert is fired.


Example:

We have an alert in Prometheus that fires between 10 and 12 AM. While this alert is firing i want all the alerts that match some label (in this case tag: target_tag) to be inhibited. This approach works flawlessly unless the alertmanager is restarted and I can see on the logs:
level=info ts=2020-03-31T14:22:58.403Z caller=main.go:217 msg="Starting Alertmanager" version="(version=0.19.0, branch=HEAD, revision=7aa5d19fea3f58e3d27dbdeb0f2883037168914a)"
level=info ts=2020-03-31T14:22:58.403Z caller=main.go:218 build_context="(go=go1.12.8, user=root@587d0268f963, date=20190903-15:01:40)"
level=debug ts=2020-03-31T14:22:58.506Z caller=cluster.go:149 component=cluster msg="resolved peers to following addresses" peers=<peers>
(...)
level=debug ts=2020-03-31T14:22:58.702Z caller=cluster.go:306 component=cluster memberlist="2020/03/31 14:22:58 [DEBUG] memberlist: Initiating push/pull sync with: <peer IP>\n"
level=debug ts=2020-03-31T14:22:58.704Z caller=delegate.go:230 component=cluster received=NotifyJoin (...) addr=<peer IP>"
level=debug ts=2020-03-31T14:22:58.802Z caller=cluster.go:470 component=cluster msg="peer rejoined" (...)"

level=debug ts=2020-03-31T14:22:58.802Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertid=\\\"ALERTID\\\", alertname=\\\"This is the alert i want to inhibit", tag="target_tag" "}\" receiver:<group_name:\"default-receiver\" (...)> timestamp:<seconds:1585648804 nanos:750301 > firing_alerts:3876410699172976497 > expires_at:<seconds:1586080804 nanos:750301 > "
level=debug ts=2020-03-31T14:22:58.802Z caller=nflog.go:540 component=nflog msg="gossiping new entry" entry="entry:<group_key:\"{}:{alertid=\\\"ALERTID\\\", alertname=\\\"This is the alert that fires between 10 and 12AM", tag="source_tag" "}\" receiver:<group_name:\"blackhole\" (...)> "

The alert that is used to inhibit the other is received from the peer before the other fires and we get a notification for something that is supposed to stay quiet.



Do you know if is there a way to priorize an alert or wait for all gossips from the peers to end before sending notifications? We tried with the flag --cluster.settle-timeout=2m but it doesn't work.


Thanks a lot!

Regards,


Reply all
Reply to author
Forward
0 new messages