prometheus:Sometimes the recovery information of alterangerer (webhook) is not received

25 views
Skip to first unread message

赵坏蛋

unread,
Dec 23, 2020, 4:47:41 AM12/23/20
to Prometheus Users
Most rules trigger alarms and alarm recovery are normal, but some alarms only receive the alarm message, and the recovery message is not received.
And make sure that the alarms on promethues and altermanager are restored. The webhook did not receive the recovery message from the altermanager.

Please help confirm whether this is a configuration problem or a bug. thank!
Message has been deleted
Message has been deleted
Message has been deleted

赵坏蛋

unread,
Dec 23, 2020, 4:53:02 AM12/23/20
to Prometheus Users
promethues (version 2.11.0)
promethues rules:
“””
  - alert: ServiceQualityDecline
    expr: (min(collectd_link_e2e_score) by (hostname,env,bond,companyId,siteName,neId,deviceId,dstNe,companyName) - min(collectd_link_e2e_score{} offset 5m) by (hostname,env,bond,companyId,siteName,neId,deviceId,dstNe,companyName)) /min(collectd_link_e2e_score offset 5m) by (hostname,env,bond,companyId,siteName,neId,deviceId,dstNe,companyName) > 0.6
    for: 2m
    labels:
      severity: Emergency
    annotations:
      summary: "{{ $labels.neId }}: service quality has declined more than 60%."
      description: "{{ $labels.neId }}: E2E score of {{ $labels.link }} is `declined."
  - alert: ServiceQualityDecline
    expr: (min(collectd_link_e2e_score) by (hostname,env,bond,companyId,siteName,neId,deviceId,dstNe,companyName) - min(collectd_link_e2e_score{} offset 5m) by (hostname,env,bond,companyId,siteName,neId,deviceId,dstNe,companyName)) /min(collectd_link_e2e_score offset 5m) by (hostname,env,bond,companyId,siteName,neId,deviceId,dstNe,companyName) > 0.3
    for: 2m
    labels:
      severity: Critical
    annotations:
      summary: "{{ $labels.neId }}: service quality has declined more than 30%."
      description: "{{ $labels.neId }}: E2E score of {{ $labels.link }} is `declined."
“””
Message has been deleted
Message has been deleted
Message has been deleted

赵坏蛋

unread,
Dec 23, 2020, 9:12:05 PM12/23/20
to Prometheus Users
altermanager config:
"""
resolve_timeout: 2m
  routes:
    - match:
        env: st 
      receiver: 'aiwan_alert'
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 1m
      repeat_interval: 24h
      continue: true
receivers:
- name: 'aiwan_alert'
  webhook_configs:
    send_resolved: true
Reply all
Reply to author
Forward
0 new messages