We are using blackbox exporter on a remote location to monitor gateway routers, hypervisors and virtual machines (router —> hypervisor —> virtual machines). We are looking for something like below.
Example 1:
If a gateway router is down and alertmanager is firing, it should stop alerting on hypervisor hosts and servers
Example2:
If a hypervisor is down, it should not alert on the virtual machines
On prometheus,we group routers in one group, hypervisor on another group and also virtual machines as a single group .
Example
job_name: 'blackbox_icmp-routers
job_name: 'blackbox_icmp-hypervisors
job_name: 'blackbox_icmp-virtualmachines
Alertmanager rules are defined based on each job
- name: RouterDown
rules:
- alert: R-InstanceDown
expr: probe_success{job="blackbox_icmp-routers} == 0
for: 1m
- name: HypervisorDown
rules:
- alert: H-InstanceDown
expr: probe_success{job="blackbox_icmp-hypervisors} == 0
for: 1m
- name: VirtualMachinesDown
rules:
- alert: V-InstanceDown
expr: probe_success{job="blackbox_icmp-virtualmachines} == 0
for: 1m
Alertmanager config is below:
route:
group_by: ['alertname']
receiver: ms-teams
repeat_interval: 5m
receivers:
- name: ms-teams
webhook_configs:
- url: 'http://monitoring:2000/alertmanager'
send_resolved: false
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Any help is much appreciated.
Thanks
Sandosh
Rules:
- name: RouterDown
rules:
- alert: R-InstanceDown
expr: probe_success{job="blackbox_icmp-routers} == 0
for: 1m
labels:
Category: 'Site'
Type: 'Router'
- name: HypervisorDown
rules:
- alert: H-InstanceDown
expr: probe_success{job="blackbox_icmp-hypervisors} == 0
for: 1m
labels:
Category: 'Site'
Type: 'Hypervisor'
- name: VirtualMachinesDown
rules:
- alert: V-InstanceDown
expr: probe_success{job="blackbox_icmp-virtualmachines} == 0
for: 1m
labels:
Category: 'Site'
Type: 'Instance'
Alertmanager conf:
route:
group_by: ['Type']
receiver: ms-teams
repeat_interval: 5m
receivers:
- name: ms-teams
webhook_configs:
- url: 'http://monitoring:2000/alertmanager'
send_resolved: false
routes:
- match:
alertname: "R-InstanceDown"
receiver: ms-teams
routes:
- match:
alertname: "H-InstanceDown"
receiver: ms-teams
- match:
alertname: "V-InstanceDown"
receiver: ms-teams
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Since our targets has unique naming per cluster (For eg: router111, router 112, hypervisor111, hypervisor112, instance111, instance112), is there a way to group them based on their naming? Like all nodes which has 111 grouped together and 112 grouped together etc... Please let me know.
As per the below configuration, we are seeing only Router Down alerts if anything is added to Router group and it is suppressing even the valid alerts. Not sure what we are missing.
...
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
- source_labels: [__param_target]
target_label: SNumber
regex: 'hyper.(.*)com'
replacement: '${1}'
- source_labels: [__param_target]
target_label: SNumber
regex: 'linux.(.*)com'
replacement: '${1}'
- source_labels: [__param_target]
target_label: SNumber
regex: 'win.(.*)com'
replacement: '${1}'
- source_labels: [__param_target]
target_label: SNumber
regex: 'Router.(.*)com'
replacement: '${1}'