We need to generate an alert - via Prometheus snmp_exporter metrics - when less than 80% of the nodes on our active bigip F5 load balancer is up (i.e. ). I think we have the percentage of up hosts, but am not sure how to ensure that we are only alerting on the active F5 load balancer node. In the snmp_exporter each F5 node is a distinct instance label name.
Here are the two metrics in question.
host up metric: ltmPoolMemberMonitorState = 4
f5 node active metric: sysCmFailoverStatusId = 4
Below are counting the number of ltmPoolMemberNodeName with a ltmPoolMemberNodeName that includes "prod" that are up, divided by the total number of ltmPoolMemberNodeName. Then we appended the OR operator to provide a 0 when all hosts are in a down state (i.e. ltmPoolMemberMonitorState is not 4). See below:
count(count by (ltmPoolMemberNodeName) (ltmPoolMemberMonitorState{ltmPoolMemberNodeName=~".*prod.*"} == 4)) / count(count by (ltmPoolMemberNodeName) (ltmPoolMemberMonitorState{ltmPoolMemberNodeName=~".*prod.*"})) OR on() vector(0)
Now we need to ensure that we are only deriving the calculation from the active f5 node instance metrics (i.e. when the metric sysCmFailoverStatusId is equal to 4 for a particular instance). I tried with (instance) and on (instance) to keep the metrics on same F5 node instance label, but haven't had any luck. Any recommendations would be greatly appreciated.