Hello all,
Let's say we have >=2 Prometheus nodes that are scraping the same k8s metrics. k8s SD happens every 5 minutes. Then, imagine an alerting rule expression such as:
absent({pod="my-cool-pod}) == 1
Then, what happens in practice is that you will see the alert quickly becoming firing -> resolved -> firing -> resolved because AFAICT one Prometheus node will send an alert towards AlertManager with the state "resolved" and then after some seconds the 2nd will still send an alert with the state "firing" because the metric is still not there. Then, it sends an alert with the state "resolved" and only then it finally becomes actually resolved. Seems like the magic happens here:
https://github.com/prometheus/prometheus/blob/master/rules/alerting.go#L103-L106. I would imagine that in such a scenario we should depend on AlertManager resolving the alert automatically for us after some time to get a "consistent" state.
Any thoughts on this or perhaps I am missing something?
BR,
Giedrius