On 10.03.21 00:31, dc3o wrote:
> Few times we had to bring our database clusters down due to maintenance.
> Prior to this we create a silence for a limited period of time. The silence
> is properly catching all the alerts. Problem is that once the db host is
> down, Prometheus is no longer scraping metrics and marks the initial alert
> as resolved. No metrics no problem. Looks like send resolved is skipping
> silencing pipeline and we're getting alert fatigue of resolved events.
Yeah, in my understanding, silencing right now has a semantic
independent from silencing. Which is IMHO confusing because a silenced
alert is not repeatedly sent to the receiver as configured with the
repeat_interval. (Some receivers are configured to consider an alert
resolved after a while if not receiving any updates).
See the old issue
https://github.com/prometheus/alertmanager/issues/226 with some
considerations when Alertmanager should send resolved and when not. I
expect some movement on this front in the near future. Reporting your
use case and your expectation there might be helpful.
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email]
bjo...@rabenste.in