--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
How long has this service been in this state already?
Since how many consecutive checks did it reveal the same result? (This could be capped to a number below 100 to avoid overflows and still be useful)
If you feel diligent, you can add "is it currently considered flapping?" flag too. This flag should be set by the check itself.
Now rate based and time based flap detection is possible in the checks.
Also increasingly heavy countermeasures again based on age and count of state changes are possible.
While working with a lot of monitoring tools, I noticed this behaviour needs to be controlled by the checks, because there are too many special cases and the countermeasures are local anyway.
Only notification is actually a global feature. I would recommend doing that on a few less exposed machines with light load.
Reasons: Notification often requires credentials and will likely not work locally anyway or might get heavily delayed, if many checks fail. This leads to a wave of notifications, after the problem is actually resolved in case of locally queued notifications or lost notifications in case of unqueued notifications for unhealthy nodes. One will lead to ignorance (alert fatigue) the other one misses the point of notification on failure.