> On 22 Dec 2014, at 2:59 am,
l...@zhihu.com wrote:
>
> It seems to me that every checks have to be pre defined. Checks seem to be more of a nagios thing, and not that of a notification engine. For example, I may use an intelligent check engine that checks anomal metrics automatically and it is unlikely to define checks for every metric because there are too many.
I agree with your sentiment, however it’s not the case that checks need to be predefined. In 1.x you can predefine entities, but checks are fully dynamic. You can also avoid pre-defining entities by means of the ALL entity hack[1].
>
> As to the question `How long has a check been failing`, I think it is not a notification engine should care. It is better to let other programs to decide whether there is a problem, flapjack just need to care who should be notified via what media.
Flapjack 2.0 will allow the failure delay to be configurable on a per check basis, including setting to 0 (no delay). This is likely to be backported to 1.x (see
https://github.com/flapjack/flapjack/pull/748 ) so you’ll then be able to set this as a default in your environment if you wish.
Composability is one of the core design philosophies of flapjack. The primary use case we had was providing a pathway for replacing Nagios, and it goes like this:
- nagios for checks and alerting
- bring in flapjack to replace the alerting
- allow other check execution systems to gradually take over from nagios (eg sensu, icinga, naemon etc)
When you’re in the second phase there, you need a failure delay to take the place of Nagios’s number of failures in a row before alerting.
If your check execution system already takes care of the failure delay question, then you’ll be able to set those checks’ failure delay to 0 and get the behaviour you’re describing.
>
> Event should contain tags and people just tell which tags they are interested and events will be routed by looking at their tags.
Yep, that works now. You can inject tags in the events. Flapjack also generates ephemeral tags from the check name/description that can be referenced in tags in notification rules.
>
> In my view, a notification engine just needs to listen on events coming in and send them to the right persons via the right media. It should not care about checks, entity, they should all be viewed as tags to provide more variability. It is irrelevant to a notification engine what the tag means, because it only uses the tags to decide who should receive the event via which media.
Agreed
>
> As to the summary field of the event, it is too narrow. For example, I want to include cpu usage, memory usage, iostat, etc into an notification message to help solving the problem. It should just be a tag of event, for example, a tag named notification_body, flapjack should not care about its content.
Yep. We’re planning on incorporating additional information into notifications, whether that information is already looked up and included in the event flapjack receives, or whether flapjack engages an external lookup service. Watch this space.
>
> The event, check data structure are more like from the nagios' view of the metric world which is too old.
Yep. Entities no longer exist in the Flapjack 2.x data structure.
[1]
http://flapjack.io/docs/1.0/usage/Howto-Dynamic-Entity-Contact-Linking/