some thoughts about the design of flapjack

153 views
Skip to first unread message

l...@zhihu.com

unread,
Dec 21, 2014, 11:29:50 AM12/21/14
to flapjack...@googlegroups.com
It seems to me that every checks have to be pre defined. Checks seem to be more of a nagios thing, and not that of a notification engine. For example, I may use an intelligent check engine that checks anomal metrics automatically and  it is unlikely to define checks for every metric because there are too many. 

As to the question `How long has a check been failing`, I think it is not a notification engine should care. It is better to let other programs to decide whether there is a problem, flapjack just need to care who should be notified via what media.

Event should contain tags and people just tell which tags they are interested and events will be routed by looking at their tags.

In my view, a notification engine just needs to listen on events coming in and send them to the right persons via the right media. It should not care about checks, entity, they should all be viewed as tags to provide more variability. It is irrelevant to a notification engine what the tag means, because it only uses the tags to decide who should receive the event via which media.

As to the summary field of the event, it is too narrow. For example, I want to include cpu usage, memory usage, iostat, etc into an notification message to help solving the problem. It should just be a tag of event, for example, a tag named notification_body, flapjack should not care about its content.

The event, check data structure are more like from the nagios' view of the metric world which is too old.

Jesse Reynolds

unread,
Dec 22, 2014, 12:27:19 AM12/22/14
to flapjack...@googlegroups.com

> On 22 Dec 2014, at 2:59 am, l...@zhihu.com wrote:
>
> It seems to me that every checks have to be pre defined. Checks seem to be more of a nagios thing, and not that of a notification engine. For example, I may use an intelligent check engine that checks anomal metrics automatically and it is unlikely to define checks for every metric because there are too many.

I agree with your sentiment, however it’s not the case that checks need to be predefined. In 1.x you can predefine entities, but checks are fully dynamic. You can also avoid pre-defining entities by means of the ALL entity hack[1].

>
> As to the question `How long has a check been failing`, I think it is not a notification engine should care. It is better to let other programs to decide whether there is a problem, flapjack just need to care who should be notified via what media.

Flapjack 2.0 will allow the failure delay to be configurable on a per check basis, including setting to 0 (no delay). This is likely to be backported to 1.x (see https://github.com/flapjack/flapjack/pull/748 ) so you’ll then be able to set this as a default in your environment if you wish.

Composability is one of the core design philosophies of flapjack. The primary use case we had was providing a pathway for replacing Nagios, and it goes like this:
- nagios for checks and alerting
- bring in flapjack to replace the alerting
- allow other check execution systems to gradually take over from nagios (eg sensu, icinga, naemon etc)

When you’re in the second phase there, you need a failure delay to take the place of Nagios’s number of failures in a row before alerting.

If your check execution system already takes care of the failure delay question, then you’ll be able to set those checks’ failure delay to 0 and get the behaviour you’re describing.

>
> Event should contain tags and people just tell which tags they are interested and events will be routed by looking at their tags.

Yep, that works now. You can inject tags in the events. Flapjack also generates ephemeral tags from the check name/description that can be referenced in tags in notification rules.

>
> In my view, a notification engine just needs to listen on events coming in and send them to the right persons via the right media. It should not care about checks, entity, they should all be viewed as tags to provide more variability. It is irrelevant to a notification engine what the tag means, because it only uses the tags to decide who should receive the event via which media.

Agreed

>
> As to the summary field of the event, it is too narrow. For example, I want to include cpu usage, memory usage, iostat, etc into an notification message to help solving the problem. It should just be a tag of event, for example, a tag named notification_body, flapjack should not care about its content.

Yep. We’re planning on incorporating additional information into notifications, whether that information is already looked up and included in the event flapjack receives, or whether flapjack engages an external lookup service. Watch this space.

>
> The event, check data structure are more like from the nagios' view of the metric world which is too old.

Yep. Entities no longer exist in the Flapjack 2.x data structure.

[1] http://flapjack.io/docs/1.0/usage/Howto-Dynamic-Entity-Contact-Linking/


Eric Howard

unread,
Mar 17, 2015, 8:31:54 AM3/17/15
to flapjack...@googlegroups.com

>> Event should contain tags and people just tell which tags they are interested and events will be routed by looking at their tags.

>Yep, that works now. You can inject tags in the events. Flapjack also generates ephemeral tags from the check name/description that can be referenced in tags in notification rules.

How does this currently work? From what I've seen people can only tell which tags they are NOT interested in, instead of which tags they ARE interested in, which in our environment is a lot more work.

Jesse Reynolds

unread,
Mar 17, 2015, 3:38:24 PM3/17/15
to flapjack...@googlegroups.com

Eric Howard

unread,
Mar 18, 2015, 11:26:32 AM3/18/15
to flapjack...@googlegroups.com
So let's say someone is interested in only one check, but that check is on every entity. Currently there is no way for them to just subscribe to that check, correct? My point was that they would have to subscribe to the ALL entity and then filter out EVERYTHING they don't want to see, which in some cases is a lot more work.

But from what I've seen, being able to subscribe to just one certain check will be possible in 2.x, correct?

George Necula

unread,
Apr 18, 2015, 12:36:04 AM4/18/15
to flapjack...@googlegroups.com
From looking at the code, it seems that (at least in 1.5), if the notification rule declares a set of tags, then it will only be fired for events that include all the tags in the rule. This suggests that for every tag that a contact is interested, you should create one notification profile that mentions that tag. 

Jesse Reynolds

unread,
Apr 19, 2015, 8:44:33 PM4/19/15
to flapjack...@googlegroups.com

> On 18 Apr 2015, at 9:42 am, George Necula <gcne...@gmail.com> wrote:
>
> From looking at the code, it seems that (at least in 1.5), if the notification rule declares a set of tags, then it will only be fired for events that include all the tags in the rule. This suggests that for every tag that a contact is interested, you should create one notification profile that mentions that tag.

Yep, that’s right. Sorry Eric for not explaining things clearly enough. We need to improve the documentation around how notification rules work.
Reply all
Reply to author
Forward
0 new messages