The rollup feature

45 views
Skip to first unread message

George Necula

unread,
May 9, 2015, 12:29:01 PM5/9/15
to flapjack...@googlegroups.com
Hi,

    I think that the rollup feature is very important, but I am confused about how it seems to work. (BTW, I could not find any discussion of rollup in the documentation. Did I miss it?)

    Consider the situation when I have the rollup set to 3, and I have 3 checks in failed state. Two of the checks are have the notification interval of 24hr and one every 5 minutes. I expect to get 2 notifications once a day and one notification every 5 minutes, with no rollup kicking in except if they all happen in the same minute. Instead, every 5 minutes I get a summary notification listing all 3 failed checks. This is confusing, and makes it hard to spot from the subject-line what check has changed state. Essentially, you have to compare two consecutive emails to see what is new. (We have about 10k+ entity:checks and if I have at least two of them in failed state all I am getting are summary emails). 

   Instead, I think that the alerts should be triggered independently, and a summary should be sent only when otherwise there would have been more than 3 alerts sent on the same media in the same minute. Essentially, a summary alert is just a bag with all the alerts that would have been sent out anyway. 

Any thoughts, 
Thanks,
George.

George Necula

unread,
May 24, 2015, 11:37:50 AM5/24/15
to flapjack...@googlegroups.com

 Is there a plan to change the way the rollup feature works for 2.0? 

 I was thinking of how to change the implementation of the rollup, but it does not seem easy. Let me explain a proposal and people who are more familiar with the implementation can comment if it is easier to do than I am thinking. 

 We keep a list of alerts per contact and media, as is already done in contact_alerting_checks:[contact]:media:[media]. When a new alert is triggered, we record it in the list, and if there have already been rollup alerts sent in the last interval, then we put the media in "rollup" mode and we schedule an action for "interval" seconds later. While a media is in rollup mode, we accumulate notifications (both problems and recoveries). When the rollup interval expires we send a rollup notification that includes all the notifications that have been rolled-up (and perhaps a summary of the currently alerting checks, as is done now). I do not understand that internals of Flapjack enough to know if it is easy to schedule an action after a time interval (or is it purely driven by incoming check events). Also this would require a new data structure to keep all the notifications, not just the alerting checks. 

Thanks,
George.
Reply all
Reply to author
Forward
0 new messages