Alert me once if x happens z times during y period of time

170 views
Skip to first unread message

Eclipxs

unread,
Aug 7, 2015, 6:34:41 PM8/7/15
to Riemann Users
Hello, I am new to Riemann and Clojure. 

What I am trying to accomplish is: send a pager duty alert if the description contains "deadlocked"

The following works great
Assume I have pd defined with my api service key

(streams 
    (where (description #"deadlocked")
        :trigger pd))

The 2 problems with this are, 
    a) if several deadlocks happen in quick succesion we get spammed by pagerduty, 
    b) we expect deadlocks to happen occasionally, but we need to know if this happens 5 or more times in 1 hour

I tried

(streams
        (where (description #"deadlocked")
            (batch 5 3600
                :trigger pd))))

This gets really close but presents a few problems:

    a)  once 5 happens within the hour, we will be alerted 5 times
    b)  if an hour passes and we have only received 1 deadlocked match, we still get alerted.

Can anyone point me in the right direction?

Aphyr

unread,
Aug 7, 2015, 6:57:43 PM8/7/15
to rieman...@googlegroups.com

Sounds like you want to take a (rate) and use (changed :state) to detect transitions.

--
You received this message because you are subscribed to the Google Groups "Riemann Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to riemann-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Eclipxs

unread,
Aug 7, 2015, 7:12:09 PM8/7/15
to Riemann Users
I am not sure how (changed) would come into play, I am not looking for a change, rather, I am looking for the exact same thing to happen 5 times in 1 hour.

I felt that batch was really close to what I wanted, but at the end of the time interval if I did not have 5 descriptions matching "deadlocked" then not send an alert, and if I did, only send 1 alert.


Aphyr

unread,
Aug 7, 2015, 7:37:48 PM8/7/15
to rieman...@googlegroups.com

You're looking for a change in rate, from less than 5/interval to more than 5/interval.

On Aug 7, 2015 4:12 PM, Eclipxs <quinten...@gmail.com> wrote:
I am not sure how (changed) would come into play, I am not looking for a change, rather, I am looking for the exact same thing to happen 5 times in 1 hour.

I felt that batch was really close to what I wanted, but at the end of the time interval if I did not have 5 descriptions matching "deadlocked" then not send an alert, and if I did, only send 1 alert.


Eclipxs

unread,
Aug 9, 2015, 12:22:53 AM8/9/15
to Riemann Users
Brilliant ! Thank you much, One last question then, is there a way to reset the rate after an alert is triggered so we dont get alerts for subsequent deadlocks within that same hour?

Aphyr

unread,
Aug 9, 2015, 12:25:52 AM8/9/15
to rieman...@googlegroups.com
On 08/08/2015 09:22 PM, Eclipxs wrote:
> Brilliant ! Thank you much, One last question then, is there a way to reset the
> rate after an alert is triggered so we dont get alerts for subsequent deadlocks
> within that same hour?

Throttle?

--Kyle

Eclipxs

unread,
Aug 9, 2015, 12:27:45 AM8/9/15
to Riemann Users
That might work, Ill let you know what I come up with, and if it works as expected. Thanks again

Eclipxs

unread,
Aug 11, 2015, 1:00:13 PM8/11/15
to Riemann Users
It looks like rate is waiting interval seconds to report the rate.  I need to know if these deadlocks are happening much quicker than that.  This is what I am thinking: (pseudo code)

(where (description #"deadlocked")
    (batch 5 3600
       if we have at least 5 events "in case we are here because of interval expiration"
         create new event for pager duty so it only receives one event
           (:trigger pd)))

The only thing I am worried about here, is that it is possible that we would have a rate >= 5 deadlocks in a hour and not be alerted, say 4 happen in the last half of this 1 hour window and then 2 happen in the first half of the next window.

Ideally rate and changed state would work, IF rate would report immediately and not wait the full hour.

Am I missing something?  

Kyle Kingsbury

unread,
Aug 11, 2015, 1:05:32 PM8/11/15
to rieman...@googlegroups.com
On 08/11/2015 10:00 AM, Eclipxs wrote:
> It looks like rate is waiting interval seconds to report the rate. I
> need to know if these deadlocks are happening much quicker than that.
> This is what I am thinking: (pseudo code)
>
> (where (description #"deadlocked")
> (batch 5 3600
> if we have at least 5 events "in case we are here because of
> interval expiration"
> create new event for pager duty so it only receives one event
> (:trigger pd)))
>
> The only thing I am worried about here, is that it is possible that we
> would have a rate >= 5 deadlocks in a hour and not be alerted, say 4
> happen in the last half of this 1 hour window and then 2 happen in the
> first half of the next window.

http://riemann.io/api/riemann.streams.html#var-moving-time-window

--Kyle

Eclipxs

unread,
Aug 11, 2015, 1:12:21 PM8/11/15
to Riemann Users
So something along the lines of:


(where (description #"deadlocked")
  (moving-time-window 3600
    if we have 5 events  <-- (fn [events] (> (count events) 4)) ??
      (:trigger pd)))

???

Eclipxs

unread,
Aug 11, 2015, 1:52:09 PM8/11/15
to Riemann Users
Got it!

(where (description #"deadlocked")
  (moving-time-window 3600
    (fn [events] (if (> (count events) 4) (:trigger pd)))

From here, how would I go about resetting the time window after we have alerted pd? just throttle?

Kyle Kingsbury

unread,
Aug 11, 2015, 1:53:44 PM8/11/15
to rieman...@googlegroups.com
On 08/11/2015 10:52 AM, Eclipxs wrote:
> Got it!
>
> (where (description #"deadlocked")
> (moving-time-window 3600
> (fn [events] (if (> (count events) 4) (:trigger pd)))
>
> From here, how would I go about resetting the time window after we have
> alerted pd? just throttle?

I'd take the window and pass it to (smap), emitting an event with a
:state telling whether it was over or under the threshold we expected,
then use (changed) to detect transitions.

--Kyle

Eclipxs

unread,
Aug 11, 2015, 2:22:20 PM8/11/15
to Riemann Users
Looks like http://riemann.io/howto.html#alerting-when-a-certain-percentage-of-events-happe

(let [index (index)]
  ; Inbound events will be passed to these streams:
  (streams 
    (where (description #"deadlock")
      (moving-time-window 3600
        (smap (fn [events]
          (let [deadlockCount (count events)]
            (event { :service "Deadlocks"
                        :metric deadlockCount
                        :state (condp < deadlockCount
                                      "critical"
                                       "ok")
                        :description "5 or more deadlocks have occured in the last hour"}))
            (changed-state (:trigger pd)))))))

Its looking good, but how would we set the state to be critical when it is >= 5 and ok if its < 5 ?

(>= 5) "critcal" 
(< 5) "ok"
??

Kyle Kingsbury

unread,
Aug 11, 2015, 2:29:58 PM8/11/15
to rieman...@googlegroups.com
On 08/11/2015 11:22 AM, Eclipxs wrote:
> Its looking good, but how would we set the state to be critical when it
> is >= 5 and ok if its < 5 ?

http://riemann.io/api/riemann.streams.html#var-splitp

Eclipxs

unread,
Aug 11, 2015, 3:09:48 PM8/11/15
to Riemann Users
It doesn't seem to be detecting the state change

Eclipxs

unread,
Aug 19, 2015, 2:35:33 PM8/19/15
to Riemann Users
Resolved!  Miss matching parenthesis

slipstream-ed

unread,
Aug 20, 2015, 2:13:52 PM8/20/15
to Riemann Users
 @Eclipxs
Hi,

I'm trying to achieve something similar, would you mind posting your code?

I have events coming at irregular intervals, this is what I have:

mail-alert ((rollup 2 1800(email "o...@acme.com"))

(streams
.....
(moving-time-window 60
        (smap (fn [events]
          (let [failed_logins (count events)]
            (event { :service "app1"
                        :metric failed_logins
                        (where (> metric 3)
                        (smap app_failed mail-alert) 

.......
Thanks

Quinten Marsala

unread,
Aug 20, 2015, 2:23:41 PM8/20/15
to Riemann Users
Yes I would love too, I'm at lunch I'll send it as soon as I'm back
--

Eclipxs

unread,
Aug 20, 2015, 3:02:50 PM8/20/15
to Riemann Users
@slipstream-ed

This is what I came up with

It will detect all database deadlocks in all of our applications, and as soon as we get 5 within 1 hour It will alert us via pagerduty.

I then went one step further, and added support for it to keep track of all services triggering those errors and send along a list of those services in the "tags" section of the event sent to pagerduty

If you are new to Clojure like I am, I would suggest aphyr's "Clojure from the ground up" guide, it is helping me a lot.

(streams
    (where (description #"deadlock")
	  (moving-time-window 3600
        (smap (fn [events]

          (let [deadlock-count (count events)]

                (event {:service "Applications"
                        :metric (+ 1 deadlock-count) 
                        :state (if (<= deadlock-count 4) "ok" "critical")
                        :description (format "5 or more database deadlocks have occured in the last hour, see tags for affected apps") 
             			:tags (map (fn [e] (str (:service e))) events)
						:ttl 30})))

		(changed-state {:init "ok"} (:trigger pd))))))
Reply all
Reply to author
Forward
0 new messages