Store the events for riemann in an external database

44 views
Skip to first unread message

apassionatechie

unread,
Dec 26, 2021, 7:53:30 AM12/26/21
to Riemann Users
Currently there is no way to scale riemann, wanted to know if we can store the events in a separate DB/Cache and have multiple riemann use it making riemann scalable.
Use of a separate DB to store events/ttl etc before they are sent to their destination(ex: influx)
I haven't figured out any alternatives yet. Suggestions are welcome.
Currently whenever riemann goes down we get 100s of false alerts would want to eliminate that if possible.

Toby McLaughlin

unread,
Dec 26, 2021, 8:32:00 AM12/26/21
to Riemann Users
Hi!

Is your primary goal to scale your Riemann deployment horizontally (because one instance cannot handle all the load), or make it highly available (because downtime is problematic)?

Both of those things are quite tricky to do, and kind of need to be part of your overall architecture. Riemann itself is not a distributed application. It's not really built to share state between multiple instances, because most of the state of a running Riemann is the "stream state", which is information the stream functions accumulate over time as they process events. It's all in memory and not stored in a classical database kind of way.

> Currently whenever riemann goes down we get 100s of false alerts would want to eliminate that if possible.

This sounds like something that maybe could be improved in the logic of your config. Are the alerts being triggered by Riemann when it comes back up, or by some other system?

apassionatechie

unread,
Dec 26, 2021, 10:13:30 PM12/26/21
to Riemann Users
Hi Toby,

We kind of want to do both, make it Ha and scale it horizontally (active-active scenario).
We have alerts triggered when the riemann comes back up because of the expired events.
We do want alerts on actual expired events, not the ones that come bcs of riemann going down.

Toby McLaughlin

unread,
Dec 27, 2021, 12:34:01 AM12/27/21
to Riemann Users
Makes sense, thanks. :)

I don't run an active-active setup myself, but is has occurred to me that if you ran two instances of Riemann and fed them exactly the same events, they would, over time, converge to a point were they had _very similar_, but not identical, states. I send all my triggers to PagerDuty, so it might be OK to let both Riemanns trigger and resolve incidents, and PagerDuty would just deduplicate the extras away.

> We have alerts triggered when the riemann comes back up because of the expired events.

Ah! This is a classic. Old, queued events arrive in Riemann but they are so old that their age exceeds their TTL and they are thus "dead on arrival". Is that what's happening?

One way to avoid this problem is to not trigger on the simple case of an event being expired, but to only trigger when the state changes to expired from some other state (like "ok"). A "changed" stream can help with this. Then, at startup, old events start their life as expired and so don't _change_ to expired and don't trigger.

Another way is to not alert at all on incoming expired events. Instead, when "not-expired" events come in,  you can put them in the index as a new service. Then, if that service expires, you know you have a problem. This also lets you aggregate expiry detection. You could have 10 incoming services that all have something in common. If you index every incoming event with ':service "common stuff"', then you can write a trigger that goes off when all of the 10 incoming services stop talking. I use this to detect the difference between a monitored resource that is failing, and when the thing that is actually doing the monitoring fails.

If you have hundreds of things triggering individual expiry alerts, is it worth thinking about a way to aggregate that? Instead of 100 alerts, maybe there could be one alert that says "100 things have expired".

A "stable" stream can also help to prevent alerts for transient problems. Hopefully, these "false expiries" don't last very long, so a "stable" could supress them.

apassionatechie

unread,
Jan 4, 2022, 3:43:52 AM1/4/22
to Riemann Users
Thanks Toby.


"One way to avoid this problem is to not trigger on the simple case of an event being expired, but to only trigger when the state changes to expired from some other state (like "ok"). A "changed" stream can help with this. Then, at startup, old events start their life as expired and so don't _change_ to expired and don't trigger."
Can you elaborate on this pls. 
How do I achieve this ?

Toby McLaughlin

unread,
Jan 4, 2022, 8:18:57 AM1/4/22
to Riemann Users
Sure thing. Here's a minimal config that demonstrates this method:

;; Global setup stuff.
(logging/init {:console true})
(tcp-server {})
(instrumentation {:enabled? false})
(periodically-expire 5)

;; Actual config.
(let [index (index)]
  (streams

   ;; Put events in the index so that they can expire later.
   index

   ;; Log all events at info level.
   (fn [event] (info event))

   ;; Warn on events that _change_ to the expired state.
   (changed :state
            (match :state "expired"
                   (fn [event] (warn event))))))

This config will emit an INFO log for every event received, and it will emit a WARN log when the trigger condition is met. To show the config working, here is a Ruby snippet that sends in some events:

require 'riemann/client'
c = Riemann::Client.new host: 'localhost', port: 5555, timeout: 5
20.downto(0).each do |age|
  c.tcp << {
    host: 'localhost',
    service: 'doa',
    state: 'ok',
    time: Time.now.to_i - age,
    ttl: 10,
  }
end

This will send events with ttl=10, but the first one is 20 seconds old already, so it is "dead on arrival". If we simply triggered on expired events, then this event would trigger, but we don't want that. Instead, this event's expired state sets the initial condition of the "changed" stream. The stream will now only pass events that have a change state, that is, they are _not_ expired.

The next few events are also DOA, so nothing happens. The last few events are fresh enough to be "ok", so now, "changed" sees this and passes on the first "ok" event, then waits for more change. Eventually, the events stop coming in and we are starting to think about "real" expiry from the index. When the TTL of last indexed event expires, it emits a new event with state=expired. The changed stream sees this and passes the expired event downstream to the match/warn combination, which represent the trigger.
Reply all
Reply to author
Forward
0 new messages