Librato performance

Jonas Tehler

unread,

Mar 6, 2013, 5:35:19 AM3/6/13

to rieman...@googlegroups.com

Hi

I'm running riemann-health on about a hundred servers and have a stream like this:

  (where (service #"^(cpu|memory|load|disk)")
        index
        (fn [event] (info (:host event) (:metric event) (:service event)))
        (librato :gauge)
   )

Adding the librato line seems to make the event processing slower and after a while I start getting errors like this:

INFO [2013-03-06 10:16:10,768] Thread-7 - riemann.config - lb-75 nil load

WARN [2013-03-06 10:16:10,769] Thread-7 - riemann.streams - riemann.streams$smap$stream__4498@94a3f4b threw

java.lang.ArithmeticException: Divide by zero

at clojure.lang.Numbers.divide(Numbers.java:156)

at clojure.lang.Numbers.divide(Numbers.java:3671)

at riemann.folds$mean.invoke(folds.clj:156)

at riemann.streams$smap$stream__4498.invoke(streams.clj:151)

at riemann.streams$fixed_time_window$stream__4684$fn__4714.invoke(streams.clj:342)

at riemann.streams$fixed_time_window$stream__4684.invoke(streams.clj:342)

at riemann.streams$by_fn$stream__5562$fn__5570.invoke(streams.clj:1176)

at riemann.streams$by_fn$stream__5562.invoke(streams.clj:1176)

at riemann.streams$adjust$stream__5536$fn__5546.invoke(streams.clj:1118)

at riemann.streams$adjust$stream__5536.invoke(streams.clj:1118)

at riemann.streams$adjust$stream__5536$fn__5546.invoke(streams.clj:1118)

at riemann.streams$adjust$stream__5536.invoke(streams.clj:1118)

at riemann.config$eval31$stream__83$fn__88.invoke(riemann.config:38)

at riemann.config$eval31$stream__83.invoke(riemann.config:38)

at riemann.core$reaper$worker__5888$fn__5903.invoke(core.clj:128)

at riemann.core$reaper$worker__5888.invoke(core.clj:121)

at riemann.service.ThreadService$thread_service_runner__4131.invoke(service.clj:55)

at clojure.lang.AFn.run(AFn.java:24)

at java.lang.Thread.run(Thread.java:722)

The value for the events somehow is nil which leads to the divide by zero error.

Other metrics with lower number of events per second sends to librato without problem (a few events every five seconds). I'm using the latest Riemann from github.

Any idea what might be wrong? Does anyone use Riemann to push a significant number of events to Librato Metrics?

/ Jonas

Aphyr

unread,

Mar 6, 2013, 12:06:06 PM3/6/13

to rieman...@googlegroups.com

On 03/06/2013 02:35 AM, Jonas Tehler wrote:
> INFO [2013-03-06 10:16:10,768] Thread-7 - riemann.config - lb-75 nil
> load
>
> WARN [2013-03-06 10:16:10,769] Thread-7 - riemann.streams -
> riemann.streams$smap$stream__4498@94a3f4b threw
> java.lang.ArithmeticException: Divide by zero
>

> The value for the events somehow is nil which leads to the divide by
> zero error.

Good catch; this happens because there were zero events in a window, and
the mean of zero events is undefined. There's nothing that (folds/mean)
can do in this case, but I can suppress the error message. I'll add a
ticket to do that.

> Other metrics with lower number of events per second sends to librato
> without problem (a few events every five seconds). I'm using the
> latest Riemann from github.

This is probably a consequence of backpressure. Riemann's TCP protocol
is synchronous; it won't return an ack for a given event until it's been
processed by all streams. The Librato Metrics stream pushes events
synchronously to Librato, which means:

1. You're *guaranteed* by the time the client receives an ack that your
event has actually gone to Librato.
2. Clients won't overload Riemann; the whole system slows down to avoid
backing up queues.
2. Your client has to wait for the round-trip latency (TCP handshake +
HTTP req) to Librato.

One option is just to add more clients, but eventually you'll back up
behind the netty stream executor pool. The right thing to do for
improved throughput is to defer events asynchronously to a
linkedblockingqueue and threadpool executor like so:

(let [librato-gauge (async-queue! :librato-gauge
{:queue-size 1000
:core-pool-size 10
:max-pool-size 50}
(librato :gauge))]
(streams
(librato-gauge)))

This comes with all the fun of managing queues, so you'll probably want
to watch the logs to make sure messages aren't being dropped; you have
no idea whether an event successfully made it to librato, etc. You may
want to wrap the async-queue! stream in an (exception-stream (email
"...")) to inform you about errors.

Later, I'm going to extend the librato-metrics streams to accept vectors
of events, and then you can use things like (fixed-event-window) to
batch your metrics together into single HTTP requests. That should
further improve performance.

Warning: this feature is very new! I have not tuned it in production and
I have no idea how it will behave for your particular balance of IO,
concurrency, latency, client demand, etc. Queues are a dark art. ;-)

--Kyle

Jonas Tehler

unread,

Mar 6, 2013, 4:11:55 PM3/6/13

to rieman...@googlegroups.com

On 6 mar 2013, at 18:06, Aphyr <ap...@aphyr.com> wrote:

> On 03/06/2013 02:35 AM, Jonas Tehler wrote:
>> Other metrics with lower number of events per second sends to librato
>> without problem (a few events every five seconds). I'm using the
>> latest Riemann from github.
>
> This is probably a consequence of backpressure. Riemann's TCP protocol
> is synchronous; it won't return an ack for a given event until it's been
> processed by all streams. The Librato Metrics stream pushes events
> synchronously to Librato, which means:
>
> 1. You're *guaranteed* by the time the client receives an ack that your
> event has actually gone to Librato.
> 2. Clients won't overload Riemann; the whole system slows down to avoid
> backing up queues.
> 2. Your client has to wait for the round-trip latency (TCP handshake + HTTP req) to Librato.

I see, that explains it.

> One option is just to add more clients, but eventually you'll back up behind the netty stream executor pool. The right thing to do for improved throughput is to defer events asynchronously to a linkedblockingqueue and threadpool executor like so:
>
> (let [librato-gauge (async-queue! :librato-gauge
> {:queue-size 1000
> :core-pool-size 10
> :max-pool-size 50}
> (librato :gauge))]
> (streams
> (librato-gauge)))
>
> This comes with all the fun of managing queues, so you'll probably want to watch the logs to make sure messages aren't being dropped; you have no idea whether an event successfully made it to librato, etc. You may want to wrap the async-queue! stream in an (exception-stream (email "...")) to inform you about errors.
>
> Later, I'm going to extend the librato-metrics streams to accept vectors of events, and then you can use things like (fixed-event-window) to batch your metrics together into single HTTP requests. That should further improve performance.
>
> Warning: this feature is very new! I have not tuned it in production and I have no idea how it will behave for your particular balance of IO, concurrency, latency, client demand, etc. Queues are a dark art. ;-)

I'm getting an error when trying to use the async-queue:

clojure.lang.ArityException: Wrong number of args (0) passed to: streams$execute-on$stream
at clojure.lang.AFn.throwArity(AFn.java:437)
at clojure.lang.AFn.invoke(AFn.java:35)
at riemann.config$eval31.invoke(riemann.config:74)
at clojure.lang.Compiler.eval(Compiler.java:6511)
at clojure.lang.Compiler.load(Compiler.java:6952)
at clojure.lang.Compiler.loadFile(Compiler.java:6912)
at clojure.lang.RT$3.invoke(RT.java:307)
at riemann.config$include.invoke(config.clj:249)
at riemann.bin$_main.doInvoke(bin.clj:51)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at riemann.bin.main(Unknown Source)

I'm new to clojure so I might be doing something wrong but I think it's done as you specified. Line 74 in my riemann.config is where I call (librato-gauge).

/ Jonas

Aphyr

unread,

Mar 6, 2013, 4:14:53 PM3/6/13

to rieman...@googlegroups.com

On 03/06/2013 01:11 PM, Jonas Tehler wrote:
> Line 74 in my riemann.config is where I call (librato-gauge).

Would you mind pasting your config somewhere? Feel free to drop by
#riemann if you want to talk about it interactively.

--Kyle

Reply all

Reply to author

Forward