Riemann crash

147 views
Skip to first unread message

Tommy Atkinson

unread,
Jan 30, 2014, 9:32:50 AM1/30/14
to rieman...@googlegroups.com
Riemann died on me this morning. It spammed its log with this exception a few thousand times:

WARN [2014-01-30 08:17:21,691] pool-1-thread-340 - riemann.transport.tcp - TCP handler caught
java.io.IOException: Broken pipe
    at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
    at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
    at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
    at sun.nio.ch.IOUtil.write(IOUtil.java:51)
    at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:487)
    at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$UnpooledSendBuffer.transferTo(SocketSendBufferPool.java:203)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:201)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromSelectorLoop(AbstractNioWorker.java:158)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:114)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:744)

Until it was killed by the OOM killer.

I'm using openjdk 7 on CentOS 6, riemann 0.2.4.

Any suggestions?


riemann.config

Kyle Kingsbury

unread,
Jan 30, 2014, 8:43:47 PM1/30/14
to rieman...@googlegroups.com
Intriguing; I haven't seen this crash before, but we should definitely
get it fixed. This trace suggests that Riemann was trying to write a
response to a TCP client, and the connection closed during the write.
Netty should use a fixed-size threadpool for its IO workers and
executors alike, so I'm unsure where the memory consumption problem came
from. Any chance you got a core dump or a jmap heap dump out of it? Can
you reproduce the issue at all?

--Kyle

Tommy Atkinson

unread,
Jan 31, 2014, 10:10:10 AM1/31/14
to rieman...@googlegroups.com
Hi Kyle

I was able to get a heap dump using jmap - http://monitor.neosoft.ba/heap.bin.xz

Aphyr

unread,
Jan 31, 2014, 1:30:59 PM1/31/14
to rieman...@googlegroups.com
> --
> You received this message because you are subscribed to the Google
> Groups "Riemann Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to riemann-user...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Awesome. This is during the time when memory use was increasing, prior
to OOM?

--Kyle

Tommy Atkinson

unread,
Feb 1, 2014, 9:33:08 AM2/1/14
to rieman...@googlegroups.com
Dump from a different day when I noticed the same thing happening. Memory usage rose to 500M RSS and I took the dump just before the server started swapping. I have now switched the clients to UDP and memory is stable at 160M.

Kyle Kingsbury

unread,
Feb 4, 2014, 7:38:32 PM2/4/14
to rieman...@googlegroups.com
On 01/31/2014 07:10 AM, Tommy Atkinson wrote:
I don't have a fix, per se, but here's what I can tell you.

Your heap size (256MB) is small--which is nice for me to analyze, but
probably puts Riemann under a lot more GC load than you'd really like.
I'm guessing (we'll get to that) that you're putting a lot of events
through Riemann, so a bigger heap is almost certainly a good idea and
may prevent the memory issues you saw.

Almost all the memory in this dump is consumed by 6 TCP channels, each
of which is retaining 40MB in its writebufferqueue. That's a *huge*
number of acknowledgement messages that Riemann is trying to send back
to the client, but the kernel couldn't flush.

This suggests to me two things.

One: your client is likely (intentionally or otherwise) not respecting
Riemann's backpressure; it sent a ton of messages in a row without
waiting for Riemann's acknowledgements. Use the synchronous TCP methods,
or try to limit the number of outstanding messages you send on the wire.
Some pipelining is desirable for performance, but you can cut memory use
dramatically by limiting in-flight requests to, I dunno, less than a
thousand.

Two: something *weird* happened--possibly a network hiccup? You noticed
thousands of closed connections, which suggests to me that either the
Riemann process was really overloaded to start with, and this pushed it
over the edge--or that the network was, I dunno, delivering packets in
one direction but not the other? Or maybe this is normal behavior for
your clients? I dunno!

I'm not *exactly* sure how to address this problem. On the one hand, I
consider any crash in Riemann a bug, so we need to figure out how to fix
this. On the other hand, it's not clear how to distinguish this case
from an intentional highly-pipelined connection, except in the depth of
the write queue--and where the appropriate place is to add backpressure
on the TCP stack via set_writable. I've got some feelers out to the
Netty channel for advice, but meanwhile, you may want to consider your
client use case.

Hope this helps!

--Kyle

Tommy Atkinson

unread,
Mar 1, 2014, 7:34:00 AM3/1/14
to rieman...@googlegroups.com
Thanks for looking into this. We switched over to using graphite-server for receiving data and this works much better.
Reply all
Reply to author
Forward
0 new messages