zeromq assertion error with storm 0.8.0

153 views
Skip to first unread message

Dane Hammer

unread,
Aug 28, 2012, 5:46:55 PM8/28/12
to storm...@googlegroups.com
We recently upgraded to storm 0.8.0 and we experience a new issue when we had classpath issues. With a new version of a dependency on the classpath causing a java.lang.VerifyError in our code, we would see the expected: storm brought down the JVM that experienced the error and started again. This however exposed another issue: zeromq would blow up and we kept running the disk out of space with core dump files. This was where the core dump file took us:

(gdb) where
#0  0x00000031de232885 in raise () from /lib64/libc.so.6
#1  0x00000031de234065 in abort () from /lib64/libc.so.6
#2  0x00000031de22b9fe in __assert_fail_base () from /lib64/libc.so.6
#3  0x00000031de22bac0 in __assert_fail () from /lib64/libc.so.6
#4  0x00007fa4cd69fbc2 in get_socket (env=0x7fa4d88271d0, obj=0x7fa48f1f0688, do_assert=1) at Socket.cpp:543
#5  0x00007fa4cd69fe85 in Java_org_zeromq_ZMQ_00024Socket_send (env=0x7fa4d88271d0, obj=<value optimized out>, msg=0x7fa48f1f0680, flags=2) at Socket.cpp:365

I found a few posts in zeromq related threads online, but it was all questions, no answers. It feels like a potential race condition with the JVM shutting down and something trying to re-use that socket?

The interesting part is when we resolved our classpath issue we quit seeing the issue. Thoughts?

Nathan Marz

unread,
Aug 30, 2012, 2:13:31 AM8/30/12
to storm...@googlegroups.com
I have no idea. It sounds pretty obviously related to your classpath issues. I haven't ever seen anything like this. If your issue is solved, then I'm not sure what exactly you're looking for.
--
Twitter: @nathanmarz
http://nathanmarz.com

Dane Hammer

unread,
Dec 4, 2012, 4:43:50 PM12/4/12
to storm...@googlegroups.com
So this issue has become far more common for us. I'm experiencing it in two different scenarios, in two different environments. One appears to relate to when a topology gets redeployed, one appears to be when an unrelated connection is dropped.

I cannot recreate it deliberately, so what could I do to try and trap / catch / identify what code is responsible for the crash?

Yu Dongmin

unread,
Dec 5, 2012, 2:14:35 AM12/5/12
to storm...@googlegroups.com
In theory, this assert can occur when ZMQ scoket.send is called after the ZMQ socket instance is released.

As far as I know, the Storm doesn't release the instance explicitly. I suspect JNI memory is screwed up. Try setting zmq.hwm = 1000 or so and see if the same assert occurs.

Thanks
Min

Dane Hammer

unread,
Dec 5, 2012, 6:11:22 PM12/5/12
to storm...@googlegroups.com
I'm having a hard time figuring out where I configure that. Is that for the zeromq code? JZMQ? storm?

Dane Hammer

unread,
Dec 5, 2012, 6:22:53 PM12/5/12
to storm...@googlegroups.com
Oh sorry, found it. The master branch of storm has that value in the Config. We're on storm 0.8.1, not sure how many things would be affected by trying to uplift right now. I'll see if I can reproduce what storm is doing with that value.
Reply all
Reply to author
Forward
0 new messages