We recently upgraded to storm 0.8.0 and we experience a new issue when we had classpath issues. With a new version of a dependency on the classpath causing a java.lang.VerifyError in our code, we would see the expected: storm brought down the JVM that experienced the error and started again. This however exposed another issue: zeromq would blow up and we kept running the disk out of space with core dump files. This was where the core dump file took us:
(gdb) where #0 0x00000031de232885 in raise () from /lib64/libc.so.6 #1 0x00000031de234065 in abort () from /lib64/libc.so.6 #2 0x00000031de22b9fe in __assert_fail_base () from /lib64/libc.so.6 #3 0x00000031de22bac0 in __assert_fail () from /lib64/libc.so.6 #4 0x00007fa4cd69fbc2 in get_socket (env=0x7fa4d88271d0, obj=0x7fa48f1f0688, do_assert=1) at Socket.cpp:543 #5 0x00007fa4cd69fe85 in Java_org_zeromq_ZMQ_00024Socket_send (env=0x7fa4d88271d0, obj=<value optimized out>, msg=0x7fa48f1f0680, flags=2) at Socket.cpp:365
I found a few posts in zeromq related threads online, but it was all questions, no answers. It feels like a potential race condition with the JVM shutting down and something trying to re-use that socket?
The interesting part is when we resolved our classpath issue we quit seeing the issue. Thoughts?
I have no idea. It sounds pretty obviously related to your classpath
issues. I haven't ever seen anything like this. If your issue is solved,
then I'm not sure what exactly you're looking for.
On Tue, Aug 28, 2012 at 2:46 PM, Dane Hammer <dane.molo...@gmail.com> wrote:
> We recently upgraded to storm 0.8.0 and we experience a new issue when we
> had classpath issues. With a new version of a dependency on the classpath
> causing a java.lang.VerifyError in our code, we would see the expected:
> storm brought down the JVM that experienced the error and started again.
> This however exposed another issue: zeromq would blow up and we kept
> running the disk out of space with core dump files. This was where the core
> dump file took us:
> (gdb) where
> #0 0x00000031de232885 in raise () from /lib64/libc.so.6
> #1 0x00000031de234065 in abort () from /lib64/libc.so.6
> #2 0x00000031de22b9fe in __assert_fail_base () from /lib64/libc.so.6
> #3 0x00000031de22bac0 in __assert_fail () from /lib64/libc.so.6
> #4 0x00007fa4cd69fbc2 in get_socket (env=0x7fa4d88271d0,
> obj=0x7fa48f1f0688, do_assert=1) at Socket.cpp:543
> #5 0x00007fa4cd69fe85 in Java_org_zeromq_ZMQ_00024Socket_send
> (env=0x7fa4d88271d0, obj=<value optimized out>, msg=0x7fa48f1f0680,
> flags=2) at Socket.cpp:365
> I found a few posts in zeromq related threads online, but it was all
> questions, no answers. It feels like a potential race condition with the
> JVM shutting down and something trying to re-use that socket?
> The interesting part is when we resolved our classpath issue we quit
> seeing the issue. Thoughts?
So this issue has become far more common for us. I'm experiencing it in two different scenarios, in two different environments. One appears to relate to when a topology gets redeployed, one appears to be when an unrelated connection is dropped.
I cannot recreate it deliberately, so what could I do to try and trap / catch / identify what code is responsible for the crash?
On Thursday, August 30, 2012 1:13:34 AM UTC-5, nathanmarz wrote:
> I have no idea. It sounds pretty obviously related to your classpath > issues. I haven't ever seen anything like this. If your issue is solved, > then I'm not sure what exactly you're looking for.
> On Tue, Aug 28, 2012 at 2:46 PM, Dane Hammer <dane.m...@gmail.com<javascript:> > > wrote:
>> We recently upgraded to storm 0.8.0 and we experience a new issue when we >> had classpath issues. With a new version of a dependency on the classpath >> causing a java.lang.VerifyError in our code, we would see the expected: >> storm brought down the JVM that experienced the error and started again. >> This however exposed another issue: zeromq would blow up and we kept >> running the disk out of space with core dump files. This was where the core >> dump file took us:
>> (gdb) where >> #0 0x00000031de232885 in raise () from /lib64/libc.so.6 >> #1 0x00000031de234065 in abort () from /lib64/libc.so.6 >> #2 0x00000031de22b9fe in __assert_fail_base () from /lib64/libc.so.6 >> #3 0x00000031de22bac0 in __assert_fail () from /lib64/libc.so.6 >> #4 0x00007fa4cd69fbc2 in get_socket (env=0x7fa4d88271d0, >> obj=0x7fa48f1f0688, do_assert=1) at Socket.cpp:543 >> #5 0x00007fa4cd69fe85 in Java_org_zeromq_ZMQ_00024Socket_send >> (env=0x7fa4d88271d0, obj=<value optimized out>, msg=0x7fa48f1f0680, >> flags=2) at Socket.cpp:365
>> I found a few posts in zeromq related threads online, but it was all >> questions, no answers. It feels like a potential race condition with the >> JVM shutting down and something trying to re-use that socket?
>> The interesting part is when we resolved our classpath issue we quit >> seeing the issue. Thoughts?
In theory, this assert can occur when ZMQ scoket.send is called after the ZMQ socket instance is released.
As far as I know, the Storm doesn't release the instance explicitly. I suspect JNI memory is screwed up. Try setting zmq.hwm = 1000 or so and see if the same assert occurs.
Thanks
Min
On Dec 5, 2012, at 6:43 AM, Dane Hammer <dane.molo...@gmail.com> wrote:
> So this issue has become far more common for us. I'm experiencing it in two different scenarios, in two different environments. One appears to relate to when a topology gets redeployed, one appears to be when an unrelated connection is dropped.
> I cannot recreate it deliberately, so what could I do to try and trap / catch / identify what code is responsible for the crash?
> On Thursday, August 30, 2012 1:13:34 AM UTC-5, nathanmarz wrote:
> I have no idea. It sounds pretty obviously related to your classpath issues. I haven't ever seen anything like this. If your issue is solved, then I'm not sure what exactly you're looking for.
> On Tue, Aug 28, 2012 at 2:46 PM, Dane Hammer <dane.m...@gmail.com> wrote:
> We recently upgraded to storm 0.8.0 and we experience a new issue when we had classpath issues. With a new version of a dependency on the classpath causing a java.lang.VerifyError in our code, we would see the expected: storm brought down the JVM that experienced the error and started again. This however exposed another issue: zeromq would blow up and we kept running the disk out of space with core dump files. This was where the core dump file took us:
> (gdb) where
> #0 0x00000031de232885 in raise () from /lib64/libc.so.6
> #1 0x00000031de234065 in abort () from /lib64/libc.so.6
> #2 0x00000031de22b9fe in __assert_fail_base () from /lib64/libc.so.6
> #3 0x00000031de22bac0 in __assert_fail () from /lib64/libc.so.6
> #4 0x00007fa4cd69fbc2 in get_socket (env=0x7fa4d88271d0, obj=0x7fa48f1f0688, do_assert=1) at Socket.cpp:543
> #5 0x00007fa4cd69fe85 in Java_org_zeromq_ZMQ_00024Socket_send (env=0x7fa4d88271d0, obj=<value optimized out>, msg=0x7fa48f1f0680, flags=2) at Socket.cpp:365
> I found a few posts in zeromq related threads online, but it was all questions, no answers. It feels like a potential race condition with the JVM shutting down and something trying to re-use that socket?
> The interesting part is when we resolved our classpath issue we quit seeing the issue. Thoughts?
On Wednesday, December 5, 2012 1:14:35 AM UTC-6, Mini wrote:
> In theory, this assert can occur when ZMQ scoket.send is called after the > ZMQ socket instance is released.
> As far as I know, the Storm doesn't release the instance explicitly. I > suspect JNI memory is screwed up. Try setting zmq.hwm = 1000 or so and see > if the same assert occurs.
> Thanks > Min
> On Dec 5, 2012, at 6:43 AM, Dane Hammer <dane.m...@gmail.com <javascript:>> > wrote:
> So this issue has become far more common for us. I'm experiencing it in > two different scenarios, in two different environments. One appears to > relate to when a topology gets redeployed, one appears to be when an > unrelated connection is dropped.
> I cannot recreate it deliberately, so what could I do to try and trap / > catch / identify what code is responsible for the crash?
> On Thursday, August 30, 2012 1:13:34 AM UTC-5, nathanmarz wrote:
>> I have no idea. It sounds pretty obviously related to your classpath >> issues. I haven't ever seen anything like this. If your issue is solved, >> then I'm not sure what exactly you're looking for.
>> On Tue, Aug 28, 2012 at 2:46 PM, Dane Hammer <dane.m...@gmail.com> wrote:
>>> We recently upgraded to storm 0.8.0 and we experience a new issue when >>> we had classpath issues. With a new version of a dependency on the >>> classpath causing a java.lang.VerifyError in our code, we would see the >>> expected: storm brought down the JVM that experienced the error and started >>> again. This however exposed another issue: zeromq would blow up and we kept >>> running the disk out of space with core dump files. This was where the core >>> dump file took us:
>>> (gdb) where >>> #0 0x00000031de232885 in raise () from /lib64/libc.so.6 >>> #1 0x00000031de234065 in abort () from /lib64/libc.so.6 >>> #2 0x00000031de22b9fe in __assert_fail_base () from /lib64/libc.so.6 >>> #3 0x00000031de22bac0 in __assert_fail () from /lib64/libc.so.6 >>> #4 0x00007fa4cd69fbc2 in get_socket (env=0x7fa4d88271d0, >>> obj=0x7fa48f1f0688, do_assert=1) at Socket.cpp:543 >>> #5 0x00007fa4cd69fe85 in Java_org_zeromq_ZMQ_00024Socket_send >>> (env=0x7fa4d88271d0, obj=<value optimized out>, msg=0x7fa48f1f0680, >>> flags=2) at Socket.cpp:365
>>> I found a few posts in zeromq related threads online, but it was all >>> questions, no answers. It feels like a potential race condition with the >>> JVM shutting down and something trying to re-use that socket?
>>> The interesting part is when we resolved our classpath issue we quit >>> seeing the issue. Thoughts?
Oh sorry, found it. The master branch of storm has that value in the Config. We're on storm 0.8.1, not sure how many things would be affected by trying to uplift right now. I'll see if I can reproduce what storm is doing with that value.
On Wednesday, December 5, 2012 5:11:22 PM UTC-6, Dane Hammer wrote:
> I'm having a hard time figuring out where I configure that. Is that for > the zeromq code? JZMQ? storm?
> On Wednesday, December 5, 2012 1:14:35 AM UTC-6, Mini wrote:
>> In theory, this assert can occur when ZMQ scoket.send is called after the >> ZMQ socket instance is released.
>> As far as I know, the Storm doesn't release the instance explicitly. I >> suspect JNI memory is screwed up. Try setting zmq.hwm = 1000 or so and see >> if the same assert occurs.
>> Thanks >> Min
>> On Dec 5, 2012, at 6:43 AM, Dane Hammer <dane.m...@gmail.com> wrote:
>> So this issue has become far more common for us. I'm experiencing it in >> two different scenarios, in two different environments. One appears to >> relate to when a topology gets redeployed, one appears to be when an >> unrelated connection is dropped.
>> I cannot recreate it deliberately, so what could I do to try and trap / >> catch / identify what code is responsible for the crash?
>> On Thursday, August 30, 2012 1:13:34 AM UTC-5, nathanmarz wrote:
>>> I have no idea. It sounds pretty obviously related to your classpath >>> issues. I haven't ever seen anything like this. If your issue is solved, >>> then I'm not sure what exactly you're looking for.
>>> On Tue, Aug 28, 2012 at 2:46 PM, Dane Hammer <dane.m...@gmail.com>wrote:
>>>> We recently upgraded to storm 0.8.0 and we experience a new issue when >>>> we had classpath issues. With a new version of a dependency on the >>>> classpath causing a java.lang.VerifyError in our code, we would see the >>>> expected: storm brought down the JVM that experienced the error and started >>>> again. This however exposed another issue: zeromq would blow up and we kept >>>> running the disk out of space with core dump files. This was where the core >>>> dump file took us:
>>>> (gdb) where >>>> #0 0x00000031de232885 in raise () from /lib64/libc.so.6 >>>> #1 0x00000031de234065 in abort () from /lib64/libc.so.6 >>>> #2 0x00000031de22b9fe in __assert_fail_base () from /lib64/libc.so.6 >>>> #3 0x00000031de22bac0 in __assert_fail () from /lib64/libc.so.6 >>>> #4 0x00007fa4cd69fbc2 in get_socket (env=0x7fa4d88271d0, >>>> obj=0x7fa48f1f0688, do_assert=1) at Socket.cpp:543 >>>> #5 0x00007fa4cd69fe85 in Java_org_zeromq_ZMQ_00024Socket_send >>>> (env=0x7fa4d88271d0, obj=<value optimized out>, msg=0x7fa48f1f0680, >>>> flags=2) at Socket.cpp:365
>>>> I found a few posts in zeromq related threads online, but it was all >>>> questions, no answers. It feels like a potential race condition with the >>>> JVM shutting down and something trying to re-use that socket?
>>>> The interesting part is when we resolved our classpath issue we quit >>>> seeing the issue. Thoughts?