Shutting down local topology fails intermittently

805 views
Skip to first unread message

kyrill007

unread,
Feb 22, 2012, 4:02:54 PM2/22/12
to storm-user
Nathan,

We have a situation with running Storm 0.6.1 in a local topology mode
(we're just running integration tests). It seems to fail to shut down
intermittently due to the main thread not being able to acquire a
required semaphore. Here is the thread dump for that thread.

[testng] "main" prio=10 tid=0xb6d09c00 nid=0x3dbf waiting on condition
[0xb6ee3000]
[testng] java.lang.Thread.State: WAITING (parking)
[testng] at sun.misc.Unsafe.park(Native Method)
[testng] - parking to wait for <0x859add50> (a
java.util.concurrent.Semaphore$NonfairSync)
[testng] at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
[testng] at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:
811)
[testng] at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:
969)
[testng] at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:
1281)
[testng] at java.util.concurrent.Semaphore.acquire(Semaphore.java:
286)
[testng] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
[testng] at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
[testng] at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
[testng] at java.lang.reflect.Method.invoke(Method.java:597)
[testng] at
clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:90)
[testng] at
clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:265)
[testng] at backtype.storm.timer$cancel_timer.invoke(timer.clj:78)
[testng] at backtype.storm.daemon.nimbus
$fn__2914$exec_fn__994__auto__$reify__2932.shutdown(nimbus.clj:849)
[testng] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
[testng] at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
[testng] at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
[testng] at java.lang.reflect.Method.invoke(Method.java:597)
[testng] at
clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:90)
[testng] at
clojure.lang.Reflector.invokeNoArgInstanceMember(Reflector.java:265)
[testng] at backtype.storm.testing
$kill_local_storm_cluster.invoke(testing.clj:143)
[testng] at backtype.storm.LocalCluster
$_shutdown.invoke(LocalCluster.clj:21)
[testng] at backtype.storm.LocalCluster.shutdown(Unknown Source)

Just FYI, no other thread appears to have a lock on 0x859add50 (I can
send you a full thread dump if you want).

The culprit is this piece of code in timer.clj

(defn cancel-timer [timer]
(check-active! timer)
(locking (:lock timer)
(reset! (:active timer) false)
(.interrupt (:timer-thread timer)))
(.acquire (:cancel-notifier timer)))

Do you have any idea why this may happen?

Thanks a lot!


Nathan Marz

unread,
Feb 22, 2012, 4:04:37 PM2/22/12
to storm...@googlegroups.com
0.6.2 fixed a concurrency problem with the timer (it could deadlock nimbus). So you should try out that version.
--
Twitter: @nathanmarz
http://nathanmarz.com

kyrill007

unread,
Feb 22, 2012, 5:06:54 PM2/22/12
to storm-user
We're actually using 0.6.2. I just made a mistake.

kyrill007

unread,
Feb 22, 2012, 7:34:55 PM2/22/12
to storm-user
Nathan,

It is getting trickier than the original deadlock. Now we're seeing
the following errors (again intermittently) when trying to shutdown
local topology:

012-02-22 23:07:47,842 INFO [main|] backtype.storm.daemon.supervisor
=> Shutting down supervisor cc01de03-c8e4-4c26-8634-46fd8450efe7

2012-02-22 23:07:47,842 INFO [Thread-11|] backtype.storm.util
=> Async loop interrupted!

2012-02-22 23:07:47,842 INFO [Thread-12|] backtype.storm.util
=> Async loop interrupted!

2012-02-22 23:07:47,843 INFO [Thread-13|] backtype.storm.util
=> Async loop interrupted!

2012-02-22 23:07:47,843 INFO [Thread-9|] backtype.storm.event
=> Event manager interrupted

2012-02-22 23:07:47,844 ERROR [Thread-10|]
backtype.storm.daemon.worker
=> Error on initialization of server mk-worker

java.lang.RuntimeException: java.lang.InterruptedException
at clojure.lang.LazySeq.sval(LazySeq.java:47)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.RT.next(RT.java:560)
at clojure.core$next.invoke(core.clj:61)
at clojure.core$dorun.invoke(core.clj:2451)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.common$storm_task_info.invoke(common.clj:63)
at backtype.storm.daemon.worker
$fn__3360$exec_fn__994__auto____3361.invoke(worker.clj:100)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3360$mk_worker__3505.doInvoke(worker.clj:78)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.supervisor$fn__3988.invoke(supervisor.clj:
417)
at clojure.lang.MultiFn.invoke(MultiFn.java:205)
at backtype.storm.daemon.supervisor
$fn__3749$exec_fn__994__auto____3750$sync_processes__3754$iter__3811__3815$fn__3816.invoke(supervisor.clj:
234)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.RT.seq(RT.java:450)
at clojure.core$seq.invoke(core.clj:122)
at clojure.core$dorun.invoke(core.clj:2450)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.supervisor
$fn__3749$exec_fn__994__auto____3750$sync_processes__3754.invoke(supervisor.clj:
221)
at backtype.storm.event$event_manager
$fn__2167$fn__2168.invoke(event.clj:25)
at backtype.storm.event$event_manager$fn__2167.invoke(event.clj:22)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:
1317)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:919)
at com.netflix.curator.framework.imps.GetDataBuilderImpl
$3.call(GetDataBuilderImpl.java:179)
at com.netflix.curator.framework.imps.GetDataBuilderImpl
$3.call(GetDataBuilderImpl.java:168)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:
164)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:
156)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:
35)
at backtype.storm.zookeeper$get_data.invoke(zookeeper.clj:103)
at backtype.storm.cluster$mk_distributed_cluster_state
$reify__1619.get_data(cluster.clj:70)
at backtype.storm.cluster$mk_storm_cluster_state
$reify__2091.task_info(cluster.clj:248)
at sun.reflect.GeneratedMethodAccessor55.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at backtype.storm.daemon.common$storm_task_info
$iter__959__963$fn__964.invoke(common.clj:64)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
... 27 more
2012-02-22 23:07:47,847 INFO [Thread-10|] backtype.storm.util
=> Halting process: ("Error on initialization")

2012-02-22 23:07:48,616 DEBUG [Thread-22|]
backtype.storm.daemon.worker
=> In heartbeat thread

=================

There errors are pretty deep down in Storm management code. Any
feedback would be greatly appreciated. We're trying to push Storm in a
major organization in the NorthEast.

Thanks,

-Kyrill


On Feb 22, 4:04 pm, Nathan Marz <nathan.m...@gmail.com> wrote:

Nathan Marz

unread,
Feb 22, 2012, 8:14:16 PM2/22/12
to storm...@googlegroups.com
OK, I understand why this is happening. I pushed a fix to master and this will be available in 0.7.0.

kyrill007

unread,
Feb 22, 2012, 8:49:47 PM2/22/12
to storm-user
When do you expect 0.7 to be out?

Thanks!

-K.

kyrill007

unread,
Feb 23, 2012, 1:12:27 AM2/23/12
to storm-user
Nope, Nathan, we upgraded from master branch and it is still failing:

2012-02-23 06:05:21,824 ERROR [Thread-15|]
backtype.storm.daemon.worker
=> Error on initialization of server mk-worker

java.lang.RuntimeException: java.lang.InterruptedException
at clojure.lang.LazySeq.sval(LazySeq.java:47)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.RT.next(RT.java:560)
at clojure.core$next.invoke(core.clj:61)
at clojure.core$dorun.invoke(core.clj:2451)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.common$storm_task_info.invoke(common.clj:67)
at backtype.storm.daemon.worker
$fn__3630$exec_fn__1021__auto____3631.invoke(worker.clj:89)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3630$mk_worker__3775.doInvoke(worker.clj:67)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.supervisor$fn__4258.invoke(supervisor.clj:
418)
at clojure.lang.MultiFn.invoke(MultiFn.java:205)
at backtype.storm.daemon.supervisor
$fn__4019$exec_fn__1021__auto____4020$sync_processes__4024$iter__4081__4085$fn__4086.invoke(supervisor.clj:
234)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.RT.seq(RT.java:450)
at clojure.core$seq.invoke(core.clj:122)
at clojure.core$dorun.invoke(core.clj:2450)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.supervisor
$fn__4019$exec_fn__1021__auto____4020$sync_processes__4024.invoke(supervisor.clj:
221)
at backtype.storm.event$event_manager
$fn__2357$fn__2359.invoke(event.clj:24)
at backtype.storm.event$event_manager$fn__2357.invoke(event.clj:22)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:
1317)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:919)
at com.netflix.curator.framework.imps.GetDataBuilderImpl
$3.call(GetDataBuilderImpl.java:179)
at com.netflix.curator.framework.imps.GetDataBuilderImpl
$3.call(GetDataBuilderImpl.java:168)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:
164)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:
156)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:
35)
at backtype.storm.zookeeper$get_data.invoke(zookeeper.clj:92)
at backtype.storm.cluster$mk_distributed_cluster_state
$reify__1804.get_data(cluster.clj:76)
at backtype.storm.cluster$mk_storm_cluster_state
$reify__2281.task_info(cluster.clj:254)
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at backtype.storm.daemon.common$storm_task_info
$iter__986__990$fn__991.invoke(common.clj:68)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
... 27 more
2012-02-23 06:05:21,827 INFO [Thread-15|] backtype.storm.util
=> Halting process: ("Error on initialization")

=============

It is very important to somehow get to the bottom of this. Your
attention would be greatly appreciated.

Thank you!

-Kyrill

Nathan Marz

unread,
Feb 23, 2012, 1:32:14 AM2/23/12
to storm...@googlegroups.com
I'm unable to reproduce this, nor can I find a way for this to be possible in the current codebase. You'll have to send me a test case that reproduces this behavior.

kyrill007

unread,
Feb 23, 2012, 9:58:36 AM2/23/12
to storm-user
Nathan,

Maybe we're out of sync on version of dependent libraries?

Here is a sample of what we've got (no everything but some relevant
stuff I think):

bcprov-jdk16-1.46.jar* clojure-contrib-1.2.0.jar* commons-
lang-2.6.jar* json-simple-1.1.jar* libthrift7-0.7.0.jar*
zookeeper-3.3.2.jar*
carbonite-1.0.0.jar* commons-exec-1.1.jar* curator-
client-1.0.1.jar* jvyaml-1.0.0.jar* minlog-1.2.jar*
clojure-1.2.1.jar* commons-io-1.4.jar* curator-
framework-1.0.1.jar* kryo-1.04.jar* storm-0.7.0.jar*

Should we, maybe, upgrade to clojure 1.3.0, zookeeper 3.4.3?

This issue does seem to be a wicked concurrency bug somewhere...

Nathan, thanks for your attention to this.

-Kyrill
> ...
>
> read more »

kyrill007

unread,
Feb 23, 2012, 7:29:01 PM2/23/12
to storm-user
Nathan,

We were able to reproduce it now in our local dev. environment on
Windows with a somewhat different stack trace but with exactly the
same symptoms. Maybe the new stack trace would give you some ideas?

2012-02-23 19:24:17,242 INFO [main|] backtype.storm.daemon.supervisor
=> Shutting down supervisor caf34abc-db32-45fb-aaad-1bad05fd2517

2012-02-23 19:24:17,242 INFO [Thread-11|] backtype.storm.util
=> Async loop interrupted!

2012-02-23 19:24:17,243 INFO [Thread-12|] backtype.storm.util
=> Async loop interrupted!

2012-02-23 19:24:17,243 INFO [Thread-13|] backtype.storm.util
=> Async loop interrupted!

2012-02-23 19:24:17,243 INFO [Thread-9|] backtype.storm.event
=> Event manager interrupted

2012-02-23 19:24:17,246 INFO [Thread-10|]
backtype.storm.daemon.supervisor
=> Launching worker with assignment
#:backtype.storm.daemon.supervisor.LocalAssignment{:storm-id
"testETL-1-1330043033", :task-ids (34 4 40 10 46 16 52 22 28)} for
this supervisor caf34abc-db32-45fb-aaad-1bad05fd2517 on port 2 with id
7770f731-256e-430f-a21b-20f2e4c9a090

2012-02-23 19:24:17,246 INFO [Thread-10|] backtype.storm.daemon.worker
=> Launching worker for testETL-1-1330043033 on caf34abc-db32-45fb-
aaad-1bad05fd2517:2 with id 7770f731-256e-430f-a21b-20f2e4c9a090 and
conf {"topology.fall.back.on.java.serialization" true,
"zmq.linger.millis" 0, "topology.skip.missing.kryo.registrations"
true, "ui.childopts" "-Xmx768m", "storm.zookeeper.session.timeout"
20000, "nimbus.reassign" true, "nimbus.monitor.freq.secs" 10,
"java.library.path" "/usr/local/lib:/opt/local/lib:/usr/lib",
"storm.local.dir" "C:\\cygwin\\tmp\\/d813eb61-2f81-4922-b10c-
bba2078c58ed", "supervisor.worker.start.timeout.secs" 240,
"nimbus.cleanup.inbox.freq.secs" 600,
"nimbus.inbox.jar.expiration.secs" 3600, "storm.zookeeper.port" 2181,
"transactional.zookeeper.port" nil, "transactional.zookeeper.servers"
nil, "storm.zookeeper.root" "/storm", "supervisor.enable" true,
"storm.zookeeper.servers" ["localhost"],
"transactional.zookeeper.root" "/transactional",
"topology.worker.childopts" nil, "worker.childopts" "-Xmx768m",
"supervisor.heartbeat.frequency.secs" 5, "drpc.port" 3772,
"supervisor.monitor.frequency.secs" 3, "task.heartbeat.frequency.secs"
3, "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval"
1000, "supervisor.slots.ports" (1 2 3), "topology.debug" false,
"nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60,
"topology.message.timeout.secs" 30, "task.refresh.poll.secs" 10,
"topology.workers" 1, "supervisor.childopts" "-Xmx1024m",
"nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05,
"worker.heartbeat.frequency.secs" 1, "nimbus.task.timeout.secs" 30,
"drpc.invocations.port" 3773, "zmq.threads" 1,
"storm.zookeeper.retry.times" 5,
"topology.state.synchronization.timeout.secs" 60,
"supervisor.worker.timeout.secs" 30,
"nimbus.file.copy.expiration.secs" 600, "storm.local.mode.zmq" false,
"ui.port" 8080, "nimbus.childopts" "-Xmx1024m", "topology.ackers" 1,
"storm.cluster.mode" "local", "topology.optimize" true,
"topology.max.task.parallelism" nil}

2012-02-23 19:24:17,248 ERROR [Thread-10|] backtype.storm.zookeeper
=> Unrecoverable Zookeeper error, halting process: Background
exception was not retry-able or retry gave up

java.io.IOException: Unable to establish loopback connection
at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:106)
at java.security.AccessController.doPrivileged(Native Method)
at sun.nio.ch.PipeImpl.<init>(PipeImpl.java:122)
at sun.nio.ch.SelectorProviderImpl.openPipe(SelectorProviderImpl.java:
27)
at java.nio.channels.Pipe.open(Pipe.java:133)
at sun.nio.ch.WindowsSelectorImpl.<init>(WindowsSelectorImpl.java:
104)
at
sun.nio.ch.WindowsSelectorProvider.openSelector(WindowsSelectorProvider.java:
26)
at java.nio.channels.Selector.open(Selector.java:209)
at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:160)
at org.apache.zookeeper.ClientCnxn.<init>(ClientCnxn.java:331)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:377)
at com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:
72)
at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:
46)
at com.netflix.curator.ConnectionState.reset(ConnectionState.java:
122)
at com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
at
com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:
137)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:
167)
at backtype.storm.zookeeper$mk_client.doInvoke(zookeeper.clj:53)
at clojure.lang.RestFn.invoke(RestFn.java:445)
at backtype.storm.cluster
$mk_distributed_cluster_state.invoke(cluster.clj:24)
at backtype.storm.daemon.worker
$fn__3630$exec_fn__1021__auto____3631.invoke(worker.clj:74)
Caused by: java.nio.channels.ClosedByInterruptException
at
java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:
184)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:511)
at java.nio.channels.SocketChannel.open(SocketChannel.java:146)
at sun.nio.ch.PipeImpl$Initializer.run(PipeImpl.java:78)
... 39 more
2012-02-23 19:24:17,250 INFO [Thread-10|] backtype.storm.util
=> Halting process: ("Unrecoverable Zookeeper error")

2012-02-23 19:24:17,250 ERROR [Thread-10|]
backtype.storm.daemon.worker
=> Error on initialization of server mk-worker

java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at backtype.storm.util$halt_process_BANG_.doInvoke(util.clj:157)
at clojure.lang.RestFn.invoke(RestFn.java:423)
at backtype.storm.zookeeper$mk_client
$reify__1613.unhandledError(zookeeper.clj:52)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl
$4.apply(CuratorFrameworkImpl.java:430)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl
$4.apply(CuratorFrameworkImpl.java:426)
at
com.netflix.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:
80)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:
423)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.handleBackgroundOperationException(CuratorFrameworkImpl.java:
512)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:
185)
at backtype.storm.zookeeper$mk_client.doInvoke(zookeeper.clj:53)
at clojure.lang.RestFn.invoke(RestFn.java:445)
at backtype.storm.cluster
$mk_distributed_cluster_state.invoke(cluster.clj:24)
at backtype.storm.daemon.worker
$fn__3630$exec_fn__1021__auto____3631.invoke(worker.clj:74)
2012-02-23 19:24:17,250 INFO [Thread-10|] backtype.storm.util
=> Halting process: ("Error on initialization")

=============================


Thank you so much!

-Kyrill

On Feb 23, 1:32 am, Nathan Marz <nathan.m...@gmail.com> wrote:
> ...
>
> read more »

kyrill007

unread,
Feb 24, 2012, 9:36:02 AM2/24/12
to storm-user
And another type of stack trace in exactly the same spot:

2012-02-24 00:09:02,640 ERROR [Thread-10|]
backtype.storm.daemon.worker
=> Error on initialization of server mk-worker

java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:
1302)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
at
com.netflix.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:
235)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:83)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:
134)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
125)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
34)
at backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:
70)
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:77)
at backtype.storm.cluster
$mk_distributed_cluster_state.invoke(cluster.clj:25)
at backtype.storm.daemon.worker
$fn__3628$exec_fn__1021__auto____3629.invoke(worker.clj:74)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3628$mk_worker__3773.doInvoke(worker.clj:67)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.supervisor$fn__4256.invoke(supervisor.clj:
418)
at clojure.lang.MultiFn.invoke(MultiFn.java:205)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022$iter__4079__4083$fn__4084.invoke(supervisor.clj:
234)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.RT.seq(RT.java:450)
at clojure.core$seq.invoke(core.clj:122)
at clojure.core$dorun.invoke(core.clj:2450)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022.invoke(supervisor.clj:
221)
at backtype.storm.event$event_manager$fn__2357.invoke(event.clj:24)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
2012-02-24 00:09:02,644 INFO [Thread-10|] backtype.storm.util
=> Halting process: ("Error on initialization")

2012-02-24 00:09:02,651 INFO [Thread-10-EventThread|]
backtype.storm.zookeeper
=> Zookeeper state update: :connected:none

2012-02-24 00:09:02,965 DEBUG [Thread-15|]
backtype.storm.daemon.supervisor
=> Worker a1319c51-1c95-449b-a44a-1cd8e0425d32 is :valid:
#:backtype.storm.daemon.common.WorkerHeartbeat{:time-secs
1330060142, :storm-id "testETL-1-1330060106", :task-ids (32 2 38 8 44
14 50 20 26), :port 4} at supervisor time-secs 1330060142

==========

Some serious concurrency issues...

-K.

On Feb 23, 7:29 pm, kyrill007 <kyrill...@gmail.com> wrote:
> Nathan,
>
> We were able to reproduce it now in ourlocaldev. environment on
> ...
>
> read more »

nathanmarz

unread,
Feb 24, 2012, 2:49:32 PM2/24/12
to storm-user
One of the stack traces makes sense (the ones that say "Unrecoverable
Zookeeper error"). I've opened up an issue for this one:
https://github.com/nathanmarz/storm/issues/141

The other ones still don't make sense, so you need to provide me with
code that reproduces the error since I have not been able to reproduce
it.

Some other questions:

1) How often does this happen? 1% of the time? Every time?
2) Why is this important to you? These errors can only happen when
doing a shutdown of local mode. These errors can never happen on an
actual cluster.
> > $fn__4019$exec_fn__1021__auto____4020$sync_processes__4024.invoke(superviso r.clj:
> > 221)
> >         at backtype.storm.event$event_manager
> > $fn__2357$fn__2359.invoke(event.clj:24)
> >         at backtype.storm.event$event_manager$fn__2357.invoke(event.clj:22)
> >         at clojure.lang.AFn.run(AFn.java:24)
> >         at java.lang.Thread.run(Thread.java:662)
> > Caused by: java.nio.channels.ClosedByInterruptException
> >         at
> > java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibl eChannel.java:
> > 184)
> >         at...
>
> read more »

kyrill007

unread,
Feb 24, 2012, 3:35:41 PM2/24/12
to storm-user
Nathan,

Let's move the discussion to the actual bug. At this point what
happens is clear: Storm local cluster begins a shutdown:

==================================================
2012-02-24 10:30:53,169 INFO [main|] backtype.storm.daemon.nimbus
=> Shutting down master

2012-02-24 10:30:53,179 INFO [main|] backtype.storm.daemon.nimbus
=> Shut down master

2012-02-24 10:30:53,186 INFO [main|] backtype.storm.daemon.supervisor
=> Shutting down 69904207-8058-4007-ac22-a0c3ac21cc4a:30fd235d-
ee9f-4efc-91f7-55ffd02f9ca9

2012-02-24 10:30:53,186 INFO [main|] backtype.storm.process-simulator
=> Killing process 548bbc23-8e90-4f65-a7fe-d9ab6932e9bc

2012-02-24 10:30:53,186 INFO [main|] backtype.storm.daemon.worker
=> Shutting down worker testETL-1-1330097413 69904207-8058-4007-ac22-
a0c3ac21cc4a 1

2012-02-24 10:30:53,186 INFO [main|] backtype.storm.daemon.task
=> Shutting down task testETL-1-1330097413:1

2012-02-24 10:30:53,187 INFO [Thread-115|] backtype.storm.util
=> Async loop interrupted!
==============================================================

The log file continues with normal shutdown, temp directories get
removed, etc. Then surprisingly, Storm begins to create what seems as
a new worker node! And it either succeeds in shutting it down later on
or fails on creation because of the shutdown process:

==============================================================
2012-02-24 10:30:53,351 ERROR [Thread-10|]
backtype.storm.daemon.worker
=> Error on initialization of server mk-worker

java.lang.RuntimeException: java.lang.InterruptedException
at clojure.lang.LazySeq.sval(LazySeq.java:47)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.RT.next(RT.java:560)
at clojure.core$next.invoke(core.clj:61)
at clojure.core$dorun.invoke(core.clj:2451)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.common$storm_task_info.invoke(common.clj:67)
at backtype.storm.daemon.worker
$fn__3628$exec_fn__1021__auto____3629.invoke(worker.clj:89)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3628$mk_worker__3773.doInvoke(worker.clj:67)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.supervisor$fn__4256.invoke(supervisor.clj:
418)
at clojure.lang.MultiFn.invoke(MultiFn.java:205)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022$iter__4079__4083$fn__4084.invoke(supervisor.clj:
234)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.RT.seq(RT.java:450)
at clojure.core$seq.invoke(core.clj:122)
at clojure.core$dorun.invoke(core.clj:2450)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022.invoke(supervisor.clj:
221)
at backtype.storm.event$event_manager$fn__2357.invoke(event.clj:24)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:
1317)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:804)
at com.netflix.curator.framework.imps.ExistsBuilderImpl
$2.call(ExistsBuilderImpl.java:149)
at com.netflix.curator.framework.imps.ExistsBuilderImpl
$2.call(ExistsBuilderImpl.java:138)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:
134)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
125)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
34)
at backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:
70)
at backtype.storm.zookeeper$get_data.invoke(zookeeper.clj:89)
at backtype.storm.cluster$mk_distributed_cluster_state
$reify__1804.get_data(cluster.clj:76)
at backtype.storm.cluster$mk_storm_cluster_state
$reify__2281.task_info(cluster.clj:254)
at sun.reflect.GeneratedMethodAccessor60.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at backtype.storm.daemon.common$storm_task_info
$iter__986__990$fn__991.invoke(common.clj:68)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
... 26 more
2012-02-24 10:30:53,355 INFO [Thread-10|] backtype.storm.util
=> Halting process: ("Error on initialization")

2012-02-24 10:30:53,570 DEBUG [Thread-23|]
backtype.storm.daemon.worker
=> In heartbeat thread

2012-02-24 10:30:53,570 DEBUG [Thread-23|]
backtype.storm.daemon.worker
=> Doing heartbeat
#:backtype.storm.daemon.common.WorkerHeartbeat{:time-secs
1330097453, :storm-id "testETL-1-1330097413", :task-ids (36 6 42 12 48
18 24 30), :port 6}
=============================================================

We have been able to consistently reproduce it on all Windows and
Linux JVM, and on different versions of those JVMs.

I understand that it is strictly related to local cluster. But this is
what we use to do integration testing. So this is wired into CI on
Jenkins, and this sporadic failure occurs every 5-6 run, which freaks
out the rest of the team and really gives Storm a bad PR, especially
since it is a "new and untested" system (you see what I mean).
(It is a major customer in Northern New England. Our architect will be
at the StrataConf next week, so he'll tell you all about it).

We do need some workaround to get out of this sporadic test failures.
I will include full log file for you in the bug report. This really
needs to be addressed...

Thank you very much, and go Storm!

-Kyrill

nathanmarz

unread,
Feb 24, 2012, 4:09:34 PM2/24/12
to storm-user
OK, I was able to figure out the cause of the other kind of stack
trace. Just an oversight. I pushed a fix for both to master. Try it
out and let me know how it goes.



On Feb 24, 12:35 pm, kyrill007 <kyrill...@gmail.com> wrote:
> Nathan,
>
> $fn__4017$exec_fn__1021__auto____4018$sync_processes__4022$iter__4079__4083 $fn__4084.invoke(supervisor.clj:
> 234)
>         at clojure.lang.LazySeq.sval(LazySeq.java:42)
>         at clojure.lang.LazySeq.seq(LazySeq.java:56)
>         at clojure.lang.RT.seq(RT.java:450)
>         at clojure.core$seq.invoke(core.clj:122)
>         at clojure.core$dorun.invoke(core.clj:2450)
>         at clojure.core$doall.invoke(core.clj:2465)
>         at backtype.storm.daemon.supervisor
> $fn__4017$exec_fn__1021__auto____4018$sync_processes__4022.invoke(superviso r.clj:
> 221)
>         at backtype.storm.event$event_manager$fn__2357.invoke(event.clj:24)
>         at clojure.lang.AFn.run(AFn.java:24)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.lang.InterruptedException
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:485)
>         at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:
> 1317)
>         at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:804)
>         at com.netflix.curator.framework.imps.ExistsBuilderImpl
> $2.call(ExistsBuilderImpl.java:149)
>         at com.netflix.curator.framework.imps.ExistsBuilderImpl
> $2.call(ExistsBuilderImpl.java:138)
>         at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
>         at
> com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(Exist sBuilderImpl.java:
> 134)
>         at
> com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderI mpl.java:
> 125)
>         at
> com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderI mpl.java:
> 34)
>         at backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:
> 70)
> > > 2012-02-24 00:09:02,651 INFO [Thread-10-EventThread|]...
>
> read more »

kyrill007

unread,
Feb 24, 2012, 8:05:51 PM2/24/12
to storm-user
Still fails pretty regularly (on every 5 run or so). :-(

Exceptions have several flavors.

1. The original one

=========================================================
2012-02-24 18:50:18,401 ERROR [Thread-15|]
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:919)
at com.netflix.curator.framework.imps.GetDataBuilderImpl
$3.call(GetDataBuilderImpl.java:179)
at com.netflix.curator.framework.imps.GetDataBuilderImpl
$3.call(GetDataBuilderImpl.java:168)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:
164)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:
156)
at
com.netflix.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:
35)
at backtype.storm.zookeeper$get_data.invoke(zookeeper.clj:92)
at backtype.storm.cluster$mk_distributed_cluster_state
$reify__1804.get_data(cluster.clj:76)
at backtype.storm.cluster$mk_storm_cluster_state
$reify__2281.task_info(cluster.clj:254)
at sun.reflect.GeneratedMethodAccessor61.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at backtype.storm.daemon.common$storm_task_info
$iter__986__990$fn__991.invoke(common.clj:68)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
... 26 more
2012-02-24 18:50:18,405 INFO [Thread-15|] backtype.storm.util
=> Halting process: ("Error on initialization")
======================================================

2. The new one

2012-02-24 18:51:25,603 ERROR [Thread-15|]
backtype.storm.daemon.worker
=> Error on initialization of server mk-worker

java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:
1024)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:
1303)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
at
com.netflix.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:
235)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:83)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:
134)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
125)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
34)
at backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:
70)
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:77)
at backtype.storm.cluster$mk_distributed_cluster_state
$reify__1804.mkdirs(cluster.clj:83)
at backtype.storm.cluster$mk_storm_cluster_state.invoke(cluster.clj:
213)
at backtype.storm.daemon.worker
$fn__3628$exec_fn__1021__auto____3629.invoke(worker.clj:75)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3628$mk_worker__3773.doInvoke(worker.clj:67)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.supervisor$fn__4256.invoke(supervisor.clj:
418)
at clojure.lang.MultiFn.invoke(MultiFn.java:205)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022$iter__4079__4083$fn__4084.invoke(supervisor.clj:
234)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.RT.seq(RT.java:450)
at clojure.core$seq.invoke(core.clj:122)
at clojure.core$dorun.invoke(core.clj:2450)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022.invoke(supervisor.clj:
221)
at backtype.storm.event$event_manager$fn__2357.invoke(event.clj:24)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
2012-02-24 18:51:25,607 INFO [Thread-15|] backtype.storm.util
=> Halting process: ("Error on initialization")
===================================================================

3. The similar new one:

2012-02-24 18:58:51,387 ERROR [Thread-10|]
backtype.storm.daemon.worker
=> Error on initialization of server mk-worker

java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:
1302)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253)
at
com.netflix.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:
235)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:83)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:
134)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
125)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
34)
at backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:
70)
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:77)
at backtype.storm.cluster
$mk_distributed_cluster_state.invoke(cluster.clj:25)
at backtype.storm.daemon.worker
$fn__3628$exec_fn__1021__auto____3629.invoke(worker.clj:74)
at clojure.lang.AFn.applyToHelper(AFn.java:185)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3628$mk_worker__3773.doInvoke(worker.clj:67)
at clojure.lang.RestFn.invoke(RestFn.java:512)
at backtype.storm.daemon.supervisor$fn__4256.invoke(supervisor.clj:
418)
at clojure.lang.MultiFn.invoke(MultiFn.java:205)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022$iter__4079__4083$fn__4084.invoke(supervisor.clj:
234)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.RT.seq(RT.java:450)
at clojure.core$seq.invoke(core.clj:122)
at clojure.core$dorun.invoke(core.clj:2450)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.supervisor
$fn__4017$exec_fn__1021__auto____4018$sync_processes__4022.invoke(supervisor.clj:
221)
at backtype.storm.event$event_manager$fn__2357.invoke(event.clj:24)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
2012-02-24 18:58:51,389 INFO [Thread-10-EventThread|]
backtype.storm.zookeeper
=> Zookeeper state update: :connected:none

2012-02-24 18:58:51,390 INFO [Thread-10|] backtype.storm.util
=> Halting process: ("Error on initialization")

==============================================================

But the behavior is still the same: Storm tries to start 'server mk-
worker' in the middle of a local cluster shutdown.

Once again, any attention to this issue would be tremendously
appreciated. Let me know what we can do to make it easier to debug? If
we give you an ssh login to a system where the issue is easily
reproducible, would it be any help?

Thanks,

-Kyrill



kyrill007

unread,
Feb 24, 2012, 9:51:49 PM2/24/12
to storm-user
Nathan,

I think it did fix it. We probably just used the old jar. It's been a
tough week :-)

We've run the tests about 150 times without a single failure. I'll
keep it running overnight but I think it is solved.

Now, we all really really appreciate your help! Dave K. will buy you a
beer next week!

Thank you, man.

-Kyrill
Reply all
Reply to author
Forward
0 new messages