Zookeeper ConnectionLossException in executing topology repeatedly

3,697 views
Skip to first unread message

Juta

unread,
Jan 10, 2012, 2:28:24 AM1/10/12
to storm-user
Hi, all.

I executed 1 topology and kill it repeatedly, and sometimes caught
zookeeper ConnectionLossException.
Does anyone know what is wrong and how to fix it?

My enviromnent:
nimbus, 5 supervisor hosts.
1 zookeeper on nimbus host.

My topology has more than 10 bolts and one of them emits millions of
tuples.

To evaluate the topology, I repeatedly run the topology for 10min and
kill it.



[ERROR] Async loop died!
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /supervisors
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:
90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:
28)
at backtype.storm.zookeeper
$exists_node_QMARK_.invoke(zookeeper.clj:60)
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:67)
at backtype.storm.cluster$mk_distributed_cluster_state
$reify__1433.set_ephemeral_node(cluster.clj:53)
at backtype.storm.cluster$mk_storm_cluster_state
$reify__1905.supervisor_heartbeat_BANG_(cluster.clj:290)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown
Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:
90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:
28)
at backtype.storm.daemon.supervisor
$fn__3452$exec_fn__858__auto____3453$heartbeat_fn__3615.invoke(supervisor.clj:
290)
at backtype.storm.daemon.supervisor
$fn__3452$exec_fn__858__auto____3453$fn__3617.invoke(supervisor.clj:
296)
at clojure.lang.AFn.applyToHelper(AFn.java:159)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.util$async_loop$fn__443.invoke(util.clj:215)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)

2012-01-10 14:36:33,040 worker [ERROR] Error on initialization of
server mk-worker
java.lang.RuntimeException: org.apache.zookeeper.KeeperException
$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /tasks/
test3-5-1326173755/17
at clojure.lang.LazySeq.sval(LazySeq.java:47)
at clojure.lang.LazySeq.seq(LazySeq.java:56)
at clojure.lang.Cons.next(Cons.java:39)
at clojure.lang.LazySeq.next(LazySeq.java:88)
at clojure.lang.RT.next(RT.java:560)
at clojure.core$next.invoke(core.clj:61)
at clojure.core$dorun.invoke(core.clj:2451)
at clojure.core$doall.invoke(core.clj:2465)
at backtype.storm.daemon.common
$storm_task_info.invoke(common.clj:64)
at backtype.storm.daemon.worker
$fn__3074$exec_fn__858__auto____3075.invoke(worker.clj:96)
at clojure.lang.AFn.applyToHelper(AFn.java:187)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3074$mk_worker__3216.doInvoke(worker.clj:78)
at clojure.lang.RestFn.invoke(RestFn.java:513)
at backtype.storm.daemon.worker$_main.invoke(worker.clj:247)
at clojure.lang.AFn.applyToHelper(AFn.java:174)
at clojure.lang.AFn.applyTo(AFn.java:151)
at backtype.storm.daemon.worker.main(Unknown Source)
Caused by: org.apache.zookeeper.KeeperException
$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /tasks/
test3-5-1326173755/17
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:
90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:
28)
at backtype.storm.zookeeper$get_data.invoke(zookeeper.clj:75)
at backtype.storm.cluster$mk_distributed_cluster_state
$reify__1433.get_data(cluster.clj:73)
at backtype.storm.cluster$mk_storm_cluster_state
$reify__1905.task_info(cluster.clj:250)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:
90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:
28)
at backtype.storm.daemon.common$storm_task_info
$iter__823__827$fn__828.invoke(common.clj:65)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
... 18 more

2012-01-10 14:36:53,241 worker [ERROR] Error on initialization of
server mk-worker
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /storm
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
at
org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:809)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:837)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:
39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:
25)
at java.lang.reflect.Method.invoke(Method.java:597)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:
90)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:
28)
at backtype.storm.zookeeper
$exists_node_QMARK_.invoke(zookeeper.clj:60)
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:67)
at backtype.storm.cluster
$mk_distributed_cluster_state.invoke(cluster.clj:23)
at backtype.storm.daemon.worker
$fn__3074$exec_fn__858__auto____3075.invoke(worker.clj:82)
at clojure.lang.AFn.applyToHelper(AFn.java:187)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3074$mk_worker__3216.doInvoke(worker.clj:78)
at clojure.lang.RestFn.invoke(RestFn.java:513)
at backtype.storm.daemon.worker$_main.invoke(worker.clj:247)
at clojure.lang.AFn.applyToHelper(AFn.java:174)
at clojure.lang.AFn.applyTo(AFn.java:151)
at backtype.storm.daemon.worker.main(Unknown Source)

Nathan Marz

unread,
Jan 10, 2012, 2:36:29 AM1/10/12
to storm...@googlegroups.com
0.6.2 is going to fix this issue by revamping the Zookeeper connection management by using the Curator client. 0.6.2-SNAPSHOT from the downloads page ( https://github.com/nathanmarz/storm/downloads ) already has these changes. 

How often are you seeing these errors?
--
Twitter: @nathanmarz
http://nathanmarz.com

Juta

unread,
Jan 10, 2012, 2:57:52 AM1/10/12
to storm-user
Thanks, I will try 0.6.2.

I saw these errors when I run a topology for 30 min.
Once it occurred, it appeared every about 10 minutes.

On Jan 10, 4:36 pm, Nathan Marz <nathan.m...@gmail.com> wrote:
> 0.6.2 is going to fix this issue by revamping the Zookeeper connection
> management by using the Curator client. 0.6.2-SNAPSHOT from the downloads
> page (https://github.com/nathanmarz/storm/downloads) already has these

Nathan Marz

unread,
Jan 10, 2012, 4:08:36 AM1/10/12
to storm...@googlegroups.com
Once every 10 minutes is very frequent. I see it once every few days. Let me know if 0.6.2-SNAPSHOT fixes your issue.

Juta

unread,
Jan 11, 2012, 12:55:41 AM1/11/12
to storm-user
I tried 0.6.2-SNAPSHOT and it fixes these errors.
But I got zookeeper timeout error.

2012-01-11 14:13:49,630 ConnectionState [ERROR] Connection timed out
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at
com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:
72)
at
com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:
74)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:
353)
at com.netflix.curator.framework.imps.ExistsBuilderImpl
$2.call(ExistsBuilderImpl.java:149)
at com.netflix.curator.framework.imps.ExistsBuilderImpl
$2.call(ExistsBuilderImpl.java:138)
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:
85)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:
134)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
125)
at
com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:
34)
at backtype.storm.zookeeper
$exists_node_QMARK_.invoke(zookeeper.clj:81)
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:88)
at backtype.storm.cluster
$mk_distributed_cluster_state.invoke(cluster.clj:25)
at backtype.storm.daemon.worker
$fn__3305$exec_fn__983__auto____3306.invoke(worker.clj:83)
at clojure.lang.AFn.applyToHelper(AFn.java:187)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:540)
at backtype.storm.daemon.worker
$fn__3305$mk_worker__3446.doInvoke(worker.clj:76)
at clojure.lang.RestFn.invoke(RestFn.java:513)
at backtype.storm.daemon.worker$_main.invoke(worker.clj:265)
at clojure.lang.AFn.applyToHelper(AFn.java:174)
at clojure.lang.AFn.applyTo(AFn.java:151)
at backtype.storm.daemon.worker.main(Unknown Source)
2012-01-11 14:13:49,773 ClientCnxn [WARN] Session 0x0 for server
***.***.***.***.internal/***.***.***.***:2181, unexpected error,
closing socket connectio
n and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:198)
at sun.nio.ch.IOUtil.read(IOUtil.java:166)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:
243)
at org.apache.zookeeper.ClientCnxn
$SendThread.doIO(ClientCnxn.java:858)
at org.apache.zookeeper.ClientCnxn
$SendThread.run(ClientCnxn.java:1130)


When I changed zookeeper ticktime from 2000 to 4000, the error didn't
occurred.
I also got the other new error.

2012-01-11 13:55:14,652 ClientCnxn [ERROR] Error while calling
watcher
java.lang.IllegalStateException
at
com.google.common.base.Preconditions.checkState(Preconditions.java:
129)
at
com.netflix.curator.framework.state.ConnectionStateManager.addStateChange(ConnectionStateManager.java:
124)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.validateConnection(CuratorFrameworkImpl.java:
589)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.processEvent(CuratorFrameworkImpl.java:
558)
at
com.netflix.curator.framework.imps.CuratorFrameworkImpl.access
$000(CuratorFrameworkImpl.java:50)
at com.netflix.curator.framework.imps.CuratorFrameworkImpl
$1.process(CuratorFrameworkImpl.java:112)
at
com.netflix.curator.ConnectionState.process(ConnectionState.java:149)
at org.apache.zookeeper.ClientCnxn
$EventThread.processEvent(ClientCnxn.java:530)
at org.apache.zookeeper.ClientCnxn
$EventThread.run(ClientCnxn.java:506)


thanks.

Nathan Marz

unread,
Jan 11, 2012, 5:04:52 AM1/11/12
to storm...@googlegroups.com
Something is clearly wrong with your Zookeeper. What version of Zookeeper are you using?

Perhaps you should try giving it more memory? If that doesn't help I would try running Zookeeper on its own node to see if that helps.

Juta

unread,
Jan 12, 2012, 12:37:08 AM1/12/12
to storm-user
I'm using Zookeeper ver. 3.3.3.

I try giving 1024MB memory to Zookeeper (it looks enough), and set
maxClientCnxns=128, but some errors still occurred.
These errors happened soon after submitting topology, before running
it.

My topology uses
5 supervisors, 1 zookeeper on nimbus host
34 workers, 34 tasks.


Error1:
2012-01-12 13:41:18,343 ClientCnxn [ERROR] Error while calling
Error2:
2012-01-12 13:41:36,891 ConnectionState [ERROR] Connection timed out
> ...
>
> read more »

Nathan Marz

unread,
Jan 12, 2012, 4:07:07 AM1/12/12
to storm...@googlegroups.com
Other people saw problems depending on how they were supervising the daemons, and were able to fix their issues by changing how they supervise the daemons. That may be relevant here -- see this thread: http://groups.google.com/group/storm-user/browse_thread/thread/fcbea2d892e34300

Otherwise, I would try running ZK on its own node and see if that helps. You should also look through http://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_singleAndDevSetup and make sure your ZK instance is set up properly.
Reply all
Reply to author
Forward
0 new messages