Re: Zookeeper ConectionLoss

1,153 views
Skip to first unread message

Rafael Mahnovetsky

unread,
Apr 22, 2013, 8:20:58 AM4/22/13
to storm...@googlegroups.com
update, more info.... when I run the supervisor without a network connection it starts up within a couple of seconds and the topology runs perfectly fine on my mac... But if I hook up my mac to the network then the supervisor takes about 30 seconds to start.. Why would it take so long to start when connected to a network?

On Sunday, April 21, 2013 9:42:08 PM UTC+10, Rafael Mahnovetsky wrote:
Hi Guys

Can anyone give me some tips on how to solve this issue. 

I have a storm cluster running on ubuntu 12.04 and osx(snow leopard). The topology running perfectly fine on ubunutu but on osx I get the following stack trace below when the worker runs.

I have nimbus and the UI running on my mac. Both ubuntu and OSX have a supervisor.

I have zookeeper setup on both machines with the following config
tickTime=2000
dataDir=/var/zookeeper/
clientPort=2181
initLimit=5
syncLimit=2

server.1=192.168.0.10:2888:3888
server.2=192.168.0.8:2888:3888


I'm using storm development version storm-0.9.0-wip16
with config
storm.zookeeper.servers:
     - "192.168.0.8"
     - "192.168.0.10"
#
nimbus.host: "192.168.0.8"
storm.local.dir: "/Users/rafael/Documents/rpm-workspace/storm-0.9.0-wip16"
java.library.path: "/usr/local/lib/"
supervisor.slots.ports:
      - 6700
      - 6701
      - 6702
      - 6703

I'm using zookeeper 3.3.5

2013-04-21 21:29:14 o.a.z.s.ZooKeeperServer [INFO] Server environment:user.dir=/Users/rafael/Documents/rpm-workspace/storm-0.9.0-wip16
2013-04-21 21:29:15 b.s.d.worker [INFO] Launching worker for mytopology-1-1366543682 on afc8cc4f-2fb7-4299-a258-509819d620eb:6700 with id 73c5d20f-dbe8-4dd2-8f27-ad312c7a1c1f and conf {"dev.zookeeper.path" "/tmp/dev-storm-zookeeper", "topology.tick.tuple.freq.secs" nil, "topology.builtin.metrics.bucket.size.secs" 60, "topology.fall.back.on.java.serialization" true, "topology.max.error.report.per.interval" 5, "zmq.linger.millis" 5000, "topology.skip.missing.kryo.registrations" false, "ui.childopts" "-Xmx768m", "storm.zookeeper.session.timeout" 20000, "nimbus.reassign" true, "topology.trident.batch.emit.interval.millis" 500, "nimbus.monitor.freq.secs" 10, "java.library.path" "/usr/local/lib/", "topology.executor.send.buffer.size" 1024, "storm.local.dir" "/Users/rafael/Documents/rpm-workspace/storm-0.9.0-wip16", "supervisor.worker.start.timeout.secs" 120, "topology.enable.message.timeouts" true, "nimbus.cleanup.inbox.freq.secs" 600, "nimbus.inbox.jar.expiration.secs" 3600, "drpc.worker.threads" 64, "topology.worker.shared.thread.pool.size" 4, "nimbus.host" "192.168.0.8", "storm.zookeeper.port" 2181, "transactional.zookeeper.port" nil, "topology.executor.receive.buffer.size" 1024, "transactional.zookeeper.servers" nil, "storm.zookeeper.root" "/storm", "supervisor.enable" true, "storm.zookeeper.servers" ["192.168.0.8" "192.168.0.10"], "transactional.zookeeper.root" "/transactional", "topology.acker.executors" 1, "topology.transfer.buffer.size" 1024, "topology.worker.childopts" nil, "drpc.queue.size" 128, "worker.childopts" "-Xmx768m", "supervisor.heartbeat.frequency.secs" 5, "topology.error.throttle.interval.secs" 10, "zmq.hwm" 0, "drpc.port" 3772, "supervisor.monitor.frequency.secs" 3, "topology.receiver.buffer.size" 8, "task.heartbeat.frequency.secs" 3, "topology.tasks" nil, "topology.spout.wait.strategy" "backtype.storm.spout.SleepSpoutWaitStrategy", "topology.max.spout.pending" nil, "storm.zookeeper.retry.interval" 1000, "topology.sleep.spout.wait.strategy.time.ms" 1, "nimbus.topology.validator" "backtype.storm.nimbus.DefaultTopologyValidator", "supervisor.slots.ports" [6700 6701 6702 6703], "topology.debug" false, "nimbus.task.launch.secs" 120, "nimbus.supervisor.timeout.secs" 60, "topology.message.timeout.secs" 30, "task.refresh.poll.secs" 10, "topology.workers" 1, "supervisor.childopts" "-Xmx256m", "nimbus.thrift.port" 6627, "topology.stats.sample.rate" 0.05, "worker.heartbeat.frequency.secs" 1, "topology.tuple.serializer" "backtype.storm.serialization.types.ListDelegateSerializer", "topology.acker.tasks" nil, "topology.disruptor.wait.strategy" "com.lmax.disruptor.BlockingWaitStrategy", "nimbus.task.timeout.secs" 30, "storm.zookeeper.connection.timeout" 15000, "topology.kryo.factory" "backtype.storm.serialization.DefaultKryoFactory", "drpc.invocations.port" 3773, "zmq.threads" 1, "storm.zookeeper.retry.times" 5, "topology.state.synchronization.timeout.secs" 60, "supervisor.worker.timeout.secs" 30, "nimbus.file.copy.expiration.secs" 600, "drpc.request.timeout.secs" 600, "storm.local.mode.zmq" false, "ui.port" 8080, "nimbus.childopts" "-Xmx1024m", "storm.cluster.mode" "distributed", "topology.optimize" true, "topology.max.task.parallelism" nil}
2013-04-21 21:29:15 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2013-04-21 21:29:15 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=192.168.0.8:2181,192.168.0.10:2181 sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@368d41f2
2013-04-21 21:29:15 o.a.z.ClientCnxn [INFO] Opening socket connection to server /192.168.0.8:2181
2013-04-21 21:29:30 c.n.c.ConnectionState [ERROR] Connection timed out
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:72) ~[curator-client-1.0.1.jar:na]
at com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:74) [curator-client-1.0.1.jar:na]
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:353) [curator-framework-1.0.1.jar:na]
at com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:149) [curator-framework-1.0.1.jar:na]
at com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:138) [curator-framework-1.0.1.jar:na]
at com.netflix.curator.RetryLoop.callWithRetry(RetryLoop.java:85) [curator-client-1.0.1.jar:na]
at com.netflix.curator.framework.imps.ExistsBuilderImpl.pathInForeground(ExistsBuilderImpl.java:134) [curator-framework-1.0.1.jar:na]
at com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:125) [curator-framework-1.0.1.jar:na]
at com.netflix.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:34) [curator-framework-1.0.1.jar:na]
at backtype.storm.zookeeper$exists_node_QMARK_.invoke(zookeeper.clj:78) [storm-0.9.0-wip16.jar:na]
at backtype.storm.zookeeper$mkdirs.invoke(zookeeper.clj:88) [storm-0.9.0-wip16.jar:na]
at backtype.storm.cluster$mk_distributed_cluster_state.invoke(cluster.clj:26) [storm-0.9.0-wip16.jar:na]
at backtype.storm.daemon.worker$worker_data.invoke(worker.clj:144) [storm-0.9.0-wip16.jar:na]
at backtype.storm.daemon.worker$fn__4720$exec_fn__1192__auto____4721.invoke(worker.clj:332) [storm-0.9.0-wip16.jar:na]
at clojure.lang.AFn.applyToHelper(AFn.java:185) [clojure-1.4.0.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
at clojure.core$apply.invoke(core.clj:601) [clojure-1.4.0.jar:na]
at backtype.storm.daemon.worker$fn__4720$mk_worker__4776.doInvoke(worker.clj:323) [storm-0.9.0-wip16.jar:na]
at clojure.lang.RestFn.invoke(RestFn.java:512) [clojure-1.4.0.jar:na]
at backtype.storm.daemon.worker$_main.invoke(worker.clj:433) [storm-0.9.0-wip16.jar:na]
at clojure.lang.AFn.applyToHelper(AFn.java:172) [clojure-1.4.0.jar:na]
at clojure.lang.AFn.applyTo(AFn.java:151) [clojure-1.4.0.jar:na]
at backtype.storm.daemon.worker.main(Unknown Source) [storm-0.9.0-wip16.jar:na]
2013-04-21 21:29:45 o.a.z.ClientCnxn [INFO] Socket connection established to 192.168.0.8/192.168.0.8:2181, initiating session
2013-04-21 21:29:45 o.a.z.ClientCnxn [INFO] Session establishment complete on server 192.168.0.8/192.168.0.8:2181, sessionid = 0x23e2c5486770002, negotiated timeout = 20000
2013-04-21 21:29:45 b.s.zookeeper [INFO] Zookeeper state update: :connected:none
2013-04-21 21:29:45 o.a.z.ZooKeeper [INFO] Session: 0x23e2c5486770002 closed
2013-04-21 21:29:45 o.a.z.ClientCnxn [INFO] EventThread shut down
2013-04-21 21:29:45 c.n.c.f.i.CuratorFrameworkImpl [INFO] Starting
2013-04-21 21:29:45 o.a.z.ZooKeeper [INFO] Initiating client connection, connectString=192.168.0.8:2181,192.168.0.10:2181/storm sessionTimeout=20000 watcher=com.netflix.curator.ConnectionState@469695f
2013-04-21 21:29:45 o.a.z.ClientCnxn [INFO] Opening socket connection to server /192.168.0.10:2181
2013-04-21 21:30:00 c.n.c.ConnectionState [ERROR] Connection timed out
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
at com.netflix.curator.ConnectionState.getZooKeeper(ConnectionState.java:72) ~[curator-client-1.0.1.jar:na]
at com.netflix.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:74) [curator-client-1.0.1.jar:na]
at com.netflix.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:353) [curator-framework-1.0.1.jar:na]
at com.netflix.curator.framework.imps.ExistsBuilderImpl$2.call(ExistsBuilderImpl.java:149) [curator-framework-1.0.1.jar:na]
at com.netflix.curator.framework.imps.Ex

Viral Bajaria

unread,
Apr 22, 2013, 3:14:15 PM4/22/13
to storm...@googlegroups.com
When you have both machines on the network, did you verify that you are able to talk to zookeeper from each one of them ? From the stack trace it looks like the connection timed out and so zookeeper server might not be reachable. You can try connecting to port 2181 from each of the supervisor box.

Also are there any logs on the zookeeper side ? Maybe firewall on your servers is blocking those ports ?

-Viral


--
You received this message because you are subscribed to the Google Groups "storm-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to storm-user+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Rafael Mahnovetsky

unread,
Apr 23, 2013, 7:48:58 AM4/23/13
to storm...@googlegroups.com
I found the issue it was a DNS problem. Im not sure why, but if I use the router's DNS then it takes ages to connect. If I use my internet providers DNS then it connects instantly.

thanks for you help.

jsvachon

unread,
Oct 22, 2013, 11:26:18 PM10/22/13
to storm...@googlegroups.com
Hi, 

I am having a similar issue with ZK and I dont quite understand why your DNS was faulty since you're using IP addresses in your config... can you elaborate on this?

Thanks

Juno Yoon

unread,
Nov 5, 2013, 9:32:49 PM11/5/13
to storm...@googlegroups.com
This is the known zookeeper issue. Zookeeper client internally tries to resolve the IP to hostname no matter you're using IP or DNS.
If the reverse DNS entry was not registered, Zookeeper client tries to resolve IP to Hostname in various way and it takes about 5 seconds to give up the trial.

If you try to get some node before the connection is made. It shows your error.
The solution.. is to register... hostname ip pair in the /etc/hosts file or ask your DNS administrator to register reverse DNS entry.

Reply all
Reply to author
Forward
0 new messages