ZK server dies, some clients take a long time to reconnect to ensemble

966 views
Skip to first unread message

Eric Pederson

unread,
Oct 27, 2012, 8:47:59 PM10/27/12
to curato...@googlegroups.com
Hi all:

Here is a test scenario I have set up on my workstation.

My application has the ZK server embedded.  There are JVMs running with the app+ZK server.

The relevant settings are:
  • Connection Timeout and Session Timeout: 15 sec
  • RetryNTimes(5, 100ms)
If I kill the JVM with ZK server that is the ZK (not curator) leader then one of the clients usually reconnects right away to the ZK server that takes over as the leader.  The other client takes more time, sometimes up to 3 minutes (like in the example below).

Do my settings sound reasonable?   Is 3 minutes reasonable to expect a reconnect?

Thanks,

20:24:32.291 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:14185] WARN  o.a.zookeeper.server.NIOServerCnxn - caught end of stream exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client sessionid 0x23aa4a97cc00002, likely client has closed socket

20:24:32.291 [LearnerHandler-/127.0.0.1:51944] ERROR o.a.z.server.quorum.LearnerHandler - Unexpected exception causing shutdown while sock still open
java.io.EOFException: null

20:24:32.292 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:14185] INFO  o.a.zookeeper.server.NIOServerCnxn - Closed socket connection for client /127.0.0.1:51962 which had sessionid 0x23aa4a97cc00002

20:24:32.294 [LearnerHandler-/127.0.0.1:51944] WARN  o.a.z.server.quorum.LearnerHandler - ******* GOODBYE /127.0.0.1:51944 ********

20:24:32.392 [main-EventThread] INFO  c.n.c.f.state.ConnectionStateManager - State change: SUSPENDED
20:24:32.393 [main-EventThread] INFO  c.nyx.services.ZookeeperServiceImpl - watchedEvent: eventType=None state=Disconnected

20:24:32.774 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:14185] INFO  o.a.z.server.NIOServerCnxnFactory - Accepted socket connection from /0:0:0:0:0:0:0:1%0:52064
20:24:32.774 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:14185] INFO  o.a.zookeeper.server.ZooKeeperServer - Client attempting to renew session 0x23aa4a97cc00000 at /0:0:0:0:0:0:0:1%0:52064
20:24:32.775 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:14185] INFO  o.a.zookeeper.server.ZooKeeperServer - Established session 0x23aa4a97cc00000 with negotiated timeout 15000 for client /0:0:0:0:0:0:0:1%0:52064

20:24:46.000 [SessionTracker] INFO  o.a.zookeeper.server.ZooKeeperServer - Expiring session 0x23aa4a97cc00002, timeout of 15000ms exceeded

20:24:46.001 [ProcessThread(sid:2 cport:-1):] INFO  o.a.z.server.PrepRequestProcessor - Processed session termination for sessionid: 0x23aa4a97cc00002

20:24:47.405 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (15012)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:24:47.406 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:24:47.507 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:00.014 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (27620)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:00.015 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:00.117 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:02.519 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (30126)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:02.520 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:02.620 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:15.130 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (42736)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:15.232 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:17.633 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (45240)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:17.633 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:17.735 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:30.246 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (57852)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:30.247 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:30.348 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:32.749 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (60355)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:32.749 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:32.850 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:45.362 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (72968)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:45.363 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:45.464 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:25:47.865 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (75471)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:47.865 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:25:47.966 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:26:00.478 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (88084)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:00.479 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:00.580 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:26:02.978 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (90585)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:02.979 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:02.979 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retrying operation
20:26:02.979 [main-EventThread] ERROR c.n.c.f.imps.CuratorFrameworkImpl - Watcher exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:02.979 [main-EventThread] INFO  c.n.c.f.state.ConnectionStateManager - State change: LOST
20:26:02.979 [main-EventThread] INFO  c.nyx.services.ZookeeperServiceImpl - watchedEvent: eventType=None state=Disconnected

20:26:15.593 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (103199)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:15.594 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:15.594 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retrying operation
20:26:15.595 [ThreadPoolTaskScheduler-1] ERROR o.s.s.s.TaskUtils$LoggingErrorHandler - Unexpected error occurred in scheduled task.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

20:26:17.991 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (105598)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:17.992 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:18.093 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:26:30.609 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (118216)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:30.610 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:30.711 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:26:33.105 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (120711)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:33.105 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:33.207 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:26:45.726 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (133332)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:45.727 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:45.828 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:26:48.218 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (135825)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:48.219 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:26:48.320 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:27:00.839 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (148445)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:00.840 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:00.942 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:27:03.333 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (150940)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:03.334 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:03.435 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:27:15.954 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (163560)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:15.955 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:16.056 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:27:18.448 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (166054)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:18.449 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:18.549 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:27:31.070 [ThreadPoolTaskScheduler-1] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (178676)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:31.071 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:31.172 [ThreadPoolTaskScheduler-1] DEBUG com.netflix.curator.RetryLoop - Retry policy not allowing retry

20:27:33.562 [main-EventThread] ERROR com.netflix.curator.ConnectionState - Connection timed out for connection string (localhost:14181,localhost:14185,localhost:14189) and timeout (15000) / elapsed (181169)
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:33.563 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retry-able exception received
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
20:27:33.563 [main-EventThread] DEBUG com.netflix.curator.RetryLoop - Retrying operation
20:27:33.563 [main-EventThread] ERROR c.n.c.f.imps.CuratorFrameworkImpl - Watcher exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss

20:27:33.564 [main-EventThread] INFO  c.n.c.f.state.ConnectionStateManager - State change: RECONNECTED


Jordan Zimmerman

unread,
Oct 28, 2012, 11:41:21 AM10/28/12
to curato...@googlegroups.com
It would help to see the DEBUG logs for ZooKeeper. But, as a general reply, I have seen long reconnect times for ZooKeeper if the ZK data set is large. I don't see a particular Curator issue here. You might get better information from the ZK mailing list:

* How large is the ZK dataset?
* It looks like 3 instances running on the same machine. Is that right?
* What recipe/code are you using? The log shows something is continually trying to connect.

-Jordan

--
You received this message because you are subscribed to the Google Groups "curator-users" group.
To post to this group, send email to curato...@googlegroups.com.
To unsubscribe from this group, send email to curator-user...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msg/curator-users/-/3yXJ5PngJfEJ.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Eric Pederson

unread,
Oct 28, 2012, 5:10:12 PM10/28/12
to curato...@googlegroups.com
Hi Jordan - in answer to your questions:
  • The ZK dataset is very small - the only data is the Curator LeaderLatch for 3 clients.
  • Yes - three instances on one machine (for development/testing purposes)
  • I'm using Curator's LeaderLatch recipe.
After changing the log configuration to DEBUG for org.apache.zookeeper I ran another test of killing the JVM which was the ZK quorum leader.   I've attached the logs for the other 2 JVMs in the cluster.   In this test both of them took approximately 90 seconds to reconnect to the cluster.  

Thanks,

-- Eric
jvm1.log
jvm2.log

Jordan Zimmerman

unread,
Oct 28, 2012, 5:20:02 PM10/28/12
to curato...@googlegroups.com
You should probably take this up on the ZK user's list. ZK is having trouble re-establishing leadership. Also, you might try 3.4.4 over 3.4.3.

-Jordan

<jvm1.log><jvm2.log>

Eric Pederson

unread,
Oct 28, 2012, 5:21:26 PM10/28/12
to curato...@googlegroups.com
Ok - thanks!

-- Eric
Reply all
Reply to author
Forward
0 new messages