Hi,
I had the following case running ELS.1.6.2 with Zookeeper 3.4.6 and the els-zookeeper plugin 1.6.1.
It looks like due to a network partition the plugin lost the connection to Zookeeper, and after trying to reconnect it got a UnknownHostException which presumably is due to the same partition,
but rather than pausing and retrying after a while the plugin just gave up and left the cluster.
I've added here the logs and highlighted the relevant parts.
---
[2016-02-10 06:15:35,746][INFO ][org.apache.zookeeper.ClientCnxn] Client session timed out, have not heard from server in 26679ms for sessionid 0x34f3ad0a3294272, closing socket connection and attempting
reconnect
[2016-02-10 06:15:35,943][INFO ][org.apache.zookeeper.ClientCnxn] Opening socket connection to server ip-10-10-1-5.eu-west-1.compute.internal/10.10.1.5:2181. Will not attempt to authenticate using SASL (u
nknown error)
[2016-02-10 06:15:35,944][INFO ][org.apache.zookeeper.ClientCnxn] Socket connection established to ip-10-10-1-5.eu-west-1.compute.internal/10.10.1.5:2181, initiating session
[2016-02-10 06:15:35,953][INFO ][org.apache.zookeeper.ClientCnxn] Unable to reconnect to ZooKeeper service, session 0x34f3ad0a3294272 has expired, closing socket connection
[2016-02-10 06:15:35,961][INFO ][com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService] [Omen] Restarting ZooKeeper discovery
[2016-02-10 06:15:35,961][INFO ][org.apache.zookeeper.ZooKeeper] Initiating client connection, connectString=zookeeper.service.consul:2181 sessionTimeout=60000 watcher=com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$1@463dc8
[2016-02-10 06:15:35,999][ERROR][org.apache.zookeeper.ClientCnxn] Caught unexpected throwable
org.elasticsearch.ElasticsearchException: Cannot start ZooKeeper
at com.sonian.elasticsearch.zookeeper.client.ZooKeeperFactory.newZooKeeper(ZooKeeperFactory.java:61)
at com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService.doStart(ZooKeeperClientService.java:91)
at com.sonian.elasticsearch.zookeeper.client.ZooKeeperClientService$19.processResult(ZooKeeperClientService.java:517)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:609)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
Caused by: java.net.UnknownHostException: zookeeper.service.consul: unknown error
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:907)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1302)
at java.net.InetAddress.getAllByName0(InetAddress.java:1255)
at java.net.InetAddress.getAllByName(InetAddress.java:1171)
at java.net.InetAddress.getAllByName(InetAddress.java:1105)
at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:61)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:445)
at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:380)
at com.sonian.elasticsearch.zookeeper.client.ZooKeeperFactory.newZooKeeper(ZooKeeperFactory.java:59)
... 4 more
[2016-02-10 06:15:35,999][INFO ][org.apache.zookeeper.ClientCnxn] EventThread shut down
---
The main reason for using this plugin is to get a robust solution in presence of network partitions,
and it seems to me that this bug negates the benefits of the tool itself.
Is there any chance to get this issue fixed even in a recent ELS version?
Bruno