Regular Split Brain Issue Running on 2 VMs in Windows Azure (Unicast Discovery)

Showing 1-2 of 2 messages
Regular Split Brain Issue Running on 2 VMs in Windows Azure (Unicast Discovery) Nariman Haghighi 5/6/13 7:36 AM
We're running ES on 2 Azure VMs using unicast discovery and opening ports 9200 and 9300 on both nodes.

The initial start is always successful but we've started to notice split brain pattern emerging a few hours into each deployment.

The configuration is as follows:

{"path":{"data":"F:\\","work":"F:\\"},"cluster":{"name":"FiveAces.Coffee.Web"},"node":{"name":"FiveAces.Coffee.Web_IN_0"},"discovery":{"zen":{"ping":{"multicast":{"enabled":false},"unicast":{"hosts":["10.241.238.26","10.241.182.18"]}}}}}

And this is the pattern that's happening almost regularly (which subsequently leads to data loss after each side is restarted):

NODE1:

[2013-05-05 22:00:47,123][INFO ][node                     ] [FiveAces.Coffee.Web_IN_1] {0.90.0}[2164]: initializing ...
[2013-05-05 22:00:47,524][INFO ][plugins                  ] [FiveAces.Coffee.Web_IN_1] loaded [], sites [head]
[2013-05-05 22:00:53,259][INFO ][node                     ] [FiveAces.Coffee.Web_IN_1] {0.90.0}[2164]: initialized
[2013-05-05 22:00:53,264][INFO ][node                     ] [FiveAces.Coffee.Web_IN_1] {0.90.0}[2164]: starting ...
[2013-05-05 22:00:53,506][INFO ][transport                ] [FiveAces.Coffee.Web_IN_1] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.241.182.18:9300]}
[2013-05-05 22:01:00,800][INFO ][discovery.zen            ] [FiveAces.Coffee.Web_IN_1] master_left [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]], reason [do not exists on master, act as master failure]
[2013-05-05 22:01:00,831][INFO ][cluster.service          ] [FiveAces.Coffee.Web_IN_1] detected_master [FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]], added {[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]],}, reason: zen-disco-receive(from master [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]])
[2013-05-05 22:01:00,911][INFO ][discovery                ] [FiveAces.Coffee.Web_IN_1] FiveAces.Coffee.Web/lH-lp_jsQ4WhwwFJO3B-kA
[2013-05-05 22:01:00,912][INFO ][cluster.service          ] [FiveAces.Coffee.Web_IN_1] master {new [FiveAces.Coffee.Web_IN_1][lH-lp_jsQ4WhwwFJO3B-kA][inet[/10.241.182.18:9300]], previous [FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]}, removed {[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]],}, reason: zen-disco-master_failed ([FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]])
[2013-05-05 22:01:02,352][INFO ][http                     ] [FiveAces.Coffee.Web_IN_1] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.241.182.18:9200]}
[2013-05-05 22:01:02,354][INFO ][node                     ] [FiveAces.Coffee.Web_IN_1] {0.90.0}[2164]: started
[2013-05-05 22:01:03,912][WARN ][discovery.zen            ] [FiveAces.Coffee.Web_IN_1] received cluster state from [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]] which is also master but with an older cluster_state, telling [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]] to rejoin the cluster
[2013-05-05 22:01:03,965][WARN ][discovery.zen            ] [FiveAces.Coffee.Web_IN_1] received cluster state from [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]] which is also master but with an older cluster_state, telling [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]] to rejoin the cluster
[2013-05-05 22:01:03,966][WARN ][discovery.zen            ] [FiveAces.Coffee.Web_IN_1] received cluster state from [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]] which is also master but with an older cluster_state, telling [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]] to rejoin the cluster
[2013-05-05 22:01:03,966][WARN ][discovery.zen            ] [FiveAces.Coffee.Web_IN_1] failed to send rejoin request to [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [FiveAces.Coffee.Web_IN_0][inet[/10.241.238.26:9300]][discovery/zen/rejoin]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:171)
at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:527)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:229)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:95)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [FiveAces.Coffee.Web_IN_0][inet[/10.241.238.26:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:788)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:522)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:184)
... 7 more
[2013-05-05 22:01:04,072][WARN ][discovery.zen            ] [FiveAces.Coffee.Web_IN_1] failed to send rejoin request to [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [FiveAces.Coffee.Web_IN_0][inet[/10.241.238.26:9300]][discovery/zen/rejoin]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:171)
at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:527)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:229)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:95)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [FiveAces.Coffee.Web_IN_0][inet[/10.241.238.26:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:788)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:522)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:184)
... 7 more
[2013-05-05 22:01:03,966][WARN ][discovery.zen            ] [FiveAces.Coffee.Web_IN_1] failed to send rejoin request to [[FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [FiveAces.Coffee.Web_IN_0][inet[/10.241.238.26:9300]][discovery/zen/rejoin]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:171)
at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:527)
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:229)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:95)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [FiveAces.Coffee.Web_IN_0][inet[/10.241.238.26:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:788)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:522)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:184)
... 7 more

NODE2:

[2013-05-05 22:00:48,930][INFO ][node                     ] [FiveAces.Coffee.Web_IN_0] {0.90.0}[2640]: initializing ...
[2013-05-05 22:00:49,133][INFO ][plugins                  ] [FiveAces.Coffee.Web_IN_0] loaded [], sites [head]
[2013-05-05 22:00:54,125][INFO ][node                     ] [FiveAces.Coffee.Web_IN_0] {0.90.0}[2640]: initialized
[2013-05-05 22:00:54,127][INFO ][node                     ] [FiveAces.Coffee.Web_IN_0] {0.90.0}[2640]: starting ...
[2013-05-05 22:00:54,310][INFO ][transport                ] [FiveAces.Coffee.Web_IN_0] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/10.241.238.26:9300]}
[2013-05-05 22:00:57,389][INFO ][cluster.service          ] [FiveAces.Coffee.Web_IN_0] new_master [FiveAces.Coffee.Web_IN_0][CVJT6uiFR4OEAzzXyRL_yQ][inet[/10.241.238.26:9300]], reason: zen-disco-join (elected_as_master)
[2013-05-05 22:00:57,406][INFO ][discovery                ] [FiveAces.Coffee.Web_IN_0] FiveAces.Coffee.Web/CVJT6uiFR4OEAzzXyRL_yQ
[2013-05-05 22:00:58,618][INFO ][http                     ] [FiveAces.Coffee.Web_IN_0] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/10.241.238.26:9200]}
[2013-05-05 22:00:58,619][INFO ][node                     ] [FiveAces.Coffee.Web_IN_0] {0.90.0}[2640]: started
[2013-05-05 22:01:00,532][INFO ][gateway                  ] [FiveAces.Coffee.Web_IN_0] recovered [1] indices into cluster_state
[2013-05-05 22:01:00,794][INFO ][cluster.service          ] [FiveAces.Coffee.Web_IN_0] added {[FiveAces.Coffee.Web_IN_1][lH-lp_jsQ4WhwwFJO3B-kA][inet[/10.241.182.18:9300]],}, reason: zen-disco-receive(join from node[[FiveAces.Coffee.Web_IN_1][lH-lp_jsQ4WhwwFJO3B-kA][inet[/10.241.182.18:9300]]])

Could this be tied to the transport module using a bind range of 9300-9400 by default? Wondering if I should give it a narrow range of one or two ports that I can then open up between the two VMs.

The data loss that results from this is unacceptable, any suggestions on how to avoid the split brain scenario (other than moving to 3 nodes) would be appreciated.

Regards,
N.
Re: Regular Split Brain Issue Running on 2 VMs in Windows Azure (Unicast Discovery) Imdad Ahmed 5/6/13 9:52 PM
We use elasticsearch-zookeeper plugin to address the split brain issue: https://github.com/sonian/elasticsearch-zookeeper
This helps avoiding split-brain because, the master election process is now externalised to zookeeper service.