First, I apologize for the lengthy post. I am using Hazelcast 3.4.4 and testing a cluster made up of 2 webapp nodes... one running on my local windows box (10.10.222.239), and the other running on another box in a linux vm (10.10.222.145). The (spring) hazelcast config is setup as follows:
<hz:hazelcast id="instance">
<hz:config>
<hz:spring-aware />
<hz:group name="cluster1" password="cluster1" />
<hz:network port="5701" port-auto-increment="true">
<hz:join>
<hz:multicast enabled="false"/>
<hz:tcp-ip enabled="true">
<hz:members>10.10.222.239, 10.10.222.145</hz:members>
</hz:tcp-ip>
</hz:join>
</hz:network>
....
</hz:config>
</hz:hazelcast>
When I start up my windows webapp node (with linux node down), the logs output what I would expect to see. That is, only one node, and some attempts to connect to the linux node:
Members [1] {
Member [10.10.222.239]:5701 this
}
...
Jul 15, 2015 4:53:22 PM com.hazelcast.nio.tcp.SocketConnector
Jul 15, 2015 4:53:22 PM com.hazelcast.nio.tcp.SocketConnector
When I start up the linux node, everything seems to work fine. The two connect to each other as indicated by the windows node logs:
NFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Accepting socket connection from /10.10.222.145:52127 Jul 15, 2015 4:55:09 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
Jul 15, 2015 4:55:16 PM com.hazelcast.cluster.ClusterService
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4]
Members [2] {
Member [10.10.222.239]:5701 this
Member [10.10.222.145]:5701
}
Starting and stopping the linux node produces the expected behavior; it joins and leaves the cluster.
My problem arises when I start the linux node first, and the windows node afterwards; things don't behave the same way. The two don't seem to see each other after the windows node starts. The logs indicate that both are independent.
// linux
Members [1] {
Member [10.10.222.145]:5701 this
}
// windows
Members [1] {
Member [10.10.222.239]:5701 this
}
In the linux node logs there was an attempt to connect to the other but it failed, which is expected I guess since the windows one was down initially. But what is unexpected is that the windows one shows failed connection attempts to the linux node:
Jul 15, 2015 5:08:31 PM com.hazelcast.nio.tcp.SocketConnector
Jul 15, 2015 5:08:31 PM com.hazelcast.nio.tcp.SocketConnector
4 minutes later the windows node shows this:
Jul 15, 2015 5:12:37 PM com.hazelcast.nio.tcp.SocketAcceptor
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Accepting socket connection from /10.10.222.145:52800 Jul 15, 2015 5:12:37 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.TcpIpJoiner
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is merging to Address[10.10.222.145]:5701, because : node.getThisAddress().hashCode() > joinRequest.address.hashCode() , this node member count: 1
Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.TcpIpJoiner
WARNING: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is merging [tcp/ip] to Address[10.10.222.145]:5701
Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.operations.PrepareMergeOperation
WARNING: [10.10.222.239]:5701 [cluster1] [3.4.4] Preparing to merge... Waiting for merge instruction...
Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.operations.MergeClustersOperation
WARNING: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is merging to Address[10.10.222.145]:5701, because: instructed by master Address[10.10.222.239]:5701
Jul 15, 2015 5:13:13 PM com.hazelcast.core.LifecycleService
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is MERGING
Jul 15, 2015 5:13:13 PM com.hazelcast.nio.tcp.TcpIpConnection
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connection [Address[10.10.222.145]:5701] lost. Reason: Socket explicitly closed
Jul 15, 2015 5:13:14 PM com.hazelcast.nio.tcp.SocketConnector
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.145:5701, timeout: 0, bind-any: true Jul 15, 2015 5:13:31 PM com.hazelcast.cluster.impl.TcpIpJoiner
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.145]:5702 is added to the blacklist.
Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.145:5702, timeout: 0, bind-any: true Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.239:5702, timeout: 0, bind-any: true Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.239:5703, timeout: 0, bind-any: true Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.145:5703, timeout: 0, bind-any: true Jul 15, 2015 5:13:35 PM com.hazelcast.nio.tcp.SocketConnector
Jul 15, 2015 5:13:35 PM com.hazelcast.cluster.impl.TcpIpJoiner
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.145]:5701 is added to the blacklist.
Jul 15, 2015 5:13:35 PM com.hazelcast.nio.tcp.SocketConnector
Jul 15, 2015 5:13:35 PM com.hazelcast.cluster.impl.TcpIpJoiner
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5703 is added to the blacklist.
Jul 15, 2015 5:13:35 PM com.hazelcast.nio.tcp.SocketConnector
Jul 15, 2015 5:13:35 PM com.hazelcast.cluster.impl.TcpIpJoiner
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5702 is added to the blacklist.
Jul 15, 2015 5:13:39 PM com.hazelcast.cluster.impl.TcpIpJoiner
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4]
Members [1] {
Member [10.10.222.239]:5701 this
}
Jul 15, 2015 5:13:39 PM com.hazelcast.core.LifecycleService
INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is MERGED
Jul 15, 2015 5:13:55 PM com.hazelcast.nio.tcp.SocketConnector
Jul 15, 2015 5:13:55 PM com.hazelcast.nio.tcp.SocketConnector
It appears there was an attempt to merge the cluster nodes, but it failed. I don't quite understand why things behave differently depending on who starts up first. Is there something misconfigured or some behaviour that is implied that I am unaware of?