Cluster issues with two nodes

1,109 views
Skip to first unread message

JS

unread,
Jul 16, 2015, 1:01:21 PM7/16/15
to haze...@googlegroups.com
First, I apologize for the lengthy post. I am using Hazelcast 3.4.4 and testing a cluster made up of 2 webapp nodes... one running on my local windows box (10.10.222.239), and the other running on another box in a linux vm (10.10.222.145). The (spring) hazelcast config is setup as follows:

    <hz:hazelcast id="instance">
       <hz:config>
         <hz:spring-aware />
         <hz:group name="cluster1" password="cluster1" />
         <hz:network port="5701" port-auto-increment="true">
           <hz:join>
             <hz:multicast enabled="false"/>
             <hz:tcp-ip enabled="true">
             <hz:members>10.10.222.239, 10.10.222.145</hz:members>
           </hz:tcp-ip>
           </hz:join>
        </hz:network>
            ....
      </hz:config>
    </hz:hazelcast>


When I start up my windows webapp node (with linux node down), the logs output what I would expect to see. That is, only one node, and some attempts to connect to the linux node:

    Members [1] {
    Member [10.10.222.239]:5701 this
    }
    ...
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5702. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5702]
    Jul 15, 2015 4:53:22 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5701. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5701]
    Jul 15, 2015 4:53:22 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5703. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5703]

When I start up the linux node, everything seems to work fine. The two connect to each other as indicated by the windows node logs:

    NFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Accepting socket connection from /10.10.222.145:52127
    Jul 15, 2015 4:55:09 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Established socket connection between /10.10.222.239:5701 and /10.10.222.145:52127
    Jul 15, 2015 4:55:16 PM com.hazelcast.cluster.ClusterService
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] 
    
    Members [2] {
    Member [10.10.222.239]:5701 this
    Member [10.10.222.145]:5701
    }

Starting and stopping the linux node produces the expected behavior; it joins and leaves the cluster.

My problem arises when I start the linux node first, and the windows node afterwards; things don't behave the same way. The two don't seem to see each other after the windows node starts. The logs indicate that both are independent.

    // linux 
    Members [1] {
            Member [10.10.222.145]:5701 this
    }
    // windows
    Members [1] {
    Member [10.10.222.239]:5701 this
    }

In the linux node logs there was an attempt to connect to the other but it failed, which is expected I guess since the windows one was down initially. But what is unexpected is that the windows one shows failed connection attempts to the linux node:

    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5701. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5701]
    Jul 15, 2015 5:08:31 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5702. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5702]
    Jul 15, 2015 5:08:31 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5703. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5703]

4 minutes later the windows node shows this:

    Jul 15, 2015 5:12:37 PM com.hazelcast.nio.tcp.SocketAcceptor
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Accepting socket connection from /10.10.222.145:52800
    Jul 15, 2015 5:12:37 PM com.hazelcast.nio.tcp.TcpIpConnectionManager
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Established socket connection between /10.10.222.239:5701 and /10.10.222.145:52800
    Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.TcpIpJoiner
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is merging to Address[10.10.222.145]:5701, because : node.getThisAddress().hashCode() > joinRequest.address.hashCode() , this node member count: 1
    Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.TcpIpJoiner
    WARNING: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is merging [tcp/ip] to Address[10.10.222.145]:5701
    Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.operations.PrepareMergeOperation
    WARNING: [10.10.222.239]:5701 [cluster1] [3.4.4] Preparing to merge... Waiting for merge instruction...
    Jul 15, 2015 5:13:13 PM com.hazelcast.cluster.impl.operations.MergeClustersOperation
    WARNING: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is merging to Address[10.10.222.145]:5701, because: instructed by master Address[10.10.222.239]:5701
    Jul 15, 2015 5:13:13 PM com.hazelcast.core.LifecycleService
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is MERGING
    Jul 15, 2015 5:13:13 PM com.hazelcast.nio.tcp.TcpIpConnection
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connection [Address[10.10.222.145]:5701] lost. Reason: Socket explicitly closed
    Jul 15, 2015 5:13:14 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.145:5701, timeout: 0, bind-any: true
    Jul 15, 2015 5:13:31 PM com.hazelcast.cluster.impl.TcpIpJoiner
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.145]:5702 is added to the blacklist.
    Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.145:5702, timeout: 0, bind-any: true
    Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.239:5702, timeout: 0, bind-any: true
    Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.239:5703, timeout: 0, bind-any: true
    Jul 15, 2015 5:13:34 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Connecting to /10.10.222.145:5703, timeout: 0, bind-any: true
    Jul 15, 2015 5:13:35 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5701. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5701]
    Jul 15, 2015 5:13:35 PM com.hazelcast.cluster.impl.TcpIpJoiner
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.145]:5701 is added to the blacklist.
    Jul 15, 2015 5:13:35 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.239:5703. Reason: SocketException[Connection refused: connect to address /10.10.222.239:5703]
    Jul 15, 2015 5:13:35 PM com.hazelcast.cluster.impl.TcpIpJoiner
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5703 is added to the blacklist.
    Jul 15, 2015 5:13:35 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.239:5702. Reason: SocketException[Connection refused: connect to address /10.10.222.239:5702]
    Jul 15, 2015 5:13:35 PM com.hazelcast.cluster.impl.TcpIpJoiner
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5702 is added to the blacklist.
    Jul 15, 2015 5:13:39 PM com.hazelcast.cluster.impl.TcpIpJoiner
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] 
    
    
    Members [1] {
    Member [10.10.222.239]:5701 this
    }
    
    Jul 15, 2015 5:13:39 PM com.hazelcast.core.LifecycleService
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Address[10.10.222.239]:5701 is MERGED
    Jul 15, 2015 5:13:55 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5702. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5702]
    Jul 15, 2015 5:13:55 PM com.hazelcast.nio.tcp.SocketConnector
    INFO: [10.10.222.239]:5701 [cluster1] [3.4.4] Could not connect to: /10.10.222.145:5703. Reason: SocketException[Connection timed out: connect to address /10.10.222.145:5703]

It appears there was an attempt to merge the cluster nodes, but it failed. I don't quite understand why things behave differently depending on who starts up first. Is there something misconfigured or some behaviour that is implied that I am unaware of?

Peter Veentjer

unread,
Jul 16, 2015, 5:19:59 PM7/16/15
to haze...@googlegroups.com
I sounds like a connectivity issue. I would suggest disabling firewall to verify that is not the cause. I would also not use windows in combination with linux. Most of the testing is done on Linux.

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/68865cee-a9d2-410f-95ca-8cd9da57329a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

JS

unread,
Jul 17, 2015, 4:24:44 AM7/17/15
to haze...@googlegroups.com, jste...@thinkingphones.com
Yes, turned out to be just that. I disabled the iptables service on both nodes and things began to work fine.

Peter Veentjer

unread,
Jul 17, 2015, 5:42:04 AM7/17/15
to haze...@googlegroups.com, jste...@thinkingphones.com
If I would get a million for every time someone has run into firewall issues, I would be millionaire for sure.

On Fri, Jul 17, 2015 at 4:48 AM, JS <jste...@thinkingphones.com> wrote:
Yes, turned out to be just that. I disabled the iptables service on both nodes and things began to work fine.

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.

ecel...@gmail.com

unread,
Aug 3, 2015, 10:49:05 AM8/3/15
to Hazelcast, jste...@thinkingphones.com
Hi JS,

May the white space between the declaration of ip members could be the problem:
<hz:members>10.10.222.239, 10.10.222.145</hz:members>

Remove this unnecessary space and try again :)

Follows the .xsd Hazelcast doc:
"Comma separated IP addresses of one or more well known members."
Reply all
Reply to author
Forward
0 new messages