Hazelcast cannot reconnect to cluster

3,046 views
Skip to first unread message

ivenhov

unread,
Oct 15, 2013, 4:03:31 PM10/15/13
to haze...@googlegroups.com
Hi 

I have 2 node cluster system with hazelcast 2.6.3
When nodes boot everything is ok, but when I simulate NIC going down
sudo ifdown eth1
and up again
sudo ifup eth1
node is never joining back to the other one


Disconnection is visible instantly but I never see Membership changes when the NIC is up. Same thing when I just unplug Ethernet cable and plug it back in.



Oct 15, 2013 4:48:57 PM com.hazelcast.cluster.ClusterManager
WARNING: [10.173.240.1]:7008 [matrix] Added Address[10.173.240.2]:7008 to list of dead addresses because of timeout since last read
Oct 15, 2013 4:48:57 PM com.hazelcast.cluster.ClusterManager
INFO: [10.173.240.1]:7008 [matrix] Removing Address Address[10.173.240.2]:7008
Oct 15, 2013 4:48:57 PM com.hazelcast.nio.Connection
INFO: [10.173.240.1]:7008 [matrix] Connection [Address[10.173.240.2]:7008] lost. Reason: Explicit close
Oct 15, 2013 4:48:57 PM com.hazelcast.impl.PartitionManager
INFO: [10.173.240.1]:7008 [matrix] Starting to send partition replica diffs...
Oct 15, 2013 4:48:57 PM com.hazelcast.cluster.ClusterManager
INFO: [10.173.240.1]:7008 [matrix]

Members [1] {
        Member [10.173.240.1]:7008 this
}

Oct 15, 2013 4:49:01 PM com.hazelcast.impl.PartitionManager
INFO: [10.173.240.1]:7008 [matrix] Remaining migration tasks in queue => Immediate-Tasks: 2, Scheduled-Tasks: 0
Oct 15, 2013 4:49:01 PM com.hazelcast.impl.PartitionManager
INFO: [10.173.240.1]:7008 [matrix] Total 0 partition replica diffs have been processed.
Oct 15, 2013 4:49:01 PM com.hazelcast.impl.PartitionManager
INFO: [10.173.240.1]:7008 [matrix] Re-partitioning cluster data... Immediate-Tasks: 0, Scheduled-Tasks: 0
Oct 15, 2013 4:49:05 PM com.hazelcast.impl.MulticastService
WARNING: [10.173.240.1]:7008 [matrix] You probably have too long Hazelcast configuration!
java.io.IOException: Network is unreachable
        at java.net.PlainDatagramSocketImpl.send(Native Method)
        at java.net.DatagramSocket.send(DatagramSocket.java:676)
        at com.hazelcast.impl.MulticastService.send(MulticastService.java:173)
        at com.hazelcast.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:112)
        at com.hazelcast.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:57)
        at com.hazelcast.impl.SplitBrainHandler.access$000(SplitBrainHandler.java:23)
        at com.hazelcast.impl.SplitBrainHandler$1.doRun(SplitBrainHandler.java:44)
        at com.hazelcast.impl.FallThroughRunnable.run(FallThroughRunnable.java:22)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:722)
        at com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:45)

My config 
    <properties>
        <property name="hazelcast.socket.bind.any">false</property>
        <property name="hazelcast.prefer.ipv4.stack">true</property>
        <property name="hazelcast.version.check.enabled">false</property>
        <property name="hazelcast.initial.min.cluster.size">2</property>
        <property name="hazelcast.max.no.heartbeat.seconds">10</property>
        <property name="hazelcast.wait.seconds.before.join">1</property>
        <property name="hazelcast.max.wait.seconds.before.join">5</property>
        <property name="hazelcast.merge.first.run.delay.seconds">15</property>
        <property name="hazelcast.merge.next.run.delay.seconds">10</property>
        <property name="hazelcast.icmp.enabled">true</property>
        <property name="hazelcast.jmx">true</property>
        <property name="hazelcast.jmx.detailed">true</property>
    </properties>

I've added hazelcast.icmp.enabled after reading on this list that it helps with recovery .
I appreciate it's not default config and maybe those settings are quite extreme but what I wanted to achieve is a quick join when the node comes back.

Any help appreciated
Daniel

ivenhov

unread,
Oct 15, 2013, 5:00:47 PM10/15/13
to haze...@googlegroups.com
My network configuration
    <network>
        <port auto-increment="true">7008</port>
        <join>
            <multicast enabled="false">
                <multicast-group>227.227.227.225</multicast-group>
                <multicast-port>7007</multicast-port>
            </multicast>
            <tcp-ip enabled="true">
                <interface>10.173.240.*</interface>
            </tcp-ip>
        </join>
        <interfaces enabled="true">
            <interface>10.173.240.*</interface>
        </interfaces>
    </network>

After bringing the NIC up I see on that node:

Oct 15, 2013 8:54:26 PM com.hazelcast.impl.MulticastJoiner
FINEST: [10.173.240.1]:7008 [matrix] Address[10.173.240.2]:7008 should merge to this node , because : node.getThisAddress().hashCode() < joinInfo.address.hashCode() , this node member count: 1
Oct 15, 2013 8:54:34 PM com.hazelcast.impl.SplitBrainHandler
FINEST: [10.173.240.1]:7008 [matrix] Searching for other clusters.
Oct 15, 2013 8:54:36 PM com.hazelcast.impl.MulticastJoiner
FINEST: [10.173.240.1]:7008 [matrix] Address[10.173.240.2]:7008 should merge to this node , because : node.getThisAddress().hashCode() < joinInfo.address.hashCode() , this node member count: 1
Oct 15, 2013 8:54:44 PM com.hazelcast.impl.SplitBrainHandler
FINEST: [10.173.240.1]:7008 [matrix] Searching for other clusters.
Oct 15, 2013 8:54:46 PM com.hazelcast.impl.MulticastJoiner
FINEST: [10.173.240.1]:7008 [matrix] Address[10.173.240.2]:7008 should merge to this node , because : node.getThisAddress().hashCode() < joinInfo.address.hashCode() , this node member count: 1


Second node just keeps printing
Oct 15, 2013 8:53:55 PM com.hazelcast.impl.PartitionManager
INFO: [10.173.240.2]:7008 [matrix] Total 0 partition replica diffs have been processed.
Oct 15, 2013 8:53:55 PM com.hazelcast.impl.PartitionManager
INFO: [10.173.240.2]:7008 [matrix] Re-partitioning cluster data... Immediate-Tasks: 0, Scheduled-Tasks: 0
Oct 15, 2013 8:54:04 PM com.hazelcast.impl.SplitBrainHandler
FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.
Oct 15, 2013 8:54:14 PM com.hazelcast.impl.SplitBrainHandler
FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.
Oct 15, 2013 8:54:24 PM com.hazelcast.impl.SplitBrainHandler
FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.
Oct 15, 2013 8:54:34 PM com.hazelcast.impl.SplitBrainHandler
FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.

ivenhov

unread,
Oct 16, 2013, 6:50:28 AM10/16/13
to haze...@googlegroups.com
I tried restarting Hazelcast on one node using jmx restart() but node still failed to join.
After restarting the whole JVM node also failed to join but joined successfully after second JVM restart.

This sounds awfully like:

Has this been fixed or is it still a problem?  This is a serious issue for us. Temporary network outages, dropped messages happen...

D.

ivenhov

unread,
Oct 17, 2013, 6:07:48 AM10/17/13
to haze...@googlegroups.com
I added special thread in my app to restart Hazelcast with

_instance.getLifecycleService().restart();

Just to solve an issue of node not being connected to the rest of the cluster but that still does not help. Node sits on it's own not joining.

Any help would be appreciated.
I can post more details if needed
I'm using  Linux  3.5.0-34-generic #55~precise1-Ubuntu

Regards
D.

Fuad Malikov

unread,
Oct 17, 2013, 12:18:29 PM10/17/13
to Hazelcast
Hi,

Can you play with the following parameters: hazelcast.merge.first.run.delay.seconds and hazelcast.merge.next.run.delay.seconds. 

With default configuration, split brain merge starts after 5 minutes. If it still doesn't work, can you send us the logs. 
 


Fuad Malikov
Co-founder & Managing Partner
Hazelcast | Open source in-memory data grid
575 Middlefield Rd, Palo Alto, CA 94301


--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
For more options, visit https://groups.google.com/groups/opt_out.

ivenhov

unread,
Oct 17, 2013, 12:58:59 PM10/17/13
to haze...@googlegroups.com
Actually my setting were
        <property name="hazelcast.merge.first.run.delay.seconds">15</property>
        <property name="hazelcast.merge.next.run.delay.seconds">10</property>

As you can see in in my first post.
I find multicast in Hazelcast completely unreliable. Nodes cannot discover other nodes unless JVM is restarted and even then it does not always happen

I don't have logs any more since I tested plenty different configuration since then and I cannot extract relevant parts now
It should be easily reproducible with information I provided.

D.

Mustafa Sancar Koyunlu

unread,
Oct 24, 2013, 4:38:35 AM10/24/13
to haze...@googlegroups.com
Hi Daniel,
I investigated your problem with 2.6.3. Since I am using mac, I simulate network failure with commands:
sudo ifconfig en0 down

sudo ifconfig en0 up 

In my environment, the nodes can find each other.

The exception "java.io.IOException: Network is unreachable " means that network is not still up. 

With my  command i did not get that exception first.After I closed some other interfaces and then opened just en0, I managed to get same exception. But nodes still successfully joined. 

Are you sure your way of simulating the failure is correct?  

And a second thing is that  even tough your configuration says that multicast is disabled, hazelcast logs says that they are using Multicast joiner. It is not consistent. You may want to check that too. 


ivenhov

unread,
Oct 24, 2013, 9:41:47 AM10/24/13
to haze...@googlegroups.com
Hi Mustafa

Thanks for getting back to me
I was doing those tests with Ubuntu machine so Mac may behave slightly different.
In my case my eth1 interface had 2 IPs configured, one of which was 10.173.240.xx and the other my external network.

I'm sure NIC was up properly as I could ssh from one machine to another on that 10.173.240.xx interface
It looked like multicast messages were dropped/lost and node could not join because it did not retried.

Configuration says multicast false as I started to investigate tcp-ip discovery as a workaround to those problems.
That proved to be more reliable although I still have to do Hazelcast restart.

Regards
Daniel

Mustafa Sancar Koyunlu

unread,
Oct 25, 2013, 3:01:29 AM10/25/13
to haze...@googlegroups.com
I tried the problem with Ubuntu 12.0.4 kernel 3.8 .
I got the exact same exception when NIC is down. But after I reopen the eth1 with
"sudo ifup eth1", nodes are reconnected successfully. I tried it with multicast. 

I am sending logs of two nodes, so that  you can see the behaviour.

node1.log
node2.log

ivenhov

unread,
Oct 25, 2013, 6:02:29 AM10/25/13
to haze...@googlegroups.com
Hi 
Thanks for sharing the logs. This is exactly what I would expect to have on my system.
Could you post you hazelcast configuration?
Are multicast messages sent periodically? How often?
I'm wondering if sender thread could crash/stop (due to NIC not available etc) and that would prevent nodes from discovering each other.

D.

M. Sancar Koyunlu

unread,
Oct 25, 2013, 8:18:18 AM10/25/13
to haze...@googlegroups.com
I just set the followings:
hazelcast.max.wait.seconds.before.join  5 
hazelcast.merge.first.run.delay.seconds 10
Others are default. But I saw that you already play with those. 
The logs in the second node saying  "Searching for other clusters" shows that responsible thread is not death.
Reply all
Reply to author
Forward
0 new messages