Hazelcast cannot reconnect to cluster

ivenhov

unread,

Oct 15, 2013, 4:03:31 PM10/15/13

to haze...@googlegroups.com

Hi

I have 2 node cluster system with hazelcast 2.6.3

When nodes boot everything is ok, but when I simulate NIC going down

sudo ifdown eth1

and up again

sudo ifup eth1

node is never joining back to the other one

Disconnection is visible instantly but I never see Membership changes when the NIC is up. Same thing when I just unplug Ethernet cable and plug it back in.

Oct 15, 2013 4:48:57 PM com.hazelcast.cluster.ClusterManager

WARNING: [10.173.240.1]:7008 [matrix] Added Address[10.173.240.2]:7008 to list of dead addresses because of timeout since last read

Oct 15, 2013 4:48:57 PM com.hazelcast.cluster.ClusterManager

INFO: [10.173.240.1]:7008 [matrix] Removing Address Address[10.173.240.2]:7008

Oct 15, 2013 4:48:57 PM com.hazelcast.nio.Connection

INFO: [10.173.240.1]:7008 [matrix] Connection [Address[10.173.240.2]:7008] lost. Reason: Explicit close

Oct 15, 2013 4:48:57 PM com.hazelcast.impl.PartitionManager

INFO: [10.173.240.1]:7008 [matrix] Starting to send partition replica diffs...

Oct 15, 2013 4:48:57 PM com.hazelcast.cluster.ClusterManager

INFO: [10.173.240.1]:7008 [matrix]

Members [1] {

Member [10.173.240.1]:7008 this

}

Oct 15, 2013 4:49:01 PM com.hazelcast.impl.PartitionManager

INFO: [10.173.240.1]:7008 [matrix] Remaining migration tasks in queue => Immediate-Tasks: 2, Scheduled-Tasks: 0

Oct 15, 2013 4:49:01 PM com.hazelcast.impl.PartitionManager

INFO: [10.173.240.1]:7008 [matrix] Total 0 partition replica diffs have been processed.

Oct 15, 2013 4:49:01 PM com.hazelcast.impl.PartitionManager

INFO: [10.173.240.1]:7008 [matrix] Re-partitioning cluster data... Immediate-Tasks: 0, Scheduled-Tasks: 0

Oct 15, 2013 4:49:05 PM com.hazelcast.impl.MulticastService

WARNING: [10.173.240.1]:7008 [matrix] You probably have too long Hazelcast configuration!

java.io.IOException: Network is unreachable

at java.net.PlainDatagramSocketImpl.send(Native Method)

at java.net.DatagramSocket.send(DatagramSocket.java:676)

at com.hazelcast.impl.MulticastService.send(MulticastService.java:173)

at com.hazelcast.impl.MulticastJoiner.searchForOtherClusters(MulticastJoiner.java:112)

at com.hazelcast.impl.SplitBrainHandler.searchForOtherClusters(SplitBrainHandler.java:57)

at com.hazelcast.impl.SplitBrainHandler.access$000(SplitBrainHandler.java:23)

at com.hazelcast.impl.SplitBrainHandler$1.doRun(SplitBrainHandler.java:44)

at com.hazelcast.impl.FallThroughRunnable.run(FallThroughRunnable.java:22)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722)

at com.hazelcast.impl.ExecutorThreadFactory$1.run(ExecutorThreadFactory.java:45)

My config

<property name="hazelcast.socket.bind.any">false</property>

<property name="hazelcast.version.check.enabled">false</property>

</properties>

I've added hazelcast.icmp.enabled after reading on this list that it helps with recovery .

I appreciate it's not default config and maybe those settings are quite extreme but what I wanted to achieve is a quick join when the node comes back.

Any help appreciated

Daniel

ivenhov

unread,

Oct 15, 2013, 5:00:47 PM10/15/13

to haze...@googlegroups.com

My network configuration

<join>

<multicast-group>227.227.227.225</multicast-group>

<multicast-port>7007</multicast-port>

</multicast>

<tcp-ip enabled="true">

</tcp-ip>

</join>

</interfaces>

</network>

After bringing the NIC up I see on that node:

Oct 15, 2013 8:54:26 PM com.hazelcast.impl.MulticastJoiner

FINEST: [10.173.240.1]:7008 [matrix] Address[10.173.240.2]:7008 should merge to this node , because : node.getThisAddress().hashCode() < joinInfo.address.hashCode() , this node member count: 1

Oct 15, 2013 8:54:34 PM com.hazelcast.impl.SplitBrainHandler

FINEST: [10.173.240.1]:7008 [matrix] Searching for other clusters.

Oct 15, 2013 8:54:36 PM com.hazelcast.impl.MulticastJoiner

FINEST: [10.173.240.1]:7008 [matrix] Address[10.173.240.2]:7008 should merge to this node , because : node.getThisAddress().hashCode() < joinInfo.address.hashCode() , this node member count: 1

Oct 15, 2013 8:54:44 PM com.hazelcast.impl.SplitBrainHandler

FINEST: [10.173.240.1]:7008 [matrix] Searching for other clusters.

Oct 15, 2013 8:54:46 PM com.hazelcast.impl.MulticastJoiner

FINEST: [10.173.240.1]:7008 [matrix] Address[10.173.240.2]:7008 should merge to this node , because : node.getThisAddress().hashCode() < joinInfo.address.hashCode() , this node member count: 1

Second node just keeps printing

Oct 15, 2013 8:53:55 PM com.hazelcast.impl.PartitionManager

INFO: [10.173.240.2]:7008 [matrix] Total 0 partition replica diffs have been processed.

Oct 15, 2013 8:53:55 PM com.hazelcast.impl.PartitionManager

INFO: [10.173.240.2]:7008 [matrix] Re-partitioning cluster data... Immediate-Tasks: 0, Scheduled-Tasks: 0

Oct 15, 2013 8:54:04 PM com.hazelcast.impl.SplitBrainHandler

FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.

Oct 15, 2013 8:54:14 PM com.hazelcast.impl.SplitBrainHandler

FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.

Oct 15, 2013 8:54:24 PM com.hazelcast.impl.SplitBrainHandler

FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.

Oct 15, 2013 8:54:34 PM com.hazelcast.impl.SplitBrainHandler

FINEST: [10.173.240.2]:7008 [matrix] Searching for other clusters.

ivenhov

unread,

Oct 16, 2013, 6:50:28 AM10/16/13

to haze...@googlegroups.com

I tried restarting Hazelcast on one node using jmx restart() but node still failed to join.

After restarting the whole JVM node also failed to join but joined successfully after second JVM restart.

This sounds awfully like:

https://code.google.com/p/hazelcast/issues/detail?id=644

Has this been fixed or is it still a problem? This is a serious issue for us. Temporary network outages, dropped messages happen...

D.

ivenhov

unread,

Oct 17, 2013, 6:07:48 AM10/17/13

to haze...@googlegroups.com

I added special thread in my app to restart Hazelcast with

_instance.getLifecycleService().restart();

Just to solve an issue of node not being connected to the rest of the cluster but that still does not help. Node sits on it's own not joining.

Any help would be appreciated.

I can post more details if needed

I'm using Linux 3.5.0-34-generic #55~precise1-Ubuntu

Regards

D.

Fuad Malikov

unread,

Oct 17, 2013, 12:18:29 PM10/17/13

to Hazelcast

Hi,

Can you play with the following parameters: hazelcast.merge.first.run.delay.seconds and hazelcast.merge.next.run.delay.seconds.

With default configuration, split brain merge starts after 5 minutes. If it still doesn't work, can you send us the logs.

Fuad Malikov
Co-founder & Managing Partner
Hazelcast | Open source in-memory data grid

575 Middlefield Rd, Palo Alto, CA 94301

Mobile: +1 (650) 690-0621

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
For more options, visit https://groups.google.com/groups/opt_out.

ivenhov

unread,

Oct 17, 2013, 12:58:59 PM10/17/13

to haze...@googlegroups.com

Actually my setting were

As you can see in in my first post.

I find multicast in Hazelcast completely unreliable. Nodes cannot discover other nodes unless JVM is restarted and even then it does not always happen

I don't have logs any more since I tested plenty different configuration since then and I cannot extract relevant parts now

It should be easily reproducible with information I provided.

D.

Mustafa Sancar Koyunlu

unread,

Oct 24, 2013, 4:38:35 AM10/24/13

to haze...@googlegroups.com

Hi Daniel,

I investigated your problem with 2.6.3. Since I am using mac, I simulate network failure with commands:

sudo ifconfig en0 down

sudo ifconfig en0 up

In my environment, the nodes can find each other.

The exception "java.io.IOException: Network is unreachable " means that network is not still up.

With my command i did not get that exception first.After I closed some other interfaces and then opened just en0, I managed to get same exception. But nodes still successfully joined.

Are you sure your way of simulating the failure is correct?

And a second thing is that even tough your configuration says that multicast is disabled, hazelcast logs says that they are using Multicast joiner. It is not consistent. You may want to check that too.

ivenhov

unread,

Oct 24, 2013, 9:41:47 AM10/24/13

to haze...@googlegroups.com

Hi Mustafa

Thanks for getting back to me

I was doing those tests with Ubuntu machine so Mac may behave slightly different.

In my case my eth1 interface had 2 IPs configured, one of which was 10.173.240.xx and the other my external network.

I'm sure NIC was up properly as I could ssh from one machine to another on that 10.173.240.xx interface

It looked like multicast messages were dropped/lost and node could not join because it did not retried.

Configuration says multicast false as I started to investigate tcp-ip discovery as a workaround to those problems.

That proved to be more reliable although I still have to do Hazelcast restart.

Regards

Daniel

Mustafa Sancar Koyunlu

unread,

Oct 25, 2013, 3:01:29 AM10/25/13

to haze...@googlegroups.com

I tried the problem with Ubuntu 12.0.4 kernel 3.8 .

I got the exact same exception when NIC is down. But after I reopen the eth1 with

"sudo ifup eth1", nodes are reconnected successfully. I tried it with multicast.

I am sending logs of two nodes, so that you can see the behaviour.

node1.log

node2.log

ivenhov

unread,

Oct 25, 2013, 6:02:29 AM10/25/13

to haze...@googlegroups.com

Hi

Thanks for sharing the logs. This is exactly what I would expect to have on my system.

Could you post you hazelcast configuration?

Are multicast messages sent periodically? How often?

I'm wondering if sender thread could crash/stop (due to NIC not available etc) and that would prevent nodes from discovering each other.

D.

M. Sancar Koyunlu

unread,

Oct 25, 2013, 8:18:18 AM10/25/13

to haze...@googlegroups.com

I just set the followings:

hazelcast.max.wait.seconds.before.join 5

hazelcast.merge.first.run.delay.seconds 10

Others are default. But I saw that you already play with those.

The logs in the second node saying "Searching for other clusters" shows that responsible thread is not death.

Reply all

Reply to author

Forward