Could not join cluster in 300000 ms. Shutting down now! Why should it ever happen?

868 views
Skip to first unread message

Sergey Ripa

unread,
Jul 8, 2016, 6:45:22 AM7/8/16
to Hazelcast
We're using 3.3.4 version of Hazelcast. very old, but still... 

I have MulticastJoiner, and 2 nodes of hazelcast running. 
at node1 I run 'tcpkill host node2', and just wait until the cluster splits.  

What i expect is - to have 2 clusters of 1 nodes, even if I never enable traffic back. 
Btw, it seems that multicast traffic is still alive between nodes, since periodically there are some activities at node2 of attempting reconnection. 

what I have in 10-15 minutes at node2 is that hazelcast decides to shutdown:

E 0707-1904:11,188 c.h.i.Node [172.29.49.152]:5701 [feperfSergey/16.5.0] [3.3.4] Could not join cluster in 300000 ms. Shutting down now! [hz._hzInstance_1_feperfSergey/16.5.0.cached.thread-3]
I 0707-1904:11,189 c.h.c.LifecycleService [172.29.49.152]:5701 [feperfSergey/16.5.0] [3.3.4] Address[172.29.49.152]:5701 is SHUTTING_DOWN [hz._hzInstance_1_feperfSergey/16.5.0.cached.thread-3]
I 0707-1904:12,189 c.h.n.t.TcpIpConnection [172.29.49.152]:5701 [feperfSergey/16.5.0] [3.3.4] Connection [Address[172.29.49.142]:5701] lost. Reason: java.io.IOException[Connection reset by peer] [hz._hzInstance_1_feperfSergey/16.5.0.IO.thread-out-0]
I 0707-1904:14,206 c.h.initializer [172.29.49.152]:5701 [feperfSergey/16.5.0] [3.3.4] Destroying node initializer. [hz._hzInstance_1_feperfSergey/16.5.0.cached.thread-3]
I 0707-1904:14,209 c.h.i.Node [172.29.49.152]:5701 [feperfSergey/16.5.0] [3.3.4] Hazelcast Shutdown is completed in 3020 ms. [hz._hzInstance_1_feperfSergey/16.5.0.cached.thread-3]
I 0707-1904:14,209 c.h.c.LifecycleService [172.29.49.152]:5701 [feperfSergey/16.5.0] [3.3.4] Address[172.29.49.152]:5701 is SHUTDOWN [hz._hzInstance_1_feperfSergey/16.5.0.cached.thread-3]


Can anyone clarify why this should happen at all? 
Why can't the node work in isolated state?

Thanks,
Sergey

Mehmet Dogan

unread,
Jul 15, 2016, 5:53:01 AM7/15/16
to Hazelcast
Cluster merge process is causing that. There's background process in Hazelcast that periodically searches for possible cluster splits and once a split is discovered then merge process is initiated. Merge process is simply a rejoin process of one split to the other one. 

In your case, since you're using multicast discovery, 2 nodes are able to discover each other over udp (multicast). Once they discover each other, one of the nodes tries to rejoin other one but because they can't communicate via tcp, rejoin fails and node decides to shutdown itself. 

There are 2 properties to control split-brain search process (see http://docs.hazelcast.org/docs/3.6/manual/html-single/index.html#system-properties):
- hazelcast.merge.first.run.delay.seconds (300 seconds default)
- hazelcast.merge.next.run.delay.seconds (120 seconds default)



--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/bac78d57-3579-4b54-8c9e-6bc51097aaf5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--

@mmdogan

Reply all
Reply to author
Forward
0 new messages