SplitBrainMessage deque flooding leading to OutOfMemoryError: GC overhead limit exceeded and 100% CPU utilization

145 views
Skip to first unread message

IvanL

unread,
Oct 25, 2017, 8:28:14 AM10/25/17
to Hazelcast

I have a 2 members Hazelcast cluster on the same computer.

 

There a few computers (about 5-6 other developers' machines) in the same network with similar clusters (but different ones, that is they have different group name/password).

The seems to lead to a situation when SplitBrainJoinMessage deque in my Hazelcast cluster's master is flooded with messages. My master node might be contributing to this situation resulting in some kind of chain reaction since judging by the code MulticastJoiner, it sends another SplitBrainJoinMessage in response.

 

I can see in my log, thousands of similar messages:

2017-10-25 07:45:45,799 odeMulticastListener [ster.MulticastThread] - [10.10.6.102]:5702 [my-cluster] [3.8.2] Dropped: SplitBrainJoinMessage{packetVersion=4, buildNumber=20170518, memberVersion=3.8.2, clusterVersion=3.8, address=[10.10.6.54]:5702, uuid='5c3a81a6-d31b-42b9-8b9d-19f73324ac6e', liteMember=false, memberCount=1, dataMemberCount=1}

 

Eventually, free heap runs out and CPU utilization goes to 100% and "OutOfMemoryError: GC overhead limit exceeded" ensue.

 

When I use tcp-ip join and restrict it to the localhost, this is not reproduced.

 

We use Hazelcast 3.8.2.

 

I would be grateful if someone could shed some light on this.


My hazelcast.xml:

 

<?xml version="1.0" encoding="UTF-8" standalone="no"?><hazelcast xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.hazelcast.com/schema/config https://hazelcast.com/schema/config/hazelcast-config-3.8.xsd">

   
<group>
       
<name>my-cluster</name>
       
<password>tcRT0FUAZOtqEo5yl2aIa2biN63z9fp7</password>
   
</group>

   
<network>
       
<join>
           
<multicast enabled="true"/>            
       
</join>
   
</network>

   
<properties>

<!-- tried playing with these but it seems that they might affect only whether the problem occurs sooner or later
                <property name="hazelcast.merge.next.run.delay.seconds">60</property>    
                <property name="hazelcast.merge.first.run.delay.seconds">10</property>
-->

   
</properties>

</hazelcast>


Alparslan Avcı

unread,
Oct 26, 2017, 6:03:39 AM10/26/17
to Hazelcast
Hi Ivan,

It is correct that the SplitBrainJoinMessage deque in your Hazelcast cluster's master is flooded with messages. This is caused by some other clusters in your network that are searching for clusters to join. The SplitBrainJoinMessage deque has not a boundary, but it is not considered as a bug. If we put a limit, than some other join messages may be lost. 

Indeed; when you are using Multicast as joiner, you should have a network group that is under your control since multicast messages are sent to a group, not a single node. If not, we recommend to use TCP/IP joiner instead.

Regards,
Alparslan

IvanL

unread,
Oct 26, 2017, 8:20:55 AM10/26/17
to Hazelcast
Hi Alparslan,

Would you be so kind as to answer a few additional questions?

1. When you say a network group, do you mean making a VLAN in the network for the Hazlecast cluster or something else? If so, does it mean that as a consequence of this implementation of MulticastJoiner, Hazelcast with multicast discovery can only run in a managed network where no other Hazelcast clusters are present, that is multicast can only be used for a single cluster in the network?
2. In the implementation of com.hazelcast.internal.cluster.impl.MulticastJoiner.searchForOtherClusters(), a SplitBrainJoinMessage is sent each time but not more often than hazelcast.merge.next.run.delay.seconds. But since messages are accumulated in the deque and processed by the MulticastJoiner while there are still messages in the deque, this interval is ignored and this in my opinion creates flooding and resulting in a chain reaction that only makes things worse. Is it a known consequence? Can this be improved somehow?

Looking forward to you reply, thank you.

Alparslan Avcı

unread,
Oct 27, 2017, 6:41:10 AM10/27/17
to haze...@googlegroups.com
Hi Ivan,

The answers are inline:

1. When you say a network group, do you mean making a VLAN in the network for the Hazlecast cluster or something else? If so, does it mean that as a consequence of this implementation of MulticastJoiner, Hazelcast with multicast discovery can only run in a managed network where no other Hazelcast clusters are present, that is multicast can only be used for a single cluster in the network? 

No, you do not need to make a VLAN to run Hazelcast with Multicast. I meant by "under your control" that you should be able to check if any other clusters in the network group can send that much "SplitBrainJoinMessage"s. If they are, they should update their configuration. Because with the correct (or default) configuration of all clusters in the multicast group, it is very hard to get an OOME with "SplitBrainJoinMessage"s from other clusters.  

Another option is that you can specify a multicast group in Hazelcast configuration and use it. Please see the details how to configura multicast group & port here: http://docs.hazelcast.org/docs/3.9/manual/html-single/index.html#discovering-members-by-multicast

2. In the implementation of com.hazelcast.internal.cluster.impl.MulticastJoiner.searchForOtherClusters(), a SplitBrainJoinMessage is sent each time but not more often than hazelcast.merge.next.run.delay.seconds. But since messages are accumulated in the deque and processed by the MulticastJoiner while there are still messages in the deque, this interval is ignored and this in my opinion creates flooding and resulting in a chain reaction that only makes things worse. Is it a known consequence? Can this be improved somehow? 

Indeed, this behavior is expected and otherwise, it can come up with an OOME easily. When there are still messages in the deque, Hazelcast member should try to process all of them regardless of the  hazelcast.merge.next.run.delay.seconds  in order to prevent increasing number of events in the deque. In your case, your cluster members cannot catch up the other clusters' searchForOtherClusters() runs. Most probably, some other cluster/s are configured with small number of hazelcast.merge.next.run.delay.seconds, and this causes the flooding in your cluster.

Regards,
Alparslan

IvanL

unread,
Oct 28, 2017, 8:29:38 AM10/28/17
to Hazelcast
Hi Alparslan,

Thank you very much for your answers. Would bear with me for a little more?

1.

with the correct (or default) configuration of all clusters in the multicast group, it is very hard to get an OOME with "SplitBrainJoinMessage"s from other clusters
Does it mean that it is not recommended to change this configuration?

Another option is that you can specify a multicast group in Hazelcast configuration and use it
The multicast group configuration is definitely an option. I've known about it but somehow this possibility has just slipped my mind. Thanks.

However, I would like to dwell more on the point I am struggling to convey in the second point.

2.
Most probably, some other cluster/s are configured with small number of hazelcast.merge.next.run.delay.seconds, and this causes the flooding in your cluster.
No, this is impossible. I've checked those configs and they are very basic, they are the same as I showed in my original e-mail (they do not have <properties> element, I used it in an attempt to mitigate the problem in my experiments)

When there are still messages in the deque, Hazelcast member should try to process all of them regardless of the  hazelcast.merge.next.run.delay.seconds  in order to prevent increasing number of events in the deque.
I must have expressed myself poorly here since I did not mean that the deque should not be drained in the while loop in the MulticastJoiner ignoring the next.run.delay. In my opinion, the problem might be in sending of another SplitBrainJoinMessage at the end of the while loop's body ignoring the delay. What I see is that that statement might create a positive feedback loop in the network and given the nature of multicast this might lead to self-amplification effect.

Here is my assumption:
a. 5 clusters send 5 multicast messages in next.run.delay seconds.
b. at this point in the deque of each cluster there are 5 messages
c. each master processes the deque 5 times generating another 5 messages in total at the end of the while loop resulting in 5 messages * 4 clusters = 20 additional messages in each deque
d. when the MulticastJoiner is iterating over the deque, it ignores the next.run.delay and sends another bunch of 20 messages, resulting in 80 messages in each deque (taking into account we have 5 clusters).

Provided some clusters are running for a long time, I think this is what I see when I start the hazelcast member and see that it is flooded with messages and OOMEs in virtually minutes.

I tried experimenting with a small test program at home and while I don't see the same magnitudes, I see the increase in the amount of messages accumulating in the deque with each memory snapshot I make in one of the clusters. Probably, give it time and it will eventually result in the same situation.

I admit that this is mainly speculative (except the symptoms) but could you possibly consider this and think about it with me?

I am looking forward to your reply.

Best regards,
Ivan

Alparslan Avcı

unread,
Nov 21, 2017, 7:45:50 AM11/21/17
to haze...@googlegroups.com
Hi Ivan,

Sorry for the late reply on this.

Does it mean that it is not recommended to change this configuration?

No, I meant you should configure it correctly if you change it. But looks like that the clusters have the correct config in your environment. 

I must have expressed myself poorly here since I did not mean that the deque should not be drained in the while loop in the MulticastJoiner ignoring the next.run.delay. In my opinion, the problem might be in sending of another SplitBrainJoinMessage at the end of the while loop's body ignoring the delay. What I see is that that statement might create a positive feedback loop in the network and given the nature of multicast this might lead to self-amplification effect.

Actually, you may be right. But resending of the SplitBrainJoinMessage is needed according to this PR: https://github.com/hazelcast/hazelcast/pull/10378 Even though, it may be a bug so I raised this Github issue to track it: https://github.com/hazelcast/hazelcast/issues/11836 Please follow it from here, and add any comments if you need.

Regards,
Alparslan Avci


Industrious

unread,
Nov 30, 2017, 7:22:59 AM11/30/17
to haze...@googlegroups.com
Hi Alparslan,

Thank you for your effort and patience.

Do I understand it right that the changes are in starting with 3.9.2 and in 3.9.1 too?

Regards,
Ivan

--
You received this message because you are subscribed to a topic in the Google Groups "Hazelcast" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hazelcast/9cVZvMPWwv8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hazelcast+unsubscribe@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/CAOZbE7wfsd%3DDK2Y94CWkj4GyrVb-BhcjgyXYSfXt__YONasxeQ%40mail.gmail.com.

For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages