Multiple masters after a split brain recovery

1,642 views
Skip to first unread message

Joe Planisky

unread,
Jun 21, 2012, 2:27:23 PM6/21/12
to haze...@googlegroups.com
I have been testing our application with Hazelcast 2.1.2 for recovery from a network outage or split brain condition. Most of the time, the cluster recovers from a network outage OK, but sometimes the nodes seem to get stuck in a state where two of them think they are masters.

Basically, in a small cluster of 3 nodes, I simulate a network outage on one of the nodes, wait a while, then restore the network. All the nodes are running on one physical machine, but each is in a separate VM running Ubuntu 10.04.2. I'm simulating the network failure with firewall rules that block all incoming and outgoing tcp and icmp traffic.

Sometimes when I restore the network, I start seeing these messages on the node that was the original master (i.e. the oldest node in the cluster, ip *.143):

2012-06-21 09:24:10,814 [hz._hzInstance_1_localtest.ServiceThread] WARN com.hazelcast.impl.PartitionManager - [172.16.191.143]:5701 [localtest] This is the master node and received a ClusterRuntimeState from Address[172.16.191.142]:5701. Ignoring incoming state!

And a similar message on the node that was isolated from the rest of the cluster (ip *.142):

2012-06-21 09:24:11,129 [hz._hzInstance_1_localtest.ServiceThread] WARN com.hazelcast.impl.PartitionManager - [172.16.191.142]:5701 [localtest] This is the master node and received a ClusterRuntimeState from Address[172.16.191.143]:5701. Ignoring incoming state!

On the other node (that wasn't master and remained in the cluster, ip *.141), I see

2012-06-21 09:24:10,816 [hz._hzInstance_1_localtest.ServiceThread] WARN com.hazelcast.impl.PartitionManager - [172.16.191.141]:5701 [localtest] Received a ClusterRuntimeState, but its sender doesn't seem master! => Sender: Address[172.16.191.142]:5701, Master: Address[172.16.191.143]:5701! (Ignore if master node has changed recently.)

These messages repeat about every 10 seconds forever more.

I'm not sure if it's related, but I eventually start getting these messages every 5 seconds on the old master:

2012-06-21 09:28:25,018 [RequestHandler-1] WARN com.hazelcast.impl.ConcurrentMapManager - [172.16.191.143]:5701 [localtest] RedoLog{key=Data{partitionHash=-1555588543} size= 82, operation=CONCURRENT_MAP_GET, target=Address[172.16.191.142]:5701, targetConnected=false, redoCount=50, migrating=null
partition=Partition [34]{
0:Address[172.16.191.142]:5701
1:Address[172.16.191.143]:5701
2:Address[172.16.191.141]:5701
}
}

While this doesn't happen often, I've been able to reproduce it twice this morning. I have debug logs from all 3 nodes if it will help.

Any idea what's happening here or any way to make split brain recovery more robust?

--
Joe

Jason Clawson

unread,
Jun 21, 2012, 5:11:49 PM6/21/12
to haze...@googlegroups.com
We are seeing similar issues with 2.1.2:


20:31:15,051 v.ServiceThread [WARN ] PartitionManager           - [10.10.20.240]:5701 [dev] Unknown Address[10.10.20.215]:5701 is found in received partition table from master Address[10.10.20.215]:5701. Probably it is dead. Partition: Partition [75]{
0:Address[10.10.20.58]:5701
1:Address[10.10.20.215]:5701
2:Address[10.10.20.240]:5701
3:Address[10.10.20.32]:5701
}


20:32:45,563 xer-processor-0 [WARN ] ConcurrentMapManager       - [10.10.20.215]:5701 [dev] RedoLog{key=Data{partitionHash=252064405} size= 10, operation=CONCURRENT_MAP_REMOVE, target=Address[10.10.20.240]:5701, targetConnected=false, redoCount=10470, migrating=null
partition=Partition [259]{
0:Address[10.10.20.240]:5701
1:Address[10.10.20.32]:5701
2:Address[10.10.20.58]:5701
3:Address[10.10.20.215]:5701
}
}

Earlier on 10.10.20.240:

20:31:15,048 v.ServiceThread [WARN ] PartitionManager           - [10.10.20.240]:5701 [dev] Received a ClusterRuntimeState, but its sender doesn't seem master! => Sender: Address[10.10.20.215]:5701, Master: Address[10.10.20.58]:5701! (Ignore if master node has changed recently.) 

And we have threads stuck like this:

"HazelcastReindexer-processor-1" Id=372 TIMED_WAITING
at java.lang.Thread.sleep(Native Method)
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getRedoAwareResult(BaseManager.java:587)
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResult(BaseManager.java:567)
at com.hazelcast.impl.BaseManager$RequestBasedCall.getResultAsIs(BaseManager.java:438)
at com.hazelcast.impl.BaseManager$ResponseQueueCall.getResultAsIs(BaseManager.java:503)
at com.hazelcast.impl.BlockingQueueManager.takeKey(BlockingQueueManager.java:339)
at com.hazelcast.impl.BlockingQueueManager.takeKey(BlockingQueueManager.java:330)
at com.hazelcast.impl.BlockingQueueManager.poll(BlockingQueueManager.java:288)
at com.hazelcast.impl.QProxyImpl$QProxyReal.poll(QProxyImpl.java:260)
at com.hazelcast.impl.QProxyImpl$QProxyReal.drainTo(QProxyImpl.java:353)
at com.hazelcast.impl.QProxyImpl.drainTo(QProxyImpl.java:147)
        at .... our code .....

Mehmet Dogan

unread,
Jun 26, 2012, 5:41:11 AM6/26/12
to haze...@googlegroups.com
Did 2nd node (*.142) restart itself when it was joining back to other nodes after you have restored network?

If you still have logs, can you send all three? 

@mmdogan





--
Joe

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To post to this group, send email to haze...@googlegroups.com.
To unsubscribe from this group, send email to hazelcast+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/hazelcast?hl=en.


Mehmet Dogan

unread,
Jun 28, 2012, 9:15:02 AM6/28/12
to haze...@googlegroups.com
Joe, 

I am able to reproduce and fix this issue and will commit the fix soon.

Thanks for your findings.

@mmdogan



On Tue, Jun 26, 2012 at 5:01 PM, Joe Planisky <joe.pl...@temboo.com> wrote:
To answer your question, yes, it looks like *.142 restarted prior to the multi-master situation (see 142.log, attached):

2012-06-21 09:23:38,906 [hz._hzInstance_1_localtest.cached.thread-34] INFO  com.hazelcast.impl.LifecycleServiceImpl - [172.16.191.142]:5701 [localtest] Address[172.16.191.142]:5701 is RESTARTING


Attached is a zip of Hazelcast logs at DEBUG level for all 3 nodes.

Here's a timeline of network disconnections:

   08:39:00 All nodes running normally
   08:41:00 network disrupted
   08:51:00 network restored
   08:59:00 network disrupted
   09:03:00 network restored
   09:12:00 network disrupted
   09:19:00 network restored
   09:21:00 network disrupted
   09:23:00 network restored

The "multiple master" issue shows up after the reconnect at 09:23:00. (Prior disconnect/reconnects seemed to resolve successfully as far as I can tell.)

"network disrupted" means the following firewall rules were executed on node 172.16.191.142 (disabling all inbound and outbound TCP and ICMP traffic):

   iptables -I INPUT -p tcp ! --dport 22 -j DROP
   iptables -I OUTPUT -p tcp ! --sport 22 -j DROP
   iptables -I INPUT -p icmp -j DROP
   iptables -I OUTPUT -p icmp -j DROP

"network restored" means the following firewall rules were executed on node 172.16.191.142 (enabling inbound and outbound TCP and ICMP traffic):

   iptables -D INPUT -p tcp ! --dport 22 -j DROP
   iptables -D OUTPUT -p tcp ! --sport 22 -j DROP
   iptables -D INPUT -p icmp -j DROP
   iptables -D OUTPUT -p icmp -j DROP

All nodes were running on Ubuntu 10.04.3 in VMWare Fusion virtual machines running on a Mac OS X 10.6.8 host.

The oldest node in the cluster (i.e. first to start) was 172.16.191.143.

The nodes were configured to use TCP-IP join (node *.140 was not running for this test.):

   <network>
       <port auto-increment="true">5701</port>
       <join>
           <multicast enabled="false">
               <multicast-group>224.2.2.3</multicast-group>
               <multicast-port>54327</multicast-port>
           </multicast>
           <tcp-ip conn-timeout-seconds="20" enabled="true" >
               <interface>172.16.191.140</interface>
               <interface>172.16.191.141</interface>
               <interface>172.16.191.142</interface>
               <interface>172.16.191.143</interface>
           </tcp-ip>
           <aws enabled="false">
               <access-key>aws_access_key</access-key>
               <secret-key>aws_secret_key</secret-key>
           </aws>
       </join>
       <interfaces enabled="true">
           <interface>172.16.191.*</interface>
       </interfaces>
   </network>

During this test, the applications on the nodes were mostly idle, there was some retrieval and updating of information in IMaps (driven by timers at 1 and 5 second intervals).

--
Joe
--
Joe Planisky . Software Engineer

[ p. (212) 274-8555 e. joe.pl...@temboo.com ]

Temboo Inc – 104 Franklin Street – New York, NY 10013

www.temboo.com

**********************************************************
The information contained in this message is privileged,
confidential, and protected from disclosure.  This message
is intended for the individual or entity addressed herein.
If you are not the intended recipient, please do not read,
copy, use or disclose this communication to others; also
please notify the sender by replying to this message, and
then delete it from your system.
**********************************************************



Mehmet Dogan

unread,
Jul 13, 2012, 12:11:17 PM7/13/12
to haze...@googlegroups.com

Not issue 195. This is reproducible without EC2 environment, only using a simple local network.

Can you reproduce on your first attempt, after first split brain? Or do you get error after several trials?

On Jul 13, 2012 6:30 PM, <je...@ziptr.com> wrote:
Is this issue #195?

I ask because the issue discusses EC2 auto-discovery and I can mostly reliably reproduce it between two machines in the office with no EC2 involvement.

In my case I have one physical machine and one virtual machine (VMWare).  I get the two nodes connected and functioning (steady state) and then go to the virtual machine and disconnect the network device.  At this point I start getting a variety of messages (I can post if you want) including member removed, etc..  The trick is to wait approx. 3 minutes and then reconnect the virtual machines network device.  

At this point both machines will rediscover each other and the SplitBrainHandler will announce it is merging the clusters.  The appropriate memberAdded events are generated and all looks correct.  However at this point I usually (but not 100% of the time) start getting the "This is the master node" messages.  I've waited 1 hour and the messages continue.   Interestingly enough everything seems like it works... I'm not sure the ramifications of ignoring the messages (not that I want to).

The important thing is this is not done in EC2, I am using local networks with TCP discovery (not multicast).

Joe - will the fix you developed also fix this issue?

Thanks,
- Jeff


On Thursday, June 28, 2012 9:15:02 AM UTC-4, Mehmet Dogan wrote:
Joe, 

I am able to reproduce and fix this issue and will commit the fix soon.

Thanks for your findings.

@mmdogan

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To view this discussion on the web visit https://groups.google.com/d/msg/hazelcast/-/xxUzSTL5g9oJ.

Mehmet Dogan

unread,
Jul 24, 2012, 5:30:50 AM7/24/12
to haze...@googlegroups.com
Hi,

Today we have released both 2.1.3 and 2.2. Both have fixes regarding join and split brain issues.

You can download from either hazelcast.com or  Maven repo

Release notes are here;



@mmdogan




On Sun, Jul 15, 2012 at 8:18 PM, <je...@ziptr.com> wrote:
Hi Mehmet, 

Yes I can usually reproduce on the first attempt.  It's not 100% reproducible but it's much more than 50%.

It's quite easy on my side, create 2 node cluster one on a physical machine, one in a VM (on the same physical machine).  Then after everything is stable I disconnect the network device on the virtual machine which basically is like pulling the network cable.  After 3 minutes I re-connect the network device and wait for the cluster to recover. 

I get all of the proper member added messages, etc and things generally look fine except for the repeating "This is the master node" that repeat in the log for both nodes.  I've left it running for 2 hours and the messages continued for the entirety of the time period.  This implies it probably would happen indefinitely.  Please note that I usually get a set of warning such as:

[2012-07-13 09:24:15,289] [WARN ] [process] [ClusterRuntimeState.java:80] -> [[10.1.15.126]:5701 [MainOnline] Unknown Address[10.1.15.126]:5701 is found in received partition table from master Address[172.16.184.134]:5701. Probably it is dead. Partition: Partition [86]{
0:Address[10.1.15.126]:5701
1:Address[172.16.184.134]:5701
}]

These continue with the partition [#] incrementing for quite some time (i.e., it will hit 90-100) before transitioning to the "This is the master node" message.

Please note that we run with 2 Hazelcast clustered defined at a time (one for the main product, one for the management of the product).  As I look at the log files I see that only the Main product (default Hazelcast) cluster is represented in the log files.  That said only that cluster has regular messages flowing in it (one every 15 seconds or so).  The Mgmt cluster only has messages if something actually happens to warrant it.

Also note that we startup Hazelcast providing the list of known other nodes that should be active.  We are not using multicast nor EC2 discovery, simply building a config object and setting the addresses via the TcpIpConfig.addAddress() method.  This means we do not define a single other node, we define all nodes that should be available including any that are having network connectivity issues.

What can I do to help debug this?  We're trying to transition to 2.1 because the 1.9 version is having problems with transient network errors and one of the main 2.0 release note items is improvements in this area.

Thanks,
- Jeff

On Friday, July 13, 2012 12:11:17 PM UTC-4, Mehmet Dogan wrote:

Not issue 195. This is reproducible without EC2 environment, only using a simple local network.

Can you reproduce on your first attempt, after first split brain? Or do you get error after several trials?

On Jul 13, 2012 6:30 PM, wrote:

Is this issue #195?

I ask because the issue discusses EC2 auto-discovery and I can mostly reliably reproduce it between two machines in the office with no EC2 involvement.

In my case I have one physical machine and one virtual machine (VMWare).  I get the two nodes connected and functioning (steady state) and then go to the virtual machine and disconnect the network device.  At this point I start getting a variety of messages (I can post if you want) including member removed, etc..  The trick is to wait approx. 3 minutes and then reconnect the virtual machines network device.  

At this point both machines will rediscover each other and the SplitBrainHandler will announce it is merging the clusters.  The appropriate memberAdded events are generated and all looks correct.  However at this point I usually (but not 100% of the time) start getting the "This is the master node" messages.  I've waited 1 hour and the messages continue.   Interestingly enough everything seems like it works... I'm not sure the ramifications of ignoring the messages (not that I want to).

The important thing is this is not done in EC2, I am using local networks with TCP discovery (not multicast).

Joe - will the fix you developed also fix this issue?

Thanks,
- Jeff

On Thursday, June 28, 2012 9:15:02 AM UTC-4, Mehmet Dogan wrote:
Joe, 

I am able to reproduce and fix this issue and will commit the fix soon.

Thanks for your findings.

@mmdogan

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To view this discussion on the web visit https://groups.google.com/d/msg/hazelcast/-/-QFqMh6jrdYJ.

Gökhan Memiş

unread,
Nov 27, 2012, 10:05:44 AM11/27/12
to haze...@googlegroups.com
I have use Hazelcast version 2.4 and I still have logs. What can I do?

Logs like this :

17:00:35.093 [hz._hzInstance_4_dev.ServiceThread] WARN  com.hazelcast.impl.PartitionManager [Slf4jFactory.java:83] - [10.12.1.49]:5704 [dev] Unknown Address[10.12.1.22]:5701 is found in received partition table from master Address[10.12.1.56]:5701. Probably it is dead. Partition: Partition [90]{
    0:Address[10.12.1.9]:5701
    1:Address[10.12.1.22]:5701
    2:Address[10.12.1.56]:5701
    3:Address[10.12.1.49]:5703
}
17:00:35.093 [hz._hzInstance_2_dev.ServiceThread] WARN  com.hazelcast.impl.PartitionManager [Slf4jFactory.java:83] - [10.12.1.49]:5702 [dev] Unknown Address[10.12.1.22]:5701 is found in received partition table from master Address[10.12.1.56]:5701. Probably it is dead. Partition: Partition [198]{
    0:Address[10.12.1.13]:5701
    1:Address[10.12.1.22]:5701
    2:Address[10.12.1.47]:5701
}
17:00:35.093 [hz._hzInstance_1_dev.ServiceThread] WARN  com.hazelcast.impl.PartitionManager [Slf4jFactory.java:83] - [10.12.1.49]:5701 [dev] Unknown Address[10.12.1.13]:5701 is found in received partition table from master Address[10.12.1.56]:5701. Probably it is dead. Partition: Partition [253]{
    0:Address[10.12.1.13]:5701
    1:Address[10.12.1.42]:5701
    3:Address[10.12.1.43]:5701
}



21 Haziran 2012 Perşembe 21:27:23 UTC+3 tarihinde JoeP yazdı:

Stock

unread,
Apr 10, 2013, 4:32:53 AM4/10/13
to haze...@googlegroups.com
I'm using Hz 2.4 and I've the same problem. What is annoying is I have to shutdown the complete cluster and restart....

Any suggestion?


2013 04 10 09:41:36 WARN   [hz._hzInstance_1_CLOUD-Env.ServiceThread] [192.168.127.27]:5710 [CLOUD-Env] Unknown Address[192.168.127.30]:5710 is found in received partition table from master Address[192.168.127.23]:5710. Probably it is dead. Partition: Partition [259]{
0:Address[192.168.127.27]:5712
1:Address[192.168.127.30]:5710
2:Address[192.168.127.30]:5711
3:Address[192.168.127.24]:5712
4:Address[192.168.127.30]:5712
5:Address[192.168.127.23]:5711
6:Address[192.168.127.29]:5712
}
2013 04 10 09:41:36 WARN   [hz._hzInstance_1_CLOUD-Env.ServiceThread] [192.168.127.27]:5710 [CLOUD-Env] Unknown Address[192.168.127.30]:5710 is found in received partition table from master Address[192.168.127.23]:5710. Probably it is dead. Partition: Partition [265]{
0:Address[192.168.127.29]:5713
1:Address[192.168.127.24]:5711
2:Address[192.168.127.30]:5710
3:Address[192.168.127.29]:5712
4:Address[192.168.127.27]:5712
5:Address[192.168.127.29]:5711
6:Address[192.168.127.23]:5712
}
2013 04 10 09:41:36 WARN   [hz._hzInstance_1_CLOUD-Env.ServiceThread] [192.168.127.27]:5710 [CLOUD-Env] Unknown Address[192.168.127.30]:5710 is found in received partition table from master Address[192.168.127.23]:5710. Probably it is dead. Partition: Partition [266]{
0:Address[192.168.127.29]:5711
1:Address[192.168.127.30]:5710
2:Address[192.168.127.29]:5712
3:Address[192.168.127.27]:5710
4:Address[192.168.127.24]:5711
5:Address[192.168.127.24]:5710
6:Address[192.168.127.23]:5710
}
2013 04 10 09:41:36 WARN   [hz._hzInstance_1_CLOUD-Env.ServiceThread] [192.168.127.27]:5710 [CLOUD-Env] Unknown Address[192.168.127.30]:5710 is found in received partition table from master Address[192.168.127.23]:5710. Probably it is dead. Partition: Partition [268]{
0:Address[192.168.127.23]:5712
1:Address[192.168.127.24]:5712
2:Address[192.168.127.29]:5711
3:Address[192.168.127.30]:5711
4:Address[192.168.127.29]:5713
5:Address[192.168.127.24]:5710
6:Address[192.168.127.30]:5710
}
2013 04 10 09:41:36 WARN   [hz._hzInstance_1_CLOUD-Env.ServiceThread] [192.168.127.27]:5710 [CLOUD-Env] Unknown Address[192.168.127.30]:5710 is found in received partition table from master Address[192.168.127.23]:5710. Probably it is dead. Partition: Partition [270]{
0:Address[192.168.127.30]:5710
1:Address[192.168.127.24]:5711
2:Address[192.168.127.30]:5711
3:Address[192.168.127.29]:5713
4:Address[192.168.127.27]:5712
5:Address[192.168.127.24]:5712
6:Address[192.168.127.24]:5710

Stock

unread,
Apr 11, 2013, 4:40:50 PM4/11/13
to haze...@googlegroups.com
Hi guys,

I'm investigating and probably it doesn't depend on HC... Could be a firewall...

CIAOO

Enes Akar

unread,
Apr 30, 2013, 6:31:39 AM4/30/13
to haze...@googlegroups.com
Getting timeout exception (Operation Timeout (with no response!): 0) from some operations (atomic number ops and containsKey) was a bug which has been resolved since 2.4

So please upgrade to 2.5


On Tue, Apr 30, 2013 at 12:53 PM, <kbie...@gmail.com> wrote:
Any luck on the investigation?

We are using Hazelcast 2.3.1 with 6 nodes on the production environment and we are having some serious issues

under high load we are getting various exception like those:
com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_GET_AND_SET] Operation Timeout (with no response!): 0
com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_ADD_AND_GET] Operation Timeout (with no response!): 0
com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Redo threshold[90] exceeded! Last redo cause: REDO_MEMBER_UNKNOWN

after some time one of the server becomes unavailable and needs a restart. I have been looking also on the Hazelcast forum for some information, but like here I have only found a vague answer such as "migrate to version 2.5" without any explanation what is the problem. We are affraid that migrating to new version will cause other issues ..

Actually I must say, that the stability of Hazelcast is questionable. We were using for a very long time version 1.9.4 and didn't had any problems with it, but then we have decided to use the Webfilter functionality to replicate the JSession on the servers, so that we can turn off the stickyness on the Loadbalancer. In order to do it we have migrated to version 2.2, which turned out to have big problems during deployment. We have found someone describing similar problems and recommending migration to 2.3.1, so we did and know we have this problems.. 

Not to mention, that when we tried to enable JMX Plugin for Hazelcast (2.3.1) after 2 days of working suddenly we started to have serious problems - when one of the servers was getting OOM then all other servers in the cluster stopped working and didn't response to any requests. This happend several times each time the number of session in Hazelcast was going over 100,000. You can see the failures on the Nagios diagram below:

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.



--
Enes Akar
Hazelcast | Open source in-memory data grid
Mobile: +90.505.394.1668

Stock

unread,
May 7, 2013, 5:41:47 PM5/7/13
to haze...@googlegroups.com, kbie...@gmail.com
Yes, a luck investigation. I forgot to report here, sorry.

For my case, firewall was guilty! Changing policies, everything was ok!



Il giorno martedì 30 aprile 2013 11:53:14 UTC+2, kbie...@gmail.com ha scritto:
Any luck on the investigation?

We are using Hazelcast 2.3.1 with 6 nodes on the production environment and we are having some serious issues

under high load we are getting various exception like those:
com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_GET_AND_SET] Operation Timeout (with no response!): 0
com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_ADD_AND_GET] Operation Timeout (with no response!): 0
com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Redo threshold[90] exceeded! Last redo cause: REDO_MEMBER_UNKNOWN

after some time one of the server becomes unavailable and needs a restart. I have been looking also on the Hazelcast forum for some information, but like here I have only found a vague answer such as "migrate to version 2.5" without any explanation what is the problem. We are affraid that migrating to new version will cause other issues ..

Actually I must say, that the stability of Hazelcast is questionable. We were using for a very long time version 1.9.4 and didn't had any problems with it, but then we have decided to use the Webfilter functionality to replicate the JSession on the servers, so that we can turn off the stickyness on the Loadbalancer. In order to do it we have migrated to version 2.2, which turned out to have big problems during deployment. We have found someone describing similar problems and recommending migration to 2.3.1, so we did and know we have this problems.. 

Not to mention, that when we tried to enable JMX Plugin for Hazelcast (2.3.1) after 2 days of working suddenly we started to have serious problems - when one of the servers was getting OOM then all other servers in the cluster stopped working and didn't response to any requests. This happend several times each time the number of session in Hazelcast was going over 100,000. You can see the failures on the Nagios diagram below:






On Thursday, April 11, 2013 10:40:50 PM UTC+2, Stock wrote:

nicola...@gmail.com

unread,
Nov 29, 2013, 8:28:52 AM11/29/13
to haze...@googlegroups.com, kbie...@gmail.com
What exactly was wrong with your firewall ?

Stock

unread,
Dec 1, 2013, 6:36:31 PM12/1/13
to haze...@googlegroups.com, kbie...@gmail.com, nicola...@gmail.com
We forgot to open ports used by HC. Every machines had a FW on board.  

vijayan....@gmail.com

unread,
Oct 16, 2016, 2:41:26 AM10/16/16
to Hazelcast
Hi Dogan,  I have gone though this complete conversation. This is eactly matching with our hot and on going production live issue where direct customer esclation.
Yes. in my 2nd node restarted itself.
Node 1: This is the master node and received a ClusterRuntim
eState from Address[172.16.1.163]:5701. Ignoring incoming state!
Node 2 : repeated warning message as " primary node Probably it is dead"
We use hazelcast 2.1.1 with mule 3.4.2.
Application error: java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor did not accept within 60000 MILLISECONDS
Application stopped woking in both nodes. resulted with thread starvation.

Would be really helpful if you provide the RCA and Fix for this. it is live on going issue.

Thanks,
VJ
Reply all
Reply to author
Forward
0 new messages