--
Joe
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To post to this group, send email to haze...@googlegroups.com.
To unsubscribe from this group, send email to hazelcast+...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/hazelcast?hl=en.
To answer your question, yes, it looks like *.142 restarted prior to the multi-master situation (see 142.log, attached):
2012-06-21 09:23:38,906 [hz._hzInstance_1_localtest.cached.thread-34] INFO com.hazelcast.impl.LifecycleServiceImpl - [172.16.191.142]:5701 [localtest] Address[172.16.191.142]:5701 is RESTARTING
Attached is a zip of Hazelcast logs at DEBUG level for all 3 nodes.
Here's a timeline of network disconnections:
08:39:00 All nodes running normally
08:41:00 network disrupted
08:51:00 network restored
08:59:00 network disrupted
09:03:00 network restored
09:12:00 network disrupted
09:19:00 network restored
09:21:00 network disrupted
09:23:00 network restored
The "multiple master" issue shows up after the reconnect at 09:23:00. (Prior disconnect/reconnects seemed to resolve successfully as far as I can tell.)
"network disrupted" means the following firewall rules were executed on node 172.16.191.142 (disabling all inbound and outbound TCP and ICMP traffic):
iptables -I INPUT -p tcp ! --dport 22 -j DROP
iptables -I OUTPUT -p tcp ! --sport 22 -j DROP
iptables -I INPUT -p icmp -j DROP
iptables -I OUTPUT -p icmp -j DROP
"network restored" means the following firewall rules were executed on node 172.16.191.142 (enabling inbound and outbound TCP and ICMP traffic):
iptables -D INPUT -p tcp ! --dport 22 -j DROP
iptables -D OUTPUT -p tcp ! --sport 22 -j DROP
iptables -D INPUT -p icmp -j DROP
iptables -D OUTPUT -p icmp -j DROP
All nodes were running on Ubuntu 10.04.3 in VMWare Fusion virtual machines running on a Mac OS X 10.6.8 host.
The oldest node in the cluster (i.e. first to start) was 172.16.191.143.
The nodes were configured to use TCP-IP join (node *.140 was not running for this test.):
<network>
<port auto-increment="true">5701</port>
<join>
<multicast enabled="false">
<multicast-group>224.2.2.3</multicast-group>
<multicast-port>54327</multicast-port>
</multicast>
<tcp-ip conn-timeout-seconds="20" enabled="true" >
<interface>172.16.191.140</interface>
<interface>172.16.191.141</interface>
<interface>172.16.191.142</interface>
<interface>172.16.191.143</interface>
</tcp-ip>
<aws enabled="false">
<access-key>aws_access_key</access-key>
<secret-key>aws_secret_key</secret-key>
</aws>
</join>
<interfaces enabled="true">
<interface>172.16.191.*</interface>
</interfaces>
</network>
During this test, the applications on the nodes were mostly idle, there was some retrieval and updating of information in IMaps (driven by timers at 1 and 5 second intervals).
--
Joe
--
Joe Planisky . Software Engineer
[ p. (212) 274-8555 e. joe.pl...@temboo.com ]
Temboo Inc – 104 Franklin Street – New York, NY 10013
www.temboo.com
**********************************************************
The information contained in this message is privileged,
confidential, and protected from disclosure. This message
is intended for the individual or entity addressed herein.
If you are not the intended recipient, please do not read,
copy, use or disclose this communication to others; also
please notify the sender by replying to this message, and
then delete it from your system.
**********************************************************
Not issue 195. This is reproducible without EC2 environment, only using a simple local network.
Can you reproduce on your first attempt, after first split brain? Or do you get error after several trials?
Is this issue #195?
I ask because the issue discusses EC2 auto-discovery and I can mostly reliably reproduce it between two machines in the office with no EC2 involvement.In my case I have one physical machine and one virtual machine (VMWare). I get the two nodes connected and functioning (steady state) and then go to the virtual machine and disconnect the network device. At this point I start getting a variety of messages (I can post if you want) including member removed, etc.. The trick is to wait approx. 3 minutes and then reconnect the virtual machines network device.At this point both machines will rediscover each other and the SplitBrainHandler will announce it is merging the clusters. The appropriate memberAdded events are generated and all looks correct. However at this point I usually (but not 100% of the time) start getting the "This is the master node" messages. I've waited 1 hour and the messages continue. Interestingly enough everything seems like it works... I'm not sure the ramifications of ignoring the messages (not that I want to).The important thing is this is not done in EC2, I am using local networks with TCP discovery (not multicast).Joe - will the fix you developed also fix this issue?
Thanks,- Jeff
On Thursday, June 28, 2012 9:15:02 AM UTC-4, Mehmet Dogan wrote:Joe,I am able to reproduce and fix this issue and will commit the fix soon.Thanks for your findings.@mmdogan
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To view this discussion on the web visit https://groups.google.com/d/msg/hazelcast/-/xxUzSTL5g9oJ.
Hi Mehmet,Yes I can usually reproduce on the first attempt. It's not 100% reproducible but it's much more than 50%.It's quite easy on my side, create 2 node cluster one on a physical machine, one in a VM (on the same physical machine). Then after everything is stable I disconnect the network device on the virtual machine which basically is like pulling the network cable. After 3 minutes I re-connect the network device and wait for the cluster to recover.I get all of the proper member added messages, etc and things generally look fine except for the repeating "This is the master node" that repeat in the log for both nodes. I've left it running for 2 hours and the messages continued for the entirety of the time period. This implies it probably would happen indefinitely. Please note that I usually get a set of warning such as:[2012-07-13 09:24:15,289] [WARN ] [process] [ClusterRuntimeState.java:80] -> [[10.1.15.126]:5701 [MainOnline] Unknown Address[10.1.15.126]:5701 is found in received partition table from master Address[172.16.184.134]:5701. Probably it is dead. Partition: Partition [86]{0:Address[10.1.15.126]:57011:Address[172.16.184.134]:5701}]These continue with the partition [#] incrementing for quite some time (i.e., it will hit 90-100) before transitioning to the "This is the master node" message.Please note that we run with 2 Hazelcast clustered defined at a time (one for the main product, one for the management of the product). As I look at the log files I see that only the Main product (default Hazelcast) cluster is represented in the log files. That said only that cluster has regular messages flowing in it (one every 15 seconds or so). The Mgmt cluster only has messages if something actually happens to warrant it.Also note that we startup Hazelcast providing the list of known other nodes that should be active. We are not using multicast nor EC2 discovery, simply building a config object and setting the addresses via the TcpIpConfig.addAddress() method. This means we do not define a single other node, we define all nodes that should be available including any that are having network connectivity issues.What can I do to help debug this? We're trying to transition to 2.1 because the 1.9 version is having problems with transient network errors and one of the main 2.0 release note items is improvements in this area.Thanks,- Jeff
On Friday, July 13, 2012 12:11:17 PM UTC-4, Mehmet Dogan wrote:
Not issue 195. This is reproducible without EC2 environment, only using a simple local network.
Can you reproduce on your first attempt, after first split brain? Or do you get error after several trials?
On Jul 13, 2012 6:30 PM, wrote:Is this issue #195?
I ask because the issue discusses EC2 auto-discovery and I can mostly reliably reproduce it between two machines in the office with no EC2 involvement.In my case I have one physical machine and one virtual machine (VMWare). I get the two nodes connected and functioning (steady state) and then go to the virtual machine and disconnect the network device. At this point I start getting a variety of messages (I can post if you want) including member removed, etc.. The trick is to wait approx. 3 minutes and then reconnect the virtual machines network device.At this point both machines will rediscover each other and the SplitBrainHandler will announce it is merging the clusters. The appropriate memberAdded events are generated and all looks correct. However at this point I usually (but not 100% of the time) start getting the "This is the master node" messages. I've waited 1 hour and the messages continue. Interestingly enough everything seems like it works... I'm not sure the ramifications of ignoring the messages (not that I want to).The important thing is this is not done in EC2, I am using local networks with TCP discovery (not multicast).Joe - will the fix you developed also fix this issue?
Thanks,- Jeff
On Thursday, June 28, 2012 9:15:02 AM UTC-4, Mehmet Dogan wrote:Joe,I am able to reproduce and fix this issue and will commit the fix soon.Thanks for your findings.@mmdogan
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To view this discussion on the web visit https://groups.google.com/d/msg/hazelcast/-/-QFqMh6jrdYJ.
Any luck on the investigation?We are using Hazelcast 2.3.1 with 6 nodes on the production environment and we are having some serious issues.under high load we are getting various exception like those:com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_GET_AND_SET] Operation Timeout (with no response!): 0com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_ADD_AND_GET] Operation Timeout (with no response!): 0com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Redo threshold[90] exceeded! Last redo cause: REDO_MEMBER_UNKNOWNafter some time one of the server becomes unavailable and needs a restart. I have been looking also on the Hazelcast forum for some information, but like here I have only found a vague answer such as "migrate to version 2.5" without any explanation what is the problem. We are affraid that migrating to new version will cause other issues ..Actually I must say, that the stability of Hazelcast is questionable. We were using for a very long time version 1.9.4 and didn't had any problems with it, but then we have decided to use the Webfilter functionality to replicate the JSession on the servers, so that we can turn off the stickyness on the Loadbalancer. In order to do it we have migrated to version 2.2, which turned out to have big problems during deployment. We have found someone describing similar problems and recommending migration to 2.3.1, so we did and know we have this problems..Not to mention, that when we tried to enable JMX Plugin for Hazelcast (2.3.1) after 2 days of working suddenly we started to have serious problems - when one of the servers was getting OOM then all other servers in the cluster stopped working and didn't response to any requests. This happend several times each time the number of session in Hazelcast was going over 100,000. You can see the failures on the Nagios diagram below:
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast?hl=en-US.
For more options, visit https://groups.google.com/groups/opt_out.
Any luck on the investigation?We are using Hazelcast 2.3.1 with 6 nodes on the production environment and we are having some serious issues.under high load we are getting various exception like those:com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_GET_AND_SET] Operation Timeout (with no response!): 0com.hazelcast.core.OperationTimeoutException: [ATOMIC_NUMBER_ADD_AND_GET] Operation Timeout (with no response!): 0com.hazelcast.core.OperationTimeoutException: [CONCURRENT_MAP_PUT] Redo threshold[90] exceeded! Last redo cause: REDO_MEMBER_UNKNOWNafter some time one of the server becomes unavailable and needs a restart. I have been looking also on the Hazelcast forum for some information, but like here I have only found a vague answer such as "migrate to version 2.5" without any explanation what is the problem. We are affraid that migrating to new version will cause other issues ..Actually I must say, that the stability of Hazelcast is questionable. We were using for a very long time version 1.9.4 and didn't had any problems with it, but then we have decided to use the Webfilter functionality to replicate the JSession on the servers, so that we can turn off the stickyness on the Loadbalancer. In order to do it we have migrated to version 2.2, which turned out to have big problems during deployment. We have found someone describing similar problems and recommending migration to 2.3.1, so we did and know we have this problems..Not to mention, that when we tried to enable JMX Plugin for Hazelcast (2.3.1) after 2 days of working suddenly we started to have serious problems - when one of the servers was getting OOM then all other servers in the cluster stopped working and didn't response to any requests. This happend several times each time the number of session in Hazelcast was going over 100,000. You can see the failures on the Nagios diagram below:
On Thursday, April 11, 2013 10:40:50 PM UTC+2, Stock wrote: