Spilt Brain problem in Hazelcast cluster nodes

Aman Jain

da leggere,

25 ago 2016, 18:21:1725/08/16

a Hazelcast

Hi,

I have two node cluster of Hazelcast nodes running on two different machines. Sometimes, we have network issue (for about 2 hours) between the two nodes. At that time, we see following exception in my application logs:

com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!

at com.hazelcast.spi.impl.proxyservice.impl.ProxyRegistry.getService(ProxyRegistry.java:65)

at com.hazelcast.spi.impl.proxyservice.impl.ProxyRegistry.<init>(ProxyRegistry.java:53)

at com.hazelcast.spi.impl.proxyservice.impl.ProxyServiceImpl$1.createNew(ProxyServiceImpl.java:74)

at com.hazelcast.spi.impl.proxyservice.impl.ProxyServiceImpl$1.createNew(ProxyServiceImpl.java:72)

at com.hazelcast.util.ConcurrencyUtil.getOrPutIfAbsent(ConcurrencyUtil.java:51)

I have couple of questions:

After how much time does each cluster node removes another node from its member's list because it is not reachable (due to a network issue)? I believe there must be a property for heartbeat check for each member node.
The above error is coming from Hazelcast nodes (not client node) which seem unusual. Can it happen in the scenario where node2 Hazelcast node is still present in the node1 member's list and since node2 is not reachable (due to a network issue), it gives above exception?

Thanks,

Aman

ih...@hazelcast.com

da leggere,

26 ago 2016, 05:57:5126/08/16

a Hazelcast

Hello,

1. Properties: Page: http://docs.hazelcast.org/docs/3.7/manual/html-single/index.html#system-properties

hazelcast.max.no.heartbeat.seconds 300 int Maximum timeout of heartbeat in seconds for a member to assume it is dead.

hazelcast.heartbeat.interval.seconds 1 int Heartbeat send interval in seconds.

Aman Jain

da leggere,

26 ago 2016, 10:46:4926/08/16

a haze...@googlegroups.com

Thank you for your response!

So, it means that by default it takes around 5 min (300 seconds) to consider member node completely dead.

Can you please comment on my second question "The error is coming from Hazelcast nodes (not client node) which seem unusual. Can it happen in the scenario where node2 Hazelcast node is still present in the node1 member's list and since node2 is not reachable (due to a network issue), it gives above exception?"?

Thanks,

Aman

--
You received this message because you are subscribed to a topic in the Google Groups "Hazelcast" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hazelcast/oxT0PvGeieI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hazelcast+unsubscribe@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/1759f9a1-7d30-4553-a089-a350e6d92152%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Thanks,

Aman

ih...@hazelcast.com

da leggere,

26 ago 2016, 14:30:1026/08/16

a Hazelcast

Aman,

As you described, if the detection is due to the heartbeat failure, the member will probably stay in the member list of the cluster until it is marked as dead, and during this time frame, what you said may be possible. Just remember that partitions are distributed to the available nodes and if a node who is the primary responsible for the partition can not be reached such exceptions may be possible.

ihsan

To unsubscribe from this group and all its topics, send an email to hazelcast+...@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/1759f9a1-7d30-4553-a089-a350e6d92152%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
Thanks,
Aman

Il messaggio è stato eliminato

Aman Jain

da leggere,

30 ago 2016, 01:51:1230/08/16

a Hazelcast

Hi,

Hazelcast server logs show this, does it provide more data point on the root cause of this issue?

2016-08-07 20:03:11 com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[EGEDIAPP01]:5701, target:null, partitionId: 228, replicaIndex: 0, operation: com.hazelcast.map.impl.operation.SetOperation, service: hz:impl:mapService
2016-08-07 20:03:11 	at com.hazelcast.spi.impl.operationservice.impl.Invocation.initInvocationTarget(Invocation.java:288)
2016-08-07 20:03:11 	at com.hazelcast.spi.impl.operationservice.impl.Invocation.doInvoke(Invocation.java:222)
2016-08-07 20:03:11 	at com.hazelcast.spi.impl.operationservice.impl.Invocation.run(Invocation.java:262)
2016-08-07 20:03:11 	at com.hazelcast.spi.impl.operationservice.impl.PartitionInvocation.run(PartitionInvocation.java:28)

-Aman

Rispondi a tutti

Rispondi all'autore

Inoltra