hazelcast connection problems causing a node to remain inactive?

1,125 views
Skip to first unread message

Ian Clark

unread,
Feb 19, 2014, 8:51:03 AM2/19/14
to haze...@googlegroups.com
We've been experiencing problems when a backend node seemingly has connection problems

[10.0.0.195]:5701 [importio-sanfran-sfjedi-new] hz._hzInstance_1_importio-sanfran-sfjedi-new.IO.thread-1 Closing socket to endpoint Address[10.0.0.83]:5701, Cause:java.io.EOFException

and thereafter the node reports problems, like:-

java.lang.IllegalStateException: Hazelcast Instance is not active!
    at com
.hazelcast.impl.FactoryImpl.initialChecks(FactoryImpl.java:728) ~[na:na]
    at com
.hazelcast.impl.QProxyImpl.ensure(QProxyImpl.java:65) ~[na:na]
    at com
.hazelcast.impl.QProxyImpl.poll(QProxyImpl.java:175) ~[na:na]

We're changing our application so it terminates when it gets to this state, but it doesn't seem be reconnecting? (not 100% certain on this yet). One of the problems we see is that we were relying on the defaults for nearly all of the configuration properties that seem to be quite large:-


We are testing these properties to see if we get better behaviour:-

<property name="hazelcast.heartbeat.interval.seconds">1</property>
<property name="hazelcast.max.no.heartbeat.seconds">8</property>
<property name="hazelcast.wait.seconds.before.join">1</property>
<property name="hazelcast.operation.call.timeout.millis">5000</property>
<property name="hazelcast.member.list.publish.interval.seconds">8</property>
<property name="hazelcast.master.confirmation.interval.seconds">5</property>
<property name="hazelcast.max.no.master.confirmation.seconds">8</property>

along with this programmatic change:-

tcpIpConfig.setConnectionTimeoutSeconds(8);

We currently running in production hazelcast 2.52, but we're trying 3.1.5 to see if that helps with these problems.

Do these configuration properties seem more sensible, too aggressive? Is there any advice we could read? 

Any help really appreciated.

IC

Peter Veentjer

unread,
Feb 19, 2014, 9:03:42 AM2/19/14
to haze...@googlegroups.com
I would try the latest 2.6.7 anyway if you want to stay with the 2.x branch because it contains many bugfixes.

I have not verified all your setting, but it doesn't seem particularly strange.

Can you tell a bit more about your netwerk. Are there any reasons to assume that there could be connectivity problems between the members. Can you guarantee that the network has no problem?

I would like to get to the bottom of this since these out of the blue network connectivity problems are a bit PITA


--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/fdd6a1a3-8f05-4b97-9d3d-09f9fd0b2ee1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ian Clark

unread,
Feb 19, 2014, 9:51:40 AM2/19/14
to haze...@googlegroups.com
Hi Peter,

Thanks. We're running all these backend nodes on amazon ec2 in our own vpc. We've been running for a long time without issues, however over the last 2 weeks we've been having problems, we did have a release of our software over this period but our first problem occurred before then so we believe it to be either environmental or load that is causing our problems. We've raised AWS support calls before and they haven't reported any problems found at the times we reported.

IC

Peter Veentjer

unread,
Feb 19, 2014, 9:54:06 AM2/19/14
to haze...@googlegroups.com
You are not the first one using ec2 that runs into unreliable network problems.

Anyhow.. we should make sure that Hazelcast is able to deal with it. I need to talk with some of the guys how to deal with it; I'm not a networking expert.

Peter.


Ian Clark

unread,
Feb 19, 2014, 11:27:59 AM2/19/14
to haze...@googlegroups.com
We are testing 3.1.5 as we said, and we are simulating network problems with ifup/down of eth0, but we get these messages on the interrupted node(0.115) like this:-

2014-02-19 16:24:31.801 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@621649e3, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 151, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService

It's a WARN message but it continuously does it? Is something going wrong?

IC

Peter Veentjer

unread,
Feb 19, 2014, 11:42:35 AM2/19/14
to haze...@googlegroups.com
It should continue. What happens here is that a request is send to the wrong machine. So this is detected, the operation is rejected, this is caught by the sender of that operations and retries the operation on the correct member. This happens when partitions are moving around.


Ian Clark

unread,
Feb 19, 2014, 12:11:11 PM2/19/14
to haze...@googlegroups.com
but continuously? It hasn't stopped since our artificial interruption...

2014-02-19 16:58:21.385 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@574dde2b, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 151, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-02-19 16:58:21.386 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@393505d4, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 83, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-02-19 16:58:22.385 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@574dde2b, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 151, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-02-19 16:58:22.386 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@393505d4, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 83, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-02-19 16:58:23.385 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@574dde2b, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 151, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-02-19 16:58:23.386 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@393505d4, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 83, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
2014-02-19 16:58:24.386 [hz._hzInstance_1_importio-nexus-new-v3.operation.thread-3] WARN  c.h.s.i.WaitNotifyServiceImpl$WaitingOp - log - [10.0.0.115]:5701 [importio-nexus-new-v3] Op: com.hazelcast.queue.PollOperation@574dde2b, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.0.0.115]:5701, target:Address[10.0.0.176]:5701, partitionId: 151, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService


Basically repeating every second... for the same 2 partitions forever.

IC

Peter Veentjer

unread,
Feb 19, 2014, 12:17:34 PM2/19/14
to haze...@googlegroups.com
Ok. That doesn't sound healthy.



Peter Veentjer

unread,
Feb 19, 2014, 12:18:26 PM2/19/14
to haze...@googlegroups.com
Can you file a bugreport here:

https://github.com/hazelcast/hazelcast/issues

We want to close all bugs for the 3.2 release.

Ian Clark

unread,
Feb 19, 2014, 12:33:48 PM2/19/14
to haze...@googlegroups.com
Hi Peter,

Thanks for your help. I'll raise that bug now. We tested that node and it seems to behave on our platform ok, not really scientific given we don't what data is in those partitions... do you think can ignore these messages though and use this version in production?

IC

Peter Veentjer

unread,
Feb 19, 2014, 12:42:07 PM2/19/14
to haze...@googlegroups.com
It depends. Have you verified that your system is making progress?

What happens is that a queue isn't able to wait; that operation that is rejected is a signal of e.g. a queue taker that it wants to take something from an empty queue and therefor it needs to wait.

The big questions are:
- is the sender confusion and keeps on sending to the wrong target because he doesn't have a good partition overview
- is the receiver confused and keeps on rejecting because he thinks he doesn't own a partition.
- is something else going wrong which is not related to 1 and 2 but keeps offering a wrongly target operation.


Ian Clark

unread,
Feb 19, 2014, 12:49:32 PM2/19/14
to haze...@googlegroups.com


You received this message because you are subscribed to a topic in the Google Groups "Hazelcast" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hazelcast/ss2xJwrC8K8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hazelcast+...@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.

Ian Clark

unread,
Feb 19, 2014, 1:45:26 PM2/19/14
to haze...@googlegroups.com
Hi Peter,

Yes sorry you are right, I wasn't very scientific. We got rid of the errors by slowly restarting the nodes, one partition message (partitionId: 151) disappeared when one of the other nodes was restarted. The final partition message(partitionId: 83) stopped after restarting the original node we interrupted (by ifup/down). That suggests to me that the nodes don't have healthy views of the partition's.

IC
Reply all
Reply to author
Forward
0 new messages