Losing cluster data (Owner of partition is being removed) & WrongTarget!

Timo Ahokas

unread,

Apr 7, 2014, 3:35:38 AM4/7/14

to haze...@googlegroups.com

Hi,

We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.

Our current test setup:

- 3 servers, each running both

a) Tomcat (as Hazelcast clients)

b) Jetty (as Hazelcast members)

We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable.

We're running rolling upgrade tests for the Tomcat servers, Jetty servers are kept running all the time. After a few rounds we might se something like this in the member logs:

lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO | jvm 1 | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[265]. PartitionReplicaChangeEvent{partitionId=265, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null}

lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO | jvm 1 | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[266]. PartitionReplicaChangeEvent{partitionId=266, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null}

lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO | jvm 1 | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[267]. PartitionReplicaChangeEvent{partitionId=267, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null}

Basically it seems we lose some or all of our data without any change to the running Hazelcast members. What could cause this?

We also frequently get the cluster stuck into a state where membership data seems to be corrupted after migration as we get errors like this that keep looping for hours and never seem to resolve:

INFO | jvm 1 | 2014/04/01 08:30:51 | W|2014-04-01 08:30:51,292|[hz._hzInstance_1_dev.operation.thread-10]|[Slf4jFactory.java:85] [10.100.0.11]:5701 [dev] Op: com.hazelcast.concurrent.lock.LockOperation@2c3f17b9, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.100.0.11]:5701, target:Address[10.100.0.12]:5701, partitionId: 110, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:lockService

INFO | jvm 1 | 2014/04/01 08:30:51 | W|2014-04-01 08:30:51,292|[hz._hzInstance_1_dev.operation.thread-30]|[Slf4jFactory.java:85] [10.100.0.11]:5701 [dev] Op: com.hazelcast.queue.PollOperation@52bcc4a9, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.100.0.11]:5701, target:Address[10.100.0.13]:5701, partitionId: 230, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService

I'm just wondering whether we are pushing too much load on Hazelcast (maybe up to 100-300 req/s per node, with the CPU load rarely going over 35%) or whether there is something fundamentally wrong with our tests or setup.

Best regards,

Timo

Timo Ahokas

unread,

Apr 7, 2014, 4:39:13 AM4/7/14

to haze...@googlegroups.com

On additional note - were using the default 1 synchronous backup, so all data should be on 2 nodes (out of 3).

How is it possible that the cluster loses the data without any of the nodes going down?

Noctarius

unread,

Apr 7, 2014, 9:33:52 AM4/7/14

to haze...@googlegroups.com

Hi Timo,

I’m currently looking into your issue and the only reason that comes to my mind would be something like a “member is assumed as dead” because it does not respond for a certain amount of time (if not mistaken 10 secs by default). Is it possible that for the load test the servers suffer from GC pauses or something like that that might make other members think others are dead?

Do you see any log lines from the Healthmonitor to state about low memory or high queue sizes? Can you maybe provide a log file?

Thanks,

Chris

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/8b1be67a-0084-4399-bd13-d11faa9b07e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Timo Ahokas

unread,

Apr 7, 2014, 12:49:35 PM4/7/14

to haze...@googlegroups.com

HI Chris,

Thanks - really appreciate any help & suggestions on this. We've been stuck on this for months as we seem to be getting different error after error during our tests. But it does definitely look like the member nodes lose connections between themselves and this leads into the data loss and all nodes get disconnected.

Our current connectivity related properties are the following:

We are running the tests in AWS VPC on 3 x c3.xlarge (4 core) boxes with 7GB memory.

Just from a recent run... things look fairly normal (?) initially:

INFO | jvm 1 | 2014/04/07 17:38:48 | I|2014-04-07 17:38:48,573|[hz._hzInstance_1_dev.HealthMonitor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] memory.used=52.6M, memory.free=73.3M, memory.total=125.9M, memory.max=1.9G, memory.used/total=41.80%, memory.used/max=2.64%, load.process=-1.00%, load.system=-1.00%, load.systemAverage=109.00%, thread.count=116, thread.peakCount=117, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.operation.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=0

But then we start seeing stuff like:

INFO | jvm 1 | 2014/04/07 17:38:59 | W|2014-04-07 17:38:59,633|[hz._hzInstance_1_dev.client.thread-38]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.PartitionIteratingOperation, service: hz:impl:mapService

...

INFO | jvm 1 | 2014/04/07 17:39:47 | W|2014-04-07 17:39:47,566|[hz._hzInstance_1_dev.response]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapSizeOperation@7cbba534, partitionId=35, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: 35, operation: com.hazelcast.map.operation.MapSizeOperation, service: hz:impl:mapService

INFO | jvm 1 | 2014/04/07 17:39:47 | W|2014-04-07 17:39:47,567|[hz._hzInstance_1_dev.response]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapSizeOperation@35b38e6, partitionId=131, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: 131, operation: com.hazelcast.map.operation.MapSizeOperation, service: hz:impl:mapService

....

INFO | jvm 1 | 2014/04/07 17:40:35 | W|2014-04-07 17:40:35,060|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] New join request has been received from an existing endpoint! => Member [10.100.0.12]:5701 Removing old member and processing join request...

INFO | jvm 1 | 2014/04/07 17:40:35 | I|2014-04-07 17:40:35,061|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.12]:5701

INFO | jvm 1 | 2014/04/07 17:40:35 | I|2014-04-07 17:40:35,066|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]

INFO | jvm 1 | 2014/04/07 17:40:35 |

INFO | jvm 1 | 2014/04/07 17:40:35 | Members [2] {

INFO | jvm 1 | 2014/04/07 17:40:35 | Member [10.100.0.11]:5701 this

INFO | jvm 1 | 2014/04/07 17:40:35 | Member [10.100.0.13]:5701

INFO | jvm 1 | 2014/04/07 17:40:35 | }

INFO | jvm 1 | 2014/04/07 17:40:35 |

INFO | jvm 1 | 2014/04/07 17:40:36 | I|2014-04-07 17:40:36,583|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Partition balance is ok, no need to re-partition cluster data...

INFO | jvm 1 | 2014/04/07 17:40:39 | W|2014-04-07 17:40:39,326|[hz._hzInstance_1_dev.async.thread-11]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation@48d77123, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=200, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[10.100.0.12]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation, service: hz:impl:mapService

INFO | jvm 1 | 2014/04/07 17:40:39 | W|2014-04-07 17:40:39,326|[hz._hzInstance_1_dev.async.thread-18]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation@c4dd10c, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=200, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[10.100.0.12]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation, service: hz:impl:mapService

INFO | jvm 1 | 2014/04/07 17:40:41 | I|2014-04-07 17:40:41,061|[hz._hzInstance_1_dev.global-operation.thread-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]

INFO | jvm 1 | 2014/04/07 17:40:41 |

INFO | jvm 1 | 2014/04/07 17:40:41 | Members [3] {

INFO | jvm 1 | 2014/04/07 17:40:41 | Member [10.100.0.11]:5701 this

INFO | jvm 1 | 2014/04/07 17:40:41 | Member [10.100.0.13]:5701

INFO | jvm 1 | 2014/04/07 17:40:41 | Member [10.100.0.12]:5701

INFO | jvm 1 | 2014/04/07 17:40:41 | }

INFO | jvm 1 | 2014/04/07 17:40:41 |

INFO | jvm 1 | 2014/04/07 17:40:41 | I|2014-04-07 17:40:41,337|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-partitioning cluster data... Migration queue size: 90

INFO | jvm 1 | 2014/04/07 17:40:42 | I|2014-04-07 17:40:42,690|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] All migration tasks has been completed, queues are empty.

....

INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 101 -vs- 103

INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-3]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 135 -vs- 137

INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 161 -vs- 162

INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-3]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 152 -vs- 154

INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,574|[hz._hzInstance_1_dev.operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 112 -vs- 114

INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,574|[hz._hzInstance_1_dev.operation.thread-4]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 81 -vs- 83

..

INFO | jvm 1 | 2014/04/07 17:52:59 | W|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701

INFO | jvm 1 | 2014/04/07 17:52:59 | W|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.13]:5701

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactor

y.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.12]:5701] lost. Reason: Socket explicitly closed

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.12]:5701

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,888|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]

INFO | jvm 1 | 2014/04/07 17:52:59 |

INFO | jvm 1 | 2014/04/07 17:52:59 | Members [2] {

INFO | jvm 1 | 2014/04/07 17:52:59 | Member [10.100.0.11]:5701 this

INFO | jvm 1 | 2014/04/07 17:52:59 | Member [10.100.0.13]:5701

INFO | jvm 1 | 2014/04/07 17:52:59 | }

INFO | jvm 1 | 2014/04/07 17:52:59 |

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,889|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.13]:5701] lost. Reason: Socket explicitly closed

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,889|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.13]:5701

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,891|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]

INFO | jvm 1 | 2014/04/07 17:52:59 |

INFO | jvm 1 | 2014/04/07 17:52:59 | Members [1] {

INFO | jvm 1 | 2014/04/07 17:52:59 | Member [10.100.0.11]:5701 this

INFO | jvm 1 | 2014/04/07 17:52:59 | }

INFO | jvm 1 | 2014/04/07 17:52:59 |

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,894|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Accepting socket connection from /10.100.0.13:54577

INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,895|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 5701 accepted socket connection from /10.100.0.13:54577

INFO | jvm 1 | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,254|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Accepting socket connection from /10.100.0.12:56674

INFO | jvm 1 | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,255|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 5701 accepted socket connection from /10.100.0.12:56674

INFO | jvm 1 | 2014/04/07 17:53:00 | W|2014-04-07 17:53:00,647|[hz._hzInstance_1_dev.global-operation.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] MasterConfirmation has been received from Address[10.100.0.12]:5701, but it is not a member of this cluster!

INFO | jvm 1 | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,648|[hz._hzInstance_1_dev.IO.thread-in-0]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.12]:5701] lost. Reason: java.io.EOFException[Remote socket closed!]

INFO | jvm 1 | 2014/04/07 17:53:00 | W|2014-04-07 17:53:00,648|[hz._hzInstance_1_dev.IO.thread-in-0]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] hz._hzInstance_1_dev.IO.thread-in-0 Closing socket to endpoint Address[10.100.0.12]:5701, Cause:java.io.EOFException: Remote socket closed!

INFO | jvm 1 | 2014/04/07 17:53:00 | Exception in thread "hz._hzInstance_1_dev.cached.thread-1" java.lang.NullPointerException

INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.handleEvent(AbstractMapAddEntryListenerRequest.java:77)

INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.entryAdded(AbstractMapAddEntryListenerRequest.java:83)

INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:873)

INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:75)

INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.spi.impl.EventServiceImpl$EventPacketProcessor.process(EventServiceImpl.java:508)

INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.spi.impl.EventServiceImpl$RemoteEventPacketProcessor.run(EventServiceImpl.java:534)

INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:142)

INFO | jvm 1 | 2014/04/07 17:53:00 | at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

INFO | jvm 1 | 2014/04/07 17:53:00 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)

INFO | jvm 1 | 2014/04/07 17:53:00 | at java.lang.Thread.run(Thread.java:662)

INFO | jvm 1 | 2014/04/07 17:53:01 | I|2014-04-07 17:53:01,391|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Partition balance is ok, no need to re-partition cluster data...

INFO | jvm 1 | 2014/04/07 17:53:06 | Exception in thread "hz._hzInstance_1_dev.cached.thread-5" java.lang.NullPointerException

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.handleEvent(AbstractMapAddEntryListenerRequest.java:77)

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.entryEvicted(AbstractMapAddEntryListenerRequest.java:95)

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:876)

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:75)

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.spi.impl.EventServiceImpl$EventPacketProcessor.process(EventServiceImpl.java:508)

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.spi.impl.EventServiceImpl$RemoteEventPacketProcessor.run(EventServiceImpl.java:534)

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:142)

INFO | jvm 1 | 2014/04/07 17:53:06 | at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

INFO | jvm 1 | 2014/04/07 17:53:06 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)

INFO | jvm 1 | 2014/04/07 17:53:06 | at java.lang.Thread.run(Thread.java:662)

INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59)

INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 33, replica: 1

INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 33, replicaIndex: 1, partition-info: Partition [33]{

INFO | jvm 1 | 2014/04/07 17:53:09 | 0:Address[10.100.0.11]:5701

INFO | jvm 1 | 2014/04/07 17:53:09 | }

INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 35, replica: 1

INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 35, replicaIndex: 1, partition-info: Partition [35]{

INFO | jvm 1 | 2014/04/07 17:53:09 | 0:Address[10.100.0.11]:5701

INFO | jvm 1 | 2014/04/07 17:53:09 | }

INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 36, replica: 1

INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 36, replicaIndex: 1, partition-info: Partition [36]{

INFO | jvm 1 | 2014/04/07 17:53:09 | 0:Address[10.100.0.11]:5701

INFO | jvm 1 | 2014/04/07 17:53:09 | }

---

INFO | jvm 1 | 2014/04/07 17:53:11 | W|2014-04-07 17:53:11,320|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] No registration found for hz:impl:mapService / 49b783fc-7458-448c-b49c-c89f5dfe06ce

INFO | jvm 1 | 2014/04/07 17:53:12 | W|2014-04-07 17:53:12,567|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] No registration found for hz:impl:mapService / 49b783fc-7458-448c-b49c-c89f5dfe06ce

..

INFO | jvm 1 | 2014/04/07 17:53:21 | W|2014-04-07 17:53:21,916|[hz._hzInstance_1_dev.global-operation.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] MasterConfirmation has been received from Address[10.100.0.13]:5701, but it is not a member of this cluster!

INFO | jvm 1 | 2014/04/07 17:53:21 | I|2014-04-07 17:53:21,917|[hz._hzInstance_1_dev.IO.thread-in-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.13]:5701] lost. Reason: java.io.EOFException[Remote socket closed!]

...

INFO | jvm 1 | 2014/04/07 17:53:40 | I|2014-04-07 17:53:40,311|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 51919 accepted socket connection from /10.100.0.13:5701

INFO | jvm 1 | 2014/04/07 17:53:41 | I|2014-04-07 17:53:41,812|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging to Address[10.100.0.13]:5701, because : joinRequest.getMemberCount() > currentMemberCount [2 > 1]

INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,813|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging [tcp/ip] to Address[10.100.0.13]:5701

INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,813|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Preparing to merge... Waiting for merge instruction...

INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,814|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging to Address[10.100.0.13]:5701, because: instructed by master Address[10.100.0.11]:5701

INFO | jvm 1 | 2014/04/07 17:53:41 | I|2014-04-07 17:53:41,815|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is MERGING

INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[0]. PartitionReplicaChangeEvent{partitionId=0, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}

INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[1]. PartitionReplicaChangeEvent{partitionId=1, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}

INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[2]. PartitionReplicaChangeEvent{partitionId=2, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}

INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,903|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[3]. PartitionReplicaChangeEvent{partitionId=3, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}

....

INFO | jvm 1 | 2014/04/07 18:15:03 | W|2014-04-07 18:15:03,041|[hz._hzInstance_1_dev.cached.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Master node has timed out its heartbeat and will be removed

INFO | jvm 1 | 2014/04/07 18:15:04 | W|2014-04-07 18:15:04,041|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Master node has timed out its heartbeat and will be removed

I did run pings between the nodes (from the shell) at the time and didn't spot any network issues, so it would seem that intra-member checks get some how blocked or delayed?

I can share full logs for all nodes for some test runs if that would be helpful? And any tweaks for AWS that might be helpful?

Best regards,

Timo

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/F6BA173F-B2AA-4BCF-ABAD-F321B9CB7BF9%40googlemail.com.

Joe Planisky

unread,

Apr 7, 2014, 1:40:16 PM4/7/14

to haze...@googlegroups.com

Hi Timo,

One thing we found that improves cluster stability is to use the G1GC garbage collector. This eliminated long GC pauses for us, which were occasionally causing problems. Our launch script includes these command line options (which also enable GC logging so we could see how much time was being taken by GC):

-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc

Note that this didn’t completely cure our cluster stability problems, but seemed to improve it. If you’re not using G1GC, you might give it a try.

—
Joe

On Apr 7, 2014, at 9:49, Timo Ahokas <timo....@gmail.com> wrote:

> HI Chris,
>
> Thanks - really appreciate any help & suggestions on this. We've been stuck on this for months as we seem to be getting different error after error during our tests. But it does definitely look like the member nodes lose connections between themselves and this leads into the data loss and all nodes get disconnected.

> ...

> On 7 April 2014 16:33, Noctarius <noctar...@googlemail.com> wrote:
> Hi Timo,
>
> I’m currently looking into your issue and the only reason that comes to my mind would be something like a “member is assumed as dead” because it does not respond for a certain amount of time (if not mistaken 10 secs by default). Is it possible that for the load test the servers suffer from GC pauses or something like that that might make other members think others are dead?
>
> Do you see any log lines from the Healthmonitor to state about low memory or high queue sizes? Can you maybe provide a log file?
>
> Thanks,
> Chris
>
>
> Am 07.04.2014 um 11:39 schrieb Timo Ahokas <timo....@gmail.com>:
>
>> On additional note - were using the default 1 synchronous backup, so all data should be on 2 nodes (out of 3).
>> How is it possible that the cluster loses the data without any of the nodes going down?
>>
>>
>> On Monday, 7 April 2014 10:35:38 UTC+3, Timo Ahokas wrote:
>> Hi,
>>
>> We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.
>>

>> ...

>> We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable.

...

Timo Ahokas

unread,

Apr 7, 2014, 3:46:45 PM4/7/14

to haze...@googlegroups.com

Hi Joe,

Thanks for the suggestion. We're still on Java 6, so might need to upgrade for the test (I presume you're running on Java 7?).

We've got GC logging enabled for both containers - I previously checked that there were no long pauses (were running with -XX:+UseConcMarkSweepGC etc), but I'll re-check just in case.

Just out of curiosity, what sort of load are you putting on your cluster? (approximate operations per node per sec, average CPU load etc) I'm just trying to find a level that you'd expect to work (as we've already moved most of our heavy operations out of Hazelcast for the test)

Best regards,

Timo

On 7 April 2014 20:40, Joe Planisky <joe.pl...@temboo.com> wrote:

Hi Timo,

One thing we found that improves cluster stability is to use the G1GC garbage collector. This eliminated long GC pauses for us, which were occasionally causing problems. Our launch script includes these command line options (which also enable GC logging so we could see how much time was being taken by GC):

-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc

Note that this didn't completely cure our cluster stability problems, but seemed to improve it. If you're not using G1GC, you might give it a try.

--

Joe

On Apr 7, 2014, at 9:49, Timo Ahokas <timo....@gmail.com> wrote:

> HI Chris,
>
> Thanks - really appreciate any help & suggestions on this. We've been stuck on this for months as we seem to be getting different error after error during our tests. But it does definitely look like the member nodes lose connections between themselves and this leads into the data loss and all nodes get disconnected.

> ...

> On 7 April 2014 16:33, Noctarius <noctar...@googlemail.com> wrote:
> Hi Timo,
>
> I'm currently looking into your issue and the only reason that comes to my mind would be something like a "member is assumed as dead" because it does not respond for a certain amount of time (if not mistaken 10 secs by default). Is it possible that for the load test the servers suffer from GC pauses or something like that that might make other members think others are dead?
>
> Do you see any log lines from the Healthmonitor to state about low memory or high queue sizes? Can you maybe provide a log file?
>
> Thanks,
> Chris
>
>
> Am 07.04.2014 um 11:39 schrieb Timo Ahokas <timo....@gmail.com>:
>
>> On additional note - were using the default 1 synchronous backup, so all data should be on 2 nodes (out of 3).
>> How is it possible that the cluster loses the data without any of the nodes going down?
>>
>>
>> On Monday, 7 April 2014 10:35:38 UTC+3, Timo Ahokas wrote:
>> Hi,
>>
>> We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.
>>

>> ...

>> We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable.

...

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/5CC1429C-6D52-4685-A795-B723D5356E0B%40temboo.com.

Gernot Pansy

unread,

Apr 7, 2014, 4:06:35 PM4/7/14

to haze...@googlegroups.com

Hi Timo,

looks a bit like my problem. I'm stilling trying to bring my application from 2.6 to 3.x. On my last test (with 3.2) i got similar problems:

https://github.com/hazelcast/hazelcast/issues/2105 - it looks like we have nearly the same workload.

I have seen in git history that mdogan has already refactored the partition service after 3.2 release. So perhaps you can give an actual 3.2.1-SNAPSHOT a try.

Is it stable if you run it with only one cluster node?

gernot

Joe Planisky

unread,

Apr 7, 2014, 5:12:40 PM4/7/14

to haze...@googlegroups.com

We are indeed using Java 7.

I can’t help too much with what a reasonable load is because our systems are limited by non-HZ factors. I would guess that at peaks we may be hitting close to 100 Hazelcast ops/sec. I’m not sure what the CPU usage is at peaks, but during normal operations it’s usually negligible, < 20%.

That said, the stability problems we’ve seen don’t seem to be load related. We’ve seen nodes start bouncing in and out of the cluster when things were basically idle. It seems to be related to the AWS environment as we’ve never had stability problems when running a cluster on a local network.

--
Joe Planisky | Temboo | 104 Franklin Street NYC | www.temboo.com

> To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/CACwRk3_urTqhtQmrSC9ofjJknPM1qJ7QgueyU4BcDBD%3DnehcFQ%40mail.gmail.com.

Timo Ahokas

unread,

Apr 8, 2014, 3:53:50 AM4/8/14

to haze...@googlegroups.com, noctar...@googlemail.com

Hi Chris,

Looking at the logs now does show that ICMP ping (hazelcast.icmp.enabled) doesn't work for us:

INFO | jvm 1 | 2014/04/08 07:07:31 | W|2014-04-08 07:07:31,697|[hz._hzInstance_1_dev.cached.thread-17]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701

INFO | jvm 1 | 2014/04/08 07:07:37 | W|2014-04-08 07:07:37,163|[hz._hzInstance_1_dev.cached.thread-12]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701

I looked a bit into this and it seems to be expected as ping/isReachable() requires root/sudo access on Linux. We are running Tomcat using the separate tomcat user (as recommended).

How is hazelcast.imcp.enabled supposed to work on Linux and when running Java as a normal user?

Best regards,

Timo

Noctarius

unread,

Apr 8, 2014, 3:59:22 AM4/8/14

to Timo Ahokas, haze...@googlegroups.com

Hi Timo,

It needs root rights or it will use port 7 for echo response but that requires port 7 to be open and an echo server installed on the other side (xinet should have one build in I guess).

Chris

Timo Ahokas

unread,

Apr 8, 2014, 4:38:17 AM4/8/14

to haze...@googlegroups.com, Timo Ahokas, noctar...@googlemail.com

Hi Chris,

Thanks a lot... I've setup xinetd/echo and will give it a try now. Some how didn't even think about this earlier as shell ping worked fine and there was no mention of it in the docs or the Hazelcast on AWS guide.

Hope this helps.... will keep you updated.

Best regards,

Timo

Noctarius

unread,

Apr 8, 2014, 4:39:43 AM4/8/14

to Timo Ahokas, haze...@googlegroups.com

Yeah you’re right it should have been mentioned. We recently (a few days before) had a customer was suffering from the same problem. We’ll add a marker to the documentation :)

Timo Ahokas

unread,

Apr 9, 2014, 8:30:18 AM4/9/14

to Noctarius, haze...@googlegroups.com

Hi Chris,

Our system works a lot better now (compared to before having a broken ICMP config) - so thanks a lot for your help. Unfortunately things are still failing (internal processes start timing out as processing takes too long) when I try moving parts of the data back into Hazelcast. We have multiple implementations for some caches/queues and are using Redis on the side for some high traffic workloads.

I'll need to run further tests to try to identify whether we're overloading the machines (CPUs still around 25-30%) or hitting some other issue. The two high volume operations that we moved back to Hazelcast and seem to bring the system down are:

- map/multimap cache for session related data (150-300 req/s with 1kb payloads (in total for 3 servers))

- queue traffic between the nodes (10-50 req/s with 1kb payloads).

Based on previous testing, this only seems to happen when we do rolling restarts of the Hazelcast clients (members not bounced) - constant high load (even 2-4 times more) is fine.

Best regards,

Timo

Noctarius

unread,

Apr 9, 2014, 9:32:30 AM4/9/14

to Timo Ahokas, haze...@googlegroups.com

Sadly I already heard a few times that people reporting the cluster becomes a bit unstable on rolling restarts and we are looking into this. We were not yet able to find a reliable problem maker in the system but it seems to happen more often if you restart the nodes to quickly.

How long are your nodes are down before you start it again?

Chris

Timo Ahokas

unread,

Apr 9, 2014, 9:42:51 AM4/9/14

to Noctarius, haze...@googlegroups.com

Hi Chris,

We are not restarting members, just the Tomcat process which is the Hazelcast client (native Java client). The client is offline maybe a bit over a minute. So 3 servers, each running Tomcat (client + our app) and Jetty (member), rolling restarts for only the Tomcats. We had to split the member out due to data migrations causing even worse pauses/issues.

Noctarius

unread,

Apr 9, 2014, 9:48:03 AM4/9/14

to Timo Ahokas, haze...@googlegroups.com

Hi Timo,

I see, at the moment we just had reports from the cluster side. I’ll ask Ali to join the conversation, since he’s the client engineer. Maybe he has a good idea / strategy for you.

Chris

Joe Planisky

unread,

Apr 9, 2014, 10:30:29 AM4/9/14

to haze...@googlegroups.com

FWIW, we are also beginning to suspect adding or removing clients (actually, lite members in our case, we’re still on 2.4.1) sometimes triggers instability. We definitely found that adding and removing full members too quickly can cause problems.

While it seems rare and not reliably reproducible, we’ve had a few cases where adding and removing lite members seemed to trigger instability.

On Apr 9, 2014, at 6:42, Timo Ahokas <timo....@gmail.com> wrote:

> Hi Chris,
>
> We are not restarting members, just the Tomcat process which is the Hazelcast client (native Java client). The client is offline maybe a bit over a minute. So 3 servers, each running Tomcat (client + our app) and Jetty (member), rolling restarts for only the Tomcats. We had to split the member out due to data migrations causing even worse pauses/issues.
>
> On 9.4.2014, at 16.32, Noctarius <noctar...@googlemail.com> wrote:
>
>> Sadly I already heard a few times that people reporting the cluster becomes a bit unstable on rolling restarts and we are looking into this. We were not yet able to find a reliable problem maker in the system but it seems to happen more often if you restart the nodes to quickly.
>>
>> How long are your nodes are down before you start it again?
>>
>> Chris
>>
>> Am 09.04.2014 um 14:30 schrieb Timo Ahokas <timo....@gmail.com>:
>>
>>> Hi Chris,
>>>

>>> ...

chirag.m...@gmail.com

unread,

Feb 15, 2017, 5:01:21 AM2/15/17

to Hazelcast, joe.pl...@temboo.com

Hi There,

In our production environment we are observing the similar hazel cast statements as following :

12:31:23,122 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-1) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.EOFException[null]

12:31:23,138 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_2_STGICD_PM.IO.thread-1) [10.4.12.186]:6604 [STGICD_PM] hz._hzInstance_2_STGICD_PM.IO.thread-1 Closing socket to endpoint Address[10.4.12.187]:6602, Cause:java.io.EOFException

12:31:23,138 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-1) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.EOFException[null]

12:31:23,138 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_1_STGICD_PM.IO.thread-1) [10.4.12.186]:6602 [STGICD_PM] hz._hzInstance_1_STGICD_PM.IO.thread-1 Closing socket to endpoint Address[10.4.12.187]:6602, Cause:java.io.EOFException

12:31:23,185 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-0) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.EOFException[null]

12:31:23,185 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_2_STGICD_PM.IO.thread-0) [10.4.12.186]:6604 [STGICD_PM] hz._hzInstance_2_STGICD_PM.IO.thread-0 Closing socket to endpoint Address[10.4.12.187]:6604, Cause:java.io.EOFException

12:31:23,185 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-0) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.EOFException[null]

12:31:23,185 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_1_STGICD_PM.IO.thread-0) [10.4.12.186]:6602 [STGICD_PM] hz._hzInstance_1_STGICD_PM.IO.thread-0 Closing socket to endpoint Address[10.4.12.187]:6604, Cause:java.io.EOFException

12:31:23,263 INFO [com.hazelcast.nio.ConnectionManager] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] 53371 accepted socket connection from /10.4.12.187:6604

12:31:23,263 INFO [com.hazelcast.nio.ConnectionManager] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] 53374 accepted socket connection from /10.4.12.187:6602

12:31:23,263 INFO [com.hazelcast.nio.ConnectionManager] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] 53373 accepted socket connection from /10.4.12.187:6604

12:31:23,263 INFO [com.hazelcast.nio.ConnectionManager] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] 53372 accepted socket connection from /10.4.12.187:6602

12:31:23,310 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-1) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]

12:31:23,310 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-1) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]

12:31:23,497 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-2) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]

12:31:23,497 INFO [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-2) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]

12:31:24,917 INFO [com.hazelcast.nio.SocketConnector] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]

12:31:25,042 INFO [com.hazelcast.nio.SocketConnector] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]

12:31:25,292 INFO [com.hazelcast.nio.SocketConnector] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]

12:31:25,292 INFO [com.hazelcast.nio.SocketConnector] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]

12:31:26,047 INFO [com.hazelcast.nio.SocketConnector] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]

12:31:26,047 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6602 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3

12:31:26,047 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Removing Address Address[10.4.12.187]:6602

12:31:26,047 INFO [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-38) 12:31:26,047 NOTICE HazelcastMembershipListener:89 - Member Removed /10.4.12.187:6602 <==> 40f7c462-d742-4cdf-932e-c268b203f024

12:31:26,047 INFO [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-38) <12:31:26:047><NOTICE><HazelcastMembershipListener>Member Removed /10.4.12.187:6602 <==> 40f7c462-d742-4cdf-932e-c268b203f024

12:31:26,047 INFO [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Starting to send partition replica diffs...true

12:31:26,047 INFO [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.MigrationThread) [10.4.12.186]:6602 [STGICD_PM] Total 0 partition replica diffs have been processed.

12:31:26,047 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM]

Members [3] {

Member [10.4.12.186]:6602 this

Member [10.4.12.186]:6604

Member [10.4.12.187]:6604

}

12:31:26,062 INFO [com.hazelcast.nio.SocketConnector] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]

12:31:26,062 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6604 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3

12:31:26,062 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Removing Address Address[10.4.12.187]:6604

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[0]. PartitionReplicaChangeEvent{partitionId=0, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[1]. PartitionReplicaChangeEvent{partitionId=1, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[2]. PartitionReplicaChangeEvent{partitionId=2, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[3]. PartitionReplicaChangeEvent{partitionId=3, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[4]. PartitionReplicaChangeEvent{partitionId=4, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 INFO [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-27) 12:31:26,062 NOTICE HazelcastMembershipListener:89 - Member Removed /10.4.12.187:6604 <==> 0e4ae4d8-85c5-4ae9-8477-6ceaaf5269cf

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[5]. PartitionReplicaChangeEvent{partitionId=5, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[6]. PartitionReplicaChangeEvent{partitionId=6, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[7]. PartitionReplicaChangeEvent{partitionId=7, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 INFO [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-27) <12:31:26:062><NOTICE><HazelcastMembershipListener>Member Removed /10.4.12.187:6604 <==> 0e4ae4d8-85c5-4ae9-8477-6ceaaf5269cf

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[8]. PartitionReplicaChangeEvent{partitionId=8, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[9]. PartitionReplicaChangeEvent{partitionId=9, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[128]. PartitionReplicaChangeEvent{partitionId=128, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[129]. PartitionReplicaChangeEvent{partitionId=129, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[130]. PartitionReplicaChangeEvent{partitionId=130, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[131]. PartitionReplicaChangeEvent{partitionId=131, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[132]. PartitionReplicaChangeEvent{partitionId=132, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[133]. PartitionReplicaChangeEvent{partitionId=133, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[134]. PartitionReplicaChangeEvent{partitionId=134, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[188]. PartitionReplicaChangeEvent{partitionId=188, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[189]. PartitionReplicaChangeEvent{partitionId=189, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[190]. PartitionReplicaChangeEvent{partitionId=190, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[191]. PartitionReplicaChangeEvent{partitionId=191, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[192]. PartitionReplicaChangeEvent{partitionId=192, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[193]. PartitionReplicaChangeEvent{partitionId=193, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[194]. PartitionReplicaChangeEvent{partitionId=194, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[195]. PartitionReplicaChangeEvent{partitionId=195, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[196]. PartitionReplicaChangeEvent{partitionId=196, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[197]. PartitionReplicaChangeEvent{partitionId=197, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[198]. PartitionReplicaChangeEvent{partitionId=198, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}

12:31:26,078 INFO [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Starting to send partition replica diffs...true

12:31:26,078 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM]

Members [2] {

Member [10.4.12.186]:6602 this

Member [10.4.12.186]:6604

}

12:31:27,312 INFO [com.hazelcast.nio.SocketConnector] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]

12:31:27,312 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6602 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3

12:31:27,312 INFO [com.hazelcast.nio.SocketConnector] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]

12:31:27,312 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM] Removing Address Address[10.4.12.187]:6602

12:31:27,312 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6604 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3

12:31:27,312 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM]

Members [3] {

Member [10.4.12.186]:6602

Member [10.4.12.186]:6604 this

Member [10.4.12.187]:6604

}

12:31:27,312 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM] Removing Address Address[10.4.12.187]:6604

12:31:27,312 INFO [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM]

Members [2] {

Member [10.4.12.186]:6602

Member [10.4.12.186]:6604 this

}

When is says possible data loss it means the transaction is getting lost in between? Correct me if i am wrong.

Also, what could be the reason for this issue ? We have 2 nodes configured in hazel cast.

Appreciate your reply.

Reply all

Reply to author

Forward