Losing cluster data (Owner of partition is being removed) & WrongTarget!

1,538 views
Skip to first unread message

Timo Ahokas

unread,
Apr 7, 2014, 3:35:38 AM4/7/14
to haze...@googlegroups.com

Hi,

We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.

Our current test setup:

- 3 servers, each running both

a) Tomcat  (as Hazelcast clients)

b) Jetty (as Hazelcast members)

We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable.


We're running rolling upgrade tests for the Tomcat servers, Jetty servers are kept running all the time. After a few rounds we might se something like this in the member logs:

lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO   | jvm 1    | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[265]. PartitionReplicaChangeEvent{partitionId=265, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null} 

lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO   | jvm 1    | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[266]. PartitionReplicaChangeEvent{partitionId=266, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null} 

lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO   | jvm 1    | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[267]. PartitionReplicaChangeEvent{partitionId=267, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null} 

Basically it seems we lose some or all of our data without any change to the running Hazelcast members. What could cause this?


We also frequently get the cluster stuck into a state where membership data seems to be corrupted after migration as we get errors like this that keep looping for hours and never seem to resolve:

INFO   | jvm 1    | 2014/04/01 08:30:51 | W|2014-04-01 08:30:51,292|[hz._hzInstance_1_dev.operation.thread-10]|[Slf4jFactory.java:85] [10.100.0.11]:5701 [dev] Op: com.hazelcast.concurrent.lock.LockOperation@2c3f17b9, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.100.0.11]:5701, target:Address[10.100.0.12]:5701, partitionId: 110, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:lockService

INFO   | jvm 1    | 2014/04/01 08:30:51 | W|2014-04-01 08:30:51,292|[hz._hzInstance_1_dev.operation.thread-30]|[Slf4jFactory.java:85] [10.100.0.11]:5701 [dev] Op: com.hazelcast.queue.PollOperation@52bcc4a9, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.100.0.11]:5701, target:Address[10.100.0.13]:5701, partitionId: 230, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService


I'm just wondering whether we are pushing too much load on Hazelcast (maybe up to 100-300 req/s per node, with the CPU load rarely going over 35%) or whether there is something fundamentally wrong with our tests or setup.


Best regards,

Timo


Timo Ahokas

unread,
Apr 7, 2014, 4:39:13 AM4/7/14
to haze...@googlegroups.com
On additional note - were using the default 1 synchronous backup, so all data should be on 2 nodes (out of 3).
How is it possible that the cluster loses the data without any of the nodes going down?

Noctarius

unread,
Apr 7, 2014, 9:33:52 AM4/7/14
to haze...@googlegroups.com
Hi Timo,

I’m currently looking into your issue and the only reason that comes to my mind would be something like a “member is assumed as dead” because it does not respond for a certain amount of time (if not mistaken 10 secs by default). Is it possible that for the load test the servers suffer from GC pauses or something like that that might make other members think others are dead?

Do you see any log lines from the Healthmonitor to state about low memory or high queue sizes? Can you maybe provide a log file?

Thanks,
Chris


-- 
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/8b1be67a-0084-4399-bd13-d11faa9b07e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Timo Ahokas

unread,
Apr 7, 2014, 12:49:35 PM4/7/14
to haze...@googlegroups.com
HI Chris,

Thanks - really appreciate any help & suggestions on this. We've been stuck on this for months as we seem to be getting different error after error during our tests. But it does definitely look like the member nodes lose connections between themselves and this leads into the data loss and all nodes get disconnected.

Our current connectivity related properties are the following:

<property name="hazelcast.icmp.enabled">true</property>

<property name="hazelcast.icmp.timeout">10000</property>

<property name="hazelcast.max.no.heartbeat.seconds">30</property>


We are running the tests in AWS VPC on 3 x c3.xlarge (4 core) boxes with 7GB memory.

Just from a recent run... things look fairly normal (?) initially:

INFO   | jvm 1    | 2014/04/07 17:38:48 | I|2014-04-07 17:38:48,573|[hz._hzInstance_1_dev.HealthMonitor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] memory.used=52.6M, memory.free=73.3M, memory.total=125.9M, memory.max=1.9G, memory.used/total=41.80%, memory.used/max=2.64%, load.process=-1.00%, load.system=-1.00%, load.systemAverage=109.00%, thread.count=116, thread.peakCount=117, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.operation.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=0

But then we start seeing stuff like:

INFO   | jvm 1    | 2014/04/07 17:38:59 | W|2014-04-07 17:38:59,633|[hz._hzInstance_1_dev.client.thread-38]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.PartitionIteratingOperation, service: hz:impl:mapService 

...

INFO   | jvm 1    | 2014/04/07 17:39:47 | W|2014-04-07 17:39:47,566|[hz._hzInstance_1_dev.response]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapSizeOperation@7cbba534, partitionId=35, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: 35, operation: com.hazelcast.map.operation.MapSizeOperation, service: hz:impl:mapService 

INFO   | jvm 1    | 2014/04/07 17:39:47 | W|2014-04-07 17:39:47,567|[hz._hzInstance_1_dev.response]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapSizeOperation@35b38e6, partitionId=131, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: 131, operation: com.hazelcast.map.operation.MapSizeOperation, service: hz:impl:mapService 

....

INFO   | jvm 1    | 2014/04/07 17:40:35 | W|2014-04-07 17:40:35,060|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] New join request has been received from an existing endpoint! => Member [10.100.0.12]:5701 Removing old member and processing join request... 

INFO   | jvm 1    | 2014/04/07 17:40:35 | I|2014-04-07 17:40:35,061|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.12]:5701 

INFO   | jvm 1    | 2014/04/07 17:40:35 | I|2014-04-07 17:40:35,066|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 

INFO   | jvm 1    | 2014/04/07 17:40:35 | 

INFO   | jvm 1    | 2014/04/07 17:40:35 | Members [2] {

INFO   | jvm 1    | 2014/04/07 17:40:35 |       Member [10.100.0.11]:5701 this

INFO   | jvm 1    | 2014/04/07 17:40:35 |       Member [10.100.0.13]:5701

INFO   | jvm 1    | 2014/04/07 17:40:35 | }

INFO   | jvm 1    | 2014/04/07 17:40:35 |  

INFO   | jvm 1    | 2014/04/07 17:40:36 | I|2014-04-07 17:40:36,583|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Partition balance is ok, no need to re-partition cluster data...  

INFO   | jvm 1    | 2014/04/07 17:40:39 | W|2014-04-07 17:40:39,326|[hz._hzInstance_1_dev.async.thread-11]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation@48d77123, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=200, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[10.100.0.12]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation, service: hz:impl:mapService 

INFO   | jvm 1    | 2014/04/07 17:40:39 | W|2014-04-07 17:40:39,326|[hz._hzInstance_1_dev.async.thread-18]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation@c4dd10c, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=200, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[10.100.0.12]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation, service: hz:impl:mapService 

INFO   | jvm 1    | 2014/04/07 17:40:41 | I|2014-04-07 17:40:41,061|[hz._hzInstance_1_dev.global-operation.thread-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 

INFO   | jvm 1    | 2014/04/07 17:40:41 | 

INFO   | jvm 1    | 2014/04/07 17:40:41 | Members [3] {

INFO   | jvm 1    | 2014/04/07 17:40:41 |       Member [10.100.0.11]:5701 this

INFO   | jvm 1    | 2014/04/07 17:40:41 |       Member [10.100.0.13]:5701

INFO   | jvm 1    | 2014/04/07 17:40:41 |       Member [10.100.0.12]:5701

INFO   | jvm 1    | 2014/04/07 17:40:41 | }

INFO   | jvm 1    | 2014/04/07 17:40:41 |  

INFO   | jvm 1    | 2014/04/07 17:40:41 | I|2014-04-07 17:40:41,337|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-partitioning cluster data... Migration queue size: 90 

INFO   | jvm 1    | 2014/04/07 17:40:42 | I|2014-04-07 17:40:42,690|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] All migration tasks has been completed, queues are empty. 

....

INFO   | jvm 1    | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 101 -vs- 103 

INFO   | jvm 1    | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-3]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 135 -vs- 137 

INFO   | jvm 1    | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 161 -vs- 162 

INFO   | jvm 1    | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-3]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 152 -vs- 154 

INFO   | jvm 1    | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,574|[hz._hzInstance_1_dev.operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 112 -vs- 114 

INFO   | jvm 1    | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,574|[hz._hzInstance_1_dev.operation.thread-4]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 81 -vs- 83 

..

INFO   | jvm 1    | 2014/04/07 17:52:59 | W|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701 

INFO   | jvm 1    | 2014/04/07 17:52:59 | W|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.13]:5701 

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactor

y.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.12]:5701] lost. Reason: Socket explicitly closed 

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.12]:5701 

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,888|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 

INFO   | jvm 1    | 2014/04/07 17:52:59 | 

INFO   | jvm 1    | 2014/04/07 17:52:59 | Members [2] {

INFO   | jvm 1    | 2014/04/07 17:52:59 |       Member [10.100.0.11]:5701 this

INFO   | jvm 1    | 2014/04/07 17:52:59 |       Member [10.100.0.13]:5701

INFO   | jvm 1    | 2014/04/07 17:52:59 | }

INFO   | jvm 1    | 2014/04/07 17:52:59 |  

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,889|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.13]:5701] lost. Reason: Socket explicitly closed 

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,889|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.13]:5701 

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,891|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 

INFO   | jvm 1    | 2014/04/07 17:52:59 | 

INFO   | jvm 1    | 2014/04/07 17:52:59 | Members [1] {

INFO   | jvm 1    | 2014/04/07 17:52:59 |       Member [10.100.0.11]:5701 this

INFO   | jvm 1    | 2014/04/07 17:52:59 | }

INFO   | jvm 1    | 2014/04/07 17:52:59 |  

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,894|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Accepting socket connection from /10.100.0.13:54577 

INFO   | jvm 1    | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,895|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 5701 accepted socket connection from /10.100.0.13:54577 

INFO   | jvm 1    | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,254|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Accepting socket connection from /10.100.0.12:56674 

INFO   | jvm 1    | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,255|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 5701 accepted socket connection from /10.100.0.12:56674 

INFO   | jvm 1    | 2014/04/07 17:53:00 | W|2014-04-07 17:53:00,647|[hz._hzInstance_1_dev.global-operation.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] MasterConfirmation has been received from Address[10.100.0.12]:5701, but it is not a member of this cluster! 

INFO   | jvm 1    | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,648|[hz._hzInstance_1_dev.IO.thread-in-0]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.12]:5701] lost. Reason: java.io.EOFException[Remote socket closed!] 

INFO   | jvm 1    | 2014/04/07 17:53:00 | W|2014-04-07 17:53:00,648|[hz._hzInstance_1_dev.IO.thread-in-0]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] hz._hzInstance_1_dev.IO.thread-in-0 Closing socket to endpoint Address[10.100.0.12]:5701, Cause:java.io.EOFException: Remote socket closed! 

INFO   | jvm 1    | 2014/04/07 17:53:00 | Exception in thread "hz._hzInstance_1_dev.cached.thread-1" java.lang.NullPointerException

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.handleEvent(AbstractMapAddEntryListenerRequest.java:77)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.entryAdded(AbstractMapAddEntryListenerRequest.java:83)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at com.hazelcast.map.MapService.dispatchEvent(MapService.java:873)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at com.hazelcast.map.MapService.dispatchEvent(MapService.java:75)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at com.hazelcast.spi.impl.EventServiceImpl$EventPacketProcessor.process(EventServiceImpl.java:508)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at com.hazelcast.spi.impl.EventServiceImpl$RemoteEventPacketProcessor.run(EventServiceImpl.java:534)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:142)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)

INFO   | jvm 1    | 2014/04/07 17:53:00 |       at java.lang.Thread.run(Thread.java:662)

INFO   | jvm 1    | 2014/04/07 17:53:01 | I|2014-04-07 17:53:01,391|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Partition balance is ok, no need to re-partition cluster data...  

INFO   | jvm 1    | 2014/04/07 17:53:06 | Exception in thread "hz._hzInstance_1_dev.cached.thread-5" java.lang.NullPointerException

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.handleEvent(AbstractMapAddEntryListenerRequest.java:77)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.entryEvicted(AbstractMapAddEntryListenerRequest.java:95)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.map.MapService.dispatchEvent(MapService.java:876)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.map.MapService.dispatchEvent(MapService.java:75)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.spi.impl.EventServiceImpl$EventPacketProcessor.process(EventServiceImpl.java:508)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.spi.impl.EventServiceImpl$RemoteEventPacketProcessor.run(EventServiceImpl.java:534)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:142)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at java.lang.Thread.run(Thread.java:662)

INFO   | jvm 1    | 2014/04/07 17:53:06 |       at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59)

INFO   | jvm 1    | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 33, replica: 1 

INFO   | jvm 1    | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 33, replicaIndex: 1, partition-info: Partition [33]{

INFO   | jvm 1    | 2014/04/07 17:53:09 |       0:Address[10.100.0.11]:5701

INFO   | jvm 1    | 2014/04/07 17:53:09 | } 

INFO   | jvm 1    | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 35, replica: 1 

INFO   | jvm 1    | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 35, replicaIndex: 1, partition-info: Partition [35]{

INFO   | jvm 1    | 2014/04/07 17:53:09 |       0:Address[10.100.0.11]:5701

INFO   | jvm 1    | 2014/04/07 17:53:09 | } 

INFO   | jvm 1    | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 36, replica: 1 

INFO   | jvm 1    | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 36, replicaIndex: 1, partition-info: Partition [36]{

INFO   | jvm 1    | 2014/04/07 17:53:09 |       0:Address[10.100.0.11]:5701

INFO   | jvm 1    | 2014/04/07 17:53:09 | } 

---

INFO   | jvm 1    | 2014/04/07 17:53:11 | W|2014-04-07 17:53:11,320|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] No registration found for hz:impl:mapService / 49b783fc-7458-448c-b49c-c89f5dfe06ce 

INFO   | jvm 1    | 2014/04/07 17:53:12 | W|2014-04-07 17:53:12,567|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] No registration found for hz:impl:mapService / 49b783fc-7458-448c-b49c-c89f5dfe06ce 

..

INFO   | jvm 1    | 2014/04/07 17:53:21 | W|2014-04-07 17:53:21,916|[hz._hzInstance_1_dev.global-operation.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] MasterConfirmation has been received from Address[10.100.0.13]:5701, but it is not a member of this cluster! 

INFO   | jvm 1    | 2014/04/07 17:53:21 | I|2014-04-07 17:53:21,917|[hz._hzInstance_1_dev.IO.thread-in-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.13]:5701] lost. Reason: java.io.EOFException[Remote socket closed!]

...

INFO   | jvm 1    | 2014/04/07 17:53:40 | I|2014-04-07 17:53:40,311|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 51919 accepted socket connection from /10.100.0.13:5701 

INFO   | jvm 1    | 2014/04/07 17:53:41 | I|2014-04-07 17:53:41,812|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging to Address[10.100.0.13]:5701, because : joinRequest.getMemberCount() > currentMemberCount [2 > 1] 

INFO   | jvm 1    | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,813|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging [tcp/ip] to Address[10.100.0.13]:5701 

INFO   | jvm 1    | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,813|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Preparing to merge... Waiting for merge instruction... 

INFO   | jvm 1    | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,814|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging to Address[10.100.0.13]:5701, because: instructed by master Address[10.100.0.11]:5701 

INFO   | jvm 1    | 2014/04/07 17:53:41 | I|2014-04-07 17:53:41,815|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is MERGING 

INFO   | jvm 1    | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[0]. PartitionReplicaChangeEvent{partitionId=0, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null} 

INFO   | jvm 1    | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[1]. PartitionReplicaChangeEvent{partitionId=1, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null} 

INFO   | jvm 1    | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[2]. PartitionReplicaChangeEvent{partitionId=2, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null} 

INFO   | jvm 1    | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,903|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[3]. PartitionReplicaChangeEvent{partitionId=3, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null} 

....

INFO   | jvm 1    | 2014/04/07 18:15:03 | W|2014-04-07 18:15:03,041|[hz._hzInstance_1_dev.cached.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Master node has timed out its heartbeat and will be removed 

INFO   | jvm 1    | 2014/04/07 18:15:04 | W|2014-04-07 18:15:04,041|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Master node has timed out its heartbeat and will be removed 


I did run pings between the nodes (from the shell) at the time and didn't spot any network issues, so it would seem that intra-member checks get some how blocked or delayed?

I can share full logs for all nodes for some test runs if that would be helpful? And any tweaks for AWS that might be helpful?


Best regards,

Timo




Joe Planisky

unread,
Apr 7, 2014, 1:40:16 PM4/7/14
to haze...@googlegroups.com
Hi Timo,

One thing we found that improves cluster stability is to use the G1GC garbage collector. This eliminated long GC pauses for us, which were occasionally causing problems. Our launch script includes these command line options (which also enable GC logging so we could see how much time was being taken by GC):

-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc

Note that this didn’t completely cure our cluster stability problems, but seemed to improve it. If you’re not using G1GC, you might give it a try.


Joe


On Apr 7, 2014, at 9:49, Timo Ahokas <timo....@gmail.com> wrote:

> HI Chris,
>
> Thanks - really appreciate any help & suggestions on this. We've been stuck on this for months as we seem to be getting different error after error during our tests. But it does definitely look like the member nodes lose connections between themselves and this leads into the data loss and all nodes get disconnected.
> ...

> On 7 April 2014 16:33, Noctarius <noctar...@googlemail.com> wrote:
> Hi Timo,
>
> I’m currently looking into your issue and the only reason that comes to my mind would be something like a “member is assumed as dead” because it does not respond for a certain amount of time (if not mistaken 10 secs by default). Is it possible that for the load test the servers suffer from GC pauses or something like that that might make other members think others are dead?
>
> Do you see any log lines from the Healthmonitor to state about low memory or high queue sizes? Can you maybe provide a log file?
>
> Thanks,
> Chris
>
>
> Am 07.04.2014 um 11:39 schrieb Timo Ahokas <timo....@gmail.com>:
>
>> On additional note - were using the default 1 synchronous backup, so all data should be on 2 nodes (out of 3).
>> How is it possible that the cluster loses the data without any of the nodes going down?
>>
>>
>> On Monday, 7 April 2014 10:35:38 UTC+3, Timo Ahokas wrote:
>> Hi,
>>
>> We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.
>>
>> ...
>> We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable.
...

Timo Ahokas

unread,
Apr 7, 2014, 3:46:45 PM4/7/14
to haze...@googlegroups.com
Hi Joe,

Thanks for the suggestion. We're still on Java 6, so might need to upgrade for the test (I presume you're running on Java 7?).
We've got GC logging enabled for both containers - I previously checked that there were no long pauses (were running with -XX:+UseConcMarkSweepGC etc), but I'll re-check just in case.

Just out of curiosity, what sort of load are you putting on your cluster? (approximate operations per node per sec, average CPU load etc) I'm just trying to find a level that you'd expect to work (as we've already moved most of our heavy operations out of Hazelcast for the test)

Best regards,
Timo


On 7 April 2014 20:40, Joe Planisky <joe.pl...@temboo.com> wrote:
Hi Timo,

One thing we found that improves cluster stability is to use the G1GC garbage collector.  This eliminated long GC pauses for us, which were occasionally causing problems.  Our launch script includes these command line options (which also enable GC logging so we could see how much time was being taken by GC):

-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc

Note that this didn't completely cure our cluster stability problems, but seemed to improve it.  If you're not using G1GC, you might give it a try.

--

Joe


On Apr 7, 2014, at 9:49, Timo Ahokas <timo....@gmail.com> wrote:

> HI Chris,
>
> Thanks - really appreciate any help & suggestions on this. We've been stuck on this for months as we seem to be getting different error after error during our tests. But it does definitely look like the member nodes lose connections between themselves and this leads into the data loss and all nodes get disconnected.
> ...

> On 7 April 2014 16:33, Noctarius <noctar...@googlemail.com> wrote:
> Hi Timo,
>
> I'm currently looking into your issue and the only reason that comes to my mind would be something like a "member is assumed as dead" because it does not respond for a certain amount of time (if not mistaken 10 secs by default). Is it possible that for the load test the servers suffer from GC pauses or something like that that might make other members think others are dead?
>
> Do you see any log lines from the Healthmonitor to state about low memory or high queue sizes? Can you maybe provide a log file?
>
> Thanks,
> Chris
>
>
> Am 07.04.2014 um 11:39 schrieb Timo Ahokas <timo....@gmail.com>:
>
>> On additional note - were using the default 1 synchronous backup, so all data should be on 2 nodes (out of 3).
>> How is it possible that the cluster loses the data without any of the nodes going down?
>>
>>
>> On Monday, 7 April 2014 10:35:38 UTC+3, Timo Ahokas wrote:
>> Hi,
>>
>> We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.
>>
>> ...
>> We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable.
...
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.

Gernot Pansy

unread,
Apr 7, 2014, 4:06:35 PM4/7/14
to haze...@googlegroups.com
Hi Timo, 

looks a bit like my problem. I'm stilling trying to bring my application from 2.6 to 3.x. On my last test (with 3.2) i got similar problems:
https://github.com/hazelcast/hazelcast/issues/2105 - it looks like we have nearly the same workload.

I have seen in git history that mdogan has already refactored the partition service after 3.2 release. So perhaps you can give an actual 3.2.1-SNAPSHOT a try.
Is it stable if you run it with only one cluster node?

gernot

Joe Planisky

unread,
Apr 7, 2014, 5:12:40 PM4/7/14
to haze...@googlegroups.com
We are indeed using Java 7.

I can’t help too much with what a reasonable load is because our systems are limited by non-HZ factors. I would guess that at peaks we may be hitting close to 100 Hazelcast ops/sec. I’m not sure what the CPU usage is at peaks, but during normal operations it’s usually negligible, < 20%.

That said, the stability problems we’ve seen don’t seem to be load related. We’ve seen nodes start bouncing in and out of the cluster when things were basically idle. It seems to be related to the AWS environment as we’ve never had stability problems when running a cluster on a local network.

--
Joe Planisky | Temboo | 104 Franklin Street NYC | www.temboo.com
> To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/CACwRk3_urTqhtQmrSC9ofjJknPM1qJ7QgueyU4BcDBD%3DnehcFQ%40mail.gmail.com.

Timo Ahokas

unread,
Apr 8, 2014, 3:53:50 AM4/8/14
to haze...@googlegroups.com, noctar...@googlemail.com
Hi Chris,

Looking at the logs now does show that ICMP ping (hazelcast.icmp.enabled) doesn't work for us:

INFO   | jvm 1    | 2014/04/08 07:07:31 | W|2014-04-08 07:07:31,697|[hz._hzInstance_1_dev.cached.thread-17]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701 

INFO   | jvm 1    | 2014/04/08 07:07:37 | W|2014-04-08 07:07:37,163|[hz._hzInstance_1_dev.cached.thread-12]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701

I looked a bit into this and it seems to be expected as ping/isReachable() requires root/sudo access on Linux. We are running Tomcat using the separate tomcat user (as recommended).
How is hazelcast.imcp.enabled supposed to work on Linux and when running Java as a normal user?

Best regards,
Timo

Noctarius

unread,
Apr 8, 2014, 3:59:22 AM4/8/14
to Timo Ahokas, haze...@googlegroups.com
Hi Timo,

It needs root rights or it will use port 7 for echo response but that requires port 7 to be open and an echo server installed on the other side (xinet should have one build in I guess).

Chris

Timo Ahokas

unread,
Apr 8, 2014, 4:38:17 AM4/8/14
to haze...@googlegroups.com, Timo Ahokas, noctar...@googlemail.com
Hi Chris,

Thanks a lot... I've setup xinetd/echo and will give it a try now. Some how didn't even think about this earlier as shell ping worked fine and there was no mention of it in the docs or the Hazelcast on AWS guide. 
Hope this helps.... will keep you updated.

Best regards,
Timo

Noctarius

unread,
Apr 8, 2014, 4:39:43 AM4/8/14
to Timo Ahokas, haze...@googlegroups.com
Yeah you’re right it should have been mentioned. We recently (a few days before) had a customer was suffering from the same problem. We’ll add a marker to the documentation :)

Timo Ahokas

unread,
Apr 9, 2014, 8:30:18 AM4/9/14
to Noctarius, haze...@googlegroups.com
Hi Chris,

Our system works a lot better now (compared to before having a broken ICMP config) - so thanks a lot for your help. Unfortunately things are still failing (internal processes start timing out as processing takes too long) when I try moving parts of the data back into Hazelcast. We have multiple implementations for some caches/queues and are using Redis on the side for some high traffic workloads. 

I'll need to run further tests to try to identify whether we're overloading the machines (CPUs still around 25-30%) or hitting some other issue. The two high volume operations that we moved back to Hazelcast and seem to bring the system down are:
- map/multimap cache for session related data (150-300 req/s with 1kb payloads (in total for 3 servers))
- queue traffic between the nodes (10-50 req/s with 1kb payloads).

Based on previous testing, this only seems to happen when we do rolling restarts of the Hazelcast clients (members not bounced) - constant high load (even 2-4 times more) is fine.

Best regards,
Timo

Noctarius

unread,
Apr 9, 2014, 9:32:30 AM4/9/14
to Timo Ahokas, haze...@googlegroups.com
Sadly I already heard a few times that people reporting the cluster becomes a bit unstable on rolling restarts and we are looking into this. We were not yet able to find a reliable problem maker in the system but it seems to happen more often if you restart the nodes to quickly.

How long are your nodes are down before you start it again?

Chris

Timo Ahokas

unread,
Apr 9, 2014, 9:42:51 AM4/9/14
to Noctarius, haze...@googlegroups.com
Hi Chris,

We are not restarting members, just the Tomcat process which is the Hazelcast client (native Java client). The client is offline maybe a bit over a minute. So 3 servers, each running Tomcat (client + our app) and Jetty (member), rolling restarts for only the Tomcats. We had to split the member out due to data migrations causing even worse pauses/issues.

Noctarius

unread,
Apr 9, 2014, 9:48:03 AM4/9/14
to Timo Ahokas, haze...@googlegroups.com
Hi Timo,

I see, at the moment we just had reports from the cluster side. I’ll ask Ali to join the conversation, since he’s the client engineer. Maybe he has a good idea / strategy for you.

Chris

Joe Planisky

unread,
Apr 9, 2014, 10:30:29 AM4/9/14
to haze...@googlegroups.com
FWIW, we are also beginning to suspect adding or removing clients (actually, lite members in our case, we’re still on 2.4.1) sometimes triggers instability. We definitely found that adding and removing full members too quickly can cause problems.

While it seems rare and not reliably reproducible, we’ve had a few cases where adding and removing lite members seemed to trigger instability.


On Apr 9, 2014, at 6:42, Timo Ahokas <timo....@gmail.com> wrote:

> Hi Chris,
>
> We are not restarting members, just the Tomcat process which is the Hazelcast client (native Java client). The client is offline maybe a bit over a minute. So 3 servers, each running Tomcat (client + our app) and Jetty (member), rolling restarts for only the Tomcats. We had to split the member out due to data migrations causing even worse pauses/issues.
>
> On 9.4.2014, at 16.32, Noctarius <noctar...@googlemail.com> wrote:
>
>> Sadly I already heard a few times that people reporting the cluster becomes a bit unstable on rolling restarts and we are looking into this. We were not yet able to find a reliable problem maker in the system but it seems to happen more often if you restart the nodes to quickly.
>>
>> How long are your nodes are down before you start it again?
>>
>> Chris
>>
>> Am 09.04.2014 um 14:30 schrieb Timo Ahokas <timo....@gmail.com>:
>>
>>> Hi Chris,
>>>
>>> ...

chirag.m...@gmail.com

unread,
Feb 15, 2017, 5:01:21 AM2/15/17
to Hazelcast, joe.pl...@temboo.com
Hi There, 

In our production environment we are observing the similar hazel cast statements as following :

12:31:23,122 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-1) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.EOFException[null]
12:31:23,138 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_2_STGICD_PM.IO.thread-1) [10.4.12.186]:6604 [STGICD_PM] hz._hzInstance_2_STGICD_PM.IO.thread-1 Closing socket to endpoint Address[10.4.12.187]:6602, Cause:java.io.EOFException
12:31:23,138 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-1) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.EOFException[null]
12:31:23,138 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_1_STGICD_PM.IO.thread-1) [10.4.12.186]:6602 [STGICD_PM] hz._hzInstance_1_STGICD_PM.IO.thread-1 Closing socket to endpoint Address[10.4.12.187]:6602, Cause:java.io.EOFException
12:31:23,185 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-0) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.EOFException[null]
12:31:23,185 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_2_STGICD_PM.IO.thread-0) [10.4.12.186]:6604 [STGICD_PM] hz._hzInstance_2_STGICD_PM.IO.thread-0 Closing socket to endpoint Address[10.4.12.187]:6604, Cause:java.io.EOFException
12:31:23,185 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-0) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.EOFException[null]
12:31:23,185 WARNING [com.hazelcast.nio.ReadHandler] (hz._hzInstance_1_STGICD_PM.IO.thread-0) [10.4.12.186]:6602 [STGICD_PM] hz._hzInstance_1_STGICD_PM.IO.thread-0 Closing socket to endpoint Address[10.4.12.187]:6604, Cause:java.io.EOFException
12:31:23,263 INFO  [com.hazelcast.nio.ConnectionManager] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] 53371 accepted socket connection from /10.4.12.187:6604
12:31:23,263 INFO  [com.hazelcast.nio.ConnectionManager] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] 53374 accepted socket connection from /10.4.12.187:6602
12:31:23,263 INFO  [com.hazelcast.nio.ConnectionManager] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] 53373 accepted socket connection from /10.4.12.187:6604
12:31:23,263 INFO  [com.hazelcast.nio.ConnectionManager] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] 53372 accepted socket connection from /10.4.12.187:6602
12:31:23,310 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-1) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]
12:31:23,310 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-1) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6604] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]
12:31:23,497 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_2_STGICD_PM.IO.thread-2) [10.4.12.186]:6604 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]
12:31:23,497 INFO  [com.hazelcast.nio.Connection] (hz._hzInstance_1_STGICD_PM.IO.thread-2) [10.4.12.186]:6602 [STGICD_PM] Connection [Address[10.4.12.187]:6602] lost. Reason: java.io.IOException[An existing connection was forcibly closed by the remote host]
12:31:24,917 INFO  [com.hazelcast.nio.SocketConnector] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]
12:31:25,042 INFO  [com.hazelcast.nio.SocketConnector] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]
12:31:25,292 INFO  [com.hazelcast.nio.SocketConnector] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]
12:31:25,292 INFO  [com.hazelcast.nio.SocketConnector] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]
12:31:26,047 INFO  [com.hazelcast.nio.SocketConnector] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]
12:31:26,047 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-4-thread-9) [10.4.12.186]:6602 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6602 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3
12:31:26,047 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Removing Address Address[10.4.12.187]:6602
12:31:26,047 INFO  [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-38) 12:31:26,047 NOTICE HazelcastMembershipListener:89 - Member Removed /10.4.12.187:6602 <==> 40f7c462-d742-4cdf-932e-c268b203f024

12:31:26,047 INFO  [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-38) <12:31:26:047><NOTICE><HazelcastMembershipListener>Member Removed /10.4.12.187:6602 <==> 40f7c462-d742-4cdf-932e-c268b203f024

12:31:26,047 INFO  [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Starting to send partition replica diffs...true
12:31:26,047 INFO  [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.MigrationThread) [10.4.12.186]:6602 [STGICD_PM] Total 0 partition replica diffs have been processed.
12:31:26,047 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] 

Members [3] {
Member [10.4.12.186]:6602 this
Member [10.4.12.186]:6604
Member [10.4.12.187]:6604
}

12:31:26,062 INFO  [com.hazelcast.nio.SocketConnector] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]
12:31:26,062 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-4-thread-8) [10.4.12.186]:6602 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6604 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3
12:31:26,062 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Removing Address Address[10.4.12.187]:6604
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[0]. PartitionReplicaChangeEvent{partitionId=0, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[1]. PartitionReplicaChangeEvent{partitionId=1, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[2]. PartitionReplicaChangeEvent{partitionId=2, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[3]. PartitionReplicaChangeEvent{partitionId=3, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[4]. PartitionReplicaChangeEvent{partitionId=4, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 INFO  [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-27) 12:31:26,062 NOTICE HazelcastMembershipListener:89 - Member Removed /10.4.12.187:6604 <==> 0e4ae4d8-85c5-4ae9-8477-6ceaaf5269cf

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[5]. PartitionReplicaChangeEvent{partitionId=5, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[6]. PartitionReplicaChangeEvent{partitionId=6, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[7]. PartitionReplicaChangeEvent{partitionId=7, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 INFO  [stdout] (hz._hzInstance_1_STGICD_PM.cached.thread-27) <12:31:26:062><NOTICE><HazelcastMembershipListener>Member Removed /10.4.12.187:6604 <==> 0e4ae4d8-85c5-4ae9-8477-6ceaaf5269cf

12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[8]. PartitionReplicaChangeEvent{partitionId=8, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[9]. PartitionReplicaChangeEvent{partitionId=9, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[128]. PartitionReplicaChangeEvent{partitionId=128, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[129]. PartitionReplicaChangeEvent{partitionId=129, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[130]. PartitionReplicaChangeEvent{partitionId=130, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[131]. PartitionReplicaChangeEvent{partitionId=131, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[132]. PartitionReplicaChangeEvent{partitionId=132, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[133]. PartitionReplicaChangeEvent{partitionId=133, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[134]. PartitionReplicaChangeEvent{partitionId=134, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[188]. PartitionReplicaChangeEvent{partitionId=188, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[189]. PartitionReplicaChangeEvent{partitionId=189, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[190]. PartitionReplicaChangeEvent{partitionId=190, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[191]. PartitionReplicaChangeEvent{partitionId=191, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[192]. PartitionReplicaChangeEvent{partitionId=192, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[193]. PartitionReplicaChangeEvent{partitionId=193, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[194]. PartitionReplicaChangeEvent{partitionId=194, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[195]. PartitionReplicaChangeEvent{partitionId=195, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[196]. PartitionReplicaChangeEvent{partitionId=196, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[197]. PartitionReplicaChangeEvent{partitionId=197, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,062 WARNING [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Owner of partition is being removed! Possible data loss for partition[198]. PartitionReplicaChangeEvent{partitionId=198, replicaIndex=0, oldAddress=Address[10.4.12.187]:6604, newAddress=null}
12:31:26,078 INFO  [com.hazelcast.impl.PartitionManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] Starting to send partition replica diffs...true
12:31:26,078 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_1_STGICD_PM.ServiceThread) [10.4.12.186]:6602 [STGICD_PM] 

Members [2] {
Member [10.4.12.186]:6602 this
Member [10.4.12.186]:6604
}

12:31:27,312 INFO  [com.hazelcast.nio.SocketConnector] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6602. Reason: ConnectException[Connection refused: connect]
12:31:27,312 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-86818-thread-8) [10.4.12.186]:6604 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6602 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3
12:31:27,312 INFO  [com.hazelcast.nio.SocketConnector] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] Could not connect to: /10.4.12.187:6604. Reason: ConnectException[Connection refused: connect]
12:31:27,312 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM] Removing Address Address[10.4.12.187]:6602
12:31:27,312 WARNING [com.hazelcast.nio.ConnectionMonitor] (pool-86818-thread-9) [10.4.12.186]:6604 [STGICD_PM] Removing connection to endpoint Address[10.4.12.187]:6604 Cause => java.net.ConnectException {Connection refused: connect}, Error-Count: 3
12:31:27,312 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM] 

Members [3] {
Member [10.4.12.186]:6602
Member [10.4.12.186]:6604 this
Member [10.4.12.187]:6604
}

12:31:27,312 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM] Removing Address Address[10.4.12.187]:6604
12:31:27,312 INFO  [com.hazelcast.cluster.ClusterManager] (hz._hzInstance_2_STGICD_PM.ServiceThread) [10.4.12.186]:6604 [STGICD_PM] 

Members [2] {
Member [10.4.12.186]:6602
Member [10.4.12.186]:6604 this
}


When is says possible data loss it means the transaction is getting lost in between? Correct me if i am wrong.
Also, what could be the reason for this issue ? We have 2 nodes configured in hazel cast.

Appreciate your reply.
Reply all
Reply to author
Forward
0 new messages