Hi,
We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.
Our current test setup:
- 3 servers, each running both
a) Tomcat (as Hazelcast clients)
b) Jetty (as Hazelcast members)
We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable.
We're running rolling upgrade tests for the Tomcat servers, Jetty servers are kept running all the time. After a few rounds we might se something like this in the member logs:
lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO | jvm 1 | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[265]. PartitionReplicaChangeEvent{partitionId=265, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null}
lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO | jvm 1 | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[266]. PartitionReplicaChangeEvent{partitionId=266, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null}
lekane/jetty/logs/jetty_wrapper_20140407001810.log:INFO | jvm 1 | 2014/04/07 10:01:37 | W|2014-04-07 10:01:37,278|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.12]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[267]. PartitionReplicaChangeEvent{partitionId=267, replicaIndex=0, oldAddress=Address[10.100.0.12]:5701, newAddress=null}
Basically it seems we lose some or all of our data without any change to the running Hazelcast members. What could cause this?
We also frequently get the cluster stuck into a state where membership data seems to be corrupted after migration as we get errors like this that keep looping for hours and never seem to resolve:
INFO | jvm 1 | 2014/04/01 08:30:51 | W|2014-04-01 08:30:51,292|[hz._hzInstance_1_dev.operation.thread-10]|[Slf4jFactory.java:85] [10.100.0.11]:5701 [dev] Op: com.hazelcast.concurrent.lock.LockOperation@2c3f17b9, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.100.0.11]:5701, target:Address[10.100.0.12]:5701, partitionId: 110, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:lockService
INFO | jvm 1 | 2014/04/01 08:30:51 | W|2014-04-01 08:30:51,292|[hz._hzInstance_1_dev.operation.thread-30]|[Slf4jFactory.java:85] [10.100.0.11]:5701 [dev] Op: com.hazelcast.queue.PollOperation@52bcc4a9, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[10.100.0.11]:5701, target:Address[10.100.0.13]:5701, partitionId: 230, replicaIndex: 0, operation: com.hazelcast.spi.impl.WaitNotifyServiceImpl$WaitingOp, service: hz:impl:queueService
I'm just wondering whether we are pushing too much load on Hazelcast (maybe up to 100-300 req/s per node, with the CPU load rarely going over 35%) or whether there is something fundamentally wrong with our tests or setup.
Best regards,
Timo
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/8b1be67a-0084-4399-bd13-d11faa9b07e3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
<property name="hazelcast.icmp.enabled">true</property>
<property name="hazelcast.icmp.timeout">10000</property>
<property name="hazelcast.max.no.heartbeat.seconds">30</property>
We are running the tests in AWS VPC on 3 x c3.xlarge (4 core) boxes with 7GB memory.
Just from a recent run... things look fairly normal (?) initially:
INFO | jvm 1 | 2014/04/07 17:38:48 | I|2014-04-07 17:38:48,573|[hz._hzInstance_1_dev.HealthMonitor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] memory.used=52.6M, memory.free=73.3M, memory.total=125.9M, memory.max=1.9G, memory.used/total=41.80%, memory.used/max=2.64%, load.process=-1.00%, load.system=-1.00%, load.systemAverage=109.00%, thread.count=116, thread.peakCount=117, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.operation.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=0
But then we start seeing stuff like:
INFO | jvm 1 | 2014/04/07 17:38:59 | W|2014-04-07 17:38:59,633|[hz._hzInstance_1_dev.client.thread-38]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.PartitionIteratingOperation, service: hz:impl:mapService
...
INFO | jvm 1 | 2014/04/07 17:39:47 | W|2014-04-07 17:39:47,566|[hz._hzInstance_1_dev.response]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapSizeOperation@7cbba534, partitionId=35, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: 35, operation: com.hazelcast.map.operation.MapSizeOperation, service: hz:impl:mapService
INFO | jvm 1 | 2014/04/07 17:39:47 | W|2014-04-07 17:39:47,567|[hz._hzInstance_1_dev.response]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.map.operation.MapSizeOperation@35b38e6, partitionId=131, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=100, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.CallerNotMemberException: Not Member! caller:Address[10.100.0.11]:5701, partitionId: 131, operation: com.hazelcast.map.operation.MapSizeOperation, service: hz:impl:mapService
....
INFO | jvm 1 | 2014/04/07 17:40:35 | W|2014-04-07 17:40:35,060|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] New join request has been received from an existing endpoint! => Member [10.100.0.12]:5701 Removing old member and processing join request...
INFO | jvm 1 | 2014/04/07 17:40:35 | I|2014-04-07 17:40:35,061|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.12]:5701
INFO | jvm 1 | 2014/04/07 17:40:35 | I|2014-04-07 17:40:35,066|[hz._hzInstance_1_dev.global-operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]
INFO | jvm 1 | 2014/04/07 17:40:35 |
INFO | jvm 1 | 2014/04/07 17:40:35 | Members [2] {
INFO | jvm 1 | 2014/04/07 17:40:35 | Member [10.100.0.11]:5701 this
INFO | jvm 1 | 2014/04/07 17:40:35 | Member [10.100.0.13]:5701
INFO | jvm 1 | 2014/04/07 17:40:35 | }
INFO | jvm 1 | 2014/04/07 17:40:35 |
INFO | jvm 1 | 2014/04/07 17:40:36 | I|2014-04-07 17:40:36,583|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Partition balance is ok, no need to re-partition cluster data...
INFO | jvm 1 | 2014/04/07 17:40:39 | W|2014-04-07 17:40:39,326|[hz._hzInstance_1_dev.async.thread-11]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation@48d77123, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=200, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[10.100.0.12]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation, service: hz:impl:mapService
INFO | jvm 1 | 2014/04/07 17:40:39 | W|2014-04-07 17:40:39,326|[hz._hzInstance_1_dev.async.thread-18]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Retrying invocation: BasicInvocation{ serviceName='hz:impl:mapService', op=com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation@c4dd10c, partitionId=-1, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=200, callTimeout=60000, target=Address[10.100.0.12]:5701}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target:Address[10.100.0.12]:5701, partitionId: -1, operation: com.hazelcast.spi.impl.EventServiceImpl$RegistrationOperation, service: hz:impl:mapService
INFO | jvm 1 | 2014/04/07 17:40:41 | I|2014-04-07 17:40:41,061|[hz._hzInstance_1_dev.global-operation.thread-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]
INFO | jvm 1 | 2014/04/07 17:40:41 |
INFO | jvm 1 | 2014/04/07 17:40:41 | Members [3] {
INFO | jvm 1 | 2014/04/07 17:40:41 | Member [10.100.0.11]:5701 this
INFO | jvm 1 | 2014/04/07 17:40:41 | Member [10.100.0.13]:5701
INFO | jvm 1 | 2014/04/07 17:40:41 | Member [10.100.0.12]:5701
INFO | jvm 1 | 2014/04/07 17:40:41 | }
INFO | jvm 1 | 2014/04/07 17:40:41 |
INFO | jvm 1 | 2014/04/07 17:40:41 | I|2014-04-07 17:40:41,337|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-partitioning cluster data... Migration queue size: 90
INFO | jvm 1 | 2014/04/07 17:40:42 | I|2014-04-07 17:40:42,690|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] All migration tasks has been completed, queues are empty.
....
INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 101 -vs- 103
INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-3]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 135 -vs- 137
INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 161 -vs- 162
INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,573|[hz._hzInstance_1_dev.operation.thread-3]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 152 -vs- 154
INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,574|[hz._hzInstance_1_dev.operation.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 112 -vs- 114
INFO | jvm 1 | 2014/04/07 17:41:00 | I|2014-04-07 17:41:00,574|[hz._hzInstance_1_dev.operation.thread-4]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Backup partition version is not matching version of the owner -> 81 -vs- 83
..
INFO | jvm 1 | 2014/04/07 17:52:59 | W|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701
INFO | jvm 1 | 2014/04/07 17:52:59 | W|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.13]:5701
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactor
y.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.12]:5701] lost. Reason: Socket explicitly closed
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,884|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.12]:5701
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,888|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]
INFO | jvm 1 | 2014/04/07 17:52:59 |
INFO | jvm 1 | 2014/04/07 17:52:59 | Members [2] {
INFO | jvm 1 | 2014/04/07 17:52:59 | Member [10.100.0.11]:5701 this
INFO | jvm 1 | 2014/04/07 17:52:59 | Member [10.100.0.13]:5701
INFO | jvm 1 | 2014/04/07 17:52:59 | }
INFO | jvm 1 | 2014/04/07 17:52:59 |
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,889|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.13]:5701] lost. Reason: Socket explicitly closed
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,889|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Removing Member [10.100.0.13]:5701
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,891|[hz._hzInstance_1_dev.cached.thread-1]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2]
INFO | jvm 1 | 2014/04/07 17:52:59 |
INFO | jvm 1 | 2014/04/07 17:52:59 | Members [1] {
INFO | jvm 1 | 2014/04/07 17:52:59 | Member [10.100.0.11]:5701 this
INFO | jvm 1 | 2014/04/07 17:52:59 | }
INFO | jvm 1 | 2014/04/07 17:52:59 |
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,894|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Accepting socket connection from /10.100.0.13:54577
INFO | jvm 1 | 2014/04/07 17:52:59 | I|2014-04-07 17:52:59,895|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 5701 accepted socket connection from /10.100.0.13:54577
INFO | jvm 1 | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,254|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Accepting socket connection from /10.100.0.12:56674
INFO | jvm 1 | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,255|[hz._hzInstance_1_dev.IO.thread-Acceptor]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 5701 accepted socket connection from /10.100.0.12:56674
INFO | jvm 1 | 2014/04/07 17:53:00 | W|2014-04-07 17:53:00,647|[hz._hzInstance_1_dev.global-operation.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] MasterConfirmation has been received from Address[10.100.0.12]:5701, but it is not a member of this cluster!
INFO | jvm 1 | 2014/04/07 17:53:00 | I|2014-04-07 17:53:00,648|[hz._hzInstance_1_dev.IO.thread-in-0]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.12]:5701] lost. Reason: java.io.EOFException[Remote socket closed!]
INFO | jvm 1 | 2014/04/07 17:53:00 | W|2014-04-07 17:53:00,648|[hz._hzInstance_1_dev.IO.thread-in-0]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] hz._hzInstance_1_dev.IO.thread-in-0 Closing socket to endpoint Address[10.100.0.12]:5701, Cause:java.io.EOFException: Remote socket closed!
INFO | jvm 1 | 2014/04/07 17:53:00 | Exception in thread "hz._hzInstance_1_dev.cached.thread-1" java.lang.NullPointerException
INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.handleEvent(AbstractMapAddEntryListenerRequest.java:77)
INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.entryAdded(AbstractMapAddEntryListenerRequest.java:83)
INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:873)
INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:75)
INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.spi.impl.EventServiceImpl$EventPacketProcessor.process(EventServiceImpl.java:508)
INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.spi.impl.EventServiceImpl$RemoteEventPacketProcessor.run(EventServiceImpl.java:534)
INFO | jvm 1 | 2014/04/07 17:53:00 | at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:142)
INFO | jvm 1 | 2014/04/07 17:53:00 | at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
INFO | jvm 1 | 2014/04/07 17:53:00 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
INFO | jvm 1 | 2014/04/07 17:53:00 | at java.lang.Thread.run(Thread.java:662)
INFO | jvm 1 | 2014/04/07 17:53:01 | I|2014-04-07 17:53:01,391|[hz._hzInstance_1_dev.migration]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Partition balance is ok, no need to re-partition cluster data...
INFO | jvm 1 | 2014/04/07 17:53:06 | Exception in thread "hz._hzInstance_1_dev.cached.thread-5" java.lang.NullPointerException
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.handleEvent(AbstractMapAddEntryListenerRequest.java:77)
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.client.AbstractMapAddEntryListenerRequest$1.entryEvicted(AbstractMapAddEntryListenerRequest.java:95)
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:876)
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.map.MapService.dispatchEvent(MapService.java:75)
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.spi.impl.EventServiceImpl$EventPacketProcessor.process(EventServiceImpl.java:508)
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.spi.impl.EventServiceImpl$RemoteEventPacketProcessor.run(EventServiceImpl.java:534)
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.util.executor.StripedExecutor$Worker.run(StripedExecutor.java:142)
INFO | jvm 1 | 2014/04/07 17:53:06 | at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
INFO | jvm 1 | 2014/04/07 17:53:06 | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
INFO | jvm 1 | 2014/04/07 17:53:06 | at java.lang.Thread.run(Thread.java:662)
INFO | jvm 1 | 2014/04/07 17:53:06 | at com.hazelcast.util.executor.PoolExecutorThreadFactory$ManagedThread.run(PoolExecutorThreadFactory.java:59)
INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 33, replica: 1
INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 33, replicaIndex: 1, partition-info: Partition [33]{
INFO | jvm 1 | 2014/04/07 17:53:09 | 0:Address[10.100.0.11]:5701
INFO | jvm 1 | 2014/04/07 17:53:09 | }
INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 35, replica: 1
INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 35, replicaIndex: 1, partition-info: Partition [35]{
INFO | jvm 1 | 2014/04/07 17:53:09 | 0:Address[10.100.0.11]:5701
INFO | jvm 1 | 2014/04/07 17:53:09 | }
INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Re-sending sync replica request for partition: 36, replica: 1
INFO | jvm 1 | 2014/04/07 17:53:09 | I|2014-04-07 17:53:09,886|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] This node is now owner of partition, cannot sync replica -> partitionId: 36, replicaIndex: 1, partition-info: Partition [36]{
INFO | jvm 1 | 2014/04/07 17:53:09 | 0:Address[10.100.0.11]:5701
INFO | jvm 1 | 2014/04/07 17:53:09 | }
---
INFO | jvm 1 | 2014/04/07 17:53:11 | W|2014-04-07 17:53:11,320|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] No registration found for hz:impl:mapService / 49b783fc-7458-448c-b49c-c89f5dfe06ce
INFO | jvm 1 | 2014/04/07 17:53:12 | W|2014-04-07 17:53:12,567|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] No registration found for hz:impl:mapService / 49b783fc-7458-448c-b49c-c89f5dfe06ce
..
INFO | jvm 1 | 2014/04/07 17:53:21 | W|2014-04-07 17:53:21,916|[hz._hzInstance_1_dev.global-operation.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] MasterConfirmation has been received from Address[10.100.0.13]:5701, but it is not a member of this cluster!
INFO | jvm 1 | 2014/04/07 17:53:21 | I|2014-04-07 17:53:21,917|[hz._hzInstance_1_dev.IO.thread-in-2]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Connection [Address[10.100.0.13]:5701] lost. Reason: java.io.EOFException[Remote socket closed!]
...
INFO | jvm 1 | 2014/04/07 17:53:40 | I|2014-04-07 17:53:40,311|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] 51919 accepted socket connection from /10.100.0.13:5701
INFO | jvm 1 | 2014/04/07 17:53:41 | I|2014-04-07 17:53:41,812|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging to Address[10.100.0.13]:5701, because : joinRequest.getMemberCount() > currentMemberCount [2 > 1]
INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,813|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging [tcp/ip] to Address[10.100.0.13]:5701
INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,813|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Preparing to merge... Waiting for merge instruction...
INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,814|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is merging to Address[10.100.0.13]:5701, because: instructed by master Address[10.100.0.11]:5701
INFO | jvm 1 | 2014/04/07 17:53:41 | I|2014-04-07 17:53:41,815|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:88] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 is MERGING
INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[0]. PartitionReplicaChangeEvent{partitionId=0, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}
INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[1]. PartitionReplicaChangeEvent{partitionId=1, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}
INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,902|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[2]. PartitionReplicaChangeEvent{partitionId=2, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}
INFO | jvm 1 | 2014/04/07 17:53:41 | W|2014-04-07 17:53:41,903|[hz._hzInstance_1_dev.cached.thread-8]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Owner of partition is being removed! Possible data loss for partition[3]. PartitionReplicaChangeEvent{partitionId=3, replicaIndex=0, oldAddress=Address[10.100.0.11]:5701, newAddress=null}
....
INFO | jvm 1 | 2014/04/07 18:15:03 | W|2014-04-07 18:15:03,041|[hz._hzInstance_1_dev.cached.thread-4]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Master node has timed out its heartbeat and will be removed
INFO | jvm 1 | 2014/04/07 18:15:04 | W|2014-04-07 18:15:04,041|[hz._hzInstance_1_dev.cached.thread-6]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Master node has timed out its heartbeat and will be removed
I did run pings between the nodes (from the shell) at the time and didn't spot any network issues, so it would seem that intra-member checks get some how blocked or delayed?
I can share full logs for all nodes for some test runs if that would be helpful? And any tweaks for AWS that might be helpful?
Best regards,
Timo
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/F6BA173F-B2AA-4BCF-ABAD-F321B9CB7BF9%40googlemail.com.
Hi Timo,
One thing we found that improves cluster stability is to use the G1GC garbage collector. This eliminated long GC pauses for us, which were occasionally causing problems. Our launch script includes these command line options (which also enable GC logging so we could see how much time was being taken by GC):
-XX:+UseG1GC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -verbose:gc
Note that this didn't completely cure our cluster stability problems, but seemed to improve it. If you're not using G1GC, you might give it a try.
--
Joe
> ...
On Apr 7, 2014, at 9:49, Timo Ahokas <timo....@gmail.com> wrote:
> HI Chris,
>
> Thanks - really appreciate any help & suggestions on this. We've been stuck on this for months as we seem to be getting different error after error during our tests. But it does definitely look like the member nodes lose connections between themselves and this leads into the data loss and all nodes get disconnected.
>> ...
> On 7 April 2014 16:33, Noctarius <noctar...@googlemail.com> wrote:
> Hi Timo,
>
> I'm currently looking into your issue and the only reason that comes to my mind would be something like a "member is assumed as dead" because it does not respond for a certain amount of time (if not mistaken 10 secs by default). Is it possible that for the load test the servers suffer from GC pauses or something like that that might make other members think others are dead?
>
> Do you see any log lines from the Healthmonitor to state about low memory or high queue sizes? Can you maybe provide a log file?
>
> Thanks,
> Chris
>
>
> Am 07.04.2014 um 11:39 schrieb Timo Ahokas <timo....@gmail.com>:
>
>> On additional note - were using the default 1 synchronous backup, so all data should be on 2 nodes (out of 3).
>> How is it possible that the cluster loses the data without any of the nodes going down?
>>
>>
>> On Monday, 7 April 2014 10:35:38 UTC+3, Timo Ahokas wrote:
>> Hi,
>>
>> We are running on Hazelcast 3.2 and still continuing with the non-successful production load tests.
>>
>> We utilize Hazelcast quite heavily (maps (with listeners), locks, atomic numbers, queues & topics etc), but have recently starter migrating data out into other services as we can't seem to keep Hazelcast stable....
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at http://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/5CC1429C-6D52-4685-A795-B723D5356E0B%40temboo.com.
INFO | jvm 1 | 2014/04/08 07:07:31 | W|2014-04-08 07:07:31,697|[hz._hzInstance_1_dev.cached.thread-17]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701
INFO | jvm 1 | 2014/04/08 07:07:37 | W|2014-04-08 07:07:37,163|[hz._hzInstance_1_dev.cached.thread-12]|[Slf4jFactory.java:90] [10.100.0.11]:5701 [dev] [3.2] Address[10.100.0.11]:5701 couldn't ping Address[10.100.0.12]:5701