hazelcast cluster timeout after 45 minutes of activity

329 views
Skip to first unread message

aruld...@gmail.com

unread,
Apr 8, 2016, 6:58:53 PM4/8/16
to Hazelcast
Hello,

I have 2 nodes in my hazelcast cluster. Both nodes run just fine for 45 minutes and times out after that. Both nodes are running ubuntu and only these ports (2434, 37000-37004) are opened for hazelcast in the firewall. Has anyone seen this before? Once this happens, nothing works in the application.

Here is my hazelcast configuration:

<hazelcast
xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.0.xsd"
xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<group>
<name>masterslave</name>
<password>masterslave</password>
</group>
<network>
<port auto-increment="false">2434</port>
<outbound-ports>
<!-- only allow these ports for outbound -->
<ports>37000,37001,37002,37003,37004</ports>
</outbound-ports>
<join>
<multicast enabled="false">
<multicast-group>235.1.1.1</multicast-group>
<multicast-port>2434</multicast-port>
</multicast>
<tcp-ip enabled="true">
<member>192.168.161.41:2434</member>
<member>192.168.197.124:2434</member>
</tcp-ip>
</join>
</network>
<executor-service>
<pool-size>16</pool-size>
</executor-service>
</hazelcast>


This is the exception I get in the leader node:

2016-04-08 21:20:31:710 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] processors=2, physical.memory.total=2.0G, physical.memory.free=28.9M, swap.space.total=256.0M, swap.space.free=224.4M, heap.memory.used=1.1G, heap.memory.free=260.6M, heap.memory.total=1.4G, heap.memory.max=1.4G, heap.memory.used/total=81.42%, heap.memory.used/max=81.42%, minor.gc.count=49, minor.gc.time=2115ms, major.gc.count=5, major.gc.time=2866ms, load.process=0.00%, load.system=0.00%, load.systemAverage=0.00, thread.count=69, thread.peakCount=71, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=2, operations.running.size=0, operations.pending.invocations.count=2, operations.pending.invocations.percentage=0.00%, proxy.count=10, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor]
2016-04-08 21:27:01:793 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] No response for 120000 ms. InvocationFuture{invocation=Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}, response=null, done=false} [Invocation]
2016-04-08 21:27:01:796 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Asking if operation execution has been started: Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][192.168.161.41]:2434 [masterslave] [3.5.3] While asking 'is-executing': Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}
java.util.concurrent.TimeoutException: Call Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:impl:queueService', partitionId=-1, callId=1103, invocationTime=1460150821795, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} encountered a timeout
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:366)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
    at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService.isOperationExecuting(IsStillRunningService.java:85)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:275)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)

2016-04-08 21:27:06:814 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] 'is-executing': false -> Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][master] error on executing distributed request -1: -
com.hazelcast.core.OperationTimeoutException: No response for 1920000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
    at ------ End remote and begin local stack-trace ------.(Unknown Source)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)

2016-04-08 21:27:12:339 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] Handled 1 invocation timeouts and 0 backupTimeouts [OperationService]
2016-04-08 21:29:06:818 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] No response for 120001 ms. InvocationFuture{invocation=Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}, response=null, done=false} [Invocation]
2016-04-08 21:29:06:819 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Asking if operation execution has been started: Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][192.168.161.41]:2434 [masterslave] [3.5.3] While asking 'is-executing': Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}
java.util.concurrent.TimeoutException: Call Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:impl:queueService', partitionId=-1, callId=1108, invocationTime=1460150946818, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} encountered a timeout
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:366)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
    at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService.isOperationExecuting(IsStillRunningService.java:85)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:275)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)

2016-04-08 21:29:11:821 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] 'is-executing': false -> Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][master] error on executing distributed request -1: -
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0
    at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
    at ------ End remote and begin local stack-trace ------.(Unknown Source)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
    at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
    at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
    at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)

2016-04-08 21:29:17:361 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] Handled 1 invocation timeouts and 0 backupTimeouts [OperationService]
2016-04-08 21:29:20:470 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Removing Member [192.168.197.124]:2434 because it has not sent any heartbeats for 300000 ms. [ClusterService]
2016-04-08 21:29:21:480 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] Connecting to /192.168.197.124:2434, timeout: 0, bind-any: true [SocketConnector]
2016-04-08 21:29:21:480 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] This node does not have a connection to Member [192.168.197.124]:2434 [ClusterService]
2016-04-08 21:29:21:481 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Removing Member [192.168.197.124]:2434 because it has not sent any heartbeats for 300000 ms. [ClusterService]
2016-04-08 21:29:21:483 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] Established socket connection between /192.168.161.41:37001 [TcpIpConnectionManager]
2016-04-08 21:29:23:475 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] Connection [Address[192.168.197.124]:2434] lost. Reason: Socket explicitly closed [TcpIpConnection]
2016-04-08 21:29:23:475 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] Removing Member [192.168.197.124]:2434 [ClusterService]
2016-04-08 21:29:23:493 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3]

Members [1] {
    Member [192.168.161.41]:2434 this
}
 [ClusterService]
2016-04-08 21:29:23:496 INFO  [192.168.161.41]:2434 [masterslave] [3.5.3] Connection [Address[192.168.197.124]:2434] lost. Reason: Socket explicitly closed [TcpIpConnection]
2016-04-08 21:29:23:509 INFO  [master] received removed status slave.orientdb=null [OHazelcastPlugin]

Then, this keeps logging indefinitely in the leader node:

2016-04-08 21:57:24:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:25:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:26:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:27:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:28:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:29:816 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]



Thank you,
Arul

Jaromir Hamala

unread,
Apr 10, 2016, 4:04:20 AM4/10/16
to Hazelcast
Hello Arul,

could you share logs from the other (=192.168.197.124) member? 

Cheers,
Jaromir

Arul Dhesiaseelan

unread,
Apr 10, 2016, 8:48:20 PM4/10/16
to haze...@googlegroups.com
Sure, here is the log from the other node.

2016-04-08 21:27:04:682 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] processors=2, physical.memory.total=2.0G, physical.memory.free=19.3M, swap.space.total=256.0M, swap.space.free=227.0M, heap.memory.used=1.1G, heap.memory.free=333.5M, heap.memory.total=1.4G, heap.memory.max=1.4G, heap.memory.used/total=76.83%, heap.memory.used/max=76.83%, minor.gc.count=44, minor.gc.time=2001ms, major.gc.count=0, major.gc.time=0ms, load.process=0.00%, load.system=0.00%, load.systemAverage=0.06, thread.count=66, thread.peakCount=67, cluster.timeDiff=-5, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=2, operations.running.size=0, operations.pending.invocations.count=2, operations.pending.invocations.percentage=0.00%, proxy.count=10, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor]
2016-04-08 21:27:34:190 WARNI [192.168.197.124]:2434 [masterslave] [3.5.3] No response for 120000 ms. InvocationFuture{invocation=Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=255, callId=597, invocationTime=1460150616293, waitTimeout=-2280001, callTimeout=60000}, partitionId=255, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.161.41]:2434, backupsExpected=0, backupsCompleted=0}, response=null, done=false} [Invocation]
2016-04-08 21:27:34:193 WARNI [192.168.197.124]:2434 [masterslave] [3.5.3] Asking if operation execution has been started: Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=255, callId=597, invocationTime=1460150616293, waitTimeout=-2280001, callTimeout=60000}, partitionId=255, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.161.41]:2434, backupsExpected=0, backupsCompleted=0} [Invocation]
2016-04-08 21:27:34:684 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] processors=2, physical.memory.total=2.0G, physical.memory.free=19.4M, swap.space.total=256.0M, swap.space.free=227.1M, heap.memory.used=1.1G, heap.memory.free=332.8M, heap.memory.total=1.4G, heap.memory.max=1.4G, heap.memory.used/total=76.87%, heap.memory.used/max=76.87%, minor.gc.count=44, minor.gc.time=2001ms, major.gc.count=0, major.gc.time=0ms, load.process=0.00%, load.system=0.00%, load.systemAverage=0.04, thread.count=66, thread.peakCount=67, cluster.timeDiff=-5, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=3, operations.running.size=0, operations.pending.invocations.count=3, operations.pending.invocations.percentage=0.00%, proxy.count=10, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor][192.168.197.124]:2434 [masterslave] [3.5.3] While asking 'is-executing': Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=255, callId=597, invocationTime=1460150616293, waitTimeout=-2280001, callTimeout=60000}, partitionId=255, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.161.41]:2434, backupsExpected=0, backupsCompleted=0}
java.util.concurrent.TimeoutException: Call Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:impl:queueService', partitionId=-1, callId=602, invocationTime=1460150854187, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[192.168.161.41]:2434, backupsExpected=0, backupsCompleted=0} encountered a timeout

        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:366)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
        at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService.isOperationExecuting(IsStillRunningService.java:85)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:275)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
        at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
        at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
        at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
        at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
        at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedMessageService$2.run(OHazelcastDistributedMessageService.java:110)
        at java.lang.Thread.run(Thread.java:745)

2016-04-08 21:27:39:201 WARNI [192.168.197.124]:2434 [masterslave] [3.5.3] 'is-executing': false -> Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=255, callId=597, invocationTime=1460150616293, waitTimeout=-2280001, callTimeout=60000}, partitionId=255, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.161.41]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][slave] error on reading distributed response
com.hazelcast.core.OperationTimeoutException: No response for 2520000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=255, callId=597, invocationTime=1460150616293, waitTimeout=-2280001, callTimeout=60000}, partitionId=255, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.161.41]:2434, backupsExpected=0, backupsCompleted=0} No response has been received!  backups-expected:0 backups-completed: 0

        at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
        at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
        at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
        at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
        at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
        at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedMessageService$2.run(OHazelcastDistributedMessageService.java:110)
        at java.lang.Thread.run(Thread.java:745)

        at ------ End remote and begin local stack-trace ------.(Unknown Source)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
        at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
        at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
        at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
        at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
        at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
        at com.orientechnologies.orient.server.hazelcast.OHazelcastDistributedMessageService$2.run(OHazelcastDistributedMessageService.java:110)
        at java.lang.Thread.run(Thread.java:745)

2016-04-08 21:27:45:129 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] Handled 1 invocation timeouts and 0 backupTimeouts [OperationService]
2016-04-08 21:28:04:686 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] processors=2, physical.memory.total=2.0G, physical.memory.free=22.1M, swap.space.total=256.0M, swap.space.free=227.1M, heap.memory.used=1.1G, heap.memory.free=328.4M, heap.memory.total=1.4G, heap.memory.max=1.4G, heap.memory.used/total=77.18%, heap.memory.used/max=77.18%, minor.gc.count=44, minor.gc.time=2001ms, major.gc.count=0, major.gc.time=0ms, load.process=0.00%, load.system=0.00%, load.systemAverage=0.02, thread.count=66, thread.peakCount=67, cluster.timeDiff=-5, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=2, operations.running.size=0, operations.pending.invocations.count=2, operations.pending.invocations.percentage=0.00%, proxy.count=10, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor]
2016-04-08 21:28:34:689 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] processors=2, physical.memory.total=2.0G, physical.memory.free=22.1M, swap.space.total=256.0M, swap.space.free=227.1M, heap.memory.used=1.1G, heap.memory.free=328.3M, heap.memory.total=1.4G, heap.memory.max=1.4G, heap.memory.used/total=77.19%, heap.memory.used/max=77.19%, minor.gc.count=44, minor.gc.time=2001ms, major.gc.count=0, major.gc.time=0ms, load.process=0.00%, load.system=0.00%, load.systemAverage=0.01, thread.count=66, thread.peakCount=67, cluster.timeDiff=-5, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=2, operations.running.size=0, operations.pending.invocations.count=2, operations.pending.invocations.percentage=0.00%, proxy.count=10, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor]
2016-04-08 21:29:04:698 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] processors=2, physical.memory.total=2.0G, physical.memory.free=22.1M, swap.space.total=256.0M, swap.space.free=227.1M, heap.memory.used=1.1G, heap.memory.free=327.7M, heap.memory.total=1.4G, heap.memory.max=1.4G, heap.memory.used/total=77.23%, heap.memory.used/max=77.23%, minor.gc.count=44, minor.gc.time=2001ms, major.gc.count=0, major.gc.time=0ms, load.process=0.00%, load.system=0.00%, load.systemAverage=0.01, thread.count=66, thread.peakCount=67, cluster.timeDiff=-5, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=2, operations.running.size=0, operations.pending.invocations.count=2, operations.pending.invocations.percentage=0.00%, proxy.count=10, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor]
2016-04-08 21:29:21:009 WARNI [192.168.197.124]:2434 [masterslave] [3.5.3] Removing Member [192.168.161.41]:2434 because it has not sent any heartbeats for 300000 ms. [ClusterService]
2016-04-08 21:29:21:010 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] Master Address[192.168.161.41]:2434 left the cluster. Assigning new master Member [192.168.197.124]:2434 this [ClusterService]
2016-04-08 21:29:21:013 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] Connection [Address[192.168.161.41]:2434] lost. Reason: Socket explicitly closed [TcpIpConnection]
2016-04-08 21:29:21:013 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3] Removing Member [192.168.161.41]:2434 [ClusterService]
2016-04-08 21:29:21:040 INFO  [192.168.197.124]:2434 [masterslave] [3.5.3]

Members [1] {
        Member [192.168.197.124]:2434 this
}




--
You received this message because you are subscribed to a topic in the Google Groups "Hazelcast" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hazelcast/f6ERyooPUbA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/9f855ffb-c937-41bc-bf40-e88f56862989%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Jaromir Hamala

unread,
Apr 11, 2016, 8:54:44 AM4/11/16
to Hazelcast, ar...@acm.org
thanks! Is it easy to reproduce? Is the error always after 45 minutes? Could you try it with the latest 3.5 release? (3.5.5)

Cheers,
Jaromir
...

aruld...@gmail.com

unread,
Apr 11, 2016, 12:00:57 PM4/11/16
to Hazelcast, ar...@acm.org
I am running another test to see if it times out exactly around 45 minutes. I can try with 3.5.5. Is this a known issue in 3.5.3?

Will keep you posted.

Thank you,
Arul
...

Jaromir Hamala

unread,
Apr 20, 2016, 10:34:05 AM4/20/16
to Hazelcast, ar...@acm.org
Hi,

Do you have any result you could share? I am not aware of a timeout related issue, but we do not know what's causing it in your case.

Cheers,
Jaromir
...

Arul Dhesiaseelan

unread,
Apr 20, 2016, 11:33:00 AM4/20/16
to haze...@googlegroups.com
Hi Jaromir,

Sorry for the late response. Both the servers were secured with ipsec and rekeying was configured to run every 45 minutes, which happens over UDP and those ports were blocked on the firewall. Opening those UDP ports fixed the network connectivity after rekeying and hazelcast worked correctly. Sorry for the false alarm.

Thanks,
Arul

--
You received this message because you are subscribed to a topic in the Google Groups "Hazelcast" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hazelcast/f6ERyooPUbA/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.

Jaromir Hamala

unread,
Apr 21, 2016, 3:17:07 AM4/21/16
to Hazelcast, ar...@acm.org
Hi Arul,

that's great to hear! Thanks for heads-up.

Cheers,
Jaromir
Reply all
Reply to author
Forward
0 new messages