Hello,
I have 2 nodes in my hazelcast cluster. Both nodes run just fine for 45 minutes and times out after that. Both nodes are running ubuntu and only these ports (2434, 37000-37004) are opened for hazelcast in the firewall. Has anyone seen this before? Once this happens, nothing works in the application.
Here is my hazelcast configuration:
<hazelcast
xsi:schemaLocation="http://www.hazelcast.com/schema/config hazelcast-config-3.0.xsd"
xmlns="http://www.hazelcast.com/schema/config" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<group>
<name>masterslave</name>
<password>masterslave</password>
</group>
<network>
<port auto-increment="false">2434</port>
<outbound-ports>
<!-- only allow these ports for outbound -->
<ports>37000,37001,37002,37003,37004</ports>
</outbound-ports>
<join>
<multicast enabled="false">
<multicast-group>235.1.1.1</multicast-group>
<multicast-port>2434</multicast-port>
</multicast>
<tcp-ip enabled="true">
<member>192.168.161.41:2434</member>
<member>192.168.197.124:2434</member>
</tcp-ip>
</join>
</network>
<executor-service>
<pool-size>16</pool-size>
</executor-service>
</hazelcast>
This is the exception I get in the leader node:
2016-04-08 21:20:31:710 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] processors=2, physical.memory.total=2.0G, physical.memory.free=28.9M, swap.space.total=256.0M, swap.space.free=224.4M, heap.memory.used=1.1G, heap.memory.free=260.6M, heap.memory.total=1.4G, heap.memory.max=1.4G, heap.memory.used/total=81.42%, heap.memory.used/max=81.42%, minor.gc.count=49, minor.gc.time=2115ms, major.gc.count=5, major.gc.time=2866ms, load.process=0.00%, load.system=0.00%, load.systemAverage=0.00, thread.count=69, thread.peakCount=71, cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operation.size=0, executor.q.priorityOperation.size=0, executor.q.response.size=0, operations.remote.size=2, operations.running.size=0, operations.pending.invocations.count=2, operations.pending.invocations.percentage=0.00%, proxy.count=10, clientEndpoint.count=0, connection.active.count=1, client.connection.count=0, connection.count=1 [HealthMonitor]
2016-04-08 21:27:01:793 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] No response for 120000 ms. InvocationFuture{invocation=Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}, response=null, done=false} [Invocation]
2016-04-08 21:27:01:796 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Asking if operation execution has been started: Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][192.168.161.41]:2434 [masterslave] [3.5.3] While asking 'is-executing': Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}
java.util.concurrent.TimeoutException: Call Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:impl:queueService', partitionId=-1, callId=1103, invocationTime=1460150821795, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} encountered a timeout
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:366)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService.isOperationExecuting(IsStillRunningService.java:85)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:275)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
2016-04-08 21:27:06:814 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] 'is-executing': false -> Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][master] error on executing distributed request -1: -
com.hazelcast.core.OperationTimeoutException: No response for 1920000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1100, invocationTime=1460150643259, waitTimeout=-1740001, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} No response has been received! backups-expected:0 backups-completed: 0
at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
at ------ End remote and begin local stack-trace ------.(Unknown Source)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
2016-04-08 21:27:12:339 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] Handled 1 invocation timeouts and 0 backupTimeouts [OperationService]
2016-04-08 21:29:06:818 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] No response for 120001 ms. InvocationFuture{invocation=Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}, response=null, done=false} [Invocation]
2016-04-08 21:29:06:819 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Asking if operation execution has been started: Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][192.168.161.41]:2434 [masterslave] [3.5.3] While asking 'is-executing': Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0}
java.util.concurrent.TimeoutException: Call Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.spi.impl.operationservice.impl.operations.IsStillExecutingOperation{serviceName='hz:impl:queueService', partitionId=-1, callId=1108, invocationTime=1460150946818, waitTimeout=-1, callTimeout=5000}, partitionId=-1, replicaIndex=0, tryCount=0, tryPauseMillis=0, invokeCount=1, callTimeout=5000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} encountered a timeout
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:366)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
at com.hazelcast.spi.impl.operationservice.impl.IsStillRunningService.isOperationExecuting(IsStillRunningService.java:85)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:275)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
2016-04-08 21:29:11:821 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] 'is-executing': false -> Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} [Invocation][master] error on executing distributed request -1: -
com.hazelcast.core.OperationTimeoutException: No response for 120000 ms. Aborting invocation! Invocation{ serviceName='hz:impl:queueService', op=com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=14, callId=1105, invocationTime=1460150826817, waitTimeout=-1, callTimeout=60000}, partitionId=14, replicaIndex=0, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeout=60000, target=Address[192.168.197.124]:2434, backupsExpected=0, backupsCompleted=0} No response has been received! backups-expected:0 backups-completed: 0
at com.hazelcast.spi.impl.operationservice.impl.Invocation.newOperationTimeoutException(Invocation.java:491)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.waitForResponse(InvocationFuture.java:277)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:224)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
at ------ End remote and begin local stack-trace ------.(Unknown Source)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponse(InvocationFuture.java:384)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveApplicationResponseOrThrowException(InvocationFuture.java:334)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:225)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.get(InvocationFuture.java:204)
at com.hazelcast.collection.impl.queue.QueueProxySupport.invokeAndGet(QueueProxySupport.java:177)
at com.hazelcast.collection.impl.queue.QueueProxySupport.pollInternal(QueueProxySupport.java:124)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.poll(QueueProxyImpl.java:88)
at com.hazelcast.collection.impl.queue.QueueProxyImpl.take(QueueProxyImpl.java:82)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.nextMessage(ODistributedWorker.java:230)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.readRequest(ODistributedWorker.java:194)
at com.orientechnologies.orient.server.hazelcast.ODistributedWorker.run(ODistributedWorker.java:97)
2016-04-08 21:29:17:361 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] Handled 1 invocation timeouts and 0 backupTimeouts [OperationService]
2016-04-08 21:29:20:470 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Removing Member [192.168.197.124]:2434 because it has not sent any heartbeats for 300000 ms. [ClusterService]
2016-04-08 21:29:21:480 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] Connecting to /
192.168.197.124:2434, timeout: 0, bind-any: true [SocketConnector]
2016-04-08 21:29:21:480 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] This node does not have a connection to Member [192.168.197.124]:2434 [ClusterService]
2016-04-08 21:29:21:481 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Removing Member [192.168.197.124]:2434 because it has not sent any heartbeats for 300000 ms. [ClusterService]
2016-04-08 21:29:21:483 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] Established socket connection between /
192.168.161.41:37001 [TcpIpConnectionManager]
2016-04-08 21:29:23:475 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] Connection [Address[192.168.197.124]:2434] lost. Reason: Socket explicitly closed [TcpIpConnection]
2016-04-08 21:29:23:475 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] Removing Member [192.168.197.124]:2434 [ClusterService]
2016-04-08 21:29:23:493 INFO [192.168.161.41]:2434 [masterslave] [3.5.3]
Members [1] {
Member [192.168.161.41]:2434 this
}
[ClusterService]
2016-04-08 21:29:23:496 INFO [192.168.161.41]:2434 [masterslave] [3.5.3] Connection [Address[192.168.197.124]:2434] lost. Reason: Socket explicitly closed [TcpIpConnection]
2016-04-08 21:29:23:509 INFO [master] received removed status slave.orientdb=null [OHazelcastPlugin]
Then, this keeps logging indefinitely in the leader node:
2016-04-08 21:57:24:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:25:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:26:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:27:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:28:815 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
2016-04-08 21:57:29:816 WARNI [192.168.161.41]:2434 [masterslave] [3.5.3] Op: com.hazelcast.collection.impl.queue.operations.PollOperation{serviceName='null', partitionId=161, callId=0, invocationTime=1460151061852, waitTimeout=-2100001, callTimeout=60000}, com.hazelcast.spi.exception.WrongTargetException: WrongTarget! this:Address[192.168.161.41]:2434, target:Address[192.168.197.124]:2434, partitionId: 161, replicaIndex: 0, operation: com.hazelcast.spi.impl.waitnotifyservice.impl.WaitingOperation, service: hz:impl:queueService [WaitingOperation]
Thank you,
Arul