I have a strange problem
Hazelcast with 3 members, version 3.9.4
As part of an upgrade procedure node 10.173.240.3 was stopped, node2 started having some problems interacting with Hazelcast.
Node 3 logs
INFO 2020-02-25 14:01:43.707 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] [10.173.240.3]:7008 is SHUTTING_DOWN (LifecycleService.java:65)
INFO 2020-02-25 14:01:43.708 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] Shutdown request of [10.173.240.3]:7008 is handled (MigrationManager.java:65)
INFO 2020-02-25 14:01:43.718 [_hzInstance_1_matrix.migration] log() [10.173.240.3]:7008 [matrix] [3.9.4] Re-partitioning cluster data... Migration queue size: 180 (MigrationManager.java:65)
INFO 2020-02-25 14:01:44.844 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] Shutting down connection manager... (Node.java:65)
INFO 2020-02-25 14:01:44.845 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] Connection[id=4, /10.173.240.3:7008->/
10.173.240.2:57241, endpoint=[10.173.240.2]:7008, alive=false, type=MEMBER] closed. Reason: TcpIpConnectionManager is stopping (TcpIpConnection.java:65)
INFO 2020-02-25 14:01:44.845 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] Connection[id=3, /10.173.240.3:7008->/
10.173.240.1:51229, endpoint=[10.173.240.1]:7008, alive=false, type=MEMBER] closed. Reason: TcpIpConnectionManager is stopping (TcpIpConnection.java:65)
INFO 2020-02-25 14:01:44.846 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] Shutting down node engine... (Node.java:65)
INFO 2020-02-25 14:01:44.920 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] Destroying node NodeExtension. (NodeExtension.java:65)
INFO 2020-02-25 14:01:44.921 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] Hazelcast Shutdown is completed in 1213 ms. (Node.java:65)
INFO 2020-02-25 14:01:44.921 [Thread-1 ] log() [10.173.240.3]:7008 [matrix] [3.9.4] [10.173.240.3]:7008 is SHUTDOWN (LifecycleService.java:65)
What I get in the log on node 10.173.240.2, this is 3 mins later
WARN 2020-02-25 14:04:23.499 [NodeListRebuilderTh ] run() Exception during background node list rebuild. Happened: 1 in the last: 1m (TeamNodesProvider.java:238)
com.hazelcast.core.OperationTimeoutException: PutOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:04:23.497. Start time: 2020-02-25 14:02:23.167. Total elapsed time: 120330 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:04:08.499. Invocation{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=1978584929, partitionId=27, replicaIndex=0, callId=-4197, invocationTime=1582639343165 (2020-02-25 14:02:23.165), waitTimeout=-1, callTimeout=60000, name=inmemMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582639343167, firstInvocationTime='2020-02-25 14:02:23.167', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.putInternal(MapProxySupport.java:389)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.putInternal(NearCachedMapProxyImpl.java:164)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:131)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:122)
at com.some.package.write(Class.java:731)
This continues in he same fashion, with write failing
INFO 2020-02-25 14:14:30.497 [matrix.InvocationMonitorThread] log() [10.173.240.2]:7008 [matrix] [3.9.4] Invocations:32 timeouts:1 backup-timeouts:0 (InvocationMonitor.java:65)
WARN 2020-02-25 14:14:30.499 [NodeListRebuilderTh ] run() Exception during background node list rebuild. Happened: 1 in the last: 2m1s (TeamNodesProvider.java:238)
com.hazelcast.core.OperationTimeoutException: PutOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:14:30.497. Start time: 2020-02-25 14:12:30.499. Total elapsed time: 119998 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:14:23.499. Invocation{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=675701052, partitionId=27, replicaIndex=0, callId=-4620, invocationTime=1582639950497 (2020-02-25 14:12:30.497), waitTimeout=-1, callTimeout=60000, name=inmemMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582639950499, firstInvocationTime='2020-02-25 14:12:30.499', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.putInternal(MapProxySupport.java:389)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.putInternal(NearCachedMapProxyImpl.java:164)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:131)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:122)
at com.some.package.write(Class.java:731)
If I understand correctly the target for the operation is the node itself.
There is no backing store for this particular map inmemMap.
Looks like heartbeat from the member is received, but operation heartbeat is not.
My question is why the operation timeouts?
What would be the reason and how to resolve this?
Relevant part of configuration below
<network>
<port auto-increment="false">7008</port>
<join>
<multicast enabled="false">
<multicast-group>227.227.227.225</multicast-group>
<multicast-port>7007</multicast-port>
<multicast-time-to-live>1</multicast-time-to-live>
<multicast-timeout-seconds>15</multicast-timeout-seconds>
</multicast>
<tcp-ip enabled="true" connection-timeout-seconds="15">
<interface>10.173.240.1-10</interface>
</tcp-ip>
</join>
<interfaces enabled="true">
<interface>10.173.240.*</interface>
</interfaces>
</network>
<properties>
<property name="hazelcast.jmx">true</property>
<property name="hazelcast.logging.type">slf4j</property>
<property name="hazelcast.shutdownhook.enabled">false</property>
<property name="hazelcast.health.monitoring.delay.seconds">180</property>
<property name="hazelcast.phone.home.enabled">false</property>
<property name="hazelcast.socket.bind.any">false</property>
<property name="hazelcast.prefer.ipv4.stack">true</property>
<property name="hazelcast.initial.min.cluster.size">1</property>
<property name="hazelcast.max.no.heartbeat.seconds">60</property>
<property name="hazelcast.max.operation.timeout">15000</property>
<property name="hazelcast.merge.first.run.delay.seconds">15</property>
<property name="hazelcast.merge.next.run.delay.seconds">10</property>
<property name="hazelcast.map.invalidation.batch.enabled">false</property>
</properties>
<executor-service>
<pool-size>64</pool-size>
<queue-capacity>0</queue-capacity>
</executor-service>
<map name="inmemMap">
<backup-count>2</backup-count>
<eviction-policy>LRU</eviction-policy>
<max-size policy="PER_NODE">10000</max-size>
<near-cache>
<time-to-live-seconds>180</time-to-live-seconds>
<max-idle-seconds>6000</max-idle-seconds>
<eviction eviction-policy="LRU" max-size-policy="ENTRY_COUNT" size="5000"/>
<invalidate-on-change>true</invalidate-on-change>
</near-cache>
</map>
Snipped of the log below, including detection of node3 going down
INFO 2020-02-25 14:01:44.845 [stance_1_matrix.IO.thread-in-0] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connection[id=1, /10.173.240.2:57241->/
10.173.240.3:7008, endpoint=[10.173.240.3]:7008, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side (TcpIpConnection.java:65)
INFO 2020-02-25 14:01:44.850 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:44.851 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:44.951 [ance_1_matrix.cached.thread-65] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:44.951 [ance_1_matrix.cached.thread-65] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.052 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.052 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.151 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.151 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.153 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.153 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
WARN 2020-02-25 14:01:45.154 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Removing connection to endpoint [10.173.240.3]:7008 Cause => java.net.SocketException {Connection refused to address /
10.173.240.3:7008}, Error-Count: 5 (TcpIpConnectionErrorHandler.java:67)
WARN 2020-02-25 14:01:45.155 [tance_1_matrix.cached.thread-9] log() [10.173.240.2]:7008 [matrix] [3.9.4] Member [10.173.240.3]:7008 - eb654341-523f-4400-b657-5ecb1f2f570c is suspected to be dead for reason: No connection (MembershipManager.java:67)
WARN 2020-02-25 14:01:45.156 [ity-generic-operation.thread-0] log() [10.173.240.2]:7008 [matrix] [3.9.4] Mastership of [10.173.240.1]:7008 is accepted. Response: MembersView{version=54, members=[MemberInfo{address=[10.173.240.1]:7008, uuid=e71a0ccd-c0ac-413a-a4ed-d405230c435b, liteMember=false}, MemberInfo{address=[10.173.240.2]:7008, uuid=aae39c3e-1e88-474e-8c30-a181ed81a451, liteMember=false}]} (ClusterService.java:67)
INFO 2020-02-25 14:01:45.247 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.247 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.252 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.252 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.254 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /
10.173.240.3:7008, timeout: 0, bind-any: false (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.254 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /
10.173.240.3:7008. Reason: SocketException[Connection refused to address /
10.173.240.3:7008] (TcpIpConnector.java:65)
INFO 2020-02-25 14:01:45.259 [ity-generic-operation.thread-0] log() [10.173.240.2]:7008 [matrix] [3.9.4]
Members {size:2, ver:55} [
Member [10.173.240.1]:7008 - e71a0ccd-c0ac-413a-a4ed-d405230c435b
Member [10.173.240.2]:7008 - aae39c3e-1e88-474e-8c30-a181ed81a451 this
]
(ClusterService.java:65)
INFO 2020-02-25 14:01:45.260 [ance_1_matrix.cached.thread-38] log() [10.173.240.2]:7008 [matrix] [3.9.4] Committing/rolling-back alive transactions of Member [10.173.240.3]:7008 - eb654341-523f-4400-b657-5ecb1f2f570c, UUID: eb654341-523f-4400-b657-5ecb1f2f570c (TransactionManagerService.java:65)
WARN 2020-02-25 14:01:58.538 [ix.SlowOperationDetectorThread] log() [10.173.240.2]:7008 [matrix] [3.9.4] Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation
Hint: You can enable the logging of stacktraces with the following system property: -Dhazelcast.slow.operation.detector.stacktrace.logging.enabled (SlowOperationDetector.java:67)
WARN 2020-02-25 14:02:32.539 [ix.SlowOperationDetectorThread] log() [10.173.240.2]:7008 [matrix] [3.9.4] Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation (2 invocations) (SlowOperationDetector.java:67)
INFO 2020-02-25 14:04:23.497 [matrix.InvocationMonitorThread] log() [10.173.240.2]:7008 [matrix] [3.9.4] Invocations:6 timeouts:1 backup-timeouts:0 (InvocationMonitor.java:65)
WARN 2020-02-25 14:04:23.499 [NodeListRebuilderTh ] run() Exception during background node list rebuild. Happened: 1 in the last: 1m (TeamNodesProvider.java:238)
com.hazelcast.core.OperationTimeoutException: PutOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:04:23.497. Start time: 2020-02-25 14:02:23.167. Total elapsed time: 120330 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:04:08.499. Invocation{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=1978584929, partitionId=27, replicaIndex=0, callId=-4197, invocationTime=1582639343165 (2020-02-25 14:02:23.165), waitTimeout=-1, callTimeout=60000, name=inmemMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582639343167, firstInvocationTime='2020-02-25 14:02:23.167', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.putInternal(MapProxySupport.java:389)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.putInternal(NearCachedMapProxyImpl.java:164)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:131)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:122)
at com.some.package.write(Matrix.java:731)
INFO 2020-02-25 14:05:26.497 [matrix.InvocationMonitorThread] log() [10.173.240.2]:7008 [matrix] [3.9.4] Invocations:7 timeouts:1 backup-timeouts:0 (InvocationMonitor.java:65)