OperationTimeoutException on a local node

949 views
Skip to first unread message

ivenhov

unread,
Feb 29, 2020, 6:06:18 AM2/29/20
to Hazelcast
I have a strange problem

Hazelcast with 3 members, version 3.9.4

As part of an upgrade procedure node 10.173.240.3 was stopped, node2 started having some problems interacting with Hazelcast.

Node 3 logs
INFO  2020-02-25 14:01:43.707  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] [10.173.240.3]:7008 is SHUTTING_DOWN      (LifecycleService.java:65)
INFO  2020-02-25 14:01:43.708  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Shutdown request of [10.173.240.3]:7008 is handled      (MigrationManager.java:65)
INFO  2020-02-25 14:01:43.718  [_hzInstance_1_matrix.migration] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Re-partitioning cluster data... Migration queue size: 180      (MigrationManager.java:65)
INFO  2020-02-25 14:01:44.844  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Shutting down connection manager...      (Node.java:65)
INFO  2020-02-25 14:01:44.845  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Connection[id=4, /10.173.240.3:7008->/10.173.240.2:57241, endpoint=[10.173.240.2]:7008, alive=false, type=MEMBER] closed. Reason: TcpIpConnectionManager is stopping      (TcpIpConnection.java:65)
INFO  2020-02-25 14:01:44.845  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Connection[id=3, /10.173.240.3:7008->/10.173.240.1:51229, endpoint=[10.173.240.1]:7008, alive=false, type=MEMBER] closed. Reason: TcpIpConnectionManager is stopping      (TcpIpConnection.java:65)
INFO  2020-02-25 14:01:44.846  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Shutting down node engine...      (Node.java:65)
INFO  2020-02-25 14:01:44.920  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Destroying node NodeExtension.      (NodeExtension.java:65)
INFO  2020-02-25 14:01:44.921  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Hazelcast Shutdown is completed in 1213 ms.      (Node.java:65)
INFO  2020-02-25 14:01:44.921  [Thread-1                      ] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] [10.173.240.3]:7008 is SHUTDOWN      (LifecycleService.java:65)


What I get in the log on node 10.173.240.2, this is 3 mins later

WARN  2020-02-25 14:04:23.499  [NodeListRebuilderTh           ] run()                          Exception during background node list rebuild. Happened: 1 in the last: 1m      (TeamNodesProvider.java:238)
com.hazelcast.core.OperationTimeoutException: PutOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:04:23.497. Start time: 2020-02-25 14:02:23.167. Total elapsed time: 120330 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:04:08.499. Invocation{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=1978584929, partitionId=27, replicaIndex=0, callId=-4197, invocationTime=1582639343165 (2020-02-25 14:02:23.165), waitTimeout=-1, callTimeout=60000, name=inmemMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582639343167, firstInvocationTime='2020-02-25 14:02:23.167', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.putInternal(MapProxySupport.java:389)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.putInternal(NearCachedMapProxyImpl.java:164)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:131)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:122)
at com.some.package.write(Class.java:731)
This continues in he same fashion, with write failing

INFO  2020-02-25 14:14:30.497  [matrix.InvocationMonitorThread] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Invocations:32 timeouts:1 backup-timeouts:0      (InvocationMonitor.java:65)
WARN  2020-02-25 14:14:30.499  [NodeListRebuilderTh           ] run()                          Exception during background node list rebuild. Happened: 1 in the last: 2m1s      (TeamNodesProvider.java:238)
com.hazelcast.core.OperationTimeoutException: PutOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:14:30.497. Start time: 2020-02-25 14:12:30.499. Total elapsed time: 119998 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:14:23.499. Invocation{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=675701052, partitionId=27, replicaIndex=0, callId=-4620, invocationTime=1582639950497 (2020-02-25 14:12:30.497), waitTimeout=-1, callTimeout=60000, name=inmemMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582639950499, firstInvocationTime='2020-02-25 14:12:30.499', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.putInternal(MapProxySupport.java:389)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.putInternal(NearCachedMapProxyImpl.java:164)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:131)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:122)
at com.some.package.write(Class.java:731)

If I understand correctly the target for the operation is the node itself.
There is no backing store for this particular map inmemMap.
Looks like heartbeat from the member is received, but operation heartbeat is not.
My question is why the operation timeouts?
What would be the reason and how to resolve this?

Relevant part of configuration below

    <network>
        <port auto-increment="false">7008</port>
        <join>
            <multicast enabled="false">
                <multicast-group>227.227.227.225</multicast-group>
                <multicast-port>7007</multicast-port>
                <multicast-time-to-live>1</multicast-time-to-live>
                <multicast-timeout-seconds>15</multicast-timeout-seconds>
            </multicast>
            <tcp-ip enabled="true" connection-timeout-seconds="15">
                <interface>10.173.240.1-10</interface>
            </tcp-ip>
        </join>
        <interfaces enabled="true">
            <interface>10.173.240.*</interface>
        </interfaces>
    </network>
<properties>
        <property name="hazelcast.jmx">true</property>
        <property name="hazelcast.logging.type">slf4j</property>
        <property name="hazelcast.shutdownhook.enabled">false</property>
        <property name="hazelcast.health.monitoring.delay.seconds">180</property>        
        <property name="hazelcast.phone.home.enabled">false</property>
        <property name="hazelcast.socket.bind.any">false</property>
        <property name="hazelcast.prefer.ipv4.stack">true</property>
        <property name="hazelcast.initial.min.cluster.size">1</property>
        <property name="hazelcast.max.no.heartbeat.seconds">60</property>
        <property name="hazelcast.max.operation.timeout">15000</property>
        <property name="hazelcast.merge.first.run.delay.seconds">15</property>
        <property name="hazelcast.merge.next.run.delay.seconds">10</property>
        <property name="hazelcast.map.invalidation.batch.enabled">false</property>
    </properties>

    <executor-service>
        <pool-size>64</pool-size>
        <queue-capacity>0</queue-capacity>
    </executor-service>
<map name="inmemMap">
        <backup-count>2</backup-count>
        <eviction-policy>LRU</eviction-policy>
        <max-size policy="PER_NODE">10000</max-size>       
        <near-cache>
            <time-to-live-seconds>180</time-to-live-seconds>
            <max-idle-seconds>6000</max-idle-seconds>
            <eviction eviction-policy="LRU" max-size-policy="ENTRY_COUNT" size="5000"/>
            <invalidate-on-change>true</invalidate-on-change>
        </near-cache>
    </map>


Snipped of the log below, including detection of node3 going down


INFO  2020-02-25 14:01:44.845  [stance_1_matrix.IO.thread-in-0] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connection[id=1, /10.173.240.2:57241->/10.173.240.3:7008, endpoint=[10.173.240.3]:7008, alive=false, type=MEMBER] closed. Reason: Connection closed by the other side      (TcpIpConnection.java:65)
INFO  2020-02-25 14:01:44.850  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:44.851  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:44.951  [ance_1_matrix.cached.thread-65] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:44.951  [ance_1_matrix.cached.thread-65] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.052  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.052  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.151  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.151  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.153  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.153  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
WARN  2020-02-25 14:01:45.154  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Removing connection to endpoint [10.173.240.3]:7008 Cause => java.net.SocketException {Connection refused to address /10.173.240.3:7008}, Error-Count: 5      (TcpIpConnectionErrorHandler.java:67)
WARN  2020-02-25 14:01:45.155  [tance_1_matrix.cached.thread-9] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Member [10.173.240.3]:7008 - eb654341-523f-4400-b657-5ecb1f2f570c is suspected to be dead for reason: No connection      (MembershipManager.java:67)
WARN  2020-02-25 14:01:45.156  [ity-generic-operation.thread-0] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Mastership of [10.173.240.1]:7008 is accepted. Response: MembersView{version=54, members=[MemberInfo{address=[10.173.240.1]:7008, uuid=e71a0ccd-c0ac-413a-a4ed-d405230c435b, liteMember=false}, MemberInfo{address=[10.173.240.2]:7008, uuid=aae39c3e-1e88-474e-8c30-a181ed81a451, liteMember=false}]}      (ClusterService.java:67)
INFO  2020-02-25 14:01:45.247  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.247  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.252  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.252  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.254  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Connecting to /10.173.240.3:7008, timeout: 0, bind-any: false      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.254  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Could not connect to: /10.173.240.3:7008. Reason: SocketException[Connection refused to address /10.173.240.3:7008]      (TcpIpConnector.java:65)
INFO  2020-02-25 14:01:45.259  [ity-generic-operation.thread-0] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] 

Members {size:2, ver:55} [
Member [10.173.240.1]:7008 - e71a0ccd-c0ac-413a-a4ed-d405230c435b
Member [10.173.240.2]:7008 - aae39c3e-1e88-474e-8c30-a181ed81a451 this
]
      (ClusterService.java:65)
INFO  2020-02-25 14:01:45.260  [ance_1_matrix.cached.thread-38] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Committing/rolling-back alive transactions of Member [10.173.240.3]:7008 - eb654341-523f-4400-b657-5ecb1f2f570c, UUID: eb654341-523f-4400-b657-5ecb1f2f570c      (TransactionManagerService.java:65)
WARN  2020-02-25 14:01:58.538  [ix.SlowOperationDetectorThread] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation
Hint: You can enable the logging of stacktraces with the following system property: -Dhazelcast.slow.operation.detector.stacktrace.logging.enabled      (SlowOperationDetector.java:67)
WARN  2020-02-25 14:02:32.539  [ix.SlowOperationDetectorThread] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation (2 invocations)      (SlowOperationDetector.java:67)
INFO  2020-02-25 14:04:23.497  [matrix.InvocationMonitorThread] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Invocations:6 timeouts:1 backup-timeouts:0      (InvocationMonitor.java:65)
WARN  2020-02-25 14:04:23.499  [NodeListRebuilderTh           ] run()                          Exception during background node list rebuild. Happened: 1 in the last: 1m      (TeamNodesProvider.java:238)
com.hazelcast.core.OperationTimeoutException: PutOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:04:23.497. Start time: 2020-02-25 14:02:23.167. Total elapsed time: 120330 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:04:08.499. Invocation{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=1978584929, partitionId=27, replicaIndex=0, callId=-4197, invocationTime=1582639343165 (2020-02-25 14:02:23.165), waitTimeout=-1, callTimeout=60000, name=inmemMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582639343167, firstInvocationTime='2020-02-25 14:02:23.167', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.putInternal(MapProxySupport.java:389)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.putInternal(NearCachedMapProxyImpl.java:164)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:131)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:122)
at com.some.package.write(Matrix.java:731)

INFO  2020-02-25 14:05:26.497  [matrix.InvocationMonitorThread] log()                          [10.173.240.2]:7008 [matrix] [3.9.4] Invocations:7 timeouts:1 backup-timeouts:0      (InvocationMonitor.java:65)


Jaromir Hamala

unread,
Feb 29, 2020, 7:40:57 AM2/29/20
to Hazelcast
Hi,

"Last operation heartbeat from member: 2020-02-25 14:14:23.499." suggests the member is sending operation heartbeats. The "Last operation heartbeat: never. " indicates the PUT operation is currently not running. 
The fact that it's not running right now is not concerning on its own. The PUT operation is meant to be extremely fast and normally they are not even picked up by operation heartbeating mechanism. The heartbeating is meant for (relatively) long running operations, think of queries, aggregations, etc.

I have 2 possible explanations:
1. The operation was never executed. Either the request was lost for some reason or it's queuing in the operation execution queue. Losing a request is unlikely, that would be a major bug. 
2. The operation was executed, but the response was lost. Again, losing a response is unlikely.

Given the log also says "Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation". It's possible the operation thread is stuck inside the DeleteOperation and cannot execute the PutOperation. Which is sitting in the queue until the calls timeouts. Given your IMap has no backing store configured the DeleteOperation should be super quick. The fact it's even picked up by the slow operation detector is highly suspicious. I have 2 possible explanations for this:
1. Bug in the Delete operation. 
2  Something environmental is causing the slowness

You said this behaviour happened after you shutdown 1 out of 3 nodes. In theory it could be caused by memory pressure - as the same IMap was now stored on just 2 instead of 3 nodes. Do you happen to have Garbage Collector logs? If you do not have them then I would strongly recommend it enable them. They have very low overhead and they can help you massively with issues like this. 

Cheers,
Jaromir

ivenhov

unread,
Feb 29, 2020, 11:50:53 AM2/29/20
to Hazelcast
Thank you very much for you valuable insight.

I have couple of other Maps defined. I did not think this would matter but I can see now it may be important.
Apologies for this
Some of those maps are backed by storage. Is it possible that transfer to another map is blocking PUT operations on in-memory map?
Is operation queue common for all maps, lists etc or is it one-per-collection?

What is surprising is 10.173.240.2 time outs on operation where target=[10.173.240.2]:7008

"Slow operation detected: com.hazelcast.map.impl.operation.DeleteOperation".
is actually on another map with a backing store.
Apologies again for creating confusion

 I think what is happening in this case is that operation is routed to node that is in the process of being shut down.

Example failing operation

WARN  2020-02-25 14:01:39.398  Invoke failed
java.lang.RuntimeException: Failed to delete Store, key: 41114891-71ad-4a5f-baad-949a2479f637
at com.store.MapStore.delete(AbstractHazelcastMapStore.java:99)
at com.hazelcast.map.impl.MapStoreWrapper.delete(MapStoreWrapper.java:107)
at com.hazelcast.map.impl.mapstore.writethrough.WriteThroughStore.remove(WriteThroughStore.java:56)
at com.hazelcast.map.impl.mapstore.writethrough.WriteThroughStore.remove(WriteThroughStore.java:28)
at com.hazelcast.map.impl.recordstore.DefaultRecordStore.removeRecord(DefaultRecordStore.java:1040)
at com.hazelcast.map.impl.recordstore.DefaultRecordStore.delete(DefaultRecordStore.java:624)
at com.hazelcast.map.impl.operation.DeleteOperation.run(DeleteOperation.java:38)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:194)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:409)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:115)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:100)
at ------ submitted from ------.(Unknown Source)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:127)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.deleteInternal(MapProxySupport.java:569)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.deleteInternal(NearCachedMapProxyImpl.java:327)
at com.hazelcast.map.impl.proxy.MapProxyImpl.delete(MapProxyImpl.java:231)
at com.api.Creds.delete()
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: java.lang.NullPointerException
at com.logic.MapStore.delete(AbstractHazelcastMapStore.java:97)
at com.hazelcast.map.impl.MapStoreWrapper.delete(MapStoreWrapper.java:107)
at com.hazelcast.map.impl.mapstore.writethrough.WriteThroughStore.remove(WriteThroughStore.java:56)
at com.hazelcast.map.impl.mapstore.writethrough.WriteThroughStore.remove(WriteThroughStore.java:28)
at com.hazelcast.map.impl.recordstore.DefaultRecordStore.removeRecord(DefaultRecordStore.java:1040)
at com.hazelcast.map.impl.recordstore.DefaultRecordStore.delete(DefaultRecordStore.java:624)
at com.hazelcast.map.impl.operation.DeleteOperation.run(DeleteOperation.java:38)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:194)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:409)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:115)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:100)
Caused by: java.lang.NullPointerException: null
at (backend not configured)


Could it be that if operation timeout is 2 minutes, and requests are routed to node 3 which is going down that will slow down all operations pending on the queue by 2 minutes per operation? 
I've checked GC logs
node2 shows no collection till 14:09 which took
node3 shows pauses 20-30 ms until it was fully stopped at 14:01:44

What concerns me is that node2 was failing even several minutes later and it was unable to recover

WARN  2020-02-25 14:18:33.498  [Thread-1           ] run()                          Exception during background node list rebuild
com.hazelcast.core.OperationTimeoutException: PutOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:18:33.497. Start time: 2020-02-25 14:16:33.498. Total elapsed time: 119999 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:18:23.499. Invocation{op=com.hazelcast.map.impl.operation.PutOperation{serviceName='hz:impl:mapService', identityHash=126643487, partitionId=27, replicaIndex=0, callId=-4736, invocationTime=1582640193497 (2020-02-25 14:16:33.497), waitTimeout=-1, callTimeout=60000, name=inmemMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582640193498, firstInvocationTime='2020-02-25 14:16:33.498', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.putInternal(MapProxySupport.java:389)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.putInternal(NearCachedMapProxyImpl.java:164)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:131)
at com.hazelcast.map.impl.proxy.MapProxyImpl.put(MapProxyImpl.java:122)


What happened after starting node3 was it was also failing due to request routed to node2, which  were timing out.
Here's an example

INFO  2020-02-25 14:21:06.931  [matrix.InvocationMonitorThread] log()                          [10.173.240.3]:7008 [matrix] [3.9.4] Invocations:1 timeouts:1 backup-timeouts:0      (InvocationMonitor.java:65)
ERROR 2020-02-25 14:21:06.941  [main                          ] start()                        Failed to start       (CriticalCall.java:38)
com.hazelcast.core.OperationTimeoutException: GetOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2020-02-25 14:21:06.931. Start time: 2020-02-25 14:19:05.950. Total elapsed time: 120984 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2020-02-25 14:20:53.500. Invocation{op=com.hazelcast.map.impl.operation.GetOperation{serviceName='hz:impl:mapService', identityHash=133775180, partitionId=235, replicaIndex=0, callId=-147, invocationTime=1582640345946 (2020-02-25 14:19:05.946), waitTimeout=-1, callTimeout=60000, name=infoMap}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=60000, firstInvocationTimeMs=1582640345950, firstInvocationTime='2020-02-25 14:19:05.950', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[10.173.240.2]:7008, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=2, /10.173.240.3:36807->/10.173.240.2:7008, endpoint=[10.173.240.2]:7008, alive=true, type=MEMBER]}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:164)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:106)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:79)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:424)
at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:346)
at com.hazelcast.map.impl.proxy.NearCachedMapProxyImpl.getInternal(NearCachedMapProxyImpl.java:114)
at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:117)
at com.om.mxs.server.matrix.Matrix.get(Matrix.java:1552)

This was a critical failure and process on node2 was exiting after that.
No matter how many times node3 was restarted it was always failing in the same place.

To resolve the problem our service was restarted on all 3 nodes at the same time.

Ozan Kılıç

unread,
Mar 4, 2020, 4:24:30 AM3/4/20
to haze...@googlegroups.com
Hi, 

Data structure operations run on a thread group called partition operation threads. 
Hazelcast divides most data structures into sections called partitions. 
Each partition operation thread is assigned certain primary and backup partitions in the cluster. 
Map entries are mapped to partitions via consistent hashing of the key. 
When you delete key "1000" on map A or put "1000" on map B, 
the same part. op. thread will handle these operations. 

If a part. op. thread is stuck on map store operations, then all the data structure operations that are mapped to this part. op. thread will be blocked. 

We have plans to introduce offloading mapstore operations but for now, you need to be careful of what runs on partition operation threads. 

Best,


--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/845fd3ef-9425-40b2-84cd-9c0470f0be69%40googlegroups.com.


--
Ozan KILIC
Solutions Architect
   hazelcast®
 
 
2 W 5th Ave, Ste 300 | San Mateo, CA 94402 | USA
+1 (650) 521-5453 | hazelcast.com

This message contains confidential information and is intended only for the individuals named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required, please request a hard-copy version. -Hazelcast

ivenhov

unread,
Mar 5, 2020, 9:26:16 AM3/5/20
to Hazelcast
Thanks Ozan this is very useful.

I will definitely give it a read, especially that chapter.
Last time I checked Hazelcast had 500+ pages of documentation so difficult to keep up sometimes.
This is not a complaint and I definitely prefer that to not having docs at all.

For clarification, regarding partition awareness, does it also mean mean part of the map may use the same thread that is used for part of a list?
Also I was under the impression that operation on a map are first routed to a primary copy and then executed, potentially on a different member.
So member1 executing map.put(keyA) could be routed to a member2 (where primary partition lives), put would be executed there, then relayed to the backup partitions.
Or would write happen through backup partition (if such exist on member1) then routed to other partitions (including primary)? 

For simplicity I omit near-cache here.

Regards

Ozan Kılıç

unread,
Mar 6, 2020, 4:37:43 AM3/6/20
to haze...@googlegroups.com
>For clarification, regarding partition awareness, does it also mean mean part of the map may use the same thread that is used for part of a list?

Yes, non-partitioned structures like List are stored on a single partition which may host map data too. So, yes to the above question. 

>Also I was under the impression that operation on a map are first routed to a primary copy and then executed, potentially on a different member.
So member1 executing map.put(keyA) could be routed to a member2 (where primary partition lives), put would be executed there, then relayed to the backup partitions.
Or would write happen through backup partition (if such exist on member1) then routed to other partitions (including primary)? 

Put operation would be routed to the member which owns the primary partition of the key. 
This member would execute put operation on the primary partition. 
Then, sends backup operations to the backup partitions and waits for the ACKs (if sync backup is configured) 
And then, returns the result of the operation. 

Also, you can check this new feature to improve the performance of the client originating operations: https://docs.hazelcast.org/docs/latest/manual/html-single/index.html#configuring-backup-acknowledgment 
Basically, this feature changes the operation flow from this: 
client -> primary partition -> backup partition -> primary partition -> client 

To this: 

client -> primary partition -> backup partition -> client 



--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.


--
Ozan KILIC
Solutions Architect
   hazelcast®
 
 


Reply all
Reply to author
Forward
0 new messages