Recently,
we observed a strange phenomenon where IMap.set operations were delayed
by 123 seconds. I don’t mean they timed out after 123 seconds, it
appears they did not start for 123 seconds.
We use a hazelcast
map as a distributed cache for data stored in a NoSQL database in a web
service. We do rolling deployments to this service, specifically we take
down 25% of hosts at a time to deploy. What I am describing took place
during such a deployment.
hazelcast.max.no.heartbeat.
seconds = 120
hazelcast.operation.call.timeout.millis = 2500
In logs, I see something like this for every
write operation that failed:
07 Mar 2017 17:54:08,967 Operation begins
07 Mar 2017 17:56:11,845 Exception:
com.hazelcast.spi.exception.RetryableHazelcastException: Member [hostA]:5701 is currently shutting down! Operation: com.hazelcast.map.impl.operation.SetOperation{serviceName='hz:impl:mapService', identityHash=341010779, partitionId=217, replicaIndex=0, callId=101686155,
invocationTime=1488909371783 (2017-03-07 17:56:11.783), waitTimeout=-1, callTimeout=2500, name=mapname}
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.checkNodeState(OperationRunnerImpl.java:221)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:165)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:396)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:117)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:102)
at ------ submitted from ------.(Unknown Source)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:111)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:377)
at com.hazelcast.map.impl.proxy.MapProxySupport.setInternal(MapProxySupport.java:433)
at com.hazelcast.map.impl.proxy.MapProxyImpl.set(MapProxyImpl.java:183)
at com.hazelcast.map.impl.proxy.MapProxyImpl.set(MapProxyImpl.java:173)
...
What’s
notable is that this is an operation that typically takes 300ms
including hazelcast hops. Furthermore, the request began at 17:54:08,967
but according to the hazelcast exception message, the hazelcast set
operation was only invoked at 17:56:11,783, with a gap of about 122 or
123 seconds.
We also had
read operations time out but they timed out cleanly like so:
07 Mar 2017 17:55:03,921 Operation begins
07 Mar 2017 17:55:09,944 Exception:
com.hazelcast.core.OperationTimeoutException:
GetOperation invocation failed to complete due to
operation-heartbeat-timeout. Current time: 2017-03-07 17:55:09.944.
Total elapsed time: 5927 ms. Last operation heartbeat: never. Last
operation heartbeat from member: 2017-03-07 17:52:04.102.
Invocation{op=com.hazelcast.map.impl.operation.GetOperation{serviceName='hz:impl:mapService', identityHash=1648711352, partitionId=141, replicaIndex=0, callId=0,
invocationTime=1488909303950 (2017-03-07 17:55:03.950),
waitTimeout=-1, callTimeout=2500, name=mapname}, tryCount=250,
tryPauseMillis=500, invokeCount=1, callTimeoutMillis=2500,
firstInvocationTimeMs=1488909304017, firstInvocationTime='2017-03-07
17:55:04.017', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01
00:00:00.000', target=[hostA]:5701, pendingResponse={VOID},
backupsAcksExpected=0, backupsAcksReceived=0,
connection=Connection[id=1, /10.161.106.144:34224->hostA/
10.85.4.23:5701, endpoint=[hostA]:5701, alive=true, type=MEMBER]}
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:377)
at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:308)
at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:94)
...
So
far as I can tell, every get operation that timed out did so in the
same way, at most several seconds after the request began. Also, every
set operation that failed in the above manner had the same 123 second
gap between operation invocation and exception.
I have not been
able to recreate the above symptoms and it was the first time we saw it
that bad since deploying hazelcast two months ago.
Any pointers on configuration would be greatly appreciated.
Thank you,