Map SetOperation delayed by 123 seconds

141 views
Skip to first unread message

David Zhou

unread,
Mar 16, 2017, 11:11:08 AM3/16/17
to Hazelcast
Hazelcast version 3.7.1

Recently, we observed a strange phenomenon where IMap.set operations were delayed by 123 seconds. I don’t mean they timed out after 123 seconds, it appears they did not start for 123 seconds.

We use a hazelcast map as a distributed cache for data stored in a NoSQL database in a web service. We do rolling deployments to this service, specifically we take down 25% of hosts at a time to deploy. What I am describing took place during such a deployment.

We have the following properties configured:
hazelcast.max.no.heartbeat.
seconds = 120
hazelcast.operation.call.timeout.millis = 2500

In logs, I see something like this for every write operation that failed:

07 Mar 2017 17:54:08,967 Operation begins
07 Mar 2017 17:56:11,845 Exception:
com.hazelcast.spi.exception.RetryableHazelcastException: Member [hostA]:5701 is currently shutting down! Operation: com.hazelcast.map.impl.operation.SetOperation{serviceName='hz:impl:mapService', identityHash=341010779, partitionId=217, replicaIndex=0, callId=101686155, invocationTime=1488909371783 (2017-03-07 17:56:11.783), waitTimeout=-1, callTimeout=2500, name=mapname}
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.checkNodeState(OperationRunnerImpl.java:221)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:165)
    at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:396)
    at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:117)
    at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:102)
    at ------ submitted from ------.(Unknown Source)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:111)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
    at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
    at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:377)
    at com.hazelcast.map.impl.proxy.MapProxySupport.setInternal(MapProxySupport.java:433)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.set(MapProxyImpl.java:183)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.set(MapProxyImpl.java:173)
    ...

What’s notable is that this is an operation that typically takes 300ms including hazelcast hops. Furthermore, the request began at 17:54:08,967 but according to the hazelcast exception message, the hazelcast set operation was only invoked at 17:56:11,783, with a gap of about 122 or 123 seconds.

We also had read operations time out but they timed out cleanly like so:

07 Mar 2017 17:55:03,921
Operation begins
07 Mar 2017 17:55:09,944 Exception:
com.hazelcast.core.OperationTimeoutException: GetOperation invocation failed to complete due to operation-heartbeat-timeout. Current time: 2017-03-07 17:55:09.944. Total elapsed time: 5927 ms. Last operation heartbeat: never. Last operation heartbeat from member: 2017-03-07 17:52:04.102. Invocation{op=com.hazelcast.map.impl.operation.GetOperation{serviceName='hz:impl:mapService', identityHash=1648711352, partitionId=141, replicaIndex=0, callId=0, invocationTime=1488909303950 (2017-03-07 17:55:03.950), waitTimeout=-1, callTimeout=2500, name=mapname}, tryCount=250, tryPauseMillis=500, invokeCount=1, callTimeoutMillis=2500, firstInvocationTimeMs=1488909304017, firstInvocationTime='2017-03-07 17:55:04.017', lastHeartbeatMillis=0, lastHeartbeatTime='1970-01-01 00:00:00.000', target=[hostA]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=Connection[id=1, /10.161.106.144:34224->hostA/10.85.4.23:5701, endpoint=[hostA]:5701, alive=true, type=MEMBER]}
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.newOperationTimeoutException(InvocationFuture.java:150)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:98)
    at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
    at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
    at com.hazelcast.map.impl.proxy.MapProxySupport.invokeOperation(MapProxySupport.java:377)
    at com.hazelcast.map.impl.proxy.MapProxySupport.getInternal(MapProxySupport.java:308)
    at com.hazelcast.map.impl.proxy.MapProxyImpl.get(MapProxyImpl.java:94)
    ...

So far as I can tell, every get operation that timed out did so in the same way, at most several seconds after the request began. Also, every set operation that failed in the above manner had the same 123 second gap between operation invocation and exception.

I have not been able to recreate the above symptoms and it was the first time we saw it that bad since deploying hazelcast two months ago.

Any pointers on configuration would be greatly appreciated.

Thank you,

Peter Veentjer

unread,
Mar 17, 2017, 2:00:47 AM3/17/17
to haze...@googlegroups.com
Can you provide details about your environment (OS/virtualization etc), JVM etc?

Regarding the first exception; I need to have a closer look at the stacktrace, but it seems this exception is logged because on the executing side the system believes there is no caller and therefore its logs to exception (to prevent the exception to get lost).

Normally you would not see this exception since it is fed back to the caller and it will then decide to do a retry. The fact that you see this logging would indicate that no retry is going to take place, hence timeout. But it is unclear why a get/put operation would be seen by the system as 'there is no caller'.

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+unsubscribe@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/0057b026-30bc-43d7-b33d-1df3889fbaf3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Peter Veentjer

unread,
Mar 17, 2017, 7:22:41 AM3/17/17
to haze...@googlegroups.com
I had a closer look at the code. At least in 3.7.1 the exception is always logged no matter of the operation returns a response or not. So this isn't the problem. The big question is why

1 : the retryable exception doesn't lead to a retry

2: if the invocation sees the member going down, it should be able to trigger a retry itself. Why did this not happen.

Some additional analysis is needed.

private void handleOperationError(Operation operation, Throwable e) {
if (e instanceof OutOfMemoryError) {
OutOfMemoryErrorDispatcher.onOutOfMemory((OutOfMemoryError) e);
}
try {
operation.onExecutionFailure(e);
} catch (Throwable t) {
logger.warning("While calling 'operation.onFailure(e)'... op: " + operation + ", error: " + e, t);
}

operation.logError(e);

if (!operation.returnsResponse()) {
return;
}

David

unread,
Mar 17, 2017, 2:08:16 PM3/17/17
to Hazelcast
Hi Peter,

Thank you for looking into it. Could it be that it did retry during the 123 seconds before the RetryableHazelcastException was logged? If only the last failed attempt threw the exception, would that exception report only the start time of the last failed attempt?

The service runs across 15 AWS EC2 hosts. We have 0 async backups and 3 sync backups.The entire data set in the backing database is about 60MB.

The OS is Amazon Linux 2012.

This is the JVM version:
$ java -version
java version "1.8.0_112"
Java(TM) SE Runtime Environment (build 1.8.0_112-b15)
Java HotSpot(TM) 64-Bit Server VM (build 25.112-b15, mixed mode)
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.

David

unread,
Mar 17, 2017, 2:19:51 PM3/17/17
to Hazelcast
One more thought, does it matter that I've set the following system properties:

hazelcast.max.no.heartbeat.seconds = 120
hazelcast.operation.call.timeout.millis = 2500

But not the following properties (with default values):

hazelcast.client.invocation.timeout.seconds = 120
hazelcast.client.max.no.heartbeat.seconds = 300

It's not clear to me the distinction between client properties and generic system properties since in our use case, each host is both a client and a server (distributed map). Since I am accessing my map via map.get(key), am I correct to understand the client properties (specifically the no heartbeat seconds) apply to the internally generated hazelcast client that may be invoking the get request while the system properties apply to the node's knowledge of the rest of the cluster?



On Friday, March 17, 2017 at 7:22:41 AM UTC-4, peter veentjer wrote:
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.

Peter Veentjer

unread,
Mar 17, 2017, 4:00:33 PM3/17/17
to haze...@googlegroups.com
Hi David, 

tomorrow I have some time to do further investigation. Can you send the logs of the caller and callee? (So the node that throw the OPerationTimeoutException and the node that logged the RetryableHazelcastException. I'm very curious about what the caller has to see because my guess is that he failed to retry.

To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+unsubscribe@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.

David

unread,
Mar 20, 2017, 10:54:32 AM3/20/17
to Hazelcast
Hi Peter,

Unfortunately I can't provide those logs. Is there something in particular you are looking for in the logs?

Peter Veentjer

unread,
Mar 20, 2017, 11:52:07 PM3/20/17
to haze...@googlegroups.com
Hi David,

could you share them with my using company email? peter at hazelcast dot com

So the thing we do know is that on the caller side the system worked fine. We see the exception and it should got send to the caller.

However.. even if the caller didn't get that exception from the callee, it should see that the node went offline and decide to retry on his own.

I need to get a better understanding of what happened on the caller side at the moment that one of the nodes went offline.  So I don't need all your logs.. just 3/4 minutes before and after the node left.

To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+unsubscribe@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
Reply all
Reply to author
Forward
0 new messages