We have a three node Hazelcast setup running inside WebLogic 11g/Suse Liunx as follows
Server nodes run inside a dedicated EAR and across three WebLogic managed servers, therefore three node grid.
Several business applications running across many different EAR files deployed again to WebLogic managed servers. These connect to the grid using the Hazelcast smart client. The Hazelcast server process and client are not within the same EAR but could be running on the same JVM. Each client will be accessed by many client threads as EJB calls are made.
We have 20 or so caches but only three are large (200k items) and each client has defined near caches. Our issue relates to random lockups of the smart client when invoking get or getAll. It seems to be the case that a client can indefinitely wait for a response, the message request/response is seemingly lost within the server node, or within the client itself. The result being total lockup of that client instance, the application is therefore rendered dead and we need to kill the underlying process to rectify the issue.
We have verified this with both 3.5.5 and 3.6.4. It is really hard to trace what is going on due to multi-processes and the threaded nature of the server and client however there seems to be a pattern whereby when one of the underlying caches is cleared periodically, the application threads check and re-populate fresh values into this cache and this process of re-filling the cache can sometimes provoke an issue. We have tried to replicate this with test code but fail every time to reproduce the lockup. We also see nothing specific in log files or standard output that helps. Even within our test systems, we may only see the issue once per week max, it is very intermittent but very severe when it occurs.
3.5.5
"[ACTIVE] ExecuteThread: '66' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=5 tid=0x440043200000 nid=0x35ea [ JVM locked by VM (w/poll advisory bit) waiting on VM lock 'com.hazelcast.client.spi.impl.ClientInvocationFuture', polling bits: safep ]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000418dac213818> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:118)
- locked <0x0000418dac213818> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:103)
at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:130)
at com.hazelcast.client.proxy.ClientMapProxy.get(ClientMapProxy.java:198)
3.6.4
"[STUCK] ExecuteThread: '120' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=5 tid=0x440038200000 nid=0x73fa [ JVM locked by VM (w/poll advisory bit) waiting on VM lock 'com.hazelcast.client.spi.impl.ClientInvocationFuture', polling bits: safep rstak ]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x0000418c670aafc8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:88)
- locked <0x0000418c670aafc8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:74)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:37)
at com.hazelcast.client.spi.ClientProxy.invokeOnPartition(ClientProxy.java:126)
at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:120)
at com.hazelcast.client.proxy.ClientMapProxy.getInternal(ClientMapProxy.java:225)
at com.hazelcast.client.proxy.NearCachedClientMapProxy.getInternal(NearCachedClientMapProxy.java:107)
at com.hazelcast.client.proxy.ClientMapProxy.get(ClientMapProxy.java:220)
Would anyone have any advice to help us resolve this?
Hi,
since your case is connected to cache.clear invocations, I think it sounds similar to an issue identified with Cache.clear causing several near cache invalidation events to be delivered instead of just 1, see https://github.com/hazelcast/hazelcast/pull/8649. This will cause a large number of callbacks (actually equal to the partition count) to be registered for execution asynchronously and this may result in OOME or just plain overloading.
Having said that, do you see in your logs any "slow operation detected" messages? As general advice for monitoring Hazelcast, when using 3.6.4 you should be able to get some insight into what's happening in cluster members by using the following properties (either pass them on JVM startup or as hazelcast properties -- these apply for 3.6.x versions):
hazelcast.health.monitoring.level=NOISY
hazelcast.performance.monitoring.enabled=true
hazelcast.performance.metric.level=INFO
The most accurate info would come from running your cluster with flight recorder enabled (if you are able to do so) and obtain a recording to see what client & member threads are executing at the time of lockup.
Cheers,
Vassilis
--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/11599ea4-6956-4407-9006-56d2cacbf79d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/88d78039-fe7d-470f-bcad-4c18ade1bd5d%40googlegroups.com.
Hi,
Could you please test this issue by using latest snapshot? We sent a fix related to the problem that Vassilis mentioned, it may have an effect on the issue.
Latest snapshot with fix is 3.7.1-SNAPSHOT
?
Snapshot repository to get it:
<repository>
<id>sonatype-snapshots</id>
<name>Sonatype Snapshot Repository</name>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
<releases>
<enabled>false</enabled>
</releases>
<snapshots>
<enabled>true</enabled>
</snapshots>
</repository>
</repositories>
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/2d32ff7e-dd35-4287-8025-159c2fb907e9%40googlegroups.com.