Hazelcast smart client freeze issue

253 views
Skip to first unread message

N Mac

unread,
Aug 8, 2016, 10:48:00 AM8/8/16
to Hazelcast

Hi there,

 

We have a three node Hazelcast setup running inside WebLogic 11g/Suse Liunx as follows

 

Server nodes run inside a dedicated EAR and across three WebLogic managed servers, therefore three node grid.

 

Several business applications running across many different EAR files deployed again to WebLogic managed servers.  These connect to the grid using the Hazelcast smart client.  The Hazelcast server process and client are not within the same EAR but could be running on the same JVM.  Each client will be accessed by many client threads as EJB calls are made.

 

We have 20 or so caches but only three are large (200k items) and each client has defined near caches.  Our issue relates to random lockups of the smart client when invoking get or getAll.  It seems to be the case that a client can indefinitely wait for a response, the message request/response is seemingly lost within the server node, or within the client itself.  The result being total lockup of that client instance, the application is therefore rendered dead and we need to kill the underlying process to rectify the issue.

 

We have verified this with both 3.5.5 and 3.6.4.  It is really hard to trace what is going on due to multi-processes and the threaded nature of the server and client however there seems to be a pattern whereby when one of the underlying caches is cleared periodically, the application threads check and re-populate fresh values into this cache and this process of re-filling the cache can sometimes provoke an issue.  We have tried to replicate this with test code but fail every time to reproduce the lockup.  We also see nothing specific in log files or standard output that helps.  Even within our test systems, we may only see the issue once per week max, it is very intermittent but very severe when it occurs.

 

 

3.5.5

 

"[ACTIVE] ExecuteThread: '66' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=5 tid=0x440043200000 nid=0x35ea  [ JVM locked by VM (w/poll advisory bit) waiting on VM lock 'com.hazelcast.client.spi.impl.ClientInvocationFuture', polling bits: safep ]

   java.lang.Thread.State: TIMED_WAITING (on object monitor)

        at java.lang.Object.wait(Native Method)

        - waiting on <0x0000418dac213818> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)

        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:118)

        - locked <0x0000418dac213818> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)

        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:103)

        at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:130)

        at com.hazelcast.client.proxy.ClientMapProxy.get(ClientMapProxy.java:198)

 

 

3.6.4

 

"[STUCK] ExecuteThread: '120' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=5 tid=0x440038200000 nid=0x73fa  [ JVM locked by VM (w/poll advisory bit) waiting on VM lock 'com.hazelcast.client.spi.impl.ClientInvocationFuture', polling bits: safep rstak ]

   java.lang.Thread.State: TIMED_WAITING (on object monitor)

        at java.lang.Object.wait(Native Method)

        - waiting on <0x0000418c670aafc8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)

        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:88)

        - locked <0x0000418c670aafc8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)

        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:74)

        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:37)

        at com.hazelcast.client.spi.ClientProxy.invokeOnPartition(ClientProxy.java:126)

        at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:120)

        at com.hazelcast.client.proxy.ClientMapProxy.getInternal(ClientMapProxy.java:225)

        at com.hazelcast.client.proxy.NearCachedClientMapProxy.getInternal(NearCachedClientMapProxy.java:107)

        at com.hazelcast.client.proxy.ClientMapProxy.get(ClientMapProxy.java:220)

 

 

 

Would anyone have any advice to help us resolve this?

Vassilis Bekiaris

unread,
Aug 9, 2016, 10:27:37 AM8/9/16
to haze...@googlegroups.com

Hi,

since your case is connected to cache.clear invocations, I think it sounds similar to an issue identified with Cache.clear causing several near cache invalidation events to be delivered instead of just 1, see https://github.com/hazelcast/hazelcast/pull/8649. This will cause a large number of callbacks (actually equal to the partition count) to be registered for execution asynchronously and this may result in OOME or just plain overloading.

Having said that, do you see in your logs any "slow operation detected" messages? As general advice for monitoring Hazelcast, when using 3.6.4 you should be able to get some insight into what's happening in cluster members by using the following properties (either pass them on JVM startup or as hazelcast properties -- these apply for 3.6.x versions):

hazelcast.health.monitoring.level=NOISY
hazelcast.performance.monitoring.enabled=true
hazelcast.performance.metric.level=INFO

The most accurate info would come from running your cluster with flight recorder enabled (if you are able to do so) and obtain a recording to see what client & member threads are executing at the time of lockup.

Cheers,

Vassilis

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/11599ea4-6956-4407-9006-56d2cacbf79d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

N Mac

unread,
Aug 9, 2016, 11:59:28 AM8/9/16
to Hazelcast
Hi

Would this also affect IMap, I forgot to mention we only use IMap

Thanks

Vassilis Bekiaris

unread,
Aug 9, 2016, 12:08:26 PM8/9/16
to haze...@googlegroups.com
Hmm no, this issue is specific to Cache. Can you share your Hazelcast configuration ?

Best,
Vassilis

N Mac

unread,
Aug 9, 2016, 12:27:49 PM8/9/16
to Hazelcast
hi

ok on the server side we would have this

<properties>
  <property name="hazelcast.logging.type">slf4j</property>
  <property name="hazelcast.jmx">true</property>
  <property name="hazelcast.jmx.detailed">true</property>
  <property name="hazelcast.io.thread.count">6</property>
</properties>

<map name="TEST">
  <in-memory-format>BINARY</in-memory-format>
  <statistics-enabled>false</statistics-enabled>
  <backup-count>0</backup-count>
  <async-backup-count>1</async-backup-count>
  <time-to-live-seconds>86400</time-to-live-seconds>
  <max-idle-seconds>86400</max-idle-seconds>
  <eviction-policy>LFU</eviction-policy>
  <max-size>100000</max-size>
</map>

and the corresponding client connection config would be

<properties>
        <property name="hazelcast.client.connection.timeout">10000</property>
        <property name="hazelcast.client.retry.count">200</property>
    </properties>
 
 <network>
     <cluster-members>
              <address>localhost:12701</address>
        </cluster-members>
    
        <smart-routing>true</smart-routing>
        <redo-operation>true</redo-operation>
        <connection-attempt-period>10000</connection-attempt-period>
        <connection-attempt-limit>720</connection-attempt-limit>
        <socket-options>
         <tcp-no-delay>true</tcp-no-delay>
         <keep-alive>true</keep-alive>
         <reuse-address>true</reuse-address>
        </socket-options>
       
    </network>
    <!--local executor pool size-->
    <executor-pool-size>40</executor-pool-size>
 
 <near-cache name="TEST">
  <max-size>100000</max-size>
  
  <!-- 3 hr if not read -->
     <max-idle-seconds>10800</max-idle-seconds>
  
     <invalidate-on-change>true</invalidate-on-change>
     <in-memory-format>OBJECT</in-memory-format>
     <local-update-policy>INVALIDATE</local-update-policy>
 </near-cache>

We only use IMap, no other features of hazelcast at this point in time.

I will check the logs for slow operations as you kindly suggest.

Thanks again.

Ahmet Mircik

unread,
Aug 15, 2016, 9:46:05 AM8/15/16
to Hazelcast

Hi,

Could you please test this issue by using latest snapshot? We sent a fix related to the problem that Vassilis mentioned, it may have an effect on the issue.

Latest snapshot with fix is 3.7.1-SNAPSHOT?

Snapshot repository to get it:

       <repository>
           <id>sonatype-snapshots</id>
           <name>Sonatype Snapshot Repository</name>
           <url>https://oss.sonatype.org/content/repositories/snapshots</url>
           <releases>
               <enabled>false</enabled>
           </releases>
           <snapshots>
               <enabled>true</enabled>
           </snapshots>
       </repository>
   </repositories>

Reply all
Reply to author
Forward
0 new messages