Client operations hang in certain circumstances

1,281 views
Skip to first unread message

moi...@gmail.com

unread,
Feb 8, 2016, 9:06:15 AM2/8/16
to Hazelcast
We have a cluster of up to 40 nodes running hazelcast 3.6, but problems described below occur even with only three nodes running.
Clients connect to the cluster using java api (there's up to about 100 clients).
On numerous occasions we see client threads stuck forever waiting for cluster response. Usually this happens, when there are changes in cluster topology, like when we restart more then one node simultaneously. But - after such restart - even new clients, that have connected to the cluster after the restart, can hang. 

By the way I noticed, that client implementation assumes, that (at least for map.get operation), the only condition when there will be no response to the operation invocation is when the connection to the cluster node gets broken - there's an infinite loop in com.hazelcast.client.spi.impl.ClientInvocationFuture.get(long, TimeUnit), which can be broken only by the invocation.isConnectionHealthy check failure - and this will fail only, when there's no heartbeat from the cluster node. The loop is infinite because map.get() operation will invoke this ClientInvocationFuture.get passing Long.MAX_VALUE timeout. In our environment this has caused us problems in both 3.5.4 and 3.6 versions. We mostly use the IMap interface for caching purposes and we need these operations to fail really fast in case of any problems.

The cluster is running on jdk1.8.0_60. We create hazelcast instance programmatically:
Config config = new XmlConfigBuilder(hazelcastConfigPathname).build();
hazelcastInstance = Hazelcast.newHazelcastInstance(config);

The following is the cluster config snippet:
<group>
<name>...</name>
<password>...</password>
</group>

    <properties>
        <property name="hazelcast.heartbeat.interval.seconds">1</property>
        <property name="hazelcast.max.no.heartbeat.seconds">15</property>
        <property name="hazelcast.max.no.master.confirmation.seconds">50</property>
        <property name="hazelcast.master.confirmation.interval.seconds">30</property>
        <property name="hazelcast.operation.call.timeout.millis">3000</property>
       <property name="hazelcast.slow.invocation.detector.threshold.millis">1000</property>        
        <property name="hazelcast.health.monitoring.level">NOISY</property>
        <property name="hazelcast.health.monitoring.delay.seconds">30</property>
        <property name="hazelcast.connect.all.wait.seconds">120</property>
        <property name="hazelcast.memcache.enabled">false</property>
        <property name="hazelcast.rest.enabled">true</property>
    </properties>

<network>
<port auto-increment="true" port-count="10">${hazelcast.port}</port>
<outbound-ports>
<ports>0</ports>
</outbound-ports>
<public-address>${hazelcast.public.address}</public-address>
<join>
<multicast enabled="true" />
</join>
</network>
<partition-group enabled="false" />
<executor-service name="default">
<pool-size>16</pool-size>
<queue-capacity>0</queue-capacity>
</executor-service>

<map name="default">
        <in-memory-format>BINARY</in-memory-format>
        <backup-count>0</backup-count>
        <async-backup-count>0</async-backup-count>
        <time-to-live-seconds>300</time-to-live-seconds>
        <max-idle-seconds>0</max-idle-seconds>
        <eviction-policy>LRU</eviction-policy>
        <max-size policy="FREE_HEAP_PERCENTAGE">20</max-size>
        <eviction-percentage>3</eviction-percentage>
        <min-eviction-check-millis>100</min-eviction-check-millis>
        <merge-policy>com.hazelcast.map.merge.PutIfAbsentMapMergePolicy</merge-policy>
    </map>

[followed by specific maps configuration]

The clients run jdk 1.6.0_45 and 1.7.0_80. We create clients using HazelcastClient.newHazelcastClient() with following configuration:

<group>
<name>...</name>
<password>...</password>
</group>
    <properties>
        <property name="hazelcast.client.heartbeat.timeout">7000</property>
        <property name="hazelcast.client.heartbeat.interval">2000</property>
        <property name="hazelcast.client.max.failed.heartbeat.count">3</property>
        <property name="hazelcast.client.request.retry.count">20</property>
        <property name="hazelcast.client.request.retry.wait.time">250</property>
        <property name="hazelcast.client.event.thread.count">5</property>
        <property name="hazelcast.client.event.queue.capacity">1000000</property>
<property name="hazelcast.jmx">true</property>
    </properties>


<network>
<cluster-members>
<address>hazelcast node 1 hostname:5701</address>
...
<address>hazelcast node 40 hostname:5701</address>
</cluster-members>
<connection-attempt-limit>2</connection-attempt-limit>
<connection-timeout>5000</connection-timeout>
<connection-attempt-period>1000</connection-attempt-period>
<smart-routing>true</smart-routing>
<redo-operation>true</redo-operation>
<socket-interceptor enabled="false" />
<aws enabled="false" />
</network>

<executor-pool-size>8</executor-pool-size>
<listeners>
</listeners>
<serialization>
</serialization>
<proxy-factories>
</proxy-factories>
<load-balancer type="random" />
<near-cache name="default">
<max-size>512</max-size>
<time-to-live-seconds>60</time-to-live-seconds>
<max-idle-seconds>0</max-idle-seconds>
<eviction-policy>LRU</eviction-policy>
<invalidate-on-change>true</invalidate-on-change>
<cache-local-entries>false</cache-local-entries>
</near-cache>
[followed by specific near-cache configurations]


Following ar stacktrace samples of hung clients:

"hz.client_132_hazelcastGlobalCacheProd.internal-2" prio=10 tid=0x099aec00 nid=0x60f5 in Object.wait() [0x073f6000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x9e06e028> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:90)
        - locked <0x9e06e028> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:75)
        at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.invoke(ClientSmartListenerService.java:82)
        at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.access$300(ClientSmartListenerService.java:42)
        at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService$1.run(ClientSmartListenerService.java:154)
        - locked <0xc9289268> (a java.lang.Object)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)


"hz.client_131_hazelcastGlobalCacheProd.internal-1" prio=10 tid=0x0dc81c00 nid=0x6061 in Object.wait() [0x17e5c000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x9e15a1b8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:90)
        - locked <0x9e15a1b8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:75)
        at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.invoke(ClientSmartListenerService.java:82)
        at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.access$300(ClientSmartListenerService.java:42)
        at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService$1.run(ClientSmartListenerService.java:154)
        - locked <0xc814a0b0> (a java.lang.Object)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
        at java.lang.Thread.run(Thread.java:662)
        at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
        at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)

"[ACTIVE] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x22e33800 nid=0x2090 in Object.wait() [0x1e2ad000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x9ebb2a70> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:90)
        - locked <0x9ebb2a70> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:75)
        at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:38)
        at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:144)
        at com.hazelcast.client.proxy.ClientMapProxy.size(ClientMapProxy.java:1339)


Any help will be much appreciated.

Best regards,
Mikolaj

M. Sancar Koyunlu

unread,
Feb 10, 2016, 9:27:29 AM2/10/16
to Hazelcast
Hi, 
The infinite loop can be avoided by using async operations and with a timeout on future as follows:

Future f = map.getAsync(1);
f.get(50 , TimeUnit.MILLISECONDS)

But be aware that getting out of this loop does not cancel any operation on server side. Things will continue and client side still get a response on the background.

Your thread should be freed of this loop when one of the following happens:
1) A response(an exception as response) comes from server.
2) Connection is closed.
3) Heartbeat is timed out.

There should not be a case that we do not cover. But if you are seeing an infinite wait there, that means there is a bug we need to investigate. I have tried to reproduce the your problem, but so far no luck.  If you can managed to send a step by step reproducer scenario, that would be great. 

Regards

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/476e2961-5a03-4b4a-9f7f-55f0a3a1ce59%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Sancar Koyunlu
Software Engineer, Hazelcast

moi...@gmail.com

unread,
Feb 11, 2016, 6:41:58 AM2/11/16
to Hazelcast
Thanks for the reply.
It will require some time to set up a working example, but I will describe before scenarios that I discovered:
1.
I run a cluster of three nodes with multicast join (java 1.8)
I run a client with tcp join to the cluster (java 1.7)
The client connects to the cluster and:
        final IMap<Object, Object> map = hazelcast.getMap("testMap");
        for(int i = 0 ; i < 1000; i++) {
            map.put(System.nanoTime(), System.nanoTime());
            log.info("Map : {}, stats: {}, size: {}", map, map.getLocalMapStats(), map.size());
            Uninterruptibles.sleepUninterruptibly(1, TimeUnit.SECONDS);
        }

While this is running I shutdown two cluster nodes simultaneously. I wait until client discovers this, then quickly startup these nodes simultaneously. Then the client hangs - always on map.size() operation I think.
I need to play with these getLocalMapStats and map.size operations, but I when I removed them and did only map put/get operations, it worked fine.
I use hazelcast-client, not hazelcast-client-new.

2. 
This is from our production environment, and it's probably because of wrong setup, but still not what I'd expect:
The same cluster of three nodes is running.
I start a weblogic server with two ear applications - both of them have hazelcast-client 3.5.4, but they try to connect (tcp join) to the 3.6 cluster. First one of them always connects, and the second one always hangs on com.hazelcast.client.HazelcastClient.newHazelcastClient(). Only when I shutdown the whole cluster, the operation continues (with newHazelcastClient() throwing exception obviously).

The fact that my threads can hang there forever (tested 12 hours), and that the cluster during this test was under no other load indicates, that the response from the server sometimes does not come. If it's because there was no response, or it didn't get delivered to the client - I can't tell.

By the way, would it be a very bad idea to use LiteMembers instead of clients as a workarround?

Best regards,
Mikolaj

M. Sancar Koyunlu

unread,
Feb 17, 2016, 4:14:40 AM2/17/16
to Hazelcast
Hi, 
I am trying to reproduce the first one by following your scenario, but no luck so far.
Can you get a stacktrace when client hanged ?
And what does client logs say ? Does it say ` is SHUTTING_DOWN` ? Or is it trying to reconnect to cluster ?

For your second setup, 3.5.4 clients cannot work with 3.6 servers.
 We have minor level compatibility before version 3.6. Meaning 3.5.x only compatible with 3.5.x .



For more options, visit https://groups.google.com/d/optout.

M. Sancar Koyunlu

unread,
Apr 20, 2016, 7:08:36 AM4/20/16
to Hazelcast
Hi, we have found and fixed the issue. Here are the related pull requests if you are interested.

Lee Butts

unread,
Apr 21, 2016, 3:16:46 AM4/21/16
to Hazelcast
Hi,

any ETA on when 3.6.3 will be released with a fix for this? We use listeners a lot so this issue is a concern for us, as we are about to deploy 3.6.2 to production.

thanks

Lee

M. Sancar Koyunlu

unread,
Apr 21, 2016, 8:08:14 AM4/21/16
to haze...@googlegroups.com
First week of May seems to be the current plan for 3.6.3. 


IMPORTANT NOTICE: MahiFX Limited (MahiFX), (company no. 2446590) is registered in New Zealand. MahiFX is authorised and regulated by the Australian Securities and Investment Commission ARBN 152 535 085, AFSL 414198. This email, its attachments and any rights attaching hereto are confidential and intended exclusively for the person to whom the email is addressed. If you are not the intended recipient, do not read, copy, disclose or use the contents in any way. Please notify the sender by return email and destroy the email and attachments immediately. MahiFX does not accept any liability for any changes made to this email or attachments after sending by MahiFX. You must scan this email and attachments for viruses. The opinions expressed are not necessarily those of MahiFX. MahiFX accepts no liability for any loss, damage or consequence, whether caused by our own negligence or not, resulting directly or indirectly from the use of this email and attachments.

For more information about MahiFX Limited see mahifx.com.

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.

For more options, visit https://groups.google.com/d/optout.

moi...@gmail.com

unread,
May 13, 2016, 4:40:58 AM5/13/16
to Hazelcast
Great news about finding and solving the issue. This is really painful for us, as after any change in cluster topology we usually need to restart a bunch of clients.
Any updates on 3.6.3 release date?

Best regards,
Mikolaj

ih...@hazelcast.com

unread,
May 13, 2016, 7:51:10 AM5/13/16
to Hazelcast, moi...@gmail.com
It will be out in a few days. Regards.
Reply all
Reply to author
Forward
0 new messages