We have a cluster of up to 40 nodes running hazelcast 3.6, but problems described below occur even with only three nodes running.
Clients connect to the cluster using java api (there's up to about 100 clients).
On numerous occasions we see client threads stuck forever waiting for cluster response. Usually this happens, when there are changes in cluster topology, like when we restart more then one node simultaneously. But - after such restart - even new clients, that have connected to the cluster after the restart, can hang.
By the way I noticed, that client implementation assumes, that (at least for map.get operation), the only condition when there will be no response to the operation invocation is when the connection to the cluster node gets broken - there's an infinite loop in com.hazelcast.client.spi.impl.ClientInvocationFuture.get(long, TimeUnit), which can be broken only by the invocation.isConnectionHealthy check failure - and this will fail only, when there's no heartbeat from the cluster node. The loop is infinite because map.get() operation will invoke this ClientInvocationFuture.get passing Long.MAX_VALUE timeout. In our environment this has caused us problems in both 3.5.4 and 3.6 versions. We mostly use the IMap interface for caching purposes and we need these operations to fail really fast in case of any problems.
The cluster is running on jdk1.8.0_60. We create hazelcast instance programmatically:
Config config = new XmlConfigBuilder(hazelcastConfigPathname).build();
hazelcastInstance = Hazelcast.newHazelcastInstance(config);
The following is the cluster config snippet:
<group>
<name>...</name>
<password>...</password>
</group>
<properties>
<property name="hazelcast.heartbeat.interval.seconds">1</property>
<property name="hazelcast.max.no.heartbeat.seconds">15</property>
<property name="hazelcast.max.no.master.confirmation.seconds">50</property>
<property name="hazelcast.master.confirmation.interval.seconds">30</property>
<property name="hazelcast.operation.call.timeout.millis">3000</property>
<property name="hazelcast.slow.invocation.detector.threshold.millis">1000</property>
<property name="hazelcast.health.monitoring.level">NOISY</property>
<property name="hazelcast.health.monitoring.delay.seconds">30</property>
<property name="hazelcast.connect.all.wait.seconds">120</property>
<property name="hazelcast.memcache.enabled">false</property>
<property name="hazelcast.rest.enabled">true</property>
</properties>
<network>
<port auto-increment="true" port-count="10">${hazelcast.port}</port>
<outbound-ports>
<ports>0</ports>
</outbound-ports>
<public-address>${hazelcast.public.address}</public-address>
<join>
<multicast enabled="true" />
</join>
</network>
<partition-group enabled="false" />
<executor-service name="default">
<pool-size>16</pool-size>
<queue-capacity>0</queue-capacity>
</executor-service>
<map name="default">
<in-memory-format>BINARY</in-memory-format>
<backup-count>0</backup-count>
<async-backup-count>0</async-backup-count>
<time-to-live-seconds>300</time-to-live-seconds>
<max-idle-seconds>0</max-idle-seconds>
<eviction-policy>LRU</eviction-policy>
<max-size policy="FREE_HEAP_PERCENTAGE">20</max-size>
<eviction-percentage>3</eviction-percentage>
<min-eviction-check-millis>100</min-eviction-check-millis>
<merge-policy>com.hazelcast.map.merge.PutIfAbsentMapMergePolicy</merge-policy>
</map>
[followed by specific maps configuration]
The clients run jdk 1.6.0_45 and 1.7.0_80. We create clients using HazelcastClient.newHazelcastClient() with following configuration:
<group>
<name>...</name>
<password>...</password>
</group>
<properties>
<property name="hazelcast.client.heartbeat.timeout">7000</property>
<property name="hazelcast.client.heartbeat.interval">2000</property>
<property name="hazelcast.client.max.failed.heartbeat.count">3</property>
<property name="hazelcast.client.request.retry.count">20</property>
<property name="hazelcast.client.request.retry.wait.time">250</property>
<property name="hazelcast.client.event.thread.count">5</property>
<property name="hazelcast.client.event.queue.capacity">1000000</property>
<property name="hazelcast.jmx">true</property>
</properties>
<network>
<cluster-members>
<address>hazelcast node 1 hostname:5701</address>
...
<address>hazelcast node 40 hostname:5701</address>
</cluster-members>
<connection-attempt-limit>2</connection-attempt-limit>
<connection-timeout>5000</connection-timeout>
<connection-attempt-period>1000</connection-attempt-period>
<smart-routing>true</smart-routing>
<redo-operation>true</redo-operation>
<socket-interceptor enabled="false" />
<aws enabled="false" />
</network>
<executor-pool-size>8</executor-pool-size>
<listeners>
</listeners>
<serialization>
</serialization>
<proxy-factories>
</proxy-factories>
<load-balancer type="random" />
<near-cache name="default">
<max-size>512</max-size>
<time-to-live-seconds>60</time-to-live-seconds>
<max-idle-seconds>0</max-idle-seconds>
<eviction-policy>LRU</eviction-policy>
<invalidate-on-change>true</invalidate-on-change>
<cache-local-entries>false</cache-local-entries>
</near-cache>
[followed by specific near-cache configurations]
Following ar stacktrace samples of hung clients:
"hz.client_132_hazelcastGlobalCacheProd.internal-2" prio=10 tid=0x099aec00 nid=0x60f5 in Object.wait() [0x073f6000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x9e06e028> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:90)
- locked <0x9e06e028> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:75)
at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.invoke(ClientSmartListenerService.java:82)
at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.access$300(ClientSmartListenerService.java:42)
at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService$1.run(ClientSmartListenerService.java:154)
- locked <0xc9289268> (a java.lang.Object)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
"hz.client_131_hazelcastGlobalCacheProd.internal-1" prio=10 tid=0x0dc81c00 nid=0x6061 in Object.wait() [0x17e5c000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x9e15a1b8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:90)
- locked <0x9e15a1b8> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:75)
at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.invoke(ClientSmartListenerService.java:82)
at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService.access$300(ClientSmartListenerService.java:42)
at com.hazelcast.client.spi.impl.listener.ClientSmartListenerService$1.run(ClientSmartListenerService.java:154)
- locked <0xc814a0b0> (a java.lang.Object)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
at com.hazelcast.util.executor.HazelcastManagedThread.executeRun(HazelcastManagedThread.java:76)
at com.hazelcast.util.executor.HazelcastManagedThread.run(HazelcastManagedThread.java:92)
"[ACTIVE] ExecuteThread: '4' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x22e33800 nid=0x2090 in Object.wait() [0x1e2ad000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x9ebb2a70> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:90)
- locked <0x9ebb2a70> (a com.hazelcast.client.spi.impl.ClientInvocationFuture)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:75)
at com.hazelcast.client.spi.impl.ClientInvocationFuture.get(ClientInvocationFuture.java:38)
at com.hazelcast.client.spi.ClientProxy.invoke(ClientProxy.java:144)
at com.hazelcast.client.proxy.ClientMapProxy.size(ClientMapProxy.java:1339)
Any help will be much appreciated.
Best regards,
Mikolaj