Sudden Slowness on Hazelcast 3.10 using the distributed map

Benjamin E.Ndugga

unread,

May 29, 2019, 7:01:59 AM5/29/19

to Hazelcast

Hi Team,

I am facing sudden slowness in processing put and get operations on a map using Hazelcast 3.10. I am running 3 RedHat Linux on a x86_64 machines with 30CPU each and there are no signs of high CPU showing in the logs, however the applications connecting to the cluster shows "Unable to connect to any address in the config! error which indicates that the cluster is not running which quite contrary as the IP and Port and listening.

What could be the cause of this and how can I add more logging inform to show internal operations. Please find my attached config file for some of the common maps I use in the code snippets below.

Here is the common operation on Hazelcast, we connect fetch the current count then disconnect , do-something then reconnect and increaset the count by 1


        ClientConfig clientConfig = new ClientConfig();
        clientConfig.setProperty("hazelcast.logging.type", "none");

        clientConfig.setGroupConfig(new GroupConfig("dev", "dev-pass"));
        ClientNetworkConfig networkConfig = clientConfig.getNetworkConfig();

        networkConfig.addAddress("192.168.0.94");
        networkConfig.addAddress("192.168.0.134");
        networkConfig.addAddress("192.168.0.137");

        networkConfig.setSmartRouting(true);
        networkConfig.setConnectionTimeout(500);
        networkConfig.setConnectionAttemptPeriod(250);
        networkConfig.setConnectionAttemptLimit(1);
        clientConfig.setNetworkConfig(networkConfig);

        HazelcastInstance client = HazelcastClient.newHazelcastClient(clientConfig);

        IMap<String, Integer> map = client.getMap("somemap");

        Integer count = map.get("key1");

After doing some other operation we then reconnect and increase the count by 1.

Note:

This operation to increase count happens in a separate thread.
The somemap map usese the default configuration on Hazelcast


        ClientConfig clientConfig = new ClientConfig();
        clientConfig.setProperty("hazelcast.logging.type", "none");

        clientConfig.setGroupConfig(new GroupConfig("dev", "dev-pass"));
        ClientNetworkConfig networkConfig = clientConfig.getNetworkConfig();

        networkConfig.addAddress("192.168.0.94");
        networkConfig.addAddress("192.168.0.134");
        networkConfig.addAddress("192.168.0.137");

        networkConfig.setSmartRouting(true);
        networkConfig.setConnectionTimeout(500);
        networkConfig.setConnectionAttemptPeriod(250);
        networkConfig.setConnectionAttemptLimit(1);
        clientConfig.setNetworkConfig(networkConfig);

        HazelcastInstance client = HazelcastClient.newHazelcastClient(clientConfig);

        IMap<String, Integer> map = client.getMap("somemap");

        Integer count = map.put("key1",count+1);

        //shut-down client 
        client.shutdown();

Regards,

Benjamin E Ndugga

hz-config-prod.xml

Josef Cacek

unread,

May 30, 2019, 8:49:10 AM5/30/19

to haze...@googlegroups.com

Hi Benjamin,

to enable logging, remove from your code lines which set the "hazelcast.logging.type" property to "none". Then the default logging should be used - java.util.logging framework.

If the INFO log level is not sufficient, then you can add following constant and the static section to your client class to enable full logging:

// hold Logger reference in a constant, the changes can be lost before the HZ is initialized otherwise.

private static final java.util.logging.Logger logger = java.util.logging.Logger.getLogger("com.hazelcast");

static {

java.util.logging.ConsoleHandler ch = new java.util.logging.ConsoleHandler();

ch.setLevel(java.util.logging.Level.ALL);

logger.addHandler(ch);

logger.setLevel(java.util.logging.Level.ALL);

}

Regards,

-- Josef

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/b28229a5-084b-4b75-a645-562b3f905e92%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benjamin E.Ndugga

unread,

Jun 2, 2019, 8:14:27 AM6/2/19

to Hazelcast

Hi Josef,

Thank you for your input, I modified the code and began to see all the error messages showing up. This one below has caught my eye when I deployed the application on glassfish.

The web application [/MYWEBBAPP] created a ThreadLocal with key of type [com.hazelcast.util.UuidUtil$1] (value [com.hazelcast.util.UuidUtil$1@62be3f08]) and a value of type [java.util.Random] (value [java.util.Random@15a73607]) but failed to remove it when the web application was stopped. Threads are going to be renewed over time to try and avoid a probable memory leak.

Then from the Hazelcast Cache side there errors indicating that the GetOperations are slow

Jun 02, 2019 1:38:50 PM com.hazelcast.spi.impl.operationexecutor.slowoperationdetector.SlowOperationDetector
WARNING: [172.27.98.94]:5701 [dev] [3.10.3] Slow operation detected: com.hazelcast.map.impl.operation.GetOperation (309 invocations)
java.net.SocketInputStream.socketRead0(Native Method)
    java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    java.net.SocketInputStream.read(SocketInputStream.java:171)
    java.net.SocketInputStream.read(SocketInputStream.java:141)
    (...)

What could be the Issue here as I have added one node to this but the Issue still pops ups and that is the difference between Map.Put and Map.PutAsync , does put method wait while PutAsync run in a separate thread.

I am trying to optimise put and get operations as these are affected the effectiveness of the application. Please let me know if you need more info

Regards,

Benjamin E Ndugga

To unsubscribe from this group and stop receiving emails from it, send an email to haze...@googlegroups.com.

Jaromir Hamala

unread,

Jun 2, 2019, 9:35:50 AM6/2/19

to Hazelcast

Hi Benjanim,

are you using MapLoader/MapStore? Can you share a full stacktrace from the slow operation?

Cheers,

Jaromir

Benjamin E.Ndugga

unread,

Jun 2, 2019, 1:21:35 PM6/2/19

to Hazelcast

Hi Jaromir,

Yes I am using a MapStore with a configuration as follows:

<map name="mymap">
        <in-memory-format>BINARY</in-memory-format>
        <backup-count>1</backup-count>
        <async-backup-count>0</async-backup-count>
        <time-to-live-seconds>2592000</time-to-live-seconds>
        <max-idle-seconds>2592000</max-idle-seconds>
        <eviction-policy>LFU</eviction-policy>
        <max-size policy="PER_NODE">10000000</max-size>
        <eviction-percentage>25</eviction-percentage>
        <min-eviction-check-millis>100</min-eviction-check-millis>
        <merge-policy>com.hazelcast.map.merge.PutIfAbsentMapMergePolicy</merge-policy>
        <cache-deserialized-values>INDEX-ONLY</cache-deserialized-values>
        <map-store enabled="true" initial-mode="LAZY">
            <class-name>com.MyMapStore</class-name>
            <write-delay-seconds>30</write-delay-seconds>
            <write-batch-size>100</write-batch-size>
            <write-coalescing>false</write-coalescing>
        </map-store>
    </map>

When I start up the Hazelcast IMDG I am only able to see the logging information from the mapstore and the extra information is written to the console window. How can I redirect or write that information to an output file, that way I will be able to sahre the full stack trace.

Please see my startup options:

JAVA_OPTS="$JAVA_OPTS -XX:+UseG1GC"
JAVA_OPTS="$JAVA_OPTS -XX:MaxGCPauseMillis=300"
JAVA_OPTS="$JAVA_OPTS -XX:+UseGCLogFileRotation"
JAVA_OPTS="$JAVA_OPTS -XX:ParallelGCThreads=22"
JAVA_OPTS="$JAVA_OPTS -XX:NumberOfGCLogFiles=5"
JAVA_OPTS="$JAVA_OPTS -XX:GCLogFileSize=10M"
JAVA_OPTS="$JAVA_OPTS -XX:+PrintGCDetails"
JAVA_OPTS="$JAVA_OPTS -XX:+PrintGCDateStamps"
JAVA_OPTS="$JAVA_OPTS -Xloggc:/opt/hazelcast-logs/garbageCollector.log"
JAVA_OPTS="$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError"
JAVA_OPTS="$JAVA_OPTS -XX:+PrintTenuringDistribution"
JAVA_OPTS="$JAVA_OPTS -XX:HeapDumpPath=/opt/hazelcast-logs/"
JAVA_OPTS="$JAVA_OPTS -Dlog4j.configuration=file:///home/Benjamin/Documents/hazelcast-3.10.3/bin/log4j.xml"

Please see my log4j file:

<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration debug="true"  xmlns:log4j='http://jakarta.apache.org/log4j/'>
  <!--
   <appender name="consoleAppender" class="org.apache.log4j.ConsoleAppender">
      <layout class="org.apache.log4j.PatternLayout">
         <param name="ConversionPattern" value="%d{dd MMM yyyy HH:mm:ss} %5p %c{1} - %m%n"/>
      </layout>
   </appender>
  -->
   <appender name="fileAppender" class="org.apache.log4j.DailyRollingFileAppender">
      <param name="append" value="true"/>
      <!--<param name="MaxFileSize" value="1KB"/>-->
      <param name="maxBackupIndex" values="100"/>
      <param name="file" value="/opt/hazelcast-logs/hazelcast.log"/>
      <param name="DatePattern" value="'-'yyyy'_'MM-dd-'T'HH'.log'"/>
      <layout class="org.apache.log4j.PatternLayout">
         <!--<param name="ConversionPattern" value="%d{ABSOLUTE} %-5p [%c{1}] %m%n"/>-->
         <param name="ConversionPattern" value="%d{dd MMM yyyy HH:mm:ss} %5p %c{1} - %m%n"/>
      </layout>
      <rollingPolicy class="org.apache.log4j.rolling.TimeBasedRollingPolicy">
         <param name ="ActiveFileName"  value="hazlecast.log"/>
         <param name ="FileNamePattern" value="hazelcast-.%d{yyyyMMdd-HHmm}.log.gz"/>
      </rollingPolicy>
   </appender>

   <root>
      <level value="DEBUG"/>
      <appender-ref ref="consoleAppender"/>
      <appender-ref ref="fileAppender"/>
   </root>

</log4j:configuration>

I need to share the stack trace from the IMDG loggin data so that we can know what happens to the processing internally

Regards,

Benjamin

Jaromir Hamala

unread,

Jun 3, 2019, 5:04:54 AM6/3/19

to haze...@googlegroups.com

the few frames from the stacktrace you shared point to network I/O. However the GetOperation does not do any network I/O on its own. Hence I am guessing it's your MapStore implementation doing network calls?

You can enable additional diagnostics by starting Hazelcast with following flags:

-Dhazelcast.diagnostics.enabled=true
-Dhazelcast.diagnostics.metric.level=info
-Dhazelcast.diagnostics.storeLatency.period.seconds=1

This will create diagnostics logs which will contain latency info abour your MapLoader implementation.

Cheers,

Jaromir

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.

To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.

To post to this group, send email to haze...@googlegroups.com.
Visit this group at https://groups.google.com/group/hazelcast.

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/ced32b12-bc9b-41ef-b667-6298fa961d81%40googlegroups.com.

orhan yilmaz

unread,

Jun 3, 2019, 6:18:39 AM6/3/19

to haze...@googlegroups.com

Hello

Upload your gc logs to gceasy.io or use censum trial account to analyze gc logs. That will tell you it is io bounded or not. If you see high kernel times check your implementation.

BR

3 Haz 2019 Pzt 12:04 tarihinde Jaromir Hamala <jar...@hazelcast.com> şunu yazdı:

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/CAOW%2Bd6QW%3DxH298vt6%3D3SbfuDY%2B3ngODgvJ27%3D5rkBp2cDCk0Uw%40mail.gmail.com.

Guido Medina

unread,

Jun 3, 2019, 7:15:17 AM6/3/19

to haze...@googlegroups.com

Why so many GC threads? the defaults are usually OK which is 1.5 x CPUs, unless you have 32+ CPUs that's certainly worrying, also; max pause = 300ms? the default is 200ms,

why would you want a bigger pause? if you want a consistent GC why don't you specify a heap size?

Here are some good settings -excluding the ones you already have-, experiment with them and see if that helps,

IMHO there is no point in using G1GC if your heap size is under 4G which is where it starts shinning:

-server
-Xms4g
-Xmx4g
-XX:+ParallelRefProcEnabled

-XX:+PerfDisableSharedMem - see why this option at https://www.evanjones.ca/jvm-mmap-pause.html

To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/CAHYNpRmvvsKo%2BX4tM-7%2BBbuD4MHaC%3D962WUt5AJM3%2BsZJbQ1zw%40mail.gmail.com.

Guido Medina

unread,

Jun 3, 2019, 7:18:38 AM6/3/19

to haze...@googlegroups.com

From https://www.oracle.com/technetwork/articles/java/g1gc-1984535.html

-XX:ParallelGCThreads=n Sets the value of the STW worker threads. Sets the value of n to the number of logical processors.

The value of n is the same as the number of logical processors up to a value of 8.

Benjamin E.Ndugga

unread,

Jun 6, 2019, 12:30:11 PM6/6/19

to Hazelcast

Hi Team,

I have added the diagnostics properties and I have been able to gather some statistics that I have attached them herewith. There is a line which shows a stacktrace, like this I wonder whether I should be worried:

06-06-2019 19:03:11 1559836991468 ConnectionRemoved[
                          Connection[id=2930269, /172.27.98.137:5701->/172.27.98.135:40320, endpoint=[172.27.98.135]:40320, alive=false, type=JAVA_CLIENT]
                          type=JAVA_CLIENT
                          isAlive=false
                          closeReason=Connection closed by the other side
                          CloseCause[
                                  java.io.EOFException: Remote socket closed!
                                  com.hazelcast.internal.networking.nio.NioInboundPipeline.process(NioInboundPipeline.java:132)
                                  com.hazelcast.internal.networking.nio.NioThread.handleSelectionKey(NioThread.java:383)
                                  com.hazelcast.internal.networking.nio.NioThread.handleSelectionKeys(NioThread.java:368)
                                  com.hazelcast.internal.networking.nio.NioThread.selectLoop(NioThread.java:275)
                                  com.hazelcast.internal.networking.nio.NioThread.run(NioThread.java:230)]]

06 Jun 2019 17:58:05  INFO HealthMonitor - [172.27.98.137]:5701 [dev] [3.10.3] processors=30, physical.memory.total=102.4G, physical.memory.free=89.5G, swap.space.total=8.0G, swap.space.free=8.0G, heap.memory.used=7.8G, heap.memory.free=2.2G, heap.memory.total=10.0G, heap.memory.max=10.0G, heap.memory.used/total=77.74%, heap.memory.used/max=77.74%, minor.gc.count=356, minor.gc.time=10290ms, major.gc.count=0, major.gc.time=0ms, load.process=2.49%, load.system=1.86%, load.systemAverage=0.02, thread.count=720, thread.peakCount=725, cluster.timeDiff=-1, event.q.size=0, executor.q.async.size=0, executor.q.client.size=0, executor.q.query.size=0, executor.q.scheduled.size=0, executor.q.io.size=0, executor.q.system.size=0, executor.q.operations.size=189, executor.q.priorityOperation.size=0, operations.completed.count=12772895, executor.q.mapLoad.size=0, executor.q.mapLoadAllKeys.size=0, executor.q.cluster.size=0, executor.q.response.size=0, operations.running.count=1, operations.pending.invocations.percentage=0.00%, operations.pending.invocations.count=6, proxy.count=15, clientEndpoint.count=20, connection.active.count=22, client.connection.count=20, connection.count=2
    at com.hazelcast.internal.networking.AbstractChannel.read(AbstractChannel.java:107)
    at com.hazelcast.internal.networking.nio.NioInboundPipeline.process(NioInboundPipeline.java:129)
    at com.hazelcast.internal.networking.nio.NioThread.handleSelectionKey(NioThread.java:383)
    at com.hazelcast.internal.networking.nio.NioThread.handleSelectionKeys(NioThread.java:368)
    at com.hazelcast.internal.networking.nio.NioThread.selectLoop(NioThread.java:275)
    at com.hazelcast.internal.networking.nio.NioThread.run(NioThread.java:230)
    at com.hazelcast.internal.networking.AbstractChannel.read(AbstractChannel.java:107)
    at com.hazelcast.internal.networking.nio.NioInboundPipeline.process(NioInboundPipeline.java:129)
    at com.hazelcast.internal.networking.nio.NioThread.handleSelectionKey(NioThread.java:383)
    at com.hazelcast.internal.networking.nio.NioThread.handleSelectionKeys(NioThread.java:368)
    at com.hazelcast.internal.networking.nio.NioThread.selectLoop(NioThread.java:275)
    at com.hazelcast.internal.networking.nio.NioThread.run(NioThread.java:230)

I also uploaded the gc file and the number of CPU stated in the start-up are fine. I hope to move to Java 8 and the report is fine, there are no leaks

This is my implementation on the client side:

ClientConfig clientConfig = new ClientConfig();



        clientConfig.setGroupConfig(new GroupConfig("user", "hz_pass"));
        ClientNetworkConfig networkConfig = clientConfig.getNetworkConfig();

        networkConfig.addAddress("172.27.98.137");
        networkConfig.addAddress("172.27.98.94");
        networkConfig.addAddress("172.27.98.134");



        networkConfig.setSmartRouting(true);
        networkConfig.setConnectionTimeout(500);
        networkConfig.setConnectionAttemptPeriod(250);
        networkConfig.setConnectionAttemptLimit(1);
        clientConfig.setNetworkConfig(networkConfig);

        HazelcastInstance client = HazelcastClient.newHazelcastClient(clientConfig);



        IMap<String, String> map = client.getMap("mymap");

        map.put("key1", "value1");

How can that be improved, is map.putAsync("key1", "value1"); much better since I need to execute it on a separate thread

diagnostics-172.27.98.137_5701-1559780734831-051.log.zip

Benjamin E.Ndugga

unread,

Jun 17, 2019, 10:29:38 AM6/17/19

to Hazelcast

Thanks Team, indeed there were I/O issues on the nodes that were fixed

anirudd...@gmail.com

unread,

Jul 25, 2019, 5:31:16 AM7/25/19

to Hazelcast

Hello,

We are also facing same issue with our hazelcast cluster. When idle it works fine but when loaded it creates tall spikes in res. time.

Can you please tell us what were the i/o issues and how they are fied ?

- Aniruddha K.

Benjamin E.Ndugga

unread,

Jul 30, 2019, 1:23:11 PM7/30/19

to Hazelcast

Hi Aniruddha ,

We had a database contention , we implemented a MapStore that connects to the database and fetches data that is not present in the data. That query was taking long to respond since the table had not been optimised.

I have also noticed that with this version 3.10 when you have too many update requests to the cache server they tend to slow down and even the slow detector won't pick it. So I have a plan to use the latest version hopefully it will be better.

Reply all

Reply to author

Forward