Embedded Hazelcast Hearbeat Issues

50 views
Skip to first unread message

germa...@gmail.com

unread,
Aug 26, 2019, 2:35:51 PM8/26/19
to Hazelcast
Hi all,

We are running a java application with an embedded Hazelcast instance used to distribute tasks between the 3 nodes. Each node has an embedded Hazelcast instance.
The application has some processing intensive peaks during the day. We are having issues because the nodes disconnects during that time and we end up loosing notifications from the ITopic. Event during those peaks the CPU is not event at 90%. I've enabled diagnostic logs and I notice that there are constant reports about Hearbeats being lost, event while the application is idle.
We only use one queue and one topic.

The VMs are using 2 Intel® Xeon® Processor E7-4890 v2, that is 30 CPU cores.
Hazelcast v3.9.2
Java 1.8.0_151-b12

We already did some tunning on the networking interface
* Disable LRO
* Increase RX queue buffers because there were many Out Of Buffer reported by ethtool (Driver is vmxnet3 version 1.4.7.0-k-NAPI)

After 3 days the same issue is triggered.
Before the RX buffer change there were constant reports related to missing heartbeats between nodes, in this case the number is reduced and max time reported without hearbeat is around 2 minutes.

We will increase the timeout (hazelcast.max.no.heartbeat.seconds) to 5 minutes just in case this is a problem related to GC or CPU usage, which I doubt. 
I've seen that default on version 3.9.2 for hazelcast.max.no.heartbeat.seconds is 60, but on latest version this was increased to 300.

Unfortunatelly I cannot share diagnostic logs. I can modify and post specific pieces.

Have you found this kind of issues before? 
I know there were issues related to LRO, that is why I disable that on the network interface.
Is is possible to increase the priority for threads used to process heartbeats ?
Do you advice to look for something specific in order to determine the root cause?

Best Regards

Randy May

unread,
Aug 26, 2019, 2:40:28 PM8/26/19
to haze...@googlegroups.com
This sounds a lot like java  garbage collection related pauses.  Have you been able to rule that out ?





Randy May
Senior Solution Architect

350 Cambridge Ave #100, Palo Alto, CA 94306 USA
Skype: randy.may | Slack: randy.may

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/18501159-46bb-4012-93e2-a8701ddbdc70%40googlegroups.com.

germa...@gmail.com

unread,
Aug 26, 2019, 2:45:38 PM8/26/19
to Hazelcast
The following statement is incorrect

I've seen that default on version 3.9.2 for hazelcast.max.no.heartbeat.seconds is 60, but on latest version this was increased to 300.

Default is still 60 seconds.

Another thing I've notice is that, there are a LOT of reports about missing heartbeats form the node. Every time member heartbeat is reported (every 10 minutes), there are missing heartbeats form the same node 

23-08-2019 04:34:59 1566527699135 MemberHeartbeats[
                          member[XXXXXXXXX]:6701[
                                  deviation(%)=292986.71875
                                  noHeartbeat(ms)=14,654,336
                                  lastHeartbeat(ms)=1,566,513,044,799
                                  lastHeartbeat(date-time)=23-08-2019 00:30:44
                                  now(ms)=1,566,527,699,135
                                  now(date-time)=23-08-2019 04:34:59]]

germa...@gmail.com

unread,
Aug 26, 2019, 2:47:15 PM8/26/19
to Hazelcast


On Monday, 26 August 2019 15:40:28 UTC-3, Randy May wrote:
This sounds a lot like java  garbage collection related pauses.  Have you been able to rule that out ?



Not yet. I'm analyzing that right now.

Thanks

 
To unsubscribe from this group and stop receiving emails from it, send an email to haze...@googlegroups.com.

Ozan Kılıç

unread,
Aug 28, 2019, 4:50:24 AM8/28/19
to haze...@googlegroups.com
You can use metric-plot to visualize your diagnostics logs: https://github.com/hazelcast/metric-plot/
Then you can check metrics like `gc.majorTime`, `gc.minorTime` to see if you suffer from GC pauses. 


To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/10a1f4ad-2251-4a9f-9f05-9e9b6bab9167%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages