Embedded Hazelcast Hearbeat Issues

germa...@gmail.com

unread,

Aug 26, 2019, 2:35:51 PM8/26/19

to Hazelcast

Hi all,

We are running a java application with an embedded Hazelcast instance used to distribute tasks between the 3 nodes. Each node has an embedded Hazelcast instance.

The application has some processing intensive peaks during the day. We are having issues because the nodes disconnects during that time and we end up loosing notifications from the ITopic. Event during those peaks the CPU is not event at 90%. I've enabled diagnostic logs and I notice that there are constant reports about Hearbeats being lost, event while the application is idle.

We only use one queue and one topic.

The VMs are using 2 Intel® Xeon® Processor E7-4890 v2, that is 30 CPU cores.

Hazelcast v3.9.2

Java 1.8.0_151-b12

We already did some tunning on the networking interface

* Disable LRO

* Increase RX queue buffers because there were many Out Of Buffer reported by ethtool (Driver is vmxnet3 version 1.4.7.0-k-NAPI)

After 3 days the same issue is triggered.

Before the RX buffer change there were constant reports related to missing heartbeats between nodes, in this case the number is reduced and max time reported without hearbeat is around 2 minutes.

We will increase the timeout (hazelcast.max.no.heartbeat.seconds) to 5 minutes just in case this is a problem related to GC or CPU usage, which I doubt.

I've seen that default on version 3.9.2 for hazelcast.max.no.heartbeat.seconds is 60, but on latest version this was increased to 300.

Unfortunatelly I cannot share diagnostic logs. I can modify and post specific pieces.

Have you found this kind of issues before?

I know there were issues related to LRO, that is why I disable that on the network interface.

Is is possible to increase the priority for threads used to process heartbeats ?

Do you advice to look for something specific in order to determine the root cause?

Best Regards

Randy May

unread,

Aug 26, 2019, 2:40:28 PM8/26/19

to haze...@googlegroups.com

This sounds a lot like java garbage collection related pauses. Have you been able to rule that out ?

Randy May

Senior Solution Architect

Hazelcast Inc.

350 Cambridge Ave #100, Palo Alto, CA 94306 USA

Email: rand...@hazelcast.com

Phone: +1 (404) 414-5730

Skype: randy.may | Slack: randy.may

--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/18501159-46bb-4012-93e2-a8701ddbdc70%40googlegroups.com.

germa...@gmail.com

unread,

Aug 26, 2019, 2:45:38 PM8/26/19

to Hazelcast

The following statement is incorrect

I've seen that default on version 3.9.2 for hazelcast.max.no.heartbeat.seconds is 60, but on latest version this was increased to 300.

Default is still 60 seconds.

Another thing I've notice is that, there are a LOT of reports about missing heartbeats form the node. Every time member heartbeat is reported (every 10 minutes), there are missing heartbeats form the same node

23-08-2019 04:34:59 1566527699135 MemberHeartbeats[

member[XXXXXXXXX]:6701[

deviation(%)=292986.71875

noHeartbeat(ms)=14,654,336

lastHeartbeat(ms)=1,566,513,044,799

lastHeartbeat(date-time)=23-08-2019 00:30:44

now(ms)=1,566,527,699,135

now(date-time)=23-08-2019 04:34:59]]

germa...@gmail.com

unread,

Aug 26, 2019, 2:47:15 PM8/26/19

to Hazelcast

On Monday, 26 August 2019 15:40:28 UTC-3, Randy May wrote:

This sounds a lot like java garbage collection related pauses. Have you been able to rule that out ?

Not yet. I'm analyzing that right now.

Thanks

To unsubscribe from this group and stop receiving emails from it, send an email to haze...@googlegroups.com.

Ozan Kılıç

unread,

Aug 28, 2019, 4:50:24 AM8/28/19

to haze...@googlegroups.com

You can use metric-plot to visualize your diagnostics logs: https://github.com/hazelcast/metric-plot/

Then you can check metrics like `gc.majorTime`, `gc.minorTime` to see if you suffer from GC pauses.

To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/10a1f4ad-2251-4a9f-9f05-9e9b6bab9167%40googlegroups.com.

Reply all

Reply to author

Forward