Hi all,
We are running a java application with an embedded Hazelcast instance used to distribute tasks between the 3 nodes. Each node has an embedded Hazelcast instance.
The application has some processing intensive peaks during the day. We are having issues because the nodes disconnects during that time and we end up loosing notifications from the ITopic. Event during those peaks the CPU is not event at 90%. I've enabled diagnostic logs and I notice that there are constant reports about Hearbeats being lost, event while the application is idle.
We only use one queue and one topic.
The VMs are using 2 Intel® Xeon® Processor E7-4890 v2, that is 30 CPU cores.
Hazelcast v3.9.2
Java 1.8.0_151-b12
We already did some tunning on the networking interface
* Disable LRO
* Increase RX queue buffers because there were many Out Of Buffer reported by ethtool (Driver is vmxnet3 version 1.4.7.0-k-NAPI)
After 3 days the same issue is triggered.
Before the RX buffer change there were constant reports related to missing heartbeats between nodes, in this case the number is reduced and max time reported without hearbeat is around 2 minutes.
We will increase the timeout (hazelcast.max.no.heartbeat.seconds) to 5 minutes just in case this is a problem related to GC or CPU usage, which I doubt.
I've seen that default on version 3.9.2 for hazelcast.max.no.heartbeat.seconds is 60, but on latest version this was increased to 300.
Unfortunatelly I cannot share diagnostic logs. I can modify and post specific pieces.
Have you found this kind of issues before?
I know there were issues related to LRO, that is why I disable that on the network interface.
Is is possible to increase the priority for threads used to process heartbeats ?
Do you advice to look for something specific in order to determine the root cause?
Best Regards