Troubleshoot Unresponsive Cluster

15 views
Skip to first unread message

Michael Brizic

unread,
Jul 11, 2023, 5:50:52 PM7/11/23
to Hazelcast
We're experiencing an issue whereby terminating an EC2 instance that is a Hazelcast member causes the remaining nodes to go into some sort of deadlock, unresponsive state, consuming all computer resources essentially taking down our system.

Our cluster is usually 3 nodes (EC2s) and stores less than 1GB of data in total.

The recent failure threw a lot of errors of the type: "invocation failed to complete due to operation-heartbeat-timeout".

Where would I begin in troubleshooting this issue?

Hazelcast 5.1
Spring Boot 2.7.6
Java 17

Ozan Kılıç

unread,
Aug 8, 2023, 3:50:20 AM8/8/23
to haze...@googlegroups.com
When you terminate the instance, OS will not have time to close TCP connections properly. So, other members will still have the TCP connections open and will try to send operations to the dead member for 60 seconds (default), until heartbeat timeout removes the member from the cluster. 
You should kill the Hazelcast process before terminating the instance. Or, you can use ping failure detector which is a faster way to remove the dead members. 


--
You received this message because you are subscribed to the Google Groups "Hazelcast" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/dc029f90-4896-4847-8e22-995d0afddf12n%40googlegroups.com.


--
Ozan Kilic 
Support Manager, EMEA
Hazelcast

This message contains confidential information and is intended only for the individuals named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required, please request a hard-copy version. -Hazelcast
Reply all
Reply to author
Forward
0 new messages