Hazelcast heartbeat timeout?

955 views
Skip to first unread message

Tom Poage

unread,
Jun 2, 2016, 11:14:10 AM6/2/16
to CAS Community
Morning,

We started running 4.2.1 w/ Hazelcast (hz.cluster.tcpip.enabled=true) on Linux VMs (RedHat variant) a couple of weeks ago with three nodes on the same subnet. Things seemed fine initially, but a couple of days ago started getting cluster errors starting with heartbeat timeout, several (dis)connects, attempted repartitions, and ending with the cluster frozen.

Has anyone experience this? E.g.

> 2016-06-01 21:01:25,330 WARN [com.hazelcast.cluster.impl.ClusterHeartbeatManager] - [-------.50]:5701 [dev] [3.6] Removing Member [------.55]:5701 because it has not sent any heartbeats for 5000 ms. Last heartbeat time was Wed Jun 01 21:01:20 PDT 2016
> 2016-06-01 21:01:25,330 INFO [com.hazelcast.cluster.ClusterService] - [------.50]:5701 [dev] [3.6] Old master Address[------.55]:5701 left the cluster, assigning new master Member [128.120.39.50]:5701 this
...
> 2016-06-01 21:01:29,167 WARN [com.hazelcast.partition.InternalPartitionService] - [------.50]:5701 [dev] [3.6] This is the master node and received a PartitionRuntimeState from Address[------.55]:5701. Ignoring incoming state!
...
> 2016-06-01 21:05:16,046 INFO [com.hazelcast.cluster.impl.operations.JoinCheckOperation] - [------.50]:5701 [dev] [3.6] Ignoring join check from Address[------.55]:5701, because cluster is in FROZEN state ...

Interestingly enough, if we shut down one of the nodes (leaving two), the issue does not recur--at least in the time we've been monitoring.

The only recourse seems to be a full cluster restart.

Thanks for any advice!

Tom.


Tom Poage

unread,
Jun 2, 2016, 11:58:24 AM6/2/16
to CAS Community
So it seems the default heartbeat timeout in Hazelcast is 5 minutes, but the default heartbeat timeout in CAS is 5 seconds.

Purposeful (rationale?), or a scaling error?

Thanks!
Tom.
> --
> You received this message because you are subscribed to the Google Groups "CAS Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cas-user+u...@apereo.org.
> To post to this group, send email to cas-...@apereo.org.
> Visit this group at https://groups.google.com/a/apereo.org/group/cas-user/.
> To view this discussion on the web visit https://groups.google.com/a/apereo.org/d/msgid/cas-user/B8F20E5F-0BC3-44AE-B53F-BCFD1B181E3D%40ucdavis.edu.
> For more options, visit https://groups.google.com/a/apereo.org/d/optout.

Misagh Moayyed

unread,
Jun 2, 2016, 12:29:32 PM6/2/16
to CAS Community
Probably too aggressive of a default, yes, but the UM is in seconds:

# hz.cluster.max.heartbeat.seconds=5

Enable that property and set it to 300.
> https://groups.google.com/a/apereo.org/d/msgid/cas-user/51FE920A-2FEE-
> 4C59-A75C-C1053256CACB%40ucdavis.edu.

Travis Schmidt

unread,
Jun 2, 2016, 12:33:38 PM6/2/16
to Misagh Moayyed, CAS Community
Looked a little further into this.  The Hazelcast documentation says the heartbeat interval is 1 second, looking into the Hazelcast code we see that it is actually defaulted to 5 seconds.

HEARTBEAT_INTERVAL_SECONDS("hazelcast.heartbeat.interval.seconds", 5, SECONDS),
So the default configuration in CAS basically set this up to be a race condition to determine if a node is dead or alive.

Tom Poage

unread,
Jun 3, 2016, 4:53:08 PM6/3/16
to CAS Community
Knock on wood: no additional Hazelcast errors after increasing the
heartbeat timeout to something more than the heartbeat interval. ;-)

Tom.

Misagh Moayyed

unread,
Jun 3, 2016, 7:24:02 PM6/3/16
to CAS Community
Thanks to Travis, see: https://github.com/apereo/cas/pull/1819

> -----Original Message-----
> From: cas-...@apereo.org [mailto:cas-...@apereo.org] On Behalf Of Tom
> Poage
> Sent: Friday, June 3, 2016 1:53 PM
> To: CAS Community <cas-...@apereo.org>
> Subject: Re: [cas-user] Hazelcast heartbeat timeout?
>
> --
> You received this message because you are subscribed to the Google Groups
> "CAS Community" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to cas-user+u...@apereo.org.
> To post to this group, send email to cas-...@apereo.org.
> Visit this group at
> https://groups.google.com/a/apereo.org/group/cas-user/.
> To view this discussion on the web visit
> https://groups.google.com/a/apereo.org/d/msgid/cas-
> user/5751EE2E.1030808%40ucdavis.edu.
Reply all
Reply to author
Forward
0 new messages