REST Member List stale

57 views
Skip to first unread message

Mike Wiesenberg

unread,
Apr 29, 2020, 2:23:43 PM4/29/20
to Hazelcast
Hi,

 Using Hazelcast 3.12.6:

 If I have a multiple node cluster e.g. 7 nodes and I shut down two of them, the others still show the cluster having 7 nodes when accessing them via the hazelcast/rest/cluster HTTP endpoint, even hours later. Is there a way to make this list accurate?

Thanks and Regards,

 Mike

Sharath Sahadevan

unread,
Apr 30, 2020, 9:43:16 AM4/30/20
to Hazelcast
Hi Mike,

     Will need some additional information on the issue.

  1. Is the cluster state reported by the REST API consistent with what you are seeing in Management Center?
  2. How are you shutting down the member ? Are the results different if you shutdown the nodes from Management Center?
  3. Hazelcast configuration details and any relevant log files.
If the results of 1 indicate an  issue with the REST API, you can create an issue here and provide details of config , logs and info from management center and the response from the REST API.

Hope that helps.
Thanks,
Sharath

Mike Wiesenberg

unread,
Apr 30, 2020, 10:03:37 AM4/30/20
to Hazelcast
1. I don't have Management Center installed
2. SigKill
3. I'm configuring via Java and not changing any defaults other than disabling Multicast and setting the host IPs in the TCPIPConfig.

Thanks
 Mike

Sharath Sahadevan

unread,
Apr 30, 2020, 11:18:49 AM4/30/20
to Hazelcast

Mike,

You can also get the state from cluster.sh script details here instead of management center.

Here are your options to shutdown a member gracefully.

Here are the options to shut down a Hazelcast member gracefully (which means waiting the migration & backups to be completed):

1- You can call HazelcastInstance#shutdown() API programmatically.
2- You can use JMX API's shutdown method. --> Using something like Jolokia which provides REST-like access to JMX should come in handy for you.
3- You can set hazelcast.shutdownhook.policy=GRACEFUL on member configs and then do shutdowns by using kill -15 <PID> command or using stop.sh script. After that, HazelcastInstance will gracefully shutdown.

4. From Management Center. I know you are not using it , but recommend it for Production.



Thanks,

Sharath

Josef Cacek

unread,
Apr 30, 2020, 11:28:09 AM4/30/20
to haze...@googlegroups.com
Hi Mike,

can you confirm, you talk about the member list view from the REST
call output - i.e. something like:

Members {size:7, ver:7} [
Member [172.17.0.2]:5701 - 2944e4e0-1fe9-47e3-b6b3-ecacd92c7642 this
Member [172.17.0.3]:5701 - cc37da21-984c-4326-9f67-6e88f5fe8c86
Member [172.17.0.4]:5701 - fc6008bb-1d77-4948-93b0-c5d224f80cc2
Member [172.17.0.5]:5701 - dd0917d6-414a-4905-944a-8d489d8db93d
Member [172.17.0.6]:5701 - de2d46b7-0770-4fb7-9321-8598e2f68daf
Member [172.17.0.7]:5701 - 961aa3fb-e51c-4160-bc39-235b142d14aa
Member [172.17.0.8]:5701 - 0232f64b-c796-46cc-9a7b-fc21070feec0
]

Could you share what value is returned from the REST API call to
/hazelcast/health/cluster-size? E.g.
curl http://172.17.0.2:5701/hazelcast/health/cluster-size

When you look into logs of still running members, do they contain an
updated cluster view after the kill? Wait 60sec after the kill before
trying it.

-- Josef
> --
> You received this message because you are subscribed to the Google Groups "Hazelcast" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/afffdf9a-784e-4a6c-ad25-b7547cd08091%40googlegroups.com.

--
This message contains confidential information and is intended only for the
individuals named. If you are not the named addressee you should not
disseminate, distribute or copy this e-mail. Please notify the sender
immediately by e-mail if you have received this e-mail by mistake and
delete this e-mail from your system. E-mail transmission cannot be
guaranteed to be secure or error-free as information could be intercepted,
corrupted, lost, destroyed, arrive late or incomplete, or contain viruses.
The sender therefore does not accept liability for any errors or omissions
in the contents of this message, which arise as a result of e-mail
transmission. If verification is required, please request a hard-copy
version. -Hazelcast

Mike Wiesenberg

unread,
Apr 30, 2020, 11:43:11 AM4/30/20
to Hazelcast
Hi Josef 
 Yes - confirmed, I'm speaking of that output. It shows 7 members. The cluster-size call also shows 7 members. This is several hours after they die.
 However, if i stop using SigInt, it updates properly. 

In either case the other nodes are not printing the cluster view. 

Thanks,
 Mike
> To unsubscribe from this group and stop receiving emails from it, send an email to haze...@googlegroups.com.

Josef Cacek

unread,
Apr 30, 2020, 12:46:08 PM4/30/20
to haze...@googlegroups.com
It's really suspicious.

Have you changed any Hazelcast property with "heartbeat" in its name?

Could you try, if the 2 members are really gone (i.e. the Hazelcast
port is closed on those machines)?
When I do the check for every IP:port combination listed in the
original member list view, then curl return for 2 killed members
following:
curl: (7) Failed to connect to 172.17.0.3 port 5701: No route to host

I used the following script to try your scenario in Docker:
https://gist.github.com/kwart/290622068677ab6b01eccb8bb1cac6e1

-- Josef
> To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/f31ad02a-74aa-4ae5-a4a1-0eb3fd40b84e%40googlegroups.com.

Mike Wiesenberg

unread,
Apr 30, 2020, 4:55:25 PM4/30/20
to Hazelcast
Hi,  I haven't changed any property with heartbeat in the name. I verified the down hosts are down, there's no jvm running on the given.

When you say the curl check failed, does that mean you are confirmed to be observing the same behavior is me, i.e. the dead members are still in the list?

Josef Cacek

unread,
May 4, 2020, 10:28:09 AM5/4/20
to haze...@googlegroups.com
Hi Mike,

When I talk about the curl failure, I mean it fails to connect to the Hazelcast member port.

Assume I've killed the process on 172.17.0.4, then execution of the following command

curl -m 2 http://172.17.0.4:5701/hazelcast/rest/cluster

results in this curl error message:
curl: (7) Failed to connect to 172.17.0.4 port 5701: No route to host

Even if connections to dead members stay somehow half-open, you should see WARNING messages related to missing heartbeats in log files of other members. Something like:

May 04, 2020 2:17:14 PM com.hazelcast.internal.cluster.impl.ClusterHeartbeatManager
WARNING: [172.17.0.2]:5701 [dev] [3.12.6] Suspecting Member [172.17.0.4]:5701 - 62ba361e-1e1e-4427-8dfc-f7998687ecb7 because it has not sent any heartbeats since 2020-05-04 14:16:11.052. Now: 2020-05-04 14:17:14.006, heartbeat timeout: 60000 ms, suspicion level: 1.00
May 04, 2020 2:17:14 PM com.hazelcast.nio.tcp.TcpIpConnection
INFO: [172.17.0.2]:5701 [dev] [3.12.6] Connection[id=4, /172.17.0.2:5701->/172.17.0.4:45175, qualifier=null, endpoint=[172.17.0.4]:5701, alive=false, type=MEMBER] closed. Reason: Suspecting Member [172.17.0.4]:5701 - 62ba361e-1e1e-4427-8dfc-f7998687ecb7 because it has not sent any heartbeats since 2020-05-04 14:16:11.052. Now: 2020-05-04 14:17:14.006, heartbeat timeout: 60000 ms, suspicion level: 1.00
May 04, 2020 2:17:14 PM com.hazelcast.internal.cluster.impl.MembershipManager
INFO: [172.17.0.2]:5701 [dev] [3.12.6] Removing Member [172.17.0.4]:5701 - 62ba361e-1e1e-4427-8dfc-f7998687ecb7
May 04, 2020 2:17:14 PM com.hazelcast.internal.cluster.ClusterService
INFO: [172.17.0.2]:5701 [dev] [3.12.6]

Members {size:6, ver:8} [
    Member [172.17.0.2]:5701 - d8d5cca8-ee05-4862-b4a0-3469b49ddedd this
    Member [172.17.0.3]:5701 - d314670d-e91b-4e25-9315-d936121a7729
    Member [172.17.0.5]:5701 - 2428b2be-d2df-4ac3-b7ba-453a9cf36793
    Member [172.17.0.6]:5701 - fbdb7e29-1090-4b5c-ae4b-e1eeff98be3b
    Member [172.17.0.7]:5701 - 17893840-e3f7-46e6-9b00-e3144a3ef958
    Member [172.17.0.8]:5701 - c4ef6467-3a94-4b50-a0e9-bdc21a2d360d
]

-- Josef


To unsubscribe from this group and stop receiving emails from it, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/8655d2e5-7fdc-4232-bef3-4335d7e4781c%40googlegroups.com.

tora...@gmail.com

unread,
May 4, 2020, 12:13:18 PM5/4/20
to haze...@googlegroups.com
Hi - I don't see the word 'suspect'  or ClusterHearbeatManager anywhere in my logs. 

When one member goes down, the other members print 'Could not connect to x' followed by 'Removing connection to endpoint x'



You received this message because you are subscribed to a topic in the Google Groups "Hazelcast" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/hazelcast/P5eJF1t2_yg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to hazelcast+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/hazelcast/CAN5h8r645j3jxgvmMT0U6iDu8jHmHEWdPBpJujF2cdM4GRLM%2BA%40mail.gmail.com.

Josef Cacek

unread,
May 5, 2020, 2:52:13 AM5/5/20
to haze...@googlegroups.com
Could you share full Hazelcast logs from the surviving members? If you
could reproduce the behavior with the DEBUG log level, it would be
helpful.
Regards,
-- Josef


On Mon, May 4, 2020 at 6:13 PM <tora...@gmail.com> wrote:
>
> Hi - I don't see the word 'suspect' or ClusterHearbeatManager anywhere in my logs.
>
> When one member goes down, the other members print 'Could not connect to x' followed by 'Removing connection to endpoint x'

Reply all
Reply to author
Forward
0 new messages