what does "[WARN] memberlist: Refuting a suspect message" mean?

10,286 views
Skip to first unread message

Nicolae Marasoiu

unread,
Mar 2, 2015, 7:21:27 AM3/2/15
to consu...@googlegroups.com
HI,

I have put the text below as a "potential issue" on 0.4.1, but I just want to ask you when would a node say such a thing?

I think a "suspect message" means a message in relation with the "suspection" protocol used by underlying serf/gossip/SWIM error detection mechanism?

Or it is a "suspect message" unrelated to the "suspection" protocol?

And what does refuting mean?

Seems that the other node keeps suspecting, so this node does not ack the other to tell him "hey i am alive"?

I have 3 nodes cluster: consul1, consul2 and consul3.
consul1 (172.18.32.130) is done.
Still, as expected both Read and Write went ok with quorum of consul2 and 3 up.

Now I stop consul2 and start it back.
As expected consul3 enters candidate state, but consul2 does not answer him.
On the other side, consul2 keeps "refuting a suspect message" from consul3, which explains why it does not answer him, and consul3 keeps timing out consul2.

Doing consul members on consul2 confirms both nodes "alive".

How can I debug this?

Log/consul3:
2015/03/02 12:07:50 [ERR] raft: Failed to make RequestVote RPC to 172.18.32.130:8300: dial tcp 172.18.32.130:8300: connection refused
2015/03/02 12:07:51 [INFO] memberlist: Suspect consul2 has failed, no acks received
2015/03/02 12:07:52 [WARN] raft: Election timeout reached, restarting election
2015/03/02 12:07:52 [INFO] raft: Node at 172.18.33.110:8300 [Candidate] entering Candidate state
2015/03/02 12:07:52 [INFO] memberlist: Suspect consul2 has failed, no acks received
2015/03/02 12:07:52 [ERR] raft: Failed to make RequestVote RPC to 172.18.32.130:8300: dial tcp 172.18.32.130:8300: connection refused
2015/03/02 12:07:53 [WARN] raft: Election timeout reached, restarting election
2015/03/02 12:07:53 [INFO] raft: Node at 172.18.33.110:8300 [Candidate] entering Candidate state
2015/03/02 12:07:53 [ERR] raft: Failed to make RequestVote RPC to 172.18.32.130:8300: dial tcp 172.18.32.130:8300: connection refused
2015/03/02 12:07:54 [INFO] memberlist: Suspect consul2 has failed, no acks received
2015/03/02 12:07:55 [WARN] raft: Election timeout reached, restarting election
2015/03/02 12:07:55 [INFO] raft: Node at 172.18.33.110:8300 [Candidate] entering Candidate state
2015/03/02 12:07:55 [ERR] raft: Failed to make RequestVote RPC to 172.18.32.130:8300: dial tcp 172.18.32.130:8300: connection refused
2015/03/02 12:07:56 [INFO] memberlist: Suspect consul2 has failed, no acks received
2015/03/02 12:07:57 [INFO] memberlist: Suspect consul2 has failed, no acks received
2015/03/02 12:07:57 [WARN] raft: Election timeout reached, restarting election
2015/03/02 12:07:57 [INFO] raft: Node at 172.18.33.110:8300 [Candidate] entering Candidate state
2015/03/02 12:07:57 [ERR] raft: Failed to make RequestVote RPC to 172.18.32.130:8300: dial tcp 172.18.32.130:8300: connection refused
2015/03/02 12:07:58 [WARN] raft: Election timeout reached, restarting election
2015/03/02 12:07:58 [INFO] raft: Node at 172.18.33.110:8300 [Candidate] entering Candidate state
2015/03/02 12:07:58 [ERR] raft: Failed to make RequestVote RPC to 172.18.32.130:8300: dial tcp 172.18.32.130:8300: connection refused
2015/03/02 12:07:58 [ERR] agent: failed to sync remote state: No cluster leader
2015/03/02 12:07:59 [INFO] memberlist: Suspect consul2 has failed, no acks received

Log/consul2:
2015/03/02 12:07:32 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:34 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:37 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:39 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:41 [WARN] memberlist: Refuting a suspect message (from: consul2)
2015/03/02 12:07:44 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:44 [ERR] agent: failed to sync remote state: No cluster leader
2015/03/02 12:07:48 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:51 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:54 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:56 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:07:58 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:08:01 [WARN] memberlist: Refuting a suspect message (from: consul3)
2015/03/02 12:08:03 [WARN] memberlist: Refuting a suspect message (from: consul3)


Armon Dadgar

unread,
Mar 2, 2015, 8:43:04 PM3/2/15
to consu...@googlegroups.com, Nicolae Marasoiu
Hey Nicolae,

You are spot on about the suspect protocol in the underlying gossip.

Basically the stages are like:

Alive -> Suspected of Failure -> Dead.

A node that is suspected of failure or marked as dead may receive this message
and actively refute that. This means it is broadcasting an “alive” message to counteract
the suspicion.

In your logs you can see messages like:
"memberlist: Suspect consul2 has failed, no acks received”

This means that `consul2` did not respond to any direct or indirect ping messages, and is being
suspected of failure.

On the other side of that, you can see messages like:
"memberlist: Refuting a suspect message (from: consul3)”

This means `consul2` is refuting a suspect message originated by `consul3`.

Now, based on these logs, 99% of the time this indicates a UDP routing issue.
Nodes are unable to ping consul2 or consul2 is unable to respond (packets lost).

If you are running in Docker, there is a known ARP caching issue with Docker that
causes this symptom. Basically, if a container is restarted with the same IP within
~5 minutes, the ARP entry remains cached and causes packets to be dropped.

Hope that helps!

Best Regards,
Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Nicolae Marasoiu

unread,
Mar 3, 2015, 1:18:32 AM3/3/15
to Armon Dadgar, consu...@googlegroups.com
Hi,

Indeed we use Docker. So if a supervising/restarting containers is not good, we should use mesos to find a random place in the lan to run the image making it unlikely to be run on same machine maybe? i willstudy the bug on the web. Thank you!

Nicolae Marasoiu

unread,
Mar 3, 2015, 3:33:14 AM3/3/15
to consu...@googlegroups.com, armon....@gmail.com
Hi again,

Let me clarify what I tend to understand from the ARP caching issue, for which I found a sort of description at https://github.com/docker/docker/issues/4581 which is also closed currently:
So my understanding is that when running multiple docker containers, the host OS caches ARP results for many IP-MAC pairs including the docker's.
When more docker go down and up, docker reuses IPs in a timeframe, and if the new dockers have different MACs, the cache is not correct anymore and routing issues appear.
Does the MAC mismatch appear only when dockers reuse addresses of others? Or restarting the same one, will get a new MAC address?
I will also study these things, but wanted to understand properly the problem.
Thanks very much,
Nicu

Armon Dadgar

unread,
Mar 3, 2015, 9:26:15 PM3/3/15
to consu...@googlegroups.com, Nicolae Marasoiu
Hey Nicolae,

That sounds right. The issue is that the IP->MAC cache never gets invalidated,
so if the IP is re-used and the MAC address is different (which I believe Docker randomly
generates), then this issue crops up. Sounds like they have fixed the issue in newer
versions of Docker however.

Best Regards,
Armon Dadgar

From: Nicolae Marasoiu <nicolae....@gmail.com>
Reply: Nicolae Marasoiu <nicolae....@gmail.com>>

James Firth

unread,
May 8, 2017, 1:23:30 PM5/8/17
to Consul, nicolae....@gmail.com
Sorry to bring up an old thread.

Did this issue get fixed?

I'm working with Amazon's ECS and getting this message constantly within the docker container.

Thanks in advance!


On Tuesday, March 3, 2015 at 8:26:15 PM UTC-6, Armon Dadgar wrote:
Hey Nicolae,

That sounds right. The issue is that the IP->MAC cache never gets invalidated,
so if the IP is re-used and the MAC address is different (which I believe Docker randomly
generates), then this issue crops up. Sounds like they have fixed the issue in newer
versions of Docker however.

Best Regards,
Armon Dadgar

ta...@oneqube.com

unread,
Oct 12, 2017, 11:01:46 PM10/12/17
to Consul
James, did you ever resolve the issue?

ta...@oneqube.com

unread,
Oct 12, 2017, 11:01:52 PM10/12/17
to Consul
James, did you ever resolve the issue?

On Monday, May 8, 2017 at 1:23:30 PM UTC-4, James Firth wrote:

James Phillips

unread,
Oct 17, 2017, 7:28:12 PM10/17/17
to consu...@googlegroups.com
Hi,

I haven't seen recent reports of this with Docker, so it seems like it
may be fixed in recent Docker versions. This message can also arise if
you have network connectivity issues between agents (all agents need
to be a full mesh with connectivity on 8301/tcp 8301/udp), so it could
also be related to other kinds of network issues. Do you see any
pattern with respect to which nodes are involved?

-- James
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/77abc911-8239-42ee-9e46-27a4032a1990%40googlegroups.com.

Niklas Kunkel

unread,
Dec 21, 2019, 7:59:25 PM12/21/19
to Consul
I'm having the exact same issue just running serf alone. However unlike most other posters, I'm not using Docker.
Any idea what could be the issue here? All my nodes are standard Ubuntu Digital Ocean servers.

Hans Hasselberg

unread,
Jan 6, 2020, 3:24:03 PM1/6/20
to consu...@googlegroups.com
Hello,

what you are seeing is probably being caused by network problems. The memberlist management library that is being used in Consul will log the following message:

[WARN] memberlist: Refuting a suspect message (from: $IP)

when the following is happening (from https://www.serf.io/docs/internals/gossip.html):

Failure detection is done by periodic random probing using a configurable interval. If the node fails to ack within a reasonable time (typically some multiple of RTT), then an indirect probe is attempted. An indirect probe asks a configurable number of random nodes to probe the same node, in case there are network issues causing our own node to fail the probe. If both our probe and the indirect probes fail within a reasonable time, then the node is marked "suspicious" and this knowledge is gossiped to the cluster. A suspicious node is still considered a member of cluster. If the suspect member of the cluster does not dispute the suspicion within a configurable period of time, the node is finally considered dead, and this state is then gossiped to the cluster.
 
What is described as "dispute the suspicion" is refuting in the logs. It happens when node receives a suspect message about itself, which it refutes because it is obviously still alive.

Hope it helps,
Hans



---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/d5d25601-42e2-4bbd-adef-7832a8e48453%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages