Anti-entropy on complete loss of cluster?

223 views
Skip to first unread message

Shane Riddell

unread,
Apr 1, 2015, 9:38:03 AM4/1/15
to consu...@googlegroups.com
I've been experimenting with recovery scenarios in the case of a catastrophic loss of a 3-node cluster, and observing anti-entropy behavior that doesn't match what I though I understood from the documentation.

I'm using Consul 0.5.0 inside docker containers; the containers are running inside vagrant boxes with different hard-coded IPs.

The server agents are started with --bootstrap-expect 3 and "-retry-join" options to  each of the 3 hardcoded IPs the server agents are assigned to run on.  I am not using Atlas, but am using encryption (but not the new keyrings).

I start all 3 vagrant boxes, and the cluster comes up and elects a leader.

I start a 4th vagrant box, running the agent as a client and registering a single service, which shows up in the catalog.  The client is started with "-retry-join' options to each of the 3 hardcoded IPs of the server agents as well.

I then destroy all 3 server instances (vagrant destroy, ungraceful termination).

At this point, the remaining client agent shows that all 3 server instances have failed.  /v1/agent/services on the client agent still shows the registered service.

I recreate the server cluster by upping the vagrant boxes.  They start, and elect a leader.  At this point, the logs on the client agent look like (172.28.128.51, 172.28.128.52, 172.28.128.53 are the server instances, the client is on 172.28.128.60)

    2015/03/31 20:50:30 [INFO] serf: attempting reconnect to node-172-28-128-53 172.28.128.53:8301

    2015/03/31 20:50:33 [ERR] http: Request /v1/health/service/dummy?index=14&passing=1&wait=60000ms, error: rpc error: No cluster leader

    2015/03/31 20:50:49 [ERR] agent: failed to sync remote state: rpc error: No cluster leader

    2015/03/31 20:51:06 [ERR] agent: failed to sync remote state: rpc error: No cluster leader

    2015/03/31 20:51:16 [INFO] serf: EventMemberJoin: node-172-28-128-53 172.28.128.53

    2015/03/31 20:51:16 [INFO] consul: adding server node-172-28-128-53 (Addr: 172.28.128.53:8300) (DC: dc1)

    2015/03/31 20:51:16 [INFO] consul: New leader elected: node-172-28-128-51

    2015/03/31 20:56:11 [INFO] agent.rpc: Accepted client: 127.0.0.1:39088

    2015/03/31 20:57:03 [INFO] agent.rpc: Accepted client: 127.0.0.1:39091

    2015/03/31 21:03:17 [INFO] agent.rpc: Accepted client: 127.0.0.1:39107

A 'consul members' on the client agent shows that all 3 server nodes are present and healthy.  Checking the local agent shows it is still aware of the local service:

curl http://localhost:8500/v1/agent/services

{"715036133e12:current-app-0:3000":{"ID":"715036133e12:current-app-0:3000","Service":"ar-hello-node","Tags":null,"Address":"","Port":3000}}


At this point, I had expected anti-entropy to add the locally registered service back into the catalog, but it never does so (I waited about 10 minutes), but while the consul ui for the cluster shows the client node as healthy, it does not show the service.

If I do a 'consul reload' on the client agent, then the logs show

==> Caught signal: hangup

==> Reloading configuration...

==> WARNING: LAN keyring exists but -encrypt given, ignoring

    2015/03/31 21:04:51 [INFO] agent: Synced service '715036133e12:current-app-0:3000'

And the service now shows up in the cluster's catalog.


Am I misunderstanding how anti-entropy works?  I thought that the local agents were authoritative, and would sync their locally registered services back to the catalog, even in the event of a complete cluster loss.  Or do I potentially have my options set incorrectly? 


Armon Dadgar

unread,
Apr 2, 2015, 1:32:42 PM4/2/15
to consu...@googlegroups.com, Shane Riddell
Hey Shane,

Your understanding of it is correct, it should repopulate the service entry using anti-entropy.
Based on the small excerpt of the logs, it looks like the agent is making an attempt to do so,
since the "failed to sync remote state message is coming from the agent pulling in the server
state to look for any deltas.

Without more logs it’s hard to tell what exactly is happening, but it does look like the agent
is running the anti-entropy as expected.

Best Regards,
Armon Dadgar
--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Shane Riddell

unread,
Apr 3, 2015, 11:40:55 AM4/3/15
to consu...@googlegroups.com, shaneridd...@gmail.com
Thanks for confirming my understanding of the docs.  

Are there particular logs I should be looking at that could help me debug thus?  It seems weird that if I have the local agent reload, it immediately resyncs without problem.

Armon Dadgar

unread,
Apr 3, 2015, 1:13:17 PM4/3/15
to consu...@googlegroups.com, Shane Riddell, shaneridd...@gmail.com
Hey Shane,

I started looking into this more, and it does appear there is a bug under these conditions.
I’ve opened a ticket for this issue here:

We will make sure this is fixed for 0.5.1. Thanks for reporting!

Shane Riddell

unread,
Apr 6, 2015, 8:57:19 AM4/6/15
to consu...@googlegroups.com, shaneridd...@gmail.com
Thanks!  I was going to dig around in it a little more, but now that I know it's not on my end, I'll wait for 0.5.1.
Reply all
Reply to author
Forward
0 new messages