Consul agent doesnt leave on terminate

264 views
Skip to first unread message

Dmitry Molotkov

unread,
Mar 1, 2016, 1:47:54 PM3/1/16
to Consul
I`m running consul inside docker. I`m using gliderlabs/consul-agent 
Here how i launch agents:
(export IP44=$(wget -qO- 169.254.169.254/latest/meta-data/local-ipv4);consul agent -config-dir=/config -data-dir /opt/consul -advertise $IP44 -atlas=${ATLAS_REPO} -atlas-join  -atlas-token=${ATLAS_TOKEN})
in /config there is an agent.json file with following content (default from that docker image):
{
        "client_addr": "0.0.0.0",
        "data_dir": "/data",
        "leave_on_terminate": true,
        "dns_config": {
                "allow_stale": true,
                "max_stale": "1s"
        }
}

The problem is that when i terminate my instance im still seeing it in atlas console and its marked as "Agent not live or unreachable
If i terminate a lot of instance there will be a lot of nodes in atlas that show up as not live and failed checks, and the whole infracture are marked as having critical problems because of it.
Isnt leave_on_terminate should be solving the problem? How to solve this issue?

Joshua Garnett

unread,
Mar 1, 2016, 4:05:21 PM3/1/16
to consu...@googlegroups.com
The challenge is the process needs enough time to stop before the instance is actually terminated.  In most cases if your init system is setup properly this should be fine, but you'll still have issues when a machine hard crashes as consul never has an opportunity to leave the cluster. 

--Josh

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/consul-tool/61ffd2d7-0fd8-4de6-8fa1-19431b74c54d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dmitry Molotkov

unread,
Mar 1, 2016, 4:24:37 PM3/1/16
to Consul
Machine just got terminated from EC2 console, menu - terminate instance, or by load balancer. Anything can be done here?

Dmitry Molotkov

unread,
Mar 1, 2016, 5:20:35 PM3/1/16
to Consul
Here what i see in logs:
2016-03-02 01:15:51 UTC+3
2016/03/01 22:15:51 [INFO] Graceful shutdown over 0
2016-03-02 01:15:51 UTC+3
2016-03-02 01:15:51 UTC+3
2016/03/01 22:15:51 [INFO] Down
2016-03-02 01:15:51 UTC+3
2016-03-02 01:15:51 UTC+3
2016/03/01 22:15:51 [INFO] consul: Deregistering fabio
2016-03-02 01:15:51 UTC+3
2016-03-02 01:15:51 UTC+3
2016/03/01 22:15:51 [INFO] consul: Health changed to #141


2016-03-02 01:16:03 UTC+3
2016/03/01 22:16:03 [INFO] memberlist: Suspect 577111e23c93 has failed, no acks received
2016-03-02 01:16:08 UTC+3
2016-03-02 01:16:06 UTC+3
2016/03/01 22:16:06 [INFO] memberlist: Suspect 577111e23c93 has failed, no acks received
2016-03-02 01:16:08 UTC+3
2016-03-02 01:16:08 UTC+3
2016/03/01 22:16:08 [INFO] memberlist: Marking 577111e23c93 as failed, suspect timeout reached
2016-03-02 01:16:13 UTC+3
2016-03-02 01:16:08 UTC+3
2016/03/01 22:16:08 [INFO] serf: EventMemberFailed: 577111e23c93 10.0.1.61
2016-03-02 01:16:13 UTC+3
2016-03-02 01:16:24 UTC+3
2016/03/01 22:16:24 [INFO] serf: attempting reconnect to 577111e23c93 10.0.1.61:8301

Dmitry Molotkov

unread,
Mar 2, 2016, 8:01:04 AM3/2/16
to Consul
I tried stop using stop from ec2 console still same, in atlas node remains and marked as unhealth.
How to fix this situation?

Dmitry Molotkov

unread,
Mar 5, 2016, 10:34:43 AM3/5/16
to Consul
So there no way to fix this?? Im stuck with cluster marked as critical when one instance got replaced by another? 

Armon Dadgar

unread,
Mar 6, 2016, 8:15:36 PM3/6/16
to consu...@googlegroups.com, Dmitry Molotkov
Dmitry,

When an agent leave the cluster gracefully, it will be in a left state. If it simply fails (agent is killed, machine terminates, etc),
it is detected as failed. By default, Consul will not reap the node for 72 hours. You can use the “force-leave” command to force
a failed node to be set as “left”.

In this case, “leave_on_terminate” probably has no effect since I don’t believe the Docker daemon will properly
terminate the processes on machine shutdown. If the SIGTERM signal is received the agent will do the leave, my
guess is that a forcible SIGKILL is used and the agent just dies.

Hope that helps.

Best Regards,
Armon Dadgar
Reply all
Reply to author
Forward
0 new messages