Autopilot and rolling rebuilds not working

Christopher Powell

unread,

Aug 13, 2019, 11:45:56 AM8/13/19

to Nomad

Greeting everyone,

Thank you for your time and assistance!

I have a system in-place where we use packer to build AMI's for our nomad server and clients. Then we do a rolling rebuild of all the nodes. For some reason, it appears that while Autopilot is configured to cleanup dead servers but its not working properly. After 24 hours, the old servers and clients are still visible in the cluster status. I am running Nomad 0.9.4 and raft protocol 3.

I was under the impression they should be cleaned up and no longer visible. After rebuilding all the nodes, should I force a GC or use the purge node api to clean them up?

Rebuilding the cluster as follows all done via the http api:

Servers first:

1) pick a non-leader server

2) shut it down

3) wait for replacement node to come online

4) validate health

5) repeat for remaining non-leader servers

6) Do the leader last

then Clients:

1) mark client as ineligible

2) drain node

3) wait for drain to complete

4) shutdown node

5) wait for new node to come online

6) repeat for remaining nodes

$ nomad -version

Nomad v0.9.4 (a81aa846a45fb8248551b12616287cb57c418cd6)

$ nomad server members

Name Address Port Status Leader Protocol Build Datacenter Region

nomad-server-10-2-18-202.us-east-1 10.2.18.202 4648 alive true 2 0.9.4 dc1 us-east-1

nomad-server-10-2-25-181.us-east-1 10.2.25.181 4648 alive false 2 0.9.4 dc1 us-east-1

nomad-server-10-2-38-163.us-east-1 10.2.38.163 4648 left false 2 0.9.4 dc1 us-east-1

nomad-server-10-2-41-128.us-east-1 10.2.41.128 4648 alive false 2 0.9.4 dc1 us-east-1

nomad-server-10-2-45-83.us-east-1 10.2.45.83 4648 left false 2 0.9.4 dc1 us-east-1

nomad-server-10-2-50-201.us-east-1 10.2.50.201 4648 alive false 2 0.9.4 dc1 us-east-1

nomad-server-10-2-61-45.us-east-1 10.2.61.45 4648 alive false 2 0.9.4 dc1 us-east-1

$ nomad node status

ID DC Name Class Drain Eligibility Status

c7bb3ff8 dc1 nomad-client-10-2-27-57 <none> false eligible ready

ff923b32 dc1 nomad-client-10-2-39-95 <none> false eligible ready

afb8793c dc1 nomad-client-10-2-53-5 <none> false eligible ready

068d65ff dc1 nomad-client-10-2-62-191 <none> false ineligible down

9e906e3d dc1 nomad-client-10-2-23-169 <none> false ineligible down

5b4b4a23 dc1 nomad-client-10-2-43-42 <none> false ineligible down

$ nomad operator autopilot get-config

CleanupDeadServers = true

LastContactThreshold = 200ms

MaxTrailingLogs = 250

ServerStabilizationTime = 10s

EnableRedundancyZones = false

DisableUpgradeMigration = false

EnableCustomUpgrades = false

$ nomad operator raft list-peers

Node ID Address State Voter RaftProtocol

nomad-server-10-2-25-181.us-east-1 3e12083e-b1d5-6580-dcfb-271cbbf61ca7 10.2.25.181:4647 follower true 3

nomad-server-10-2-18-202.us-east-1 3c41866a-8c48-300d-2b61-8988e0167b6c 10.2.18.202:4647 leader true 3

nomad-server-10-2-61-45.us-east-1 c8f8d6d2-d199-44e4-d0f8-ee700469fae5 10.2.61.45:4647 follower true 3

nomad-server-10-2-50-201.us-east-1 9cf0f469-3cba-e77f-d90c-ee60480d9214 10.2.50.201:4647 follower true 3

nomad-server-10-2-41-128.us-east-1 4e40dc46-e475-3576-fa84-68c40aa391f6 10.2.41.128:4647 follower true 3

Christopher Powell

unread,

Aug 13, 2019, 2:15:26 PM8/13/19

to Nomad

And I realized my mistake. The default for the server's `node_gc_threshold` is 24 hours so I wasn't waiting long enough I supposed. Once the 24 hours kicked in, they were removed.

https://www.nomadproject.io/docs/configuration/server.html#node_gc_threshold

Brian Lalor

unread,

Aug 14, 2019, 10:58:37 AM8/14/19

to Christopher Powell, Nomad

On Aug 13, 2019, at 11:45 AM, Christopher Powell <powellc...@gmail.com> wrote:

Servers first:
1) pick a non-leader server
2) shut it down
3) wait for replacement node to come online
4) validate health
5) repeat for remaining non-leader servers
6) Do the leader last

Christopher, this looks problematic to me: but stopping one server before starting a new one seems risky, especially if you’re only running 3 servers (I see you’ve got 5, however). Why not fire up a new node and wait for it to join the cluster before choosing one to tear down? That seems more in line with the idea of stable server introduction The section just above that discusses dead server cleanup, but I believe that only applies to failed nodes, not ones that have left gracefully. As long as a node doesn’t appear in the peer list, it shouldn’t impact leader election or quorum health.

Reply all

Reply to author

Forward