Autopilot and rolling rebuilds not working

17 views
Skip to first unread message

Christopher Powell

unread,
Aug 13, 2019, 11:45:56 AM8/13/19
to Nomad
Greeting everyone,

Thank you for your time and assistance!

I have a system in-place where we use packer to build AMI's for our nomad server and clients. Then we do a rolling rebuild of all the nodes. For some reason, it appears that while Autopilot is configured to cleanup dead servers but its not working properly. After 24 hours, the old servers and clients are still visible in the cluster status. I am running Nomad 0.9.4 and raft protocol 3.

I was under the impression they should be cleaned up and no longer visible. After rebuilding all the nodes, should I force a GC or use the purge node api to clean them up? 

Rebuilding the cluster as follows all done via the http api:

Servers first:
1) pick a non-leader server
2) shut it down
3) wait for replacement node to come online
4) validate health
5) repeat for remaining non-leader servers
6) Do the leader last

then Clients:
1) mark client as ineligible
2) drain node
3) wait for drain to complete
4) shutdown node
5) wait for new node to come online
6) repeat for remaining nodes

$ nomad -version
Nomad v0.9.4 (a81aa846a45fb8248551b12616287cb57c418cd6)

$ nomad server members
Name                                Address      Port  Status  Leader  Protocol  Build  Datacenter  Region
nomad-server-10-2-18-202.us-east-1  10.2.18.202  4648  alive   true    2         0.9.4  dc1         us-east-1
nomad-server-10-2-25-181.us-east-1  10.2.25.181  4648  alive   false   2         0.9.4  dc1         us-east-1
nomad-server-10-2-38-163.us-east-1  10.2.38.163  4648  left    false   2         0.9.4  dc1         us-east-1
nomad-server-10-2-41-128.us-east-1  10.2.41.128  4648  alive   false   2         0.9.4  dc1         us-east-1
nomad-server-10-2-45-83.us-east-1   10.2.45.83   4648  left    false   2         0.9.4  dc1         us-east-1
nomad-server-10-2-50-201.us-east-1  10.2.50.201  4648  alive   false   2         0.9.4  dc1         us-east-1
nomad-server-10-2-61-45.us-east-1   10.2.61.45   4648  alive   false   2         0.9.4  dc1         us-east-1

$ nomad node status
ID        DC   Name                      Class   Drain  Eligibility  Status
c7bb3ff8  dc1  nomad-client-10-2-27-57   <none>  false  eligible     ready
ff923b32  dc1  nomad-client-10-2-39-95   <none>  false  eligible     ready
afb8793c  dc1  nomad-client-10-2-53-5    <none>  false  eligible     ready
068d65ff  dc1  nomad-client-10-2-62-191  <none>  false  ineligible   down
9e906e3d  dc1  nomad-client-10-2-23-169  <none>  false  ineligible   down
5b4b4a23  dc1  nomad-client-10-2-43-42   <none>  false  ineligible   down

$ nomad operator autopilot get-config
CleanupDeadServers = true
LastContactThreshold = 200ms
MaxTrailingLogs = 250
ServerStabilizationTime = 10s
EnableRedundancyZones = false
DisableUpgradeMigration = false
EnableCustomUpgrades = false

$ nomad operator raft list-peers
Node                                ID                                    Address           State     Voter  RaftProtocol
nomad-server-10-2-25-181.us-east-1  3e12083e-b1d5-6580-dcfb-271cbbf61ca7  10.2.25.181:4647  follower  true   3
nomad-server-10-2-18-202.us-east-1  3c41866a-8c48-300d-2b61-8988e0167b6c  10.2.18.202:4647  leader    true   3
nomad-server-10-2-61-45.us-east-1   c8f8d6d2-d199-44e4-d0f8-ee700469fae5  10.2.61.45:4647   follower  true   3
nomad-server-10-2-50-201.us-east-1  9cf0f469-3cba-e77f-d90c-ee60480d9214  10.2.50.201:4647  follower  true   3
nomad-server-10-2-41-128.us-east-1  4e40dc46-e475-3576-fa84-68c40aa391f6  10.2.41.128:4647  follower  true   3

Christopher Powell

unread,
Aug 13, 2019, 2:15:26 PM8/13/19
to Nomad
And I realized my mistake. The default for the server's `node_gc_threshold` is 24 hours so I wasn't waiting long enough I supposed. Once the 24 hours kicked in, they were removed.

https://www.nomadproject.io/docs/configuration/server.html#node_gc_threshold

Brian Lalor

unread,
Aug 14, 2019, 10:58:37 AM8/14/19
to Christopher Powell, Nomad
On Aug 13, 2019, at 11:45 AM, Christopher Powell <powellc...@gmail.com> wrote:

Servers first:
1) pick a non-leader server
2) shut it down
3) wait for replacement node to come online
4) validate health
5) repeat for remaining non-leader servers
6) Do the leader last

Christopher, this looks problematic to me: but stopping one server before starting a new one seems risky, especially if you’re only running 3 servers (I see you’ve got 5, however).  Why not fire up a new node and wait for it to join the cluster before choosing one to tear down?  That seems more in line with the idea of stable server introduction   The section just above that discusses dead server cleanup, but I believe that only applies to failed nodes, not ones that have left gracefully.  As long as a node doesn’t appear in the peer list, it shouldn’t impact leader election or quorum health.
Reply all
Reply to author
Forward
0 new messages