Blue/Green deployment of nomad server nodes

Gregory Durham

unread,

May 26, 2016, 5:41:03 PM5/26/16

to Nomad

Hello,

We are in the process of deploying nomad in AWS, and trying to figure out how to manage the cluster deployment and rebuild with zero downtime.

As I understand it, with nomad, as long as I don't lose quorum, the cluster should continue to move forward.

Example:

Auto scaling group A: 3 node server cluster

Auto scaling group B: new 3 node cluster, joins existing 3 node cluster, thus making a 6 node cluster spanning 2 auto scaling groups

some time later (1-2 mins), we slowly drain auto scale group A, decrementing the max and min size, waiting 1 min between until we hit 0 instances, and finally delete the old Auto scaling group.

What we are seeing is that once all the nodes in the old ASG are killed off, no master is elected out of the new group, and is in a masterless state.

nomad-server-hostname nomad: 2016/05/26 21:29:02 [ERR] http: Request /v1/status/leader?region=global, error: No cluster leader

nomad-server-hostname nomad: 2016/05/26 21:29:03 [INFO] raft: Node at new_node_a:4647 [Follower] entering Follower state

nomad-server-hostname nomad: 2016/05/26 21:29:04 [WARN] raft: Heartbeat timeout reached, starting election

nomad-server-hostname nomad: 2016/05/26 21:29:04 [INFO] raft: Node at new_node_a:4647 [Candidate] entering Candidate state

nomad-server-hostname nomad: 2016/05/26 21:29:06 [INFO] raft: Node at new_node_a:4647 [Follower] entering Follower state

nomad-server-hostname nomad: 2016/05/26 21:29:07 [ERR] worker: failed to dequeue evaluation: No cluster leader

nomad-server-hostname nomad: 2016/05/26 21:29:07 [WARN] raft: Heartbeat timeout reached, starting election

nomad-server-hostname nomad: 2016/05/26 21:29:07 [INFO] raft: Node at new_node_a:4647 [Candidate] entering Candidate state

nomad-server-hostname nomad: 2016/05/26 21:29:07 [ERR] raft: Failed to make RequestVote RPC to old_node_a:4647: dial tcp old_node_a:4647: i/o timeout

nomad-server-hostname nomad: 2016/05/26 21:29:07 [ERR] raft: Failed to make RequestVote RPC to old_node_b:4647: dial tcp old_node_b:4647: i/o timeout

nomad-server-hostname nomad: 2016/05/26 21:29:07 [ERR] raft: Failed to make RequestVote RPC to old_node_c:4647: dial tcp old_node_c:4647: i/o timeout

nomad-server-hostname nomad: 2016/05/26 21:29:08 [WARN] raft: Election timeout reached, restarting election

nomad-server-hostname nomad: 2016/05/26 21:29:08 [INFO] raft: Node at new_node_a:4647 [Candidate] entering Candidate state

nomad-server-hostname nomad: 2016/05/26 21:29:08 [INFO] raft: Node at new_node_a:4647 [Follower] entering Follower state

Are there other steps that need to happen in order to migrate with no manual steps?

Thanks,

Greg

Armon Dadgar

unread,

May 26, 2016, 5:45:31 PM5/26/16

to Gregory Durham, Nomad

Hey Gregory,

It sounds like you are causing an outage of the quorum. When the new ASG boots, the quorum

is increased from 3 -> 6, but when the old nodes are killed off, only 3 / 6 are reachable, meaning

a majority cannot be achieved, resulting in quorum loss.

The solution is not to just kill the old servers, but to have them leave gracefully. Alternatively,

they can be forcefully killed, but the “server-force-leave” command should be used after the

nodes have entered a failed state (a live node cannot be force left).

Hope that helps!

Best Regards,
Armon Dadgar

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/nomad/issues
IRC: #nomad-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Nomad" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nomad-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/f6f5ff9f-f6c0-4ea5-8a41-47d8c3618d84%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gregory Durham

unread,

May 26, 2016, 8:43:04 PM5/26/16

to Armon Dadgar, Nomad

Thank you very much for the explanation. Ill get something wired up to handle this properly

Jason Price

unread,

Jun 2, 2016, 8:03:26 AM6/2/16

to Nomad

Dumb question: How do you tell nomad to gracefully leave the cluster?

The docs mention 'server-force-leave'... But that doesn't feel 'graceful'.

Is there an analog of 'consul leave'?

-Jason

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/CAEdtSe3u2LoLiWo7E1FWUFDC5iwxYfoZEfu5PW2bfiAjy0OtGQ%40mail.gmail.com.

Alex Dadgar

unread,

Jun 2, 2016, 1:32:57 PM6/2/16

to Jason Price, Nomad

Hey Jason,

You can see the documentation here under "Stopping an Agent": https://www.nomadproject.io/docs/agent/index.html

Essentially you want to set the `LeaveOnInt` and then when the client gets that signal it will gracefully leave. Otherwise, the server can be killed and then `server-force-leave` can be used.

Thanks,

Alex

To view this discussion on the web visit https://groups.google.com/d/msgid/nomad-tool/CAChvjRAzJYaSR8NdMm0UH4ga35dUAs8k1HUbxEbDMj%3DTjKvLJw%40mail.gmail.com.

Reply all

Reply to author

Forward