What is the Best Practice for Vault/Consul HA to run in production mode

2,549 views
Skip to first unread message

Vad

unread,
Jul 19, 2017, 11:17:37 AM7/19/17
to Vault
Hello,

I would like to know what is the best practice for setting up Vault/Consul in HA to run in production mode - how many Vault servers and Consul(nodes/one leader) should I setup?

Thanks,

james....@made.com

unread,
Jul 25, 2017, 6:25:38 AM7/25/17
to Vault
Hi,

We run 3 Consul servers, 3 Vault servers & 3 Nomad servers on the same cluster of machines. Seems to work pretty well so far. 

There are roughly 40 - 50 consul clients in each environment. 

James

Craig Sawyer

unread,
Jul 25, 2017, 10:04:37 AM7/25/17
to Vault
Same with James, here 3 servers of each..

tho somewhere I thought I read, it's a good idea to run 5, so you can tear down and upgrade without worrying about losing consensus.  I've never had that problem here, we just take down a follower, upgrade it, bring it up, bring down the next follower, upgrade it, bring it up and then take down the leader, upgrade it and bring it up.  You do get a new leader that way, but zero downtime.

It would be possible if you had a failure while doing an upgrade that you would lose consensus with only 3, and have trouble getting a leader election to happen.  With 5 servers, you would have to lose 3 before you had to worry about that.  With 3, you can only lose 1 at a time.  Again this hasn't been a problem for us, and we are not making any plans now to increase our count to 5, but we may eventually.

Regardless, definitely only run an odd number of servers, 3 probably being the normal, and 5 being very safe, and more than that being.. overly safe, if not wasteful.
Message has been deleted

Jeff Mitchell

unread,
Jul 25, 2017, 11:53:39 AM7/25/17
to Vault
On Tue, Jul 25, 2017 at 10:04 AM, Craig Sawyer <csa...@yumaed.org> wrote:
Same with James, here 3 servers of each..

tho somewhere I thought I read, it's a good idea to run 5, so you can tear down and upgrade without worrying about losing consensus.

Usually 3 Consul servers suffices except in very large installs.
 
I've never had that problem here, we just take down a follower, upgrade it, bring it up, bring down the next follower, upgrade it, bring it up and then take down the leader, upgrade it and bring it up.  You do get a new leader that way, but zero downtime.

Hopefully you are talking about how you upgrade Consul here, because that is not the recommended upgrade procedure for Vault -- for Vault you should upgrade all standby nodes first, then seal the active node to fail over to an upgrade standby node and not have the possibility of the formerly active node becoming active again.

Best,
Jeff

Vad

unread,
Jul 25, 2017, 2:12:04 PM7/25/17
to Vault
Jeff,

I'm talking about setting up new Vault in Production mode with HA.
Right now I've three EC2 instances which are running Vault and Consul on each EC2 instance.

Craig Sawyer

unread,
Jul 28, 2017, 7:02:42 PM7/28/17
to Vault


Hopefully you are talking about how you upgrade Consul here, because that is not the recommended upgrade procedure for Vault -- for Vault you should upgrade all standby nodes first, then seal the active node to fail over to an upgrade standby node and not have the possibility of the formerly active node becoming active again.


Awesome, thanks!  We haven't been sealing the vault leader before taking it down, but I will start doing that now :)  I couldn't find anything in the docs about upgrading, so I put some up in https://github.com/hashicorp/vault/pull/3080. They may not be overly clear, I didn't spend very much time on it, feel free to wack all over it.


~R

unread,
Aug 1, 2017, 4:43:48 PM8/1/17
to Vault
I am wondering whether folks have similar recommendations for running on top of kubernetes My original configuration had 3 Consul Servers + 2 consul clients & 2 vault servers.
Trying to switch to kubernetes, I am tripping over the following points:
1. multiple vault servers don't buy me scalability as that is a function of consul nodes
2. kubernetes can itself take care of availability of the vault server (making sure that 1 instance is always running - assuming 1 replica).  
Given that, what advantage does one get with multiple vault servers?  If anything, it just adds to the overhead of additional hops (standby to active) especially if all clients talk to vault 
via a load balancer.

I am wondering what people on this forum have done when deploying on kubernetes

Joel Thompson

unread,
Aug 3, 2017, 1:20:53 PM8/3/17
to vault...@googlegroups.com
I haven't tried to run Vault on Kubernetes. But, the advantage of multiple Vault servers is you have standby nodes that are already unsealed. If you use Kubernetes to take care of the availability, you'd need to configure Kubernetes to unseal new Vault servers, which involves storing the unseal keys somewhere. And Vault is a great place to store secrets, but you need this to work without Vault.

--Joel

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/cb235341-b9bf-47d0-9024-bca5e769ca35%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

~R

unread,
Aug 4, 2017, 3:53:32 AM8/4/17
to Vault
Joel, you bring up a great point regarding unto unseal in kubernetes- but the problem of holding onto unseal keys to unseal a new vault server does not go away if you are running outside kubernetes. you still need to unseal a node first before it can enter standby state. 

Joel Thompson

unread,
Aug 4, 2017, 11:20:46 AM8/4/17
to vault...@googlegroups.com
Absolutely. It all depends on what your workflows are and what risk profile you're willing to accept. But having multiple unsealed Vaults at the same time reduces downtime in the event of failure, or even just upgrading Vault. It's frontloading the work.

For example, if you have a separate secret store besides Vault (though why would you do that?) to store your unseal keys in, and you're willing to accept the downtime that bringing up a new Vault container and unsealing it would entail, then you probably wouldn't need to run multiple Vault instances.

If you really like Vault as a secret management solution and decide to store your unseal keys in Vault itself (though I'd recommend some other escrow process in addition!), then you want standby, unsealed Vault instances to take over so you can retrieve your unseal keys from it to unseal a failed node.

If you have people holding unseal keys (or people holding GPG keys to decrypt the unseal key shares), then it introduces more downtime -- both for Vault and any apps that depend on it -- as those people all need to decide to trust the new Vault server and unseal it. And if Vault goes down at 3:00 a.m., those humans have to get woken up in the middle of the night to unseal the new Vault server (and will consequently be very grumpy in the morning). If you have unsealed, standby Vault instances, then Vault will fail over gracefully, and those humans might get to sleep soundly through the night and can just unseal a new standby Vault in the morning :) (Assuming, of course, that you're comfortable with the reduction in redundancy that a failed Vault instances introduces, e.g., if you have 3 Vault instances and one fails, you might be OK, but if you only have two instances and one fails, you might be uncomfortable with the reduced redundancy.)

There are lots of tradeoffs, and you just need to pick the right set that work for your use cases and organization.

--Joel

~Rohit Koul

unread,
Aug 4, 2017, 6:18:01 PM8/4/17
to Vault
 "if the vault goes down at 3:00am..." you had me here!  :-) 
Thank you for taking the time to pour in your thoughts. This makes sense to me.

--Rohit

Abhimanyu Garg

unread,
May 15, 2018, 12:16:55 PM5/15/18
to Vault
This article explains Vault - Consul HA architecture with other configurations-


3 Consul servers- 2 Consul clients and 2 Vault servers should be sufficient for HA setup in production.

Thanks

Justin DynamicD

unread,
May 15, 2018, 5:59:14 PM5/15/18
to Vault
Adding my voice to the apparently popular option of creating 3 "hashi-servers" that run Consul, Nomad, and Vault.  It runs very well.

Thanks to simple majority, the first two services can be rather aggressively upgraded without concern.  Vault is the only sensitive one.  We tend to follow the same pattern though:
  • upgrade the standby nodes
  • stop the active node and then patch it.  

In our systemd our service stop contains an operator step-down which hands active status over to one of the other nodes.  That does lend itself to the case of keeping vault separate from the other two services and going to a 5 server model, but if you're not immutable (use chef/ansible/puppet to manage configs) then this reduces server count while being perfectly manageable.

Only last note is we abuse the work hashicorp did at https://registry.terraform.io/ and use their scripts to install/update all three services despite using a CM (puppet in our case).  This way the actual install/upgrade process is effectively hashi-maintained and we just "git pull" the latest install script.
Reply all
Reply to author
Forward
0 new messages