Don't understand Vault HA

2,009 views
Skip to first unread message

Lars Sommer

unread,
Apr 21, 2016, 7:44:59 PM4/21/16
to Vault
I've read this page a few times:

And still have no concept of whether it's talking about backends or the Vault servers themselves. I recently hit a "standby" error when trying to access Vault through an ELB.
Ctrl+f for 'standby' on the HA page and get a good laugh.

Setup:
1 ELB
3 instances, all running Vault
3 instances running Consul in a cluster

Scenario:
One of my three Vault servers is responding to requests and the other two are not. They return "Error: Vault is in standby mode"

What is standby mode? Why are they in it? How do you resolve this issue to provide HA for Vault?

Jeff Mitchell

unread,
Apr 21, 2016, 7:57:46 PM4/21/16
to vault...@googlegroups.com

Hi Lars,

This page will probably be useful to you:

https://www.vaultproject.io/docs/http/sys-health.html

Best,
Jeff

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/01cd4806-d3ab-4f6e-b7d4-97cdb8284a09%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Lars Sommer

unread,
Apr 21, 2016, 8:13:56 PM4/21/16
to Vault
Hi Jeff,

   Thanks for the reply, and that is indeed a helpful page, but maybe not quite helpful enough. For example:

  • standbyok optional A query parameter provided to indicate that being a standby should still return the active status code instead of the standby code
  • So does that just return the active status code, or does that mean it can actually *function* despite being in standby mode? What is standby mode? Is this like a master/slave setup where the slave is just doing heartbeats to the master to determine if it should actually handle requests?
This seems like a fairly sizable gap in documentation.

Jeff Mitchell

unread,
Apr 21, 2016, 8:33:01 PM4/21/16
to vault...@googlegroups.com
Hi Lars,

On Thu, Apr 21, 2016 at 8:13 PM, Lars Sommer <lars.j...@gmail.com> wrote:
> standbyok optional A query parameter provided to indicate that being a
> standby should still return the active status code instead of the standby
> code
> So does that just return the active status code

As the page indicates, standby and active nodes normally return
different status codes (429 vs 200). standbyok tells a standby node to
instead return 200 instead of 429.

>, or does that mean it can
> actually *function* despite being in standby mode? What is standby mode?

As the first page you linked to explains, non-leader nodes forward
requests to the active node.

> This seems like a fairly sizable gap in documentation.

PRs are welcome!

Best,
Jeff

Lars Sommer

unread,
Apr 21, 2016, 8:43:43 PM4/21/16
to Vault
When I finish documenting it internally onto my company's Confluence page so that it makes sense, I'll write something up. 
Thanks,

Lars Sommer

unread,
Apr 21, 2016, 10:47:01 PM4/21/16
to Vault
With "VAULT_ADVERTISE_ADDR" env set to the priv IP of the instance, I am still receiving errors that the server is in standby mode. Is this expected?

Jason Antman

unread,
Apr 22, 2016, 7:05:34 AM4/22/16
to Vault
Lars,

I've also been setting up an HA Vault cluster in AWS. I'll say that the infrastructure I came up with is rather complicated, but that's largely because we want end-to-end TLS.

Vault's HA/clustering allows one active node an N standby nodes. The standby nodes cannot respond to requests; all they can ever do is redirect client requests (it's a simple HTTP 307 Temporary Redirect) to the active node. The only control that you have is "?standbyok" on the health check, which doesn't change Vault's behavior, it just changes whether the health check of a standby node returns a 200 or a 429.

So... I could go into detail about it, but if you're intending on putting this behind TLS, I'd completely dispense with the ELB. Assuming you're going with plain HTTP, I'm personally not convinced that ?standbyok is a wonderful thing, but I guess that depends on the behavior you want to see...

VAULT_ADVERTISE_ADDR simply specifies the address that clients should use to connect to this Vault instance. I.e. if instance A is active and instance B is in standby, then a client request to instance B will be HTTP 307 redirected to http://<instance A VAULT_ADVERTISE_ADDR>/<original request path>

If you're getting health check results that don't immediately make sense to you, I'd highly recommend repeating them with ``curl -i`` or something else that will let you see the same response that the ELB is seeing.

-Jason

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.

Jeff Mitchell

unread,
Apr 22, 2016, 9:23:58 AM4/22/16
to vault...@googlegroups.com
To add on to what Jason said, if you are adamant about keeping the
Vault nodes behind a load balancer, the advertise address on all Vault
nodes should be the same -- the address of the service on the load
balancer.

--Jeff
> https://groups.google.com/d/msgid/vault-tool/CAFt4V4mrUw8hbyZWVRdynN3UbNDjziKzXALnxxBTpHHgGyuOPw%40mail.gmail.com.

Lars Sommer

unread,
Apr 22, 2016, 11:39:50 AM4/22/16
to Vault
I am not adamant about keeping them behind a load balancer, but I am adamant about not designing in a SPOF to my infrastructure. Do you have any suggestions that don't include a load balancer?
Thanks for the tip about advertise address, I'll try that.

Jeff Mitchell

unread,
Apr 22, 2016, 12:14:10 PM4/22/16
to vault...@googlegroups.com
On Fri, Apr 22, 2016 at 11:39 AM, Lars Sommer <lars.j...@gmail.com> wrote:
> I am not adamant about keeping them behind a load balancer, but I am adamant
> about not designing in a SPOF to my infrastructure. Do you have any
> suggestions that don't include a load balancer?

Yep -- what Jason said!

Direct connections to Vault instances both: allow the standby Vault
instances to redirect as soon as a new active node is selected (rather
than wait for a load balancer to first detect the problem and then
find the new configuration); and, ensure TLS connectivity from the
client to the server without transitive trust issues (which are not a
problem if you were going to use a LB in TCP mode anyways).

At some point Vault is going to be able to directly toggle Consul
health status as well (rather than require periodic health checks
configured externally) which will make any transition even faster, if
you're using Consul to discover the active node.

Of course, there are things an LB can give you -- especially a managed
one like ELB -- that you won't get in this approach, like resilience
to DDoS attacks. But you could always stick a temporary LB in front if
needed in response to an attack, configured with the correct active
node at the time.

Best,
Jeff

Lars Sommer

unread,
Apr 22, 2016, 12:19:36 PM4/22/16
to Vault
Ahh so do you suggest just using round-robin DNS then and removing the load balancing layer all together?

Jeff Mitchell

unread,
Apr 22, 2016, 12:43:31 PM4/22/16
to vault...@googlegroups.com
Hi Lars,

Round-robin DNS is one way to go, but really I'm suggesting service
discovery. Consul for instance exposes a DNS interface, which makes it
easy to use, but the results that come back from a query reflect the
state of the backend services. So in the simple case you could make a
DNS query and the only result that comes back is the active node.

Service discovery also provides a nice way to keep an LB up to date if
you go that route, via e.g. Consul Template (although not ELB for
obvious reasons).

--Jeff
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/vault/issues
> IRC: #vault-tool on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Vault" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to vault-tool+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/vault-tool/a50a1912-9b7b-47f4-bc30-d203ce15c316%40googlegroups.com.

Lars Sommer

unread,
Apr 22, 2016, 12:46:41 PM4/22/16
to Vault
Oh very slick I don't know why that didn't occur to me, thanks for the direction!

Lars Sommer

unread,
Apr 27, 2016, 5:26:37 PM4/27/16
to Vault
Hey Jeff,

   I have my service setup properly and working well- Consul thinks I have 2/3 unhealthy nodes with 1 healthy node. This is because I am checking for a 200 from curling the health path. The problem is when I go to dig the DNS name for the service, it's returning all 3 nodes. Am I missing a step that is required for Consul to not return the unhealthy nodes? I was under the impression that Consul would remove unhealthy nodes from the SRV answer that gets returned.

Thank you,
-L

David Adams

unread,
Apr 27, 2016, 5:39:42 PM4/27/16
to vault...@googlegroups.com
Hey Lars,
Do you have dns_config.only_passing set to false in your agent config? I think that will cause this behavior.

If you need that setting for some other reason, you can also create a prepared query for your service that specifies OnlyPassing = true.

-dave


Lars Sommer

unread,
Apr 27, 2016, 5:54:53 PM4/27/16
to Vault
This is what I have set on the instance that is running the docker container that is running consul server and the vault process:

[ec2-user@ip-10-5-100-91 ~]$ docker exec -it 0 cat /config/agent.json
{
"client_addr": "0.0.0.0",
"data_dir": "/data",
"leave_on_terminate": true,
"dns_config": {
"allow_stale": true,
"max_stale": "1s"
}
}
[ec2-user@ip-10-5-100-91 ~]$ docker exec -it 0 cat /config/server.json
{
"ui": true,
"dns_config": {
"allow_stale": false
}
}

ja...@hashicorp.com

unread,
Apr 27, 2016, 6:07:53 PM4/27/16
to Vault
Hi Lars,

It sounds like you might be returning 1 from your health check script which is putting it in the warning state instead of critical. If you return >1 it'll go critical and get excluded. As David mentioned you could also set https://www.consul.io/docs/agent/options.html#only_passing to true, but that affects all lookups (prepared queries do let you create a query that just does this for Vault).

-- James

Lars Sommer

unread,
Apr 27, 2016, 6:17:54 PM4/27/16
to Vault
Hey James- yes that's exactly what's happening: They show up in warning state, not critical.
I am actually not sure how to write that, do you have any examples?

It's kind of weird to me that I would write an explicit health check and then if it fails the health check, it's still returned....

-L

ja...@hashicorp.com

unread,
Apr 27, 2016, 6:50:21 PM4/27/16
to Vault
Hi Lars,

That's the current default behavior of Consul - it won't exclude things in the warning state unless you configure it otherwise. The details depend on your health check script, but something along these lines should work in sh or bash:

curl --fail http://<vault server>/v1/sys/health || exit 2

-- James

Lars Sommer

unread,
Apr 28, 2016, 11:09:16 AM4/28/16
to Vault
That's perfect thanks so much, my bash-fu isn't as good as it should be and I didn't know about the || operator. Thank you!
Reply all
Reply to author
Forward
0 new messages