Hi PJ,
> 1. What are the ways to improve performance of a vault cluster?
Answers to your specific questions are below; I'm not yet sure if I
can go into full detail, but I can say that I have managed to pass >
37k requests per second through Vault with audit logging off and > 24k
with file-based audit logging on, across a range of concurrent clients
numbering from 100 to 1000. So Vault's pretty speedy. Chances are
decent that the limiting factor in Vault's speed when you run it won't
be Vault itself but rather the networked physical and logical backends
(on the physical side, there is an LRU cache that helps quite a lot).
> a. Does adding new vault affect the performance or are redundant vault
> servers just installed for fault-tolerance?
Vault is active/standby, so extra standby nodes simply provide fault tolerance.
> b. Why cant any of the three vaults, answer the queries itself instead
> of redirecting to the master vault node?
One of the main reasons for this design is that it drastically
simplifies the operational model of Vault; as a result, it is easier
to have a good understanding of what might be happening inside Vault
at any given time. This is a really nice thing in a security product.
Introducing Raft or Paxos adds a lot of complexity, and even in a
leader/follower scenario like you might get with Raft or Paxos, if you
want strong consistency and no possibility of stale reads, you need to
forward queries to the master anyways.
Also, I mentioned the LRU cache on the physical side; we'd have to
have a much more complex networked cache (or forego a cache
altogether) if we had multiple Vault nodes writing. So that would
negate a lot of the potential speedups from having multiple active
nodes.
Altogether this is a simpler model (which is really nice for security)
and it's not at all clear that multiple-active would be any faster
(and it may be slower).
> 2. If a consul node is down, is the corresponding vault node marked as
> unavailable too?
I assume here you mean a local Consul agent rather than a Consul
server node? If you have Vault connecting through your local Consul
agent (which is the recommended approach), taking the Consul agent
down will affect Vault's ability to communicate, so if connectivity
isn't restored quickly then another node will take over active duty.
> 3. What are the possible benefits of pairing vault:consul nodes in n:1 or
> 1:n format, instead of the recommended 1:1 pairing?
I'm not sure where you saw a 1:1 pairing. Since standby nodes simply
increase fault tolerance, one or two should be enough, regardless of
how many Consul servers you have. We do recommend connecting Vault
through a local agent on the Vault node, because that way the Consul
agent handles such things as directing queries to the current Consul
leader to avoid request forwarding.
> 4. In a multi-datacenter scenario for vault-consul, if all the nodes in a
> consul cluster inside one datacenter are unavailable, how do we ensure
> request to vault, routes to the consul cluster in other datacenter?
Vault is not multi-datacenter aware. More importantly, the K/V stores
of Consul are per-datacenter. So you wouldn't want Vault to simply
redirect to the other datacenter, because it'd be a totally different
data set.
> 5. If an active instance of Vault node fails, is it the responsibility of
> the REST API client to discover hot-standby nodes?
There are a lot of ways to skin this cat, but if you're running with
Consul, using Consul health checks and connecting to the service
address for Vault should do this automatically. We're considering
putting in a Consul TTL-based health check directly in Vault for users
of the Consul backend; this way any such failover should happen very
quickly. Although you can already do health checks with e.g. 1 second
TTL, this would allow a Vault node to explicitly mark itself as
available/unavailable when its status changes (e.g. shutting down,
starting up, getting unsealed, etc.)
> 6. How do hot-standby nodes understand, if and when, the active(master)
> vault is unavailable?
Vault uses a lock in Consul; when the lock is released, either
explicitly or due to a session failure from that node's Consul client,
one of the other nodes is able to (atomically) grab it. Managing to
grab it lets that node that it is now the leader. You can get more
information here:
https://www.consul.io/docs/guides/leader-election.html
> 7. Does master vault, have a memory cache? For each read request, does vault
> go to its backend, to fetch the results?
Yes. There is an LRU cache for the physical store that is invalidated
on write. Some backends (e.g. transit) also have their own specialty
caches as well.
> 8. What are the impacts of hardware or filesystem or O.S. failure on one or
> multiple servers for vault-consul cluster? What is the recovery mechanism in
> such cases?
It depends on the underlying physical store. Vault is basically a
specialty database storing its data in some other service, so disaster
recover procedures for that service should be used to restore Vault's
data in case of disaster. At HC, we have snapshots of our Consul
cluster taken every five minutes. In the event of catastrophic failure
of our Consul cluster, we can simply restore one of the snapshots and
as long as the KV store is restored, Vault will be fine.
Best,
Jeff