Vault System Requirement Recommendations

3,150 views
Skip to first unread message

mlap...@newrelic.com

unread,
Apr 15, 2016, 2:55:53 PM4/15/16
to Vault
Hello Security Friends,

I'm in talks with my companies site engineering group on deploying our production instance of Vault. Are there recommendations on hardware requirements for Vault nodes? I'm not able to find any and not sure what to tell the site engineers when they ask for how much RAM and CPU vault needs. Will Vault scale up by dedicating a higher number of CPU cores?

Does anyone have suggestions or data from their own production deploys they'd be willing to share? 

Thanks!

-Matt

Jeff Mitchell

unread,
Apr 15, 2016, 3:16:28 PM4/15/16
to vault...@googlegroups.com
Hi Matt,

To a large extent this depends on usage, so I'll try to give you some
insight rather than quote a number: As you may imagine since I'm doing
dev work I'm not usually running Vault at super high scale, and while
HC has a production Vault instance it doesn't require super high
throughput compared to some customers that I've talked to. I've done
some artificial benchmarking of the transit backend at high scale
(between two AWS c3.8xlarge instances) and didn't get anywhere
remotely close to exhausting resources, but it was artificial so some
of the points I list below didn't apply. Hopefully others can give you
a better idea of exactly how it scales for their workloads.

* Vault will take advantage of multiple CPU cores; connections are
handled in their own goroutines so Go will scale up and down
automatically when it schedules.

* As you might expect, you'll get a lot of mileage from keeping alive
connections and pipelining requests to avoid TLS handshaking. Whether
that makes sense for your workload depends on whether it's (for
instance) credential-generating-heavy vs.
transit-encrypting/decrypting heavy, and how often you're actually
using these functions (once an hour per machine, once a second per
machine...). The difference in my benchmarking was very significant
(which mostly made it clear to me that Vault itself is quite fast and
network conditions and connections, both for its data storage backend
and resources it needs to contact to generate credentials, are the
bottleneck in almost all cases).

* Vault holds lease information in memory to manage expirations, so if
you are generating a very large number of leases/tokens RAM needs will
increase. In practice, this still tends to not actually be super high,
unless you are doing something like having servers authenticate
(generating a new token) every time they want to read or write a
value, instead of storing a token for its lifetime. (Or, worse,
authenticate, then grab a fresh set of leased credentials for various
resources instead of re-using still-valid credentials, creating
multiple expiration entries at a time.) When you get to millions of
tracked leases, the RAM needs go up significantly to. Which also leads
to my next point...

* In terms of allocation, pay attention to your data store too;
consul/etcd/zk especially are not really designed to hold huge amounts
of data at once and can start to have issues (depending, often, on
RAM) when the number of entries gets super high (millions to many
millions, again depending highly on RAM) -- and this includes tokens,
token expirations, and credential leases, not just e.g. items in
'generic' backend mounts. In general, if you are finding that you need
huge amounts of data stored in Vault, you may want to think about
whether it makes sense to use transit instead to encrypt/decrypt the
data but store the encrypted data in a different data store; or, use
transit with a data key to do the encryption/decryption locally but
not store the data key in memory.

Hope this helps -- feel free to ask more specifics if you like.
Hopefully others will chime in too!

--Jeff
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/vault/issues
> IRC: #vault-tool on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Vault" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to vault-tool+...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/vault-tool/995cc434-c283-4622-a6c2-1016883c31ab%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Matt Button

unread,
Apr 18, 2016, 8:25:29 AM4/18/16
to vault...@googlegroups.com
We're running vault in production (we're in the final stages of transitioning to it). We use it solely for encrypting/decrypting data with the transit backend with a handful (~100) of transit keys. Pretty much all our requests can be served by vault's in memory cache. We don't use audit logging as storing logs would cause too much overhead for us.

Our setup consists of 3 c3.2xlarge machine, backed by RDS MySQL for storage, and consul for coordination. We tried to run it on c3.xlarge machines, but they didn't cope very well with our throughput, and we saw higher latencies and some dropped connections when vault started consuming a lot of CPU.

In terms of throughput, we have seen it process an average of 523rps over a 1 hour period at about 55% CPU utilization. Since then we've done some optimization work to remove unnecessary calls to vault. Over the past week the peak throughput on our vault server (as measured by statsite) has been around 26,000rpm, which works out at around 433rps, and utilizes ~20-30% CPU and ~30MB RAM. The `vault.core.handle_request.p99` metric claims the max p99 response time over the past week is around 50ms, with a few spikes up to around 100-200ms. We have metrics for latencies on the client side, but I won't quote them as they currently include time for retrying requests/failing over to secondary vault nodes. Unfortunately we don't have accurate metrics at a per-second resolution, and our historical readings are stored at a lower resolution.

A few weeks ago we had an outage caused by expiring vault auth tokens + naive retry logic in clients, which caused the traffic to vault to almost triple. During the outage vault was processing an average of 962rps and hitting around 97% CPU (our metrics provider has rolled up those measurements into 15 minute buckets). At that point vault seemed to have trouble processing TLS handshakes. It's worth noting that all of the requests in the peak of this graph are being rejected by vault due to the tokens being invalid, so vault isn't doing "real" transit work, as it were.

We have been a little concerned about how we're going to scale vault in the future. For now we can probably keep bumping the node type, but that will get expensive (as we have to provision 3 machines of the same type). At some point it may be easier for us to find creative ways to add additional read-only vault nodes (over 90% of traffic to our vault is for transit decryption).

One suggestion I would make is to ensure you have metrics on how many nodes are sealed/unsealed, which node is the primary, latencies/error counts seen by clients, and the TTL of credentials that are talking to vault/remaining time on leases from vault. We've setup a few cron jobs that curl various endpoints in vault and set gauges in statsite using netcat. Not pretty, but it does the job.

Hope this helps,

Matt

Jeff Mitchell

unread,
Apr 18, 2016, 10:21:46 AM4/18/16
to vault...@googlegroups.com
On Mon, Apr 18, 2016 at 8:25 AM, Matt Button <matt....@geckoboard.com> wrote:
> In terms of throughput, we have seen it process an average of 523rps over a
> 1 hour period at about 55% CPU utilization. Since then we've done some
> optimization work to remove unnecessary calls to vault. Over the past week
> the peak throughput on our vault server (as measured by statsite) has been
> around 26,000rpm, which works out at around 433rps, and utilizes ~20-30% CPU
> and ~30MB RAM.

That seems very low to me. In my benchmarks with transit (using Apache
Bench, plaintext strings of about 20 characters, between two
dedicated-tenancy c3.8xlarge machines, re-using HTTP connections to
avoid TLS overhead, no audit logging), I was hitting about 35k
requests per second. With audit logging turned on (to a file), I was
hitting about 25k requests per second (but every request that was
claimed to have been handled was indeed showing up in the audit log).

As I noted in my previous email, reusing HTTP/TLS connections helps a
lot (as it does for nearly any service). It's also going to be very
highly dependent on the size of the plaintext/ciphertext --
encrypting/decrypting 20KB blobs will be quite a lot slower than
20-character strings.

Also, Vault 0.5.0 significantly speeds up transit, and Vault 0.5.1+
builds against Go 1.6 which brings AES speed improvements. My
benchmarks were against both of those.

Best,
Jeff

Matt Button

unread,
Apr 18, 2016, 11:00:00 AM4/18/16
to vault...@googlegroups.com
That seems very low to me. In my benchmarks with transit (using Apache
> Bench, plaintext strings of about 20 characters, between two
> dedicated-tenancy c3.8xlarge machines, re-using HTTP connections to
> avoid TLS overhead, no audit logging), I was hitting about 35k
> requests per second. With audit logging turned on (to a file), I was
> hitting about 25k requests per second (but every request that was
> claimed to have been handled was indeed showing up in the audit log).

The data we're encrypting varies a bit in size, but the majority should only be a few hundred bytes long. At most, they'd be a few KB in size, but we'll have a look to validate this assumption.

As I noted in my previous email, reusing HTTP/TLS connections helps a lot (as it does for nearly any service)

Interesting, we're using the official vault-ruby client library. I'm not sure if it's doing connection re-use behind the scenes (the ruby Net::HTTP docs are a little cryptic about this), I don't suppose you know?

Also, Vault 0.5.0 significantly speeds up transit, and Vault 0.5.1+
> builds against Go 1.6 which brings AES speed improvements. My
> benchmarks were against both of those.

We're using version 0.5.0, though we build and package it ourselves on our CI (from when we had to apply some custom patches for the MySQL backend). I believe our current version was built with go 1.5, but I'd have to look that up to be sure.

If you have any pointers for other things to investigate we'd be more than happy to look into them.

Matt

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.

GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.

Jeff Mitchell

unread,
Apr 18, 2016, 11:22:56 AM4/18/16
to vault...@googlegroups.com
On Mon, Apr 18, 2016 at 10:59 AM, Matt Button
<matt....@geckoboard.com> wrote:
>> That seems very low to me. In my benchmarks with transit (using Apache
>> Bench, plaintext strings of about 20 characters, between two
>> dedicated-tenancy c3.8xlarge machines, re-using HTTP connections to
>> avoid TLS overhead, no audit logging), I was hitting about 35k
>> requests per second. With audit logging turned on (to a file), I was
>> hitting about 25k requests per second (but every request that was
>> claimed to have been handled was indeed showing up in the audit log).
>
> The data we're encrypting varies a bit in size, but the majority should only
> be a few hundred bytes long. At most, they'd be a few KB in size, but we'll
> have a look to validate this assumption.

It's also worth remembering that different EC2 instance types have
different network characteristics; tenancy and VPC location also have
an impact. Basically, AWS is variable :-D Since I was testing for
raw throughput of transit I optimized for "in-datacenter" style
compute/network resources.

BTW, forget to mention before, but my results were across 100, 250,
and 1000 concurrent writers. This number made very little difference,
however.

> Interesting, we're using the official vault-ruby client library. I'm not
> sure if it's doing connection re-use behind the scenes (the ruby Net::HTTP
> docs are a little cryptic about this), I don't suppose you know?

Sorry -- I know almost nothing about Ruby.

> We're using version 0.5.0, though we build and package it ourselves on our
> CI (from when we had to apply some custom patches for the MySQL backend). I
> believe our current version was built with go 1.5, but I'd have to look that
> up to be sure.

Go 1.6 brings some better AES-GCM behavior (the variant used in
transit), and by better I mean *hugely* better. See slide 16 of
http://www.slideshare.net/NicholasSullivan/whats-new-in-go-crypto-gotham-go

Later Vault versions (I forget if it's starting with 0.5.1 or 0.5.2)
actually require Go 1.6, so you'll have to migrate there eventually.
:-)

--Jeff
Reply all
Reply to author
Forward
0 new messages