Hi everyone,
There has been questions on occasion with how many secrets you can store in vault. I haven't seen anyone else talk about their sizing, except to say can Vault handle X amount of stuff?
Our experience is, things start to slow down when we hit ~ 600,000 secrets. Without getting into the details of our use-case, which is definitely
not fabulous.. (we are working on transitioning to using the transit backend for this, as it would let us offload the storage onto end user machines and fit better with Vault) here is our config and experience. We sort of just shoved an existing process into Vault, to get some security, knowing it wasn't an ideal workload for vault. Overall we are still happy with this decision, while we transition into a better process that does fit Vault better.
Config:
HA vault on 3 8gb memory instances, they are not solely dedicated to vault (yes we want to fix this, but public school funding has it's limits...)
Consul storage backend on 3 different machines (also with at least 8GB memory) -- also not dedicated to only consul.
1 path in secret/APPNAME/ had 599k entries in it. (plus we have about 1k secrets in other paths)
Most of our load was on the consul boxes when this hit, we had *some* impact to end users hitting up the vault instance, no timeouts were hit, but 30second delays were possible. This happened last week, we don't normally run with 600k secrets, but we use the secrets backend to store temporary secrets we generate for end users, we have code that comes along and erases them when done, but there was a bug in the cleanup code that was causing it to not run, and it sort of piled up on us. We normally average around 100k secrets in production, without issues.
We had a noticeable slow down, but no outage when we hit 600k secrets in Vault. We didn't spend much time diagnosing the problem, other than to just fix the cleanup code and get it cleaning up the old secrets that need to get erased. Plus we learned to keep a better eye on our cleanup code(it now alerts us on a failure)
CPU and Memory were not pegged out max for either the vault or consul process, the delay(s) seemed to be in I/O (likely network), but we didn't dig into this.
For us, with our current production config, our upper limit seems to be around 600k, before we have slowdown issues. It was taking about 1/2 a second wall time to delete from that path for a while. We just cleaned up overnight, and everyone was back to normal the next day.
Thanks Vault for not just crashing and burning on us!