DynamoDB Backend Global Table and Active/Active Vault Servers in Different Regions

258 views
Skip to first unread message

awa...@zendesk.com

unread,
Oct 30, 2018, 10:29:45 AM10/30/18
to Vault
Our team has been testing out a couple of different DynamoDB setups (DynamoDB HA + DynamoDB Storage/Consul HA + Consul Storage) in preparation for a re-architecture of our secrets management. During our testing, we came up with an architecture idea that we haven't found much information on and are looking to test.

DynamoDB Global Table Across Two Regions (Region 1 and Region 2)

1 Non-HA Vault Server in Region 1 
1 Non-HA Vault Server in Region 2

We would point both of these active, non-HA Vault servers at the same DynamoDB global table and let each server manage the local region's Vault needs. Writes would be replicated across the regions. In the event of a region/AZ/instance failure, the remaining Vault server could service all of the necessary secret operations. 

We understand that this appears to be an untested configuration. If there is a zero percent chance of this working in any way, we'd like to skip any sort of PoC work. 

Does anyone have any insight into anything that would certainly break with a setup like this? 

Joel Thompson

unread,
Oct 30, 2018, 11:15:15 AM10/30/18
to vault...@googlegroups.com
Hi,

You would almost certainly be asking for trouble (and potentially data loss!) if you did this.

Vault assumes that there is only one active node writing to its data storage at once, and if you violate that assumption, it could cause problems. For example, Vault does a lot of caching of data, and if your non-HA Vault in region 1 writes something to DynamoDB, the Vault in region 2 wouldn't know about that write, and it would get out of sync and potentially overwrite data written by the Vault in region 1. This is but one example of what can happen; there are many other reasons why this would be bad.

You might potentially be able to get away with HA Vaults pointing to the same DDB Global Tables in different regions (and Vault might actually end up defaulting to HA across the two regions if it sees there's data in the DDB Global Tables and there's another Vault), but then you'd be very dependent upon AWS's DDB replication latency to ensure you only have a single master -- if the replication latency grows too much, Vault in region 2 might think the Vault in region 1 has failed and try to take over (and then you'd be back to the situation I mentioned in the previous paragraph).

--Joel

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/vault-tool/f1a0dde0-09d4-459f-b2ce-b8ec4d5fb4ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

awa...@zendesk.com

unread,
Oct 30, 2018, 11:38:34 AM10/30/18
to Vault
Thanks, Joel! This was exactly the type of response I was hoping to see. Very much appreciate the summary.

are...@zendesk.com

unread,
Oct 30, 2018, 12:43:32 PM10/30/18
to Vault
Hey Joel!  I'm on Andrew's team.  Thanks for the quick reply.

> For example, Vault does a lot of caching of data, and if your non-HA
> Vault in region 1 writes something to DynamoDB, the Vault in region 2
> wouldn't know about that write, and it would get out of sync and
> potentially overwrite data written by the Vault.

Would the disable_cache option address this?  Or any other no-cache options?

> This is but one example of what can happen; there are many other reasons
> why this would be bad.

If you wouldn't mind, I'd love to hear some of these.  With more global
managed global datastores like spanner and dynamodb popping up, this
could be a really compelling use case.

Joel Thompson

unread,
Oct 30, 2018, 5:11:01 PM10/30/18
to vault...@googlegroups.com
Hello,

On Tue, Oct 30, 2018 at 12:43 PM arecker via Vault <vault...@googlegroups.com> wrote:
Hey Joel!  I'm on Andrew's team.  Thanks for the quick reply.

> For example, Vault does a lot of caching of data, and if your non-HA
> Vault in region 1 writes something to DynamoDB, the Vault in region 2
> wouldn't know about that write, and it would get out of sync and
> potentially overwrite data written by the Vault.

Would the disable_cache option address this?  Or any other no-cache options?


Not really. For example, there are places in the Vault code that do sequences of things like:
1. Take a write lock out
2. Read in some data from storage
3. Modify the data
4. Write it back out to storage
5. Release the write lock
This ensures that all updates to configs happen serially. The write lock is local to a Vault instance. So having multiple instances could cause data overwriting/corruption/loss due to race conditions that don't exist when using a single instance.

Additionally, DDB Global Tables follow a principle of "last writer wins" and isn't strongly consistent across regions, so this could also lead to data being out of sync.

> This is but one example of what can happen; there are many other reasons
> why this would be bad.

If you wouldn't mind, I'd love to hear some of these.  With more global
managed global datastores like spanner and dynamodb popping up, this
could be a really compelling use case.

In some cases, Vault wants to have a notion of a single master/leader. For example, when a lease expires, Vault needs to revoke that lease (e.g., tear down issued credentials). When a token expires, all leases for that token need to be revoked. In your setup, which of the instances would tear down resources that are inherently global (such as AWS IAM users in the AWS Secrets engine)? Both might try, and that could cause race conditions and potentially data corruption.

But, more fundamentally, Vault isn't designed for this; it's not tested, it's not supported, and it could easily break in the future with no warning. There are many previous discussions around this very subject on this mailing list. You might get lucky, but you're playing with fire.

Paid Vault editions do support replication, and that is the suggested and supported way of handling global availability. I know it's a bit disappointing to hear a response of, "Buy the paid product," but HashiCorp employs a number of full-time engineers working on developing Vault, both OSS and paid features, and so HC needs to make money to be able to continue employing their engineers and delivering great features. (And no, I'm not a HashiCorp employee, just a community member and contributor, so I'm not trying to tell you that you should buy the product to support me personally.) There are some very thoughtful comments from HashiCorp employees at https://github.com/hashicorp/vault/issues/132#issuecomment-320778432 about whether features are paid vs. OSS that I encourage people to read.

--Joel

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/vault/issues
IRC: #vault-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Vault" group.
To unsubscribe from this group and stop receiving emails from it, send an email to vault-tool+...@googlegroups.com.

are...@zendesk.com

unread,
Oct 30, 2018, 5:57:24 PM10/30/18
to Vault
Ah - thanks for humoring me!  Needless to say, we're probably going to heed your warning, just curious about the details.

iw

unread,
Nov 5, 2018, 9:51:51 AM11/5/18
to Vault
Hi,

We are testing a similar arrangement:

API Gateway -> Lambda -> Vault -> DynamoDB Global Table

within two regions, with active-passive failover.

We had mighty concerns over behaviour regarding race conditions for active-active configuration (latency-based routing); reinforced by Joel's great comments.

The next activity for us is to produce a CloudWatch Logs Audit Device for Vault.

ian
Reply all
Reply to author
Forward
0 new messages