-- Sorry for the cross post, this is on Vault forums too:
Hi guys we have a cluster of Vaults running with Consul backends. Three EC2 instances, each with Vault and Consul installed on them. Finally, we have an additional instance with haproxy installed on it.
Unfortunately this is the second time that all three EC2 instances have simultaneously failed to pass the last of 2 status checks. If you're familiar with the EC2 console you can see this "1/2 checks passed" right on your EC2 instance list page.
Now the first time it happened I chalked it up to an AWS problem. But now that it's happened twice I'm starting to wonder if this is an application level thing.
How could Vault and Consul cause these ec2 instances to become unable to SSH into and fail their status checks? Anything I should look into or consider?
I don't see anything obvious like out of memory occurring.
I checked the logs and all I see is at about the time they go down Consul starts failing - unable to elect leaders and such, but this would be expected if the instances can't communicate to each other.