AWS Failed Instance Status Checks on all 3 Consul-backed Vault EC2 Instances

154 views
Skip to first unread message

Sean Bollin

unread,
Apr 25, 2016, 8:53:34 PM4/25/16
to Consul
-- Sorry for the cross post, this is on Vault forums too:

Hi guys we have a cluster of Vaults running with Consul backends.  Three EC2 instances, each with Vault and Consul installed on them.  Finally, we have an additional instance with haproxy installed on it.

Unfortunately this is the second time that all three EC2 instances have simultaneously failed to pass the last of 2 status checks.  If you're familiar with the EC2 console you can see this "1/2 checks passed" right on your EC2 instance list page.

Now the first time it happened I chalked it up to an AWS problem.  But now that it's happened twice I'm starting to wonder if this is an application level thing.

How could Vault and Consul cause these ec2 instances to become unable to SSH into and fail their status checks?  Anything I should look into or consider?

I don't see anything obvious like out of memory occurring.

I checked the logs and all I see is at about the time they go down Consul starts failing - unable to elect leaders and such, but this would be expected if the instances can't communicate to each other.

Jeff Mitchell

unread,
Apr 27, 2016, 9:24:18 AM4/27/16
to consu...@googlegroups.com
Hi Sean,

The only thing I can think of is if there is some bad state that
causes the CPU to be pegged, making the machines unresponsive. OOM on
a Consul machine could maybe cause such a thing. Do you have any
runtime metrics for the machines? I doubt you'd get a sudden spike in
memory/CPU usage but maybe if it was trending upwards over time...?

--Jeff
> --
> This mailing list is governed under the HashiCorp Community Guidelines -
> https://www.hashicorp.com/community-guidelines.html. Behavior in violation
> of those guidelines may result in your removal from this mailing list.
>
> GitHub Issues: https://github.com/hashicorp/consul/issues
> IRC: #consul on Freenode
> ---
> You received this message because you are subscribed to the Google Groups
> "Consul" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to consul-tool...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/consul-tool/ccec4927-4eec-44d6-9a69-762e002fb38f%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

Oleg Soroka

unread,
Apr 27, 2016, 10:15:14 AM4/27/16
to Consul
Consul is effectively unusable on AWS.

https://www.google.com/search?q=aws+consul+flapping

Lets wait another crutch from hashicorp.

David Adams

unread,
Apr 27, 2016, 10:54:12 AM4/27/16
to consu...@googlegroups.com
This most certainly isn't true. Darron Froese has done some great presentations on using Consul at scale:

See blog.froese.org for more.

I've been running Consul on AWS with a relatively light but increasing workload across several hundred autoscaling instances in our largest cluster and with 13 interconnected Consul datacenters (each maps to a VPC, and we operate across six regions). We've yet to run into any flapping issues, but based on the presentation linked above the key factor when dealing with a lot of clients is allocating enough resources. So far, I've been using 5 c4.large instances for our biggest datacenter, and also for our ACL datacenter. Our other datacenters (some of which are very lightly used) are all operating with three t2.small instances.


--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/consul/issues
IRC: #consul on Freenode
---
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.

James Phillips

unread,
Apr 27, 2016, 1:43:13 PM4/27/16
to consu...@googlegroups.com
Hi Sean,

I'd second Jeff's recommendation to look into CPU usage. It would also be interesting to pull Consul's telemetry (https://www.consul.io/docs/agent/telemetry.html) to look for in-process factors like memory usage and GC performance. We've seen issues with Consul servers running on under-sized AWS instance types where CPU stealing can cause heavy periods of packet loss, leading to timeouts between the servers and elections, and moving up to an instance with more CPU and/or better performing networking fixes the issue. What instance type/Consul version are you using?

If you want to post logs + telemetry and work through this in detail it might be good to open a GH issue.

-- James

Darron Froese

unread,
Apr 27, 2016, 3:20:39 PM4/27/16
to consu...@googlegroups.com
Consul is most certainly usable on AWS. That's where we use it with 1000+ agent nodes and 5 server nodes.

If you're using 0.5.x and previous - then there's some well documented scaling issues - but our nodes have been running since August without any significant issues.

Without telemetry data we can't really help diagnose - if you could post some then maybe we can help!

Roman Rusakov

unread,
Apr 29, 2016, 8:11:37 AM4/29/16
to Consul, dar...@froese.org
I wonder if it is possible that guys who have good stability in AWS just dont use separate AZ in one datacenter?
We have two AZ in one VPC and one consul DC for all. We have consul 0.6.3 - 0.6.4.mixed env. Here is our telemetry:



среда, 27 апреля 2016 г., 21:20:39 UTC+2 пользователь Darron Froese написал:

David Adams

unread,
Apr 29, 2016, 9:24:01 AM4/29/16
to consu...@googlegroups.com, dar...@froese.org
We use 2-3 AZs in each region in which we operate, for what it's worth.

-dave

James Phillips

unread,
Apr 29, 2016, 11:20:24 AM4/29/16
to consu...@googlegroups.com, Darron Froese
Hi all,

We've got some improvements coming in Consul 0.7 targeted at improving these kinds of infrequent serfHealth flaps. In AWS these seem to be caused by brief periods of heavy packet loss and/or CPU exhaustion on small instance types, leading them to not meet the soft real-time requirements for Serf's probes. The change will add more independent confirmations before a node is declared dead, but in a way that won't increase detection time for real failures. There will be some PRs landing soon with more details.

-- James

Roman Rusakov

unread,
May 25, 2016, 2:35:41 PM5/25/16
to Consul, dar...@froese.org
Hi James!

I dont see any AWS related changes in https://github.com/hashicorp/consul/blob/master/CHANGELOG.md for unrealeased 0.7 :(((((
Do you know if any improvements in serfHealth will really present in 0.7? Also could you please announce an approximate date of release if it is possible?
It's kind of blocker to use consul in production on AWS for us :(

Thanks,
Roman.

пятница, 29 апреля 2016 г., 18:20:24 UTC+3 пользователь James Phillips написал:
Reply all
Reply to author
Forward
0 new messages