Dealing with sporadic network failures on EC2 while running playbooks

45 views
Skip to first unread message

Gregory Taylor

unread,
Sep 12, 2014, 11:04:47 AM9/12/14
to ansible...@googlegroups.com
We use Ansible to deploy code updates across a small fleet (~8 machines). At least a few times a week, we run into network hiccups that cause the SSH connection to a random EC2 instance to fail, causing the entire playbook run to fail. Sometimes this happens such that we are left with an incomplete deploy, which is no fun. In almost all cases we can immediately re-launch the playbook and the errant instance is fine the second time around. These appear to be very short interruptions, and there's no rhyme or reason as to which instance it effects. It's usually only one instance out of our fleet at a time (though there's no pattern as to which has connectivity issues).

What kind of strategies is everyone using to deal with these sort of sporadic SSH failures that cause the whole playbook run to fail prematurely?

Michael DeHaan

unread,
Sep 12, 2014, 11:45:49 AM9/12/14
to ansible...@googlegroups.com
Hmm....

We run a lot of EC2 plays from our integration tests -- originating *not* in EC2 and running almost constantly, and don't really see this -- some other providers, yes.  I'd be curious if others do.

You can definitely consider running the Ansible control machine *inside* EC2, where connections will be more reliable (and also faster), which is something I usually recommend to folks.

Another thing is when spinning up new instances, using the "wait_for" trick, be sure to put a sleep in after the wait_for.   SSH ports can come up but not be quite ready, which gives the appearance of SSH failure.  I'm wondering if that might be part of it, or if you're seeing connection issues at effectively random points or just those.


On Fri, Sep 12, 2014 at 11:04 AM, Gregory Taylor <snaggl...@gmail.com> wrote:
We use Ansible to deploy code updates across a small fleet (~8 machines). At least a few times a week, we run into network hiccups that cause the SSH connection to a random EC2 instance to fail, causing the entire playbook run to fail. Sometimes this happens such that we are left with an incomplete deploy, which is no fun. In almost all cases we can immediately re-launch the playbook and the errant instance is fine the second time around. These appear to be very short interruptions, and there's no rhyme or reason as to which instance it effects. It's usually only one instance out of our fleet at a time (though there's no pattern as to which has connectivity issues).

What kind of strategies is everyone using to deal with these sort of sporadic SSH failures that cause the whole playbook run to fail prematurely?

--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/c91a5b9d-3cf3-4efe-93ac-17c7e7f107e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gregory Taylor

unread,
Sep 12, 2014, 12:05:05 PM9/12/14
to ansible...@googlegroups.com
On Fri, Sep 12, 2014 at 11:45 AM, Michael DeHaan <mic...@ansible.com> wrote:

You can definitely consider running the Ansible control machine *inside* EC2, where connections will be more reliable (and also faster), which is something I usually recommend to folks.

We run an Ansible Tower instance in EC2 that runs these tasks. This is where we are seeing the issues. We've tried running the playbooks from a few different host machines on there, but we always eventually run into the periodic SSH network failure where subsequent retries eventually work.
 
Another thing is when spinning up new instances, using the "wait_for" trick, be sure to put a sleep in after the wait_for.   SSH ports can come up but not be quite ready, which gives the appearance of SSH failure.  I'm wondering if that might be part of it, or if you're seeing connection issues at effectively random points or just those.

While we do use Ansible for provisioning new instances, that's not where we're seeing the issue. It's our playbooks for rolling out code updates. We're just SSH'ing into each (existing) app server, transferring the updated code, and running a process restart. So by the time we run these playbooks, the instances could be hours or days or months old at that point, making the port readiness issue a non-factor.

Most of the time the EC2 network is fast and reliable, but we deploy frequently and do run into these issues from time to time. This is consistent with the errors we've seen with our app servers temporarily being unable to reach ElastiCache instances. Failure is just one of those things we have to live with and build for in EC2.

Michael DeHaan

unread,
Sep 12, 2014, 3:16:29 PM9/12/14
to ansible...@googlegroups.com
Yeah this is most definitely not a Tower specific thing since it's just running Ansible underneath -- but it's not something we have been seeing.

I'd say run things periodically and avoid use of the Atlantis or Pompeii availability zones?  :)



--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.

Gregory Taylor

unread,
Sep 12, 2014, 3:30:44 PM9/12/14
to ansible...@googlegroups.com
On Fri, Sep 12, 2014 at 3:16 PM, Michael DeHaan <mic...@ansible.com> wrote:
Yeah this is most definitely not a Tower specific thing since it's just running Ansible underneath -- but it's not something we have been seeing.

I'd say run things periodically and avoid use of the Atlantis or Pompeii availability zones?  :)

us-east-1 definitely falls under this description. We have at least one or two small 10-15 second long hiccups each week between app servers and the DB/cache instances, or even between specific ELBs and their child instances. The disruptions are usually over so fast that it's not a big deal, but that's a bit different of a case than a deploy.
Reply all
Reply to author
Forward
0 new messages