We use Ansible to deploy code updates across a small fleet (~8 machines). At least a few times a week, we run into network hiccups that cause the SSH connection to a random EC2 instance to fail, causing the entire playbook run to fail. Sometimes this happens such that we are left with an incomplete deploy, which is no fun. In almost all cases we can immediately re-launch the playbook and the errant instance is fine the second time around. These appear to be very short interruptions, and there's no rhyme or reason as to which instance it effects. It's usually only one instance out of our fleet at a time (though there's no pattern as to which has connectivity issues).What kind of strategies is everyone using to deal with these sort of sporadic SSH failures that cause the whole playbook run to fail prematurely?
--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/c91a5b9d-3cf3-4efe-93ac-17c7e7f107e8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
You can definitely consider running the Ansible control machine *inside* EC2, where connections will be more reliable (and also faster), which is something I usually recommend to folks.
Another thing is when spinning up new instances, using the "wait_for" trick, be sure to put a sleep in after the wait_for. SSH ports can come up but not be quite ready, which gives the appearance of SSH failure. I'm wondering if that might be part of it, or if you're seeing connection issues at effectively random points or just those.
--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/CAA0B%3D%3DStjXCd7F7SHM4xXRdcA00NJVUmFLCTA8MEkmjWhP%2BOSA%40mail.gmail.com.
Yeah this is most definitely not a Tower specific thing since it's just running Ansible underneath -- but it's not something we have been seeing.I'd say run things periodically and avoid use of the Atlantis or Pompeii availability zones? :)