Retrying failed tasks

386 views
Skip to first unread message

Ian Rose

unread,
Aug 10, 2015, 3:36:30 PM8/10/15
to Ansible Project
Hi all -

I've been pretty happy running Ansible for a few months now.  The one major thorn in my side is failed tasks.  Our fleet of VMs is not very large, but apparently is large enough (or our playbook is long enough) that we hit at least one spurious SSH error (e.g. "SSH Error: mux_client_hello_exchange: write packet: Broken pipe"), or, more rarely, I'll hit a spurious 500 from a third party service (e.g. adding or removing our VMs to/from load balancers via a cloud API).

What's the best practice for dealing with these kinds of transient failures?  It seems like me that something like "sleep X seconds, then retry, up to Y times" would work quite well, but it isn't obvious to me how to make that happen.

I'm aware of the wait_for module, but I don't think that really helps in this situation since the problem isn't that a resource is actually missing; its just spurious failures.

Any suggestions?

Thanks!
- Ian

Brian Coca

unread,
Aug 10, 2015, 3:37:32 PM8/10/15
to Ansible Project
You can use the .retry files as a --limit to rerun the plays.
> --
> You received this message because you are subscribed to the Google Groups
> "Ansible Project" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to ansible-proje...@googlegroups.com.
> To post to this group, send email to ansible...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/ansible-project/e47c3c8a-817f-4933-b429-492a430b277f%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Brian Coca

Ian Rose

unread,
Aug 10, 2015, 4:11:02 PM8/10/15
to Ansible Project
My understanding of retry files (which could certainly be wrong) is that they merely limit the hosts that are included in the run.  Which I don't think will work for me, although perhaps this indicates that my playbook is not set up well.  Here is a simplified version of my site.yml:

- name: copy new files to all nodes
  hosts: all
  tasks:
  - include: tasks/deploy_files.yml

- name: configure and deploy backend type foo
  hosts: tag_foo
  roles:
    - foo

- name: configure and deploy backend type bar
  hosts: tag_bar
  roles:
  - bar

- name: configure and deploy backend type baz
  hosts: tag_baz
  roles:
  - baz

(etc, for 7 total backend types)

- name: clean up old deployments from all nodes
  hosts: all
  tasks:
  - include: tasks/remove_old_deployments.yml


So, given this structure, pretend that the "foo" step went fine, but then some step during one of the "bar" backend deployments failed.  Won't the retry file just contain that single host?  (assuming we are running "serial: 1" for that task that failed)  So if I reran using that file, I might get that "bar" host to deploy correctly, but I will totally miss all of the "baz" hosts and all other backends whose deployment tasks appear after the "bar" task.

I suppose one option might be to break up this single site.yml into 7 different playbooks, one for each backend type, and then execute them each in order, retrying each one as necessary if any errors occur.  Would that be a better setup?  That seems to be a bit silly, but maybe I'm wrong on that...

Thanks,
Ian

Brian Coca

unread,
Aug 10, 2015, 4:39:47 PM8/10/15
to Ansible Project
Reply all
Reply to author
Forward
0 new messages