Behavior of ec2_elb module

191 views
Skip to first unread message

Bruce Pennypacker

unread,
Feb 13, 2014, 4:26:12 PM2/13/14
to ansible...@googlegroups.com
I occasionally run into an issue with the ec2_elb module that I think needs to be addressed.  This module is used to add and remove Amazon EC2 instances to/from an Amazon load balancer.  The API into Amazon is asynchronous, so when you use it to add an instance to a load balancer it issues a "insert" command then it goes into a loop where it sleeps for a second, checks the status, and continues to loop until the instance reaches an "In Service" state in the load balancer. There's also a check to see if the instance enters an error state, and if it does then the module immediately returns an error. Here's a bit of pseudocode to demonstrate the behavior:

while True:
    get_instance_state
    if InService:
      return success;
    else if instance_error:
      return error;
    sleep 1

So when an instance is added to the load balancer, the module waits until it reaches the InService state.  If at any point along that way it enters an error state then it immediately fails.

The problem I have is that it's not terribly uncommon for an instance to enter a transient unhealthy state for a couple of seconds prior to being successfully put into service.  I have on a number of occasions had my Ansible playbook fail because the ec2_elb module throws an error and yet the EC2 instance is successfully put into service in the load balancer.  If the module had simply waited a few more seconds to check on the health of the instance then my playbook would have run successfully.

I would like to propose making a change to the ec2_elb module to address these sorts of transient errors.  There really should be a timeout associated with this while loop in the module.  It should only fail if the instance is not put into service during that period of time, and any errors that occur within that time period should be ignored.  To maintain the current state it shouldn't be too difficult to add an optional timeout parameter that changes the behavior only if it is set.  So if a timeout parameter is added then the above loop woudl look something like this:

while not timeout_exceeded:
    get_instance_state
    if InService:
      return success;
    else if instance_error AND timeout_exceeded:
      return error;
    sleep 1

return timeout

Any comments/suggestions about this, especially from other folks using the ec2_elb module?

-Bruce


Michael DeHaan

unread,
Feb 13, 2014, 5:07:06 PM2/13/14
to ansible...@googlegroups.com
I'm ok with the above.




--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Scott Anderson

unread,
Feb 14, 2014, 12:35:27 PM2/14/14
to ansible...@googlegroups.com
Are there other modules that behave in this fashion as well?

If so a generic retry loop that takes a callable for checking state might be useful as a bit of refactoring.

-scott

Alexander Popov

unread,
Feb 20, 2014, 1:18:59 PM2/20/14
to ansible...@googlegroups.com
I observed similar issue with ec2_group and ec2_vpc (while creating subnets, route tables, associations, etc.) 

C. S.

unread,
Mar 9, 2014, 11:55:29 PM3/9/14
to ansible...@googlegroups.com
We’ve had issues with this module too (1.4.x)…

The main issue being that it de-registeres the instance from the ELB and doesn’t store which ELB is removed it from anywhere. So if there is a failure before the re-register, you have to manually add it back to the ELB pool (and it breaks autoscale in the mean time). I wonder if the module should update an EC2 tag on the instance first, so it knows how to put it back on subsequent runs if it couldn’t re-register it in a previous run.

Our workaround was to de-register and immediately re-register the instance with the ELB so that just the healthcheck gets reset while we update an instance’s app. Additionally we had add retries to the module, since we occasionally will get throttled by the AWS APIs.

Here is the algorithm that AWS recommends for their APIs:
http://docs.aws.amazon.com/general/latest/gr/api-retries.html

Scott Anderson

unread,
Mar 10, 2014, 7:26:26 AM3/10/14
to ansible...@googlegroups.com
We’re going to use a different means of updating our instances:

1) Update the application AMI offline
2) Change the autoscaler to use the new AMI
3) Gracefully down and kill the old AMI instances in the ELB
4) The ELB will start up new AMI instances as the old ones die
5) Profit!

-scott

You received this message because you are subscribed to a topic in the Google Groups "Ansible Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ansible-project/qG9Thrx100A/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ansible-proje...@googlegroups.com.

To post to this group, send email to ansible...@googlegroups.com.

C. S.

unread,
Mar 15, 2014, 4:26:30 AM3/15/14
to ansible...@googlegroups.com
Cool. Do you have the whole process automated with Ansible?

Michael DeHaan

unread,
Mar 15, 2014, 9:27:31 AM3/15/14
to ansible...@googlegroups.com
This seems fine to me, +1


On Thu, Feb 13, 2014 at 4:26 PM, Bruce Pennypacker <bruce.pe...@gmail.com> wrote:

Michael DeHaan

unread,
Mar 15, 2014, 9:27:58 AM3/15/14
to ansible...@googlegroups.com
Rarrgh, replies to old thread, please disregard.


Reply all
Reply to author
Forward
0 new messages