extra scaling events with ec2_asg module

81 views
Skip to first unread message

ro...@pandastrike.com

unread,
Mar 25, 2015, 9:36:17 AM3/25/15
to ansible...@googlegroups.com
For Ansible 1.9-develop Pull request 601 had the fix for Issue 383, which does affect our production ASG about every two weeks or so. We use the ec2_asg module to refresh our ASG instances 3 times a day. 

I was eager to test. In doing so, I noticed that the replace_all_instances or replace_instances options cause extra set of scaling events. Has anyone else who uses either replace_ option see this happen? See below for the screen shot which demonstrates the behavior.

We have one instance in two different Availability Zones. So we use a batch size of two (actually a formula based upon the length of the availability_zones list of the ASG). 

Interesting... I just tested with batch_size: 1. The extra set of scaling events was 1. I.e. one new instance launched and one new instance terminated.

The batch_size logic is broken. I am going open an Issue in ansible-modules-core, but welcome others to note their experience here. I'll update this topic with a link to the Issue, too.

    - name: Retrieve Auto Scaling Group properties
      local_action:
        module: ec2_asg
        name: "{{ asg_name }}"
        state: present
        health_check_type: ELB
      register: result_asg

    - name: Auto Scaling Group properties
      debug: var=result_asg

    - name: Replace current instances with fresh instances
      local_action:
        module: ec2_asg
        name: "{{ asg_name }}"
        state: present
        min_size: "{{ result_asg.min_size }}"
        max_size: "{{ result_asg.max_size }}"
        desired_capacity: "{{ result_asg.desired_capacity }}"
        health_check_type: "{{ result_asg.health_check_type }}"
        lc_check: no
        replace_all_instances: yes
        replace_batch_size: "{{ result_asg.availability_zones | length() }}"





1. and 2. are expected. a. - d. are extra scaling events.

James Martin

unread,
Mar 25, 2015, 11:02:12 AM3/25/15
to ansible...@googlegroups.com
Looking forward to the github issue -- make sure you take a look at the autoscale group and the ELB in the AWS console and see if it gives a description why the instances were terminated.  I've seen cases where things did not come online fast enough and the ELB marks them as unhealthy and the ASG terminates them.

Thanks,

James

ro...@pandastrike.com

unread,
Mar 25, 2015, 1:42:19 PM3/25/15
to ansible...@googlegroups.com
Thanks James.

All the instances terminated are due to being marked Unhealthy by terminate_batch().

I am using the changes from this PR: https://github.com/ansible/ansible-modules-core/pull/589, combined with the fixes in PR 601. Rationale: I need `lc_check=no` to cause all instances to get replaced. With the current way it is written in the module lc_check only works if the active if an instance has a different Launch Config than the one assigned to the ASG. Upon further consideration, I should add a new option instead of overloading the meaning of lc_check.

James Martin

unread,
Mar 30, 2015, 4:44:25 PM3/30/15
to ansible...@googlegroups.com
I spent some time re-working the algorithm that does the rolling replacement.  It is much smarter now, and it shouldn't cause unecessary scaling events.  I've also merged the functionality of #589.  Would you mind giving it a whirl?

Reply all
Reply to author
Forward
0 new messages