EC2 Rolling Deploy with an ASG

Daniel Langer

unread,

Sep 7, 2014, 2:08:32 PM9/7/14

to ansible...@googlegroups.com

Hello,

I'm trying to use Ansible to do a rolling deploy against an ELB linked to an auto-scaling group (ASG), using a pre-baked AMI. My ideal process would go something like this

1. Get the current membership of the ASG

2. Update the launch configuration for the ASG

3. For each member:

3a. Create an instance using the new AMI

3b. Associate the instance with the ASG

3c. Terminate the original instance

The other option I was considering was:

1. Get the current membership of the ASG

2. Update the launch configuration for the ASG

3. For each member:

3a. Terminate the instance

3b. Wait until the ASG has noticed and launched a new instance before continuing

For the former, I don't see a way using the built-in EC2 modules to associate an instance with an ASG. For the latter, I'm not clear how I'd wait until the ASG has launched a new instance to catch up with the one I terminated.

Any suggestions on how to do either one, or if that's not possible, what the best-practice for what I'm trying to do it?

Thanks,

Dan

Michael DeHaan

unread,

Sep 7, 2014, 2:23:25 PM9/7/14

to ansible...@googlegroups.com, James Martin

Hi,

James Martin is working on a 2-3 part blog post on *exactly* this subject, which I believe we're going to be posting this week, which shows a couple of ways to do it.

I've included him on this mailing list thread if he wants to share some cliff-notes.

--Michael

--
You received this message because you are subscribed to the Google Groups "Ansible Project" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ansible-proje...@googlegroups.com.
To post to this group, send email to ansible...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/42240c16-c2a5-4dac-b6f9-a30fc6e5b8d2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Martin

unread,

Sep 7, 2014, 2:50:08 PM9/7/14

to Michael DeHaan, ansible...@googlegroups.com

Dan,

I've been tinkering with this process for quite a while and have made a pull request to ansible core that I believe does what you are looking for:

https://github.com/ansible/ansible/pull/8901

As Michael stated, we will be releasing a blog post that's going to go more in depth in describing a few different ways to perform updates to ASG's that use pre-baked AMIs (this module approach being one of them).

I appreciate any feedback/testing you can provide on that pull request of course. The documentation is inline in the module source.

Thanks,

- James

--

James Martin

Solutions Architect

Ansible, Inc.

Michael DeHaan

unread,

Sep 7, 2014, 3:26:14 PM9/7/14

to James Martin, ansible...@googlegroups.com

Hi James,

Thanks!

In reading the PR examples section, I'm curious why we might show Option 1 if Option 2 is much cleaner and would be interested in details.

Also, quick question - it's replacing all instances, but what's it replacing them *with* ?

Perhaps this is something we should show as well, where we indicate how to specify what the new instance IDs would be.

Can you help me grok additions?

Thanks again!

+## Option 2
		+This does everything that Option 1 does, but is contained inside the module. It's more opaque,
		+but the playbooks end up being much clearer.
		+
		+
		+- ec2_asg:
		+ name: myasg
		+ health_check_period: 60
		+ health_check_type: ELB
		+ replace_all_instances: yes
		+ min_size: 5
		+ max_size: 5
		+ desired_capacity: 5
		+ region: us-east-1
		+
		+

Daniel Langer

unread,

Sep 7, 2014, 5:30:01 PM9/7/14

to ansible...@googlegroups.com

FWIW I like the cleanliness of Option 2 - would it still support the options like replace_batch_size?

James Martin

unread,

Sep 7, 2014, 5:48:17 PM9/7/14

to Michael DeHaan, ansible...@googlegroups.com

Michael,

The reason for having both was to spur this very discussion. :). Option 1 is a bit more complicated but more transparent, option 2 is much easier but less transparent. I'm more fond of option 2, and happy to make it the only one. BTW, are we talking about the docs or the actual feature?

As far as what the instances are being replaced with-- the ASG is going to spin up new instances with the current launch configuration. With option 2, the module starts by building a list of which instances should be replaced. This list is made up of all instances that have not been launched with the current launch configuration. The module then bumps the size of the ASG by the replace_batch size. It then terminates replace_batch_size instances at a time, waits for the ASG to spin up new instances in their place and become healthy, then continues on down the list until there are no more left to replace. Then it sets the ASG size back to it's original value.

James

James Martin

unread,

Sep 7, 2014, 9:59:49 PM9/7/14

to ansible...@googlegroups.com

Daniel,

Yep, Option 2 is designed to work with replace_batch_size. Currently it's mutually exclusive with "replace_instances", but I think if we decide to go with Option 2, I could make replace_instances work. I think it might be desirable to replace all instances or just a few, but still have the playbook be nice and clean.

- James

Michael DeHaan

unread,

Sep 8, 2014, 1:21:14 PM9/8/14

to James Martin, ansible...@googlegroups.com

On Sun, Sep 7, 2014 at 5:48 PM, James Martin <jma...@ansible.com> wrote:

Michael,

The reason for having both was to spur this very discussion. :). Option 1 is a bit more complicated but more transparent, option 2 is much easier but less transparent. I'm more fond of option 2, and happy to make it the only one. BTW, are we talking about the docs or the actual feature?

I'm not sure option 1 is transparent more so than more manual/explicit? I guess if you mean "less abstracted", yes. I would prefer the one that lets me forget more about how it works :)

As far as what the instances are being replaced with-- the ASG is going to spin up new instances with the current launch configuration. With option 2, the module starts by building a list of which instances should be replaced. This list is made up of all instances that have not been launched with the current launch configuration. The module then bumps the size of the ASG by the replace_batch size. It then terminates replace_batch_size instances at a time, waits for the ASG to spin up new instances in their place and become healthy, then continues on down the list until there are no more left to replace. Then it sets the ASG size back to it's original value.

Ok, so I'm thinking *MAYBE* in the examples, we show a call to ec2_lc to show the launch config change prior to the invocation, so the user can see this in context.

Sidenote to all - our ec2 user guide in the docs are lacking, and I'm open to having them mostly rewritten. Showing a more end to end tutorial, maybe one ec2 simple one and another using ec2_lc/asg, would be really awesome IMHO.

James Martin

unread,

Sep 10, 2014, 11:34:01 PM9/10/14

to ansible...@googlegroups.com

Just wanted to note that this code has now been merged into ec2_asg in Ansible 1.8 devel branch, and the docs are updated with examples.

Thanks,

James

To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/CA%2BnsWgxBOkPpMDPRjeV98CcpCc5W4F14a1RJ5aNTLcOYnCJO9w%40mail.gmail.com.

Will Thames

unread,

Sep 11, 2014, 8:31:23 AM9/11/14

to ansible...@googlegroups.com, jma...@ansible.com

I have some relatively extensive documentation on ec2 - it might be a little too over the top for the user guide.
http://willthames.github.io/2014/03/17/ansible-layered-configuration-for-aws.html

If you want me to incorporate any or all of it into the user guide, I'd be happy to do so.

I haven't done enough with asgs to contribute much (and it seems like James' docs are pretty good to go anyway)

Will

Michael DeHaan

unread,

Sep 11, 2014, 10:31:30 AM9/11/14

to ansible...@googlegroups.com, James Martin

I definitely would like to see the ec2 guide upgraded to teach more ec2 concepts.

It's largely a holdover from the very early days, and needs to show some basics like using add_host together with ec2 (as is shown elsewhere)

but also some more idioms.

I'd be quite welcome to see it mostly rewritten should you want to take a stab at improving it.

To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/0cbd0506-887c-4dbc-9bc9-57576a5b2005%40googlegroups.com.

Scott Anderson

unread,

Sep 11, 2014, 1:50:33 PM9/11/14

to ansible...@googlegroups.com, jma...@ansible.com

Wow, I wish I'd seen this conversation earlier.

I have a module that does this, using something similar to option 1.

My module respects multi-AZ load balancers and results in a completely transparent deploy, *so long as* the code in the new AMI can run alongside the old code. There's a start of two different methods, one which replaces a single instance at a time and the other which fires up all the new instances in the proper VPCs, waits for them to initialize, adds them to the elb and ash, then terminates once they're all stable.

You also have to set up session pinning and draining on the elb for it to function correctly. Otherwise you can end up with someone getting assets from two different code bases.

There's actually a more reliable way to do it that involves using intermediary instances, but we haven't gotten that far yet.

-scott

Scott Anderson

unread,

Sep 11, 2014, 2:04:30 PM9/11/14

to ansible...@googlegroups.com

For comparison:

https://github.com/scottanderson42/ansible/blob/ec2_vol/library/cloud/ec2_asg_cycle

Still a work in progress (as you should be able to tell from the logging statements :-), but we’ve been using it in production for several months and it’s (now) battle tested. The “Slow” method is unimplemented but is intended to be your Option 2.

-scott

James Martin

unread,

Sep 11, 2014, 2:26:44 PM9/11/14

to ansible...@googlegroups.com

Scott,

Neat to see someone else's approach. The "fast method" you have there probably could be worked into what's been merged. Another approach (maybe simpler) would just be stand up a parallel ASG with the new AMI.

I like making the AutoScale Group do the instance provisioning, versus your approach of provisioning the instance and then moving it to an ASG. From what I can tell, your module doesn't seem to be idempotent -- so if it's run, it's always going to act. The feature I added only updates instances if they have a launch config that is different from what's currently assigned to the ASG. So it's safe to run again (or continue a run that failed for some reason), without having to cycle through all the instances again.

We will be publishing an article on some different approaches that we've worked through for doing this "immutablish" deploy stuff sometime next week.

--
You received this message because you are subscribed to a topic in the Google Groups "Ansible Project" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/ansible-project/JXiZgm36sHU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to ansible-proje...@googlegroups.com.

To post to this group, send email to ansible...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/D5B502F2-C5E4-4399-970E-E0B22FEC6736%40gmail.com.

For more options, visit https://groups.google.com/d/optout.

Scott Anderson

unread,

Sep 11, 2014, 2:51:07 PM9/11/14

to ansible...@googlegroups.com

On Sep 11, 2014, at 2:26 PM, James Martin <jma...@ansible.com> wrote:

Scott,

Neat to see someone else's approach. The "fast method" you have there probably could be worked into what's been merged. Another approach (maybe simpler) would just be stand up a parallel ASG with the new AMI.

The general problem with this approach is that it doesn’t work well for blue-green deployments, nor if the new code can’t coexist with the currently running code. We make that decision before deploy time and put the site in maintenance mode if we determine there’s an incompatibility between the two versions.

I think we’re probably going to move to a system that uses a tier of proxies and two ELBs. That way we can update the idle ELB, change out the AMIs, and bring the updated ELB up behind an alternate domain for the blue-green testing. Then when everything checks out, switch the proxies to the updated ELB and take down the remaining, now idle ELB.

Amazon would suggest using Route53 to point to the new ELB, but there’s too great a chance of faulty DNS caching breaking a switch to a new ELB. Plus there’s a 60s TTL to start with regardless, even in the absence of caching.

I like making the AutoScale Group do the instance provisioning, versus your approach of provisioning the instance and then moving it to an ASG. From what I can tell, your module doesn't seem to be idempotent -- so if it's run, it's always going to act. The feature I added only updates instances if they have a launch config that is different from what's currently assigned to the ASG. So it's safe to run again (or continue a run that failed for some reason), without having to cycle through all the instances again.

You may have missed the “cycle_all” parameter. If False, only instances that don’t match the new AMI are cycled.

Using the ASG to do the provisioning might be preferable if it’s reliable. At first I went that route, but I was having problems with the ASG’s provisioning being non-deterministic. Manually creating the instances seems to ensure that things happen in a particular order and with predictable speed. As mentioned, the manual method definitely works every time, although I need to add some more timeout and error checking (like what happens if I ask for 3 new instances and only get 2).

I have a separate task that cleans up the old AMIs and LCs, incidentally. I keep the most recent around as a backup for quick rollbacks.

We will be publishing an article on some different approaches that we've worked through for doing this "immutablish" deploy stuff sometime next week.

I’m looking forward to reading it for sure.

Regards,

-scott

To view this discussion on the web visit https://groups.google.com/d/msgid/ansible-project/CAMP2DW5tC%3DZkyPFALX5Axc%2B_7dHeyJ%2BPiAen3Daq3PbEHDnRxA%40mail.gmail.com.

James Martin

unread,

Sep 11, 2014, 3:26:51 PM9/11/14

to ansible...@googlegroups.com

On Thu, Sep 11, 2014 at 2:51 PM, Scott Anderson <scottan...@gmail.com> wrote:

The general problem with this approach is that it doesn’t work well for blue-green deployments, nor if the new code can’t coexist with the currently running code.

Yep, understood.

I think we’re probably going to move to a system that uses a tier of proxies and two ELBs. That way we can update the idle ELB, change out the AMIs, and bring the updated ELB up behind an alternate domain for the blue-green testing. Then when everything checks out, switch the proxies to the updated ELB and take down the remaining, now idle ELB.

Not following this exactly -- what's your tier of proxies? You have a group of proxies (haproxy, nginx) behind a load balancer that point to your application?

Amazon would suggest using Route53 to point to the new ELB, but there’s too great a chance of faulty DNS caching breaking a switch to a new ELB. Plus there’s a 60s TTL to start with regardless, even in the absence of caching.

Quite right. There are some interesting things you can do with tools you could run on the hosts that would redirect traffic from blue hosts to the green LB, socat being one. After you notice no more traffic coming to blue, you can terminate it.

You may have missed the “cycle_all” parameter. If False, only instances that don’t match the new AMI are cycled.

You're right, I did miss that. By checking the AMI, you're only updating the instance if the AMI changes. If you a checking the launch config, you are updating the instances if any component of the launch config has changed -- AMI, instance type, address type, etc.

Using the ASG to do the provisioning might be preferable if it’s reliable. At first I went that route, but I was having problems with the ASG’s provisioning being non-deterministic. Manually creating the instances seems to ensure that things happen in a particular order and with predictable speed. As mentioned, the manual method definitely works every time, although I need to add some more timeout and error checking (like what happens if I ask for 3 new instances and only get 2).

I didn't have any issues with the ASG doing the provisioning, but I would say nothing is predictable with AWS :).

I have a separate task that cleans up the old AMIs and LCs, incidentally. I keep the most recent around as a backup for quick rollbacks.

That's cool, care to share?

Scott Anderson

unread,

Sep 11, 2014, 3:54:25 PM9/11/14

to ansible...@googlegroups.com

On Sep 11, 2014, at 3:26 PM, James Martin <jma...@ansible.com> wrote:

I think we’re probably going to move to a system that uses a tier of proxies and two ELBs. That way we can update the idle ELB, change out the AMIs, and bring the updated ELB up behind an alternate domain for the blue-green testing. Then when everything checks out, switch the proxies to the updated ELB and take down the remaining, now idle ELB.

Not following this exactly -- what's your tier of proxies? You have a group of proxies (haproxy, nginx) behind a load balancer that point to your application?

Yes, nginx or some other HA-ish thing. If it’s nginx then you can maintain a brochure site even if something horrible happens to the application.

Amazon would suggest using Route53 to point to the new ELB, but there’s too great a chance of faulty DNS caching breaking a switch to a new ELB. Plus there’s a 60s TTL to start with regardless, even in the absence of caching.

Quite right. There are some interesting things you can do with tools you could run on the hosts that would redirect traffic from blue hosts to the green LB, socat being one. After you notice no more traffic coming to blue, you can terminate it.

That’s an interesting idea, but it fails if people are behind a caching DNS and they visit after you’ve terminated the blue traffic but before their caching DNS lets go of the record.

You're right, I did miss that. By checking the AMI, you're only updating the instance if the AMI changes. If you a checking the launch config, you are updating the instances if any component of the launch config has changed -- AMI, instance type, address type, etc.

That’s true, but if I’m changing instance types I’ll generally just cycle_all. Because of the connection draining and parallelism of the instance creation, it’s just as quick to do all of them instead of the ones that needs changing. That said, it’s an obvious optimization for sure.

Using the ASG to do the provisioning might be preferable if it’s reliable. At first I went that route, but I was having problems with the ASG’s provisioning being non-deterministic. Manually creating the instances seems to ensure that things happen in a particular order and with predictable speed. As mentioned, the manual method definitely works every time, although I need to add some more timeout and error checking (like what happens if I ask for 3 new instances and only get 2).

I didn't have any issues with the ASG doing the provisioning, but I would say nothing is predictable with AWS :).

Very true. Over the past few months I’ve had several working processes just fail with no warning. The most recent is AWS sometimes refusing to return the current list of AMIs. Prior to that it was the Available status on an AMI not really meaning available. Now I check the list of returned AMIs in a loop until the one I’m looking for shows up, Available status notwithstanding. Very frustrating. Things could be worse, however: the API could be run by Facebook...

I have a separate task that cleans up the old AMIs and LCs, incidentally. I keep the most recent around as a backup for quick rollbacks.

That's cool, care to share?

I think I’ve posted it before, but here’s the important bit. After deleting everything but the oldest backup AMI (determined by naming convention or tags), delete any LC that doesn’t have an associated AMI:

def delete_launch_configs(asg_connection, ec2_connection, module):

changed = False

launch_configs = asg_connection.get_all_launch_configurations()

for config in launch_configs:

image_id = config.image_id

images = ec2_connection.get_all_images(image_ids=[image_id])

if not images:

config.delete()

changed = True

module.exit_json(changed=changed)

-scott

Ben Whaley

unread,

Nov 22, 2014, 5:39:28 PM11/22/14

to ansible...@googlegroups.com

Hi all,

Sorry for resurrecting an old thread, but wanted to mention my experience thus far using ec2_asg & ec2_lc for code deploys.

I'm more or less following the methods described in this helpful repo

https://github.com/ansible/immutablish-deploys

I believe the dual_asg role is accepted as the more reliable method for deployments. If a deployment uses two ASGs, it's possible to just delete the new ASG and everything goes back to normal. This is the "Netflix" manner of releasing updates.

The thing I'm finding though is that instances become "viable" well before they're actually InService in the ELB. From the ec2_asg code and by running ansible in verbose mode it's clear that ansible considers an instance viable once AWS indicates that instances are Healthy and InService. Checking via the AWS CLI tool, I can see that the ASG shows instances as Healthy and InService, but the ELB shows OutOfService.

The AWS docs are clear about the behavior of autoscale instances with health check type ELB: "For each call, if the Elastic Load Balancing action returns any state other than InService, the instance is marked as unhealthy." But this is not actually the case.

Has anyone else encountered this? Any suggested workarounds or fixes?

Thanks,

Ben

James Martin

unread,

Nov 24, 2014, 10:25:58 AM11/24/14

to ansible...@googlegroups.com

Ben,

Thanks for the question. Considering this: http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html, "Auto Scaling marks an instance unhealthy if the calls to the Amazon EC2 action DescribeInstanceStatus return any state other than running, the system status shows impaired, or the calls to Elastic Load Balancing action DescribeInstanceHealth returns OutOfService in the instance state field."

For determining the instance health status, we are fetching an ASG object in boto and checking the health_status attribute for each instance in the ASG, which are equal to either "healthy" or "unhealthy". Are you using an instance grace period option for the ELB? http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html, see HealthCheckGracePeriod. This option is configurable with the health_check_period setting found in the ec2_asg module. By default it is 500, and this would prematurely return the status of a healthy instance, as it means it would mark any instance as healthy for 500 seconds.

- James

Ben Whaley

unread,

Nov 24, 2014, 12:44:21 PM11/24/14

to ansible...@googlegroups.com

Hi James,

Thanks for your reply.

Interesting point about the HealthCheckGracePeriod option. I wasn't aware of its role here. I am indeed using it, in fact according to the docs it is a required option for ELB health checks. I had it set to 180, and I just tried it with lower values of 10 and 1 second. In both cases the behavior is the same: the autoscale group considers the instances healthy (because of the grace period, even at the lower value) and as a result ansible moves on before the instances are InService in the ELB. Even with the HealthCheckGracePeriod at the lowest possible value of 1 second, a race exists between the module's health check and the ELB grace period.

I've worked around this for now with a script that does the following:

- Find the instances in the ASG

- Check the ELB to determine if they are healthy or not

- Exit 1 if not, 0 if yes

Then I use an ansible task with an "until" loop to check the return code. The script is here:

https://gist.github.com/anonymous/05e99828848ee565ed33

Happy to work this in to an ansible module if you think this is useful. Or did I misunderstand the point about the health check grace period?

Thanks,

Ben

James Martin

unread,

Nov 24, 2014, 12:49:45 PM11/24/14

to ansible...@googlegroups.com

Hmm.. I wonder if we need to have the ec2_asg module to wait for the healthCheckGracePeriod to expire before checking the instance health (assuming that it is an ELB health check type). Even with a health check grace period of one, there is a chance that the instance can become healthy in that time. Can you please open a bug on github.com/ansible/ansible-modules-core to track this?

Thanks,

James

Ben Whaley

unread,

Nov 24, 2014, 12:49:48 PM11/24/14

to ansible...@googlegroups.com

Oops, please use this link for the code instead.

https://gist.github.com/bwhaley/eee6a0f61636862515aa

Ben Whaley

unread,

Nov 24, 2014, 12:54:09 PM11/24/14

to ansible...@googlegroups.com

Opened issue 383. Thanks again.

https://github.com/ansible/ansible-modules-core/issues/383

- Ben

Reply all

Reply to author

Forward