"Phasing in" an Autoscaling Group?

14258 views
Skip to first unread message

James Carr

unread,
Sep 2, 2015, 12:20:36 AM9/2/15
to Terraform
Hi All,

Long time lurker, first time poster. 

One thing I would like to do is phase in a new ASG. For example, when I update a new AMI for an autoscaling group, I want to 

* Spin up a new ASG with the updated launch config that has new AMI
* attach to existing ELB, leave old ASG alone and attached
* ones the min healthy hosts are attached from the new ASG, scale down and destroy the old ASG

Is there any easy way to accomplish this via terraform? So far the behavior I've observed is it just destroys the previous ASG and LC and creates a new one. Seems rather dangerous in a production environment so I am sure I am missing something here. 


Thanks!
James

Paul Hinze

unread,
Sep 2, 2015, 10:34:11 AM9/2/15
to terrafo...@googlegroups.com
Hi James,

Those steps are pretty much exactly how we currently do production rollouts at HashiCorp. :)

Here's how we structure things:

resource "aws_launch_configuration" "someapp" {
  lifecycle { create_before_destroy = true }

  image_id       = "${var.ami}"
  instance_type  = "${var.instance_type}"
  key_name       = "${var.key_name}"
  security_group = ["${var.security_group}"]
  
}

resource "aws_autoscaling_group" "someapp" {
  lifecycle { create_before_destroy = true }

  name                 = "someapp - ${aws_launch_configuration.someapp.name}"
  launch_configuration = "${aws_launch_configuration.someapp.name}"
  desired_capacity     = "${var.nodes}"
  min_size             = "${var.nodes}"
  max_size             = "${var.nodes}"
  min_elb_capacity     = "${var.nodes}"
  availability_zones   = ["${split(",", var.azs)}"]
  vpc_zone_identifier  = ["${split(",", var.subnet_ids)}"]
  load_balancers       = ["${aws_elb.someapp.id}"]
}


The important bits are:

 * Both LC and ASG have create_before_destroy set
 * The LC omits the "name" attribute to allow Terraform to auto-generate a random one, which prevent collisions
 * The ASG interpolates the launch configuration name into its name, so LC changes always force replacement of the ASG (and not just an ASG update).
 * The ASG sets "min_elb_capacity" which means Terraform will wait for instances in the new ASG to show up as InService in the ELB before considering the ASG successfully created. 

The behavior when "var.ami" changes is:

 (1) New "someapp" LC is created with the fresh AMI
 (2) New "someapp" ASG is created with the fresh LC
 (3) Terraform waits for the new ASG's instances to spin up and attach to the "someapp" ELB
 (4) Once all new instances are InService, Terraform begins destroy of old ASG
 (5) Once old ASG is destroyed, Terraform destroys old LC

If Terraform hits its 10m timeout during (3), the new ASG will be marked as "tainted" and the apply will halt, leaving the old ASG in service.

Hope this helps! Happy to answer any further questions you might have,

Paul


--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Terraform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to terraform-too...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/6607b727-0240-4619-a694-4fb7470b57bc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

James Carr

unread,
Sep 4, 2015, 1:46:09 AM9/4/15
to Terraform
Hi Paul,

Works like a charm... thanks!

Martin Atkins

unread,
Sep 4, 2015, 11:13:49 AM9/4/15
to Terraform
Thanks for sharing this, Paul. We were puzzling over something very much like this at work the other day, so it's nice to see a working example!

It'd be nice to have some of these tricks included in the asg example in the repo for posterity, along with your very useful summary of why it works. :)

Bruno Bonacci

unread,
Sep 5, 2015, 1:08:47 PM9/5/15
to Terraform
Hi Paul,

can you please clarify what would be behaviour in case the autoscaling group is not attached to a load balancer?
would the old ASG and related instances destroyed before the new ones are fully in service?

Regards
Bruno

Paul Hinze

unread,
Sep 8, 2015, 11:20:01 AM9/8/15
to terrafo...@googlegroups.com
can you please clarify what would be behaviour in case the autoscaling group is not attached to a load balancer?
> would the old ASG and related instances destroyed before the new ones are fully in service?

Very good question, Bruno. Your observation is 100% correct.

Because AWS responds successfully for the ASG create well before the instances are actually ready to perform service, it's true that without an ELB the booting instances will end up racing the destroy, and almost certainly lose, resulting in a service outage during replacement.

The way we've worked around this today for our non-ELB services is that time-honored tradition of "adding a sleep". :)

resource "aws_autoscaling_group" "foo" {
  lifecycle { create_before_destroy = true }

  # ...

  # on replacement, gives new service time to spin up before moving on to destroy
  provisioner "local-exec" {
    command = "sleep 200"
  }
}

This obviously does not perform well if the replacement service fails to come up. A better solution would be to use a provisioner that actually checks service health. Something like:


resource "aws_autoscaling_group" "foo" {
  lifecycle { create_before_destroy = true }

  # ...

  # on replacement, poll until new ASG's instances shows up healthy
  provisioner "remote-exec" {
    connection {
      # some node in your env with scripts+access to check services
      host = "${var.health_checking_host}"
      user = "${var.health_checking_user}"
    }
    # script that performs an app-specific check on the new ASG, exits non-0 after timeout
    inline = "poll-until-in-service --service foo --asg ${self.id}"
  }
}

I'm also kicking ideas around in the back of my head about how TF can better support this "resource verification" / "poll until healthy" use case first class. Any ideas on design for features like that are welcome!

Matt Black

unread,
Sep 8, 2015, 11:33:05 AM9/8/15
to terrafo...@googlegroups.com
My couple of cents on the whole "resource verification" thing..

I think you really need some custom code, somewhere, somehow, running in a loop, to determine that a resource is ready for action, or is ready for rollover and destruction. For example, I have previously used a custom script to poll RabbitMQ until it's queues are empty, before rolling the server which is processing the queue..

Paul Hinze

unread,
Sep 8, 2015, 11:45:24 AM9/8/15
to terrafo...@googlegroups.com
I think you really need some custom code, somewhere, somehow, running in a loop, to determine that a resource is ready for action, or is ready for rollover and destruction.

Yep totally. So it's just a matter of how we model a Terraform feature such that it allows a flexible hook for said custom code.

Bruno Bonacci

unread,
Sep 8, 2015, 12:08:22 PM9/8/15
to Terraform
Hi Paul,

I think there are several dimensions to this problem.

There are different issues that can be addressed independently:

  •  updating images and instance configuration in a ASG + LC setup
Here as dicussed https://github.com/hashicorp/terraform/issues/2183 there are constraints coming from AWS which we need to work around.
The solution you suggested seems to solve a specific case rather than being a general solution.
I think that for this specific problem it would be easier if Terraform rather than delete + create new would use for LC the strategy i've suggested.
The strategy consist on copying the launch configuration into a new/temporary LC, update the ASG to use the cloned LC, then delete the old LC,
at this point you can create a new LC with the new settings and finally update the ASG to use the new LC.

This would allow to update the LC without forcing an IMMEDIATE destruction of instances and recreation, which is very important for stateful systems such as databases,
and it would allow to implement a rolling restart with much less constraints.

  • Determine when an instance is READY and not just booted.

As noted by Matt this can be application specific, for example consider a database which upon startup of a new instance there is a syncronization phase in which
the data from existing one must be replicated to the new nodes before considering the new DB node READY. This operation can take from a few minutes to hours
depending on the data size. Not considering this a ASG update or rolling restart feature would have disastrous consequences (aka. new nodes up without full data, old nodes killed)

  • Updating ASG really means Rolling update
The when updating ASG what the user really want is a way to automate the update phase of the cluster (rolling update).
Creating a new cluster all at once and killing the old one is not a suitable solution in all cases (for example for the a DB as explained above)
A rolling update must consider the time it takes to the instance to join the set and replicate data, consider possible failures and allow to roll back of new instances. 
Some systems such as Zookeeper and Kafka, have unique IDs which must be preserved, so before starting a new instance with id=1 the old instance must be stopped first,
which again is different than how normally you would approach with let' say Cassandra or MongoDB. 
 

So as you see the update scenarios can be quite complex, I wouldn't use the lifecycle { create_before_destroy = true }
as a general solution to build upon.

If you have ideas on how to handle these scenarios currently I would be glad to see some examples.

Thanks
Bruno

Paul Hinze

unread,
Sep 8, 2015, 12:44:24 PM9/8/15
to terrafo...@googlegroups.com
Hi Bruno,

I think that for this specific problem it would be easier if Terraform rather than delete + create new would use for LC the strategy i've suggested.
The strategy consist on copying the launch configuration into a new/temporary LC, update the ASG to use the cloned LC, then delete the old LC,
at this point you can create a new LC with the new settings and finally update the ASG to use the new LC.

You can do this today with Terraform, you'll just need to manage the rolling of your instances via a separate process.

resource "aws_launch_configuration" "foo" {
  lifecycle { create_before_destroy = true }
  # omit name so it's generated as a unique value
  image_id = "${var.ami}"
  # ...
}

resource "aws_autoscaling_group" "foo" {
  launch_configuration = "${aws_launch_configuration.foo.id}"
  
  name = "myapp"
  # ^^ do not interpolate AMI or launch config name in the name.
  #    this avoids forced ASG replacement on LC change
}

Given the above config, Terraform's behavior when `var.ami` is changed from `ami-abc123` to `ami-def456` is as follows:

 * create LC with `ami-def456`
 * update existing ASG with new LC name
 * delete LC with `ami-abc123`

At this point, any new instance launched into the ASG will use `ami-def456`. So your deployment process can choose what behavior you want. Options include:

 * scale up to 2x capacity then scale back down, which will terminate oldest instances
 * terminate existing instances one by one, allowing them to be replaced with new ones

(Note create_before_destroy on the ASG is optional here - it depends on the behavior you'd like to see if/when you do need to replace the ASG for some reason.)

As noted by Matt this can be application specific, for example consider a database which upon startup of a new instance there is a syncronization phase 

Yep totally agreed. Terraform will need to provide hooks for delegating out to site-specific code for determining resource health. There's no one-size-fits-all solution here. Today this can be achieved to a certain extent by calling out via local-exec and remote-exec provisioners, but a more first-class primitive for expressing this behavior would be nice to add. It's just a matter of how we model it.

The when updating ASG what the user really want is a way to automate the update phase of the cluster (rolling update).

Terraform does not have a mechanism for managing a rolling update today. Unfortunately, though AWS provides rolling update behavior in CloudFormation, it's done via cfn-specific, internally implemented behavior [1], and there are no externally available APIs to trigger it on a vanilla ASG.

As I described in the example above - Terraform can make the resource adjustments necessary to support rolling update scenarios, but the actual roll will need to be managed by an external system.



ja...@fpcomplete.com

unread,
Sep 9, 2015, 7:51:25 AM9/9/15
to Terraform


On Tuesday, September 8, 2015 at 11:45:24 AM UTC-4, Paul Hinze wrote:
I think you really need some custom code, somewhere, somehow, running in a loop, to determine that a resource is ready for action, or is ready for rollover and destruction.

Yep totally. So it's just a matter of how we model a Terraform feature such that it allows a flexible hook for said custom code.

I have been thinking about similar topics recently. I almost wrote a basic utility that would poll/wait for a specific port to become available / service up - maybe Terraform could create similar functionality (the SSH polling is already doing this) - but with an ASG, it would be like: Terraform has to get a list of the new IPs in the ASG, and then poll SSH or similar ports there until the services are up. As a basic step, we could focus on starting with support for polling SSH and HTTP until the services are available on the specified ports.

Thoughts?

Bruno Bonacci

unread,
Sep 10, 2015, 5:23:06 PM9/10/15
to Terraform


On Tuesday, September 8, 2015 at 5:44:24 PM UTC+1, Paul Hinze wrote:
Hi Bruno,

I think that for this specific problem it would be easier if Terraform rather than delete + create new would use for LC the strategy i've suggested.
The strategy consist on copying the launch configuration into a new/temporary LC, update the ASG to use the cloned LC, then delete the old LC,
at this point you can create a new LC with the new settings and finally update the ASG to use the new LC.

You can do this today with Terraform, you'll just need to manage the rolling of your instances via a separate process.

resource "aws_launch_configuration" "foo" {
  lifecycle { create_before_destroy = true }
  # omit name so it's generated as a unique value
  image_id = "${var.ami}"
  # ...
}

resource "aws_autoscaling_group" "foo" {
  launch_configuration = "${aws_launch_configuration.foo.id}"
  
  name = "myapp"
  # ^^ do not interpolate AMI or launch config name in the name.
  #    this avoids forced ASG replacement on LC change
}

Given the above config, Terraform's behavior when `var.ami` is changed from `ami-abc123` to `ami-def456` is as follows:

 * create LC with `ami-def456`
 * update existing ASG with new LC name
 * delete LC with `ami-abc123`


Hi Paul,

I've updated the stack as per your previous description and now I get a Cycle error on destroy (https://github.com/hashicorp/terraform/issues/2359#issuecomment-139382605)
any suggestions?

Bruno



Paul Hinze

unread,
Sep 10, 2015, 5:25:34 PM9/10/15
to terrafo...@googlegroups.com
Responded on the issue. :)

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Terraform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to terraform-too...@googlegroups.com.

James Carr

unread,
Oct 5, 2015, 7:41:56 AM10/5/15
to Terraform
Hey Paul,

Thanks for all the guidance, I really appreciate it. I'm actually wonder if there's a good way to scale in a new ASG while simultaneously scaling down the previous ASG. For example in our situation I have an ASG with 60 servers in it that are queue workers. If we just bring in another 60 it'll blow out connection limits in places (we're working on it) so it'd be nicer if it brought instances in more like "launch new ASG, bringing in X at a time while scaling X down in the old). 

Would something like this even be doable? The best I can see is the previous option mentioned: update the launch_config and then terminate instances to bring up new ones. 

ja...@fpcomplete.com

unread,
Oct 6, 2015, 8:06:19 AM10/6/15
to Terraform


On Monday, October 5, 2015 at 7:41:56 AM UTC-4, James Carr wrote:
Hey Paul,

Thanks for all the guidance, I really appreciate it. I'm actually wonder if there's a good way to scale in a new ASG while simultaneously scaling down the previous ASG. For example in our situation I have an ASG with 60 servers in it that are queue workers. If we just bring in another 60 it'll blow out connection limits in places (we're working on it) so it'd be nicer if it brought instances in more like "launch new ASG, bringing in X at a time while scaling X down in the old). 

Raising the limit is not a terrible idea.. You could also split up the big ASG of 60 into smaller groups of 30 or whatever, and replace one group at a time. Going one route you have extra capacity for a little while, and the other you're short, so that might be a way to make the decision.

ti...@ibexlabs.com

unread,
Nov 18, 2015, 11:46:34 PM11/18/15
to Terraform
Paul,
Everything seems to be working as expected 
1 - Updating the ami creates new launch config
2 - Updates the autoscaling to new launch config name

However new machines are not launching automatically using the new LC/ASG created . What could be the issue here ?

Paul Hinze

unread,
Dec 3, 2015, 12:06:25 PM12/3/15
to terrafo...@googlegroups.com
I'm actually wonder if there's a good way to scale in a new ASG while simultaneously scaling down the previous ASG. 

This is an interesting and important question. If you're looking for fine grained control over a deploy like that today, I'd recommend a blue/green style deployment with two LC/ASGs that you can scale independently.

Assuming your example of 60 nodes, in a Blue/Green model you'd have steps like:

 * Begin state: Blue in service at full 60 nodes, Green cold
 * Replace Green with new LC/ASG holding fresh AMI, scale to a small number of instances
 * Scale down blue and up green in as many batches as you like.
 * End state: Blue cold, Green at 60 nodes

This can be driven by Terraform, but involves several runs to orchestrate the batches. Getting Terraform itself to manage a rolling deploy is being discussed in https://github.com/hashicorp/terraform/issues/1552 

However new machines are not launching automatically using the new LC/ASG created . What could be the issue here ?

This is the behavior of AWS when updating the LC. Existing instances are not touched, and new instances use the new LC.

What we do to force new instances to be created is interpolate the LC name into the ASG name - this forces the ASG to be recreated anytime the LC is replaced, which guarantees the nodes are rolled.

Paul

Yevgeniy Brikman

unread,
Feb 19, 2016, 1:45:31 PM2/19/16
to Terraform
Paul,

A question about your rolling deployment strategy: how does it work with autoscaling policies? The aws_autoscaling_policy resource docs (https://www.terraform.io/docs/providers/aws/r/autoscaling_policy.html) recommend omitting the desired_capacity attribute from the aws_autoscaling_group. So if my ASG has a min size of 2, a max size of 10, and based on traffic, the autoscaling policy has scaled the ASG up to 8 instances. If I try to roll out a new version, would it end up back at size 2 (min size) if I don't specify desired_capacity?

Thanks!
Jim

Paul Hinze

unread,
Feb 29, 2016, 11:28:21 AM2/29/16
to terrafo...@googlegroups.com
Jim,

In my example, Terraform is managing the scaling "manually" from the ASG's perspective, which is why desired_capacity is being used.

If you wanted to combine this sort of a strategy with scaling policies, you'd need to play around with temporarily setting and removing desired_capacity to "pre-warm" your clusters as you switch over, then either removing it or adding ignore_changes to let the scaling policy take over.

Paul

Yevgeniy Brikman

unread,
Feb 29, 2016, 11:57:14 AM2/29/16
to Terraform
Thanks Paul. It sounds a bit complicated to use this strategy with dynamically sized ASGs. And if we have to write custom scripts to make it work anyway, I wonder if it wouldn't be better to:
  1. Update the AMI version in the launch configuration.
  2. Deploy the new version with Terraform, which won't have any immediate effect on the ASG.
  3. Write a script to force the ASG to deploy instances with the new launch configuration. Anyone have experience with aws-ha-release?
  4. Perhaps run the script using a local provisioner on the launch configuration?
Jim

Paul Hinze

unread,
Feb 29, 2016, 12:13:14 PM2/29/16
to terrafo...@googlegroups.com
Yep that's definitely a valid strategy - use Terraform to _update_ the ASG and leave the instance rolling to an out-of-band process. Plenty of ways to slice the instance roll then.

-p

Yevgeniy Brikman

unread,
Mar 1, 2016, 2:31:23 PM3/1/16
to Terraform
Instead of using a hacky script to do a rolling deployment, I decided to try to leverage the rolling UpdatePolicy built into CloudFormation. It seems to be working pretty well. Details are here: https://github.com/hashicorp/terraform/issues/1552#issuecomment-190864512

Jim

jak...@thoughtworks.com

unread,
Jan 8, 2017, 4:58:33 AM1/8/17
to Terraform
Hi Paul, 

Thanks for the answer, it's clear and works well, I tried to separate my lc and asg into different module, and use the variable to pass the lc name (still generated by terraform)  into asg module, but it stops working, it start to give cycle error, putting asg and lc into same module, everything works fine, any idea why this happen? why lc and asg cannot be in different module?

Lowe Schmidt

unread,
Jan 9, 2017, 4:29:45 AM1/9/17
to terrafo...@googlegroups.com

On 8 January 2017 at 10:58, <jak...@thoughtworks.com> wrote:
Thanks for the answer, it's clear and works well, I tried to separate my lc and asg into different module, and use the variable to pass the lc name (still generated by terraform)  into asg module, but it stops working, it start to give cycle error, putting asg and lc into same module, everything works fine, any idea why this happen? why lc and asg cannot be in different module?

It sounds like your modules are depending on each other, have you tried to graph them and see if it loops? 
--
Lowe Schmidt | +46 723 867 157

jak...@thoughtworks.com

unread,
Jan 9, 2017, 5:32:33 AM1/9/17
to Terraform
Yes, the asg module depends on the lc module which is expected, yes I tried 
terraform graph  -draw-cycles

then I got red lines on a very complicated graph, I still couldn't figure out where is the cycle by reading the graph. here is my top level main.tf:
provider "aws" {
  region
= "ap-southeast-2"
}

module "my_elb" {
  source
= "../modules/elb"
  subnets
= ["subnet-481d083f", "subnet-303cd454"]
  security_groups
= ["sg-e8ac308c"]
}

module "my_lc" {
  source
= "../modules/lc"
  subnets
= ["subnet-481d083f", "subnet-303cd454"]
  security_groups
= ["sg-e8ac308c"]
  snapshot_id
= "snap-00d5e8ef70d1b3e24"
}

module "my_asg" {
  source
= "../modules/asg"
  subnets
= ["subnet-481d083f", "subnet-303cd454"]
  my_asg_name
= "my_asg_${module.my_lc.my_lc_name}"
  my_lc_id
= "${module.my_lc.my_lc_id}"
  my_elb_name
= "${module.my_elb.my_elb_name}"
}

And here is the main.tf of lc module:
data "template_file" "userdata" {
  template = "${file("${path.module}/userdata.sh")}"

  vars {
    notify_email = "m...@email.co"
  }
}

resource "aws_launch_configuration" "my_lc" {
  lifecycle {
    create_before_destroy = true
  }
  image_id = "ami-28cff44b"
  instance_type = "t2.micro"
  security_groups = ["${var.security_groups}"]
  user_data = "${data.template_file.userdata.rendered}"
  associate_public_ip_address = false
  key_name = "sydney"

  root_block_device {
    volume_size = 20
  }

  ebs_block_device {
    device_name = "/dev/sdi"
    volume_size = 10
    snapshot_id = "${var.snapshot_id}"
  }
}



and main.tf of asg module:
resource "aws_autoscaling_group" "my_asg" {
  name = "${var.my_asg_name}"
  lifecycle {
    create_before_destroy = true
  }
  max_size = 1
  min_size = 1
  vpc_zone_identifier = ["${var.subnets}"]
  wait_for_elb_capacity = true
  wait_for_capacity_timeout = "6m"
  min_elb_capacity = 1
  launch_configuration = "${var.my_lc_id}"
  load_balancers = ["${var.my_elb_name}"]
  tag {
    key = "Role"
    value = "API"
    propagate_at_launch = true
  }
}

resource "aws_autoscaling_policy" "scale_up" {
  name = "scale_up"
  lifecycle { create_before_destroy = true }
  scaling_adjustment = 1
  adjustment_type = "ChangeInCapacity"
  cooldown = 300
  autoscaling_group_name = "${aws_autoscaling_group.my_asg.name}"
}

resource "aws_cloudwatch_metric_alarm" "scale_up_alarm" {
  alarm_name = "high_cpu"
  lifecycle  { create_before_destroy = true }
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods = "2"
  metric_name = "CPUUtilization"
  namespace = "AWS/EC2"
  period = "120"
  statistic = "Average"
  threshold = "80"
  insufficient_data_actions = []
  alarm_description = "EC2 CPU Utilization"
  alarm_actions = ["${aws_autoscaling_policy.scale_up.arn}"]
  dimensions {
    AutoScalingGroupName = "${aws_autoscaling_group.my_asg.name}"
  }
}

resource "aws_autoscaling_policy" "scale_down" {
  name = "scale_down"
  lifecycle { create_before_destroy = true }
  scaling_adjustment = -1
  adjustment_type = "ChangeInCapacity"
  cooldown = 600
  autoscaling_group_name = "${aws_autoscaling_group.my_asg.name}"
}

resource "aws_cloudwatch_metric_alarm" "scale_down_alarm" {
  alarm_name = "low_cpu"
  lifecycle  { create_before_destroy = true }
  comparison_operator = "LessThanThreshold"
  evaluation_periods = "5"
  metric_name = "CPUUtilization"
  namespace = "AWS/EC2"
  period = "120"
  statistic = "Average"
  threshold = "300"
  insufficient_data_actions = []
  alarm_description = "EC2 CPU Utilization"
  alarm_actions = ["${aws_autoscaling_policy.scale_down.arn}"]
  dimensions {
    AutoScalingGroupName = "${aws_autoscaling_group.my_asg.name}"
  }
}



Thanks very much, and very sorry for the long email, I have created the repo here: https://github.com/JakimLi/terraform-error/,  I have googled this for days, so any suggestions will be greatly appreciated.

Si Hobbs

unread,
Jun 11, 2017, 8:55:37 AM6/11/17
to Terraform
On Wednesday, 9 September 2015 01:20:01 UTC+10, Paul Hinze wrote:
can you please clarify what would be behaviour in case the autoscaling group is not attached to a load balancer?
> would the old ASG and related instances destroyed before the new ones are fully in service?
... 
I'm also kicking ideas around in the back of my head about how TF can better support this "resource verification" / "poll until healthy" use case first class. Any ideas on design for features like that are welcome!

Great thread!
Did anything progress around a "poll until healthy" type feature?

Cheers
Si

David Managadze

unread,
Jul 3, 2017, 6:14:49 PM7/3/17
to Terraform
@Paul Hinze

Hi Paul,
Thanks for your comprehensive answers and examples!
 
If Terraform hits its 10m timeout during (3), the new ASG will be marked as "tainted" and the apply will halt, leaving the old ASG in service.

I am using exactly the same except that I am using name_prefix for naming an LC but I hit this timeout problem every time I am trying to (re-)create ASG. I am not using any user data and the AMI is a standard Ubuntu image. What could be the reason and how could it be alleviated?

Here is a snippet:

resource "aws_autoscaling_group" "web" {
  vpc_zone_identifier = ["${var.web_subnet_id}"]
  min_size = "${var.autoscaling_min_size}"
  max_size = "${var.autoscaling_max_size}"
  wait_for_elb_capacity = false
  force_delete = true
  launch_configuration = "${aws_launch_configuration.web.id}"
  load_balancers = ["${var.web_elb_name}"]
  lifecycle { create_before_destroy = true }
  initial_lifecycle_hook {
    name                    = "ec2-web-up"
    default_result          = "CONTINUE"
    heartbeat_timeout       = 3600
    lifecycle_transition    = "autoscaling:EC2_INSTANCE_LAUNCHING"
    notification_metadata = <<EOF
{
  "myapp_autoscaling_group_by_terraform": "web",
}
EOF
  }
  tag {
    key = "Name"
    value = "myapp-${var.env}-web-server-asg"
    propagate_at_launch = "true"
  }
}

resource "aws_launch_configuration" "web" {
  lifecycle { create_before_destroy = true }
  name_prefix = "myapp-${var.env}-web-lc-"
  image_id = "ami-5e63d13e"
  instance_type = "t2.micro"
  security_groups = ["${aws_security_group.web.id}"]
  key_name = "${var.aws_keypair_name}"
  associate_public_ip_address = false
}



 

Andreas Galazis

unread,
Oct 21, 2017, 3:38:52 PM10/21/17
to Terraform
Does that also mean that also mean that aws_elb.someapp has lifecycle { create_before_destroy = true } as well? docs clearly state that: Resources that utilize the create_before_destroy key can only depend on other resources that also include create_before_destroy. Referencing a resource that does not include create_before_destroywill result in a dependency graph cycle.
The above is not included in your description so i am quite puzzled

On Wednesday, 2 September 2015 17:34:11 UTC+3, Paul Hinze wrote:
Hi James,

Those steps are pretty much exactly how we currently do production rollouts at HashiCorp. :)

Here's how we structure things:

resource "aws_launch_configuration" "someapp" {
  lifecycle { create_before_destroy = true }

If Terraform hits its 10m timeout during (3), the new ASG will be marked as "tainted" and the apply will halt, leaving the old ASG in service.

Hope this helps! Happy to answer any further questions you might have,

Paul

Asif Zayed

unread,
Jun 19, 2018, 12:50:17 AM6/19/18
to Terraform
Hi Paul Hinze,

I am using the said process to the new deployment on Autoscaling group with Launch Configurations.
This works like a charm.

Now we are planning to use AWS launch template instead of the Launch Configurations.

I don't see if this deployment strategy would be feasible with Launch template now.

Please let me know if this is still valid with Launch template.


Waiting for your reply.

Thanks,
Asif
Message has been deleted

Thomas Bernard

unread,
Aug 23, 2018, 1:17:24 PM8/23/18
to Terraform
Hi,

What is the recommended way to do a rolling update if you're using an Amazon Application loadbalancer? I tried a couple things, but it all resulted in "create new autoscaling group, don't wait for instances to be healthy, start destroying old autoscaling group", as a result every time I deploy I have a small outage.

My only (hacky) workaround that avoids an outage is to add a sleep in my autoscaling group config:
 provisioner "local-exec" {
    command = "sleep 60"
  }
But ideally I'd like to wait for healthy instances and abort deployment if something went wrong.

Ideas?
Thanks!

Matt Button

unread,
Aug 24, 2018, 6:31:47 AM8/24/18
to terrafo...@googlegroups.com
Hi Thomas,

We use the approach suggested by Paul in this thread and can confirm that it will wait for the instances to show up as healthy in the ALB before removing the old autoscaling group. You can also define lifecycle hooks using `aws_autoscaling_group`'s `initial_lifecycle_hook` option to signal to AWS when your machine has finished provisioning. AWS will not mark the new instances as "InService" until the hook has been has been marked as completed.

Caveats with this approach:

- Using the launch configuration name in your ASG means small changes to the launch config (e.g. formatting of user data) will create a new ASG. This isn't necessarily a bad thing, but it's something you need to bear in mind if you have stateful services like consul.
- I seem to remember that if you attach the instances to a target group and configure `min_elb_capacity`, but the target group is not attached to an ALB, the machine will not receive ping health checks, so terraform will wait indefinitely for the instance to be marked healthy, which will never happen. We ran into this a few months ago, so I may be misremembering details/things might have changed since then
- If the new ASG coming up does not pass its health checks/lifecycle hooks the terraform apply will fail, and mark the new ASG as deposed. On the next run terraform will attempt to recreate the deposed ASG, then delete the deposed ASG. This means it tries to create an ASG with the exact same name as the deposed one, which fails because ASG names need to be unique. We work around this by tainting the launch configuration for the ASG. I wrote a small script to make it easier to taint many things at once. It automatically handles resources that are nested within modules.

An alternative approach is to grow/shrink the ASG when you want to roll instances, so that AWS will launch new instances when it grows, and terminate the oldest ones when it shrinks. Someone in the community recently made a lambda script that automates this.

Matt

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to a topic in the Google Groups "Terraform" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/terraform-tool/7Gdhv1OAc80/unsubscribe.
To unsubscribe from this group and all its topics, send an email to terraform-too...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/47f05b3b-90c5-476f-9da3-3f957db1d196%40googlegroups.com.

Thomas Bernard

unread,
Aug 24, 2018, 7:47:44 PM8/24/18
to Terraform
that feels like the right solution with a traditional loadbalancer, but I have the impression I'm missing a magic parameter or something. Here's a snippet of my configuration. Also note that when I change AMIs, only the launchconfig gets created (and old one destroyed), my autoscaling group is only updated with a new launchconfig. My workaround here is to change the prefix of the autoscaling group and a new one gets created (with an outage in between autoscaling groups).

### Creating Launch Configuration
resource "aws_launch_configuration" "main" {
  name_prefix            = "lc-prod_"
  image_id               = "${lookup(var.amis,var.region)}"
  instance_type          = "${var.instance_type}"
  security_groups        = ${var.security_group}
  key_name               = "${var.key_name}"
  lifecycle {
    create_before_destroy = true
  }
}

### Creating AutoScaling Group
resource "aws_autoscaling_group" "main" {
  name_prefix                   = "asg-prod_"
  launch_configuration          = "${aws_launch_configuration.main.id}"
  vpc_zone_identifier           = [${subnet_id}]
  min_size                      = 1
  max_size                      = 10
  desired_capacity              = 1
  min_elb_capacity              = 1
  target_group_arns = ["${aws_alb_target_group.default.arn}"]
  lifecycle { create_before_destroy = true }
}

further down I also have a resource config for aws_alb_target_group and autoscaling_attachment.

my terraform version is: Terraform v0.11.7
and AWS provider version is: 1.31

thanks in advance for your help!

Vicki Kozel

unread,
Sep 10, 2018, 6:14:06 PM9/10/18
to Terraform
Hi Paul, thank you so much for providing this solution; it works great. However, I notice during my testing that old ASG starts destroying before the Instances in new ASG are marked InService. Which causes downtime for the app. Is there a way to make sure old ASG is up and running until the services on new instances come up? We do have an health check for the service port in the ELB, but no health checks are defined in ASG.

Thank you!

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Terraform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to terraform-too...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/6607b727-0240-4619-a694-4fb7470b57bc%40googlegroups.com.

Vicki Kozel

unread,
Sep 12, 2018, 5:13:09 PM9/12/18
to Terraform
Figured this out :) I was missing  min_elb_capacity in my asg configuration as mentioned in Paul's solution.

Bharat Puri

unread,
Jun 19, 2019, 8:01:30 AM6/19/19
to Terraform
@Paul Hinze

If my automation tests fail on newly created cluster then how I can delete or destroy the newly created cluster and without disturbing the existing running cluster?

Bharat Puri

unread,
Jun 19, 2019, 8:02:49 AM6/19/19
to Terraform

David Jeche

unread,
Jun 19, 2019, 8:08:20 AM6/19/19
to Terraform
I am sorry to say but you are going to have to do it manually. There is no easy way with automation. 

We can start by maybe looking at the resource names. Then we can perform a taint then re-apply. This should delete the mentioned resource. There is no easy way to define what to taint and what not. 

you can always try split the terraform to bit to ensure their conflicts between many modules and resources.

John Patton

unread,
Jun 19, 2019, 2:35:47 PM6/19/19
to Terraform
Congratulations!  You've finally made it to where you need to build something more resilient!  I highly recommend working on creating a blue/green deployment mechanism to solve for exactly what you're dealing with.  Here's a good overview of the idea, I'm sure there are other articles details similar things on this concept, however:


Cheers,

John H Patton

Nashwan Bilal

unread,
May 6, 2020, 8:54:07 AM5/6/20
to Terraform
Hi Paul Hinze 

what if we are using Launch Template instead if Launch Configuration, is it going to function same as with LC?
On Tue, Sep 1, 2015 at 11:20 PM, James Carr <james...@gmail.com> wrote:
Hi All,

Long time lurker, first time poster. 

One thing I would like to do is phase in a new ASG. For example, when I update a new AMI for an autoscaling group, I want to 

* Spin up a new ASG with the updated launch config that has new AMI
* attach to existing ELB, leave old ASG alone and attached
* ones the min healthy hosts are attached from the new ASG, scale down and destroy the old ASG

Is there any easy way to accomplish this via terraform? So far the behavior I've observed is it just destroys the previous ASG and LC and creates a new one. Seems rather dangerous in a production environment so I am sure I am missing something here. 


Thanks!
James

--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Terraform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to terrafo...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages