AWS: deletion of subnet times out because of scaling group

1,093 views
Skip to first unread message

Owen Smith

unread,
Nov 18, 2015, 1:30:27 PM11/18/15
to Terraform
Greetings,

I've been having fun recently throwing AWS resources around in
Terraform! Generally it's working fine – at worst I've seen
cyclic dependencies in my change which I had to break into a
couple of steps, or a dependency breakage which fixed itself on a
second run. However, I have just run into one interesting case
which the tool doesn't seem to be able to automate its way out
of.

The case is this: I have a subnet that I want to delete and
recreate in a different VPC. Yet Terraform hangs and then errors
out, like this:

* aws_subnet.foo-b: Error deleting subnet: timeout while waiting
  for state to become 'destroyed'
* aws_subnet.foo-a: Error deleting subnet: timeout while waiting
  for state to become 'destroyed'

This recurs if I rerun the deploy.

"Hm," I thought, "I wonder what happens if I just go into my AWS
console and delete it manually?" I get this error:

The following subnets contain one or more instances or network
interfaces. You cannot delete these subnets until those instances
have been terminated, and the network interfaces have been
deleted.

In my case it's instances, and the instances in question are
associated with an autoscaling group. I am managing the ASG in
Terraform, but because the instances are automatically spun up by
scaling policies, they are not managed directly in Terraform. To
unblock Terraform, I think I will have to manually delete the ASG
and its instances before changing the subnet will work.

I'm not quite sure how to categorize this: crazy corner case,
expected but wonky behavior, a bug, a feature request, or what?
It seems to me that there might be a reasonable path for
automation here:

- Are any ASGs dependent on the subnet to be deleted (looking at
  vpc_zone_identifier(s))?
  - yes => Are those ASGs are also being deleted?
    - yes =>
      - "find all instances associated with those ASGs and terminate
        them along with the ASG"
      - delete the subnet
    - no => block the operation because of a Terraform dependency
  - no => proceed as usual

The difficulty, I think, is the operation in quotes – is it in
fact possible to roll up the ASG and all of its instances in the
same breath? If you just delete instances before deleting the
ASG, I imagine a policy might be able to spin up another one
before you can reap the ASG. If you delete the ASG first, I don't
know if you can correctly identify the instances that are OK to
reap.

If we can't actually manage the deletion in this case, but we can
recognize the problem, perhaps we could simply have Terraform
report the dependency issue rather than timing out?

Curious to hear your thoughts? I'm brand-new to Terraform and
new-ish to AWS, so maybe it's not as hard as I think it is, or
maybe it's a known limitation.

Thanks!
-- Owen

Paul Hinze

unread,
Dec 3, 2015, 10:25:13 AM12/3/15
to terrafo...@googlegroups.com
Hi Owen,

Thanks for the well explained scenario!

I've reproduced the situation you described in config:


Let me know if that maps correctly to what you're picturing.

The reason you get the unexpected behavior is because Terraform believes the ASG only needs an Update to change its subnet ID, not a Destroy/Create.

In the attached example you'll see I show a workaround where we force a -/+ on the ASG by interpolating the Subnet ID into the ASG Name (a field that cannot be updated, so any changes force replacement).

With the workaround, Terraform performs the order of operations you'd want:

 A1. Old ASG Destroy (this waits for instances to come down)
 A2. Old Subnet Destroy (should be successful)
 A3. New Subnet Create
 A4. New ASG Create

If I added `lifecycle { create_before_destroy = true }` to the resources I'd get this:

 B1. New Subnet Create
 B2. New ASG Create
 B3. Old ASG Destroy (this waits for instances to come down)
 B4. Old Subnet Destroy (should be successful)

Both of those are valid ways of replacing the subnet.

Without the workaround, the attempted incorrect behavior is:

 C1. Old Subnet Destroy (fails, does not continue)
 C2. New Subnet Create
 C3. Update existing ASG to point to new Subnet

We could add create_before_destroy here, but it still does not fix the destroy in D3:

 D1. New Subnet Create
 D2. Update existing ASG to point to new Subnet
 D3. Old Subnet Destroy (still fails because ASG still has instances launched into the old subnet)

Generally a good strategy in trying to improve Terraform's behavior in situations like this is to ask the question "what would I do manually as an operator in this situation?"

Probably I'd do something like this:

 E1. Create the new subnet
 E2. Update the ASG to point to the new Subnet
 E3. Roll the instances in the ASG one by one so we get fresh instances in our new subnet
 E4. Delete the old subnet

So we're almost there with the D steps, we're just missing the concept of "this change should trigger a roll of instances in the ASG". This is a commonly requested feature you can track here:


We are also missing a first-class version of the name interpolation workaround used in A and B. Essentially a way to tell Terraform "please treat any change to this attribute as a replacing change". The relevant issue to track here is:


Phew okay I think that's all I have to say for now. Sorry for the tome! :) Feel free to follow up with any further questions.

Paul




--
This mailing list is governed under the HashiCorp Community Guidelines - https://www.hashicorp.com/community-guidelines.html. Behavior in violation of those guidelines may result in your removal from this mailing list.
 
GitHub Issues: https://github.com/hashicorp/terraform/issues
IRC: #terraform-tool on Freenode
---
You received this message because you are subscribed to the Google Groups "Terraform" group.
To unsubscribe from this group and stop receiving emails from it, send an email to terraform-too...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/terraform-tool/db2ee775-e8b8-4e6b-8b3a-35fb182a1571%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Owen Smith

unread,
Dec 3, 2015, 2:16:59 PM12/3/15
to Terraform
Sweet, thanks for the investigation!

Yes, in my case the ASG was using a static name, so it would seem to fit your test case.

As far as instance rollover goes: IIUC that means taking out instances out of the old subnet one at a time and adding instances on the new subnet? That's not actually something I needed in this particular case since it's a new (pre-production) system I'm working on, so I would have been OK with the "destroy all instances then create from scratch" approach (A1-4). But I can certainly see how a controlled rollover would be more useful once the system is in production!

I think I can see from this that my mental model was wrong. I was thinking of vpc_zone_identifier as "the subnets the ASG is contained in". Instead, it sounds like it's "the subnets the ASG will launch new instances in, _not necessarily_ the subnets the ASG's instances are in." So it didn't even occur to me that this could be a simple attribute update instead of a wholesale migration of instances.

I looked at #2341... it seems to me analogous to a problem Puppet et al deal with where you have two dependency chains: one for ordering of evaluation (require/before), and one for forcing reevaluation (subscribe/notify). If we're thinking about doing it similar to Puppet, we'd have something analogous to "depends_on", say "refresh_on", that will "force new" (i.e. destroy/recreate) the ASG instead of simply updating it. So, I guess the code would look something like this:

resource "aws_autoscaling_group" "app" {
    refresh_on = ["${aws_subnet.moving.id}"]
    vpc_zone_identifier = ["${aws_subnet.moving.id}"]
    ...
}

Or (this might cause some interesting implementation challenges?):

resource "aws_autoscaling_group" "app" {
    refresh_on = ["${self.vpc_zone_identifier}"]
    vpc_zone_identifier = ["${aws_subnet.moving.id}"]
    ...
}

But I don't know that it's intuitive for users since the definition of "refresh" for an ASG seems a little nebulous to me. For example, couldn't a naïve user consider applying an update to be a "refresh" of some sort?

Just spitballing here, but what if we thought of it as "this ASG should constrain its instances to these subnets"?

resource "aws_autoscaling_group" "app" {
   launch_in_subnets = ["${aws_subnet.moving.id}"]
   run_in_subnets = ["${aws_subnet.moving.id}"]
}

Then changes to "run_in_subnets" could automatically force deletion/recreation/rollover, where "launch_in_subnets" (alias for "vpc_zone_identifier") would remain as a simple update. Challenges: launch_in_subnets had better be a subset of run_in_subnets... maybe raise that as a configuration error if it's not?

Thanks again for looking at this and pointing me to the relevant GitHub issues!
-- O

Paul Hinze

unread,
Dec 4, 2015, 5:13:37 PM12/4/15
to terrafo...@googlegroups.com
it seems to me analogous to a problem Puppet et al deal with where you have two dependency chains...

This is a great point, and one I'll be thinking about more. "Refresh" has a special meaning in Terraform of refreshing attributes by reading from the upstream API - so the behavior you describe would probably be called something more like replace_on, though you'll notice towards the bottom of #2341 deciding that we could bundle "replace on" behavior with depends_on. Thinking about two separate dependency chains makes me hesitate to do so, though. More thought required!

As for a field called refresh_on that uses Terraform's definition of "refresh", we have a parallel thread going on ("Refreshing aws_instance's public_dns") where that sounds like it might be just the thing we need.

Just spitballing here, but what if we thought of it as "this ASG should constrain its instances to these subnets"?

I understand the motivation here, but I generally try to steer us away from resource-specific behavior-modification attributes as much as possible. I think a more generic solution like replace_on = ["subnet_ids"] is a better approach to pursue.

Thanks a lot for your thoughts! This has been a helpful conversation.

Paul 

Reply all
Reply to author
Forward
0 new messages