I just wanted to point out that it may be more common than you may think, as this is way to
achieve blue/green deployment by using DNS as your router.
I think it's a bad way of implementing this. For a number of reasons :
1) It assumes that the DNS server doesn't override the TTL value. That happens, in real life.
2) It assumes that the local implementation honours it (nscd didn't in the past for one)
3) It assumes it isn't cached somewhere else
In general, I avoid relying on DNS changes : It will bite you at some point.
This is where I disagree, TTL is part of the DNS spec and is pretty well respected everywhere.
The default behavior is to cache forever when a security manager is installed, and to cache for an implementation specific period of time, when a security manager is not installed.
Notice the "implementation specific period".
Of course you can override it (AWS recommends setting it to 60). I've seen multiple clients / servers with either bugs, or simply ignoring / overriding TTL records. Not to mention application doing their own caching.
I usually recommend either an ELB that understands resource groups (Amazon's ELB does), or use nginx that can be taught to handle this. Amazon's new ELB can even switch to a different resource group based on the request path.
But you can't simply use an ELB if your goal is balance your traffic between 2 stacks,
Why not ? Just attach 2 autoscaling groups.
or you have to have a stack that contains only this ELB and then have substacks that are your "real" stack but this is complicated setup compared to having route53 letting you select and switch the ELB. You can even distribute x% of your traffic to a stack or distribute it based on geolocation of the request.
That is traffic to the ELB's themselves, I'm talking about the distribution behind the ELB.
I understand the benefits of the approach you mention but in practice its is painful to set up. I know that on google cloud you have an internal LB in front of your application that allow you to siwtch between versions.
That also saves you the wait for a TTL to kick in in case the switch turns out bad. Or hacking a java lib because it uses a resource pool which you have no control over.
It is not 'hacking' on something you don't have control over, it is configuring a setting which control exactly the resource pool behaviour.
Assuming you have control over it.
Short TTL is also a core trick (scattering) used by cloudflare to failover other backends during a DDOS attack.
Assuming clients honour that. Your average DDOS service / tool will probably not.
If you ask me, their biggest asset is the huge amounts of bandwidth they have.
Please read the
cloudflare blog article where they explain how they do it and would like to be able to set a shot TTL on the Glue record to do the same.
That article is about TLD glue records, and about protecting DNS servers. That still doesn't resolve the issues I describe above, which are there in real life.
Sure, in the ideal world a short TTL would be a good thing, and solve a lot of issues. But we're not living in the ideal world.
Igmar