EC2 autoscaling reusing hostnames

2,307 views
Skip to first unread message

Bad Tux

unread,
May 24, 2014, 5:54:04 AM5/24/14
to puppet...@googlegroups.com
So I'm using Amazon's amazing EC2 autoscaling service and hey, this is pretty cool. Traffic on the web site constellation goes up, Amazon slowly spawns new instances of our web application to handle the traffic and attaches them to the load balancer for our site. Puppet runs, pulls in the application from the PuppetMaster (which was designated at scaling group creation time), spins it up, load balancer asks it "hey are you there", the application says "yep", and traffic starts getting split out to the new instance. Traffic goes back down, after a while Amazon slowly spins the excess instances back down. 

So I sit there for a few weeks watching traffic yoyo up and down and watching the scaling notifications crawl across my inbox, then suddenly my Nagios alarms go off telling me that the application is offline. WTF? There's instances up there! I attach an elastic IP to the ssh gateway instance and log into a couple of the application instances via ssh and sure enough, no Tomcat is installed or running, nevermind the web app that Tomcat is supposed to be running. Okay, is my puppetmaster offline? Nope, it's online and listening. So I manually run puppet on one of the instances and... "invalid certificate for this hostname".

Wha?

Then I realize: Amazon gave this instance the same IP address and hostname as a prior instance that'd been part of the constellation! Which is inevitable when you're running inside a VPC (Virtual Private Cloud), because you have only a /16 to play with, which must be divided between multiple availability groups and multiple security zones. And the puppetmaster's SSL sez, "nope, no way, I seen you before and you had a different certificate, go away." 

Uhm, okay. So I need to solve this problem so that my new instances can get deployed. Only thing I can think of is to trash the ssl directories on both the puppet master and all of the clients, and then run puppet again. Note that all the instances and puppet are in a "puppet" network security group that was created by CloudFormation, and instances not part of the "puppet" security group cannot connect to the puppet master, so we *know* that we're talking to the puppet master, and the puppet master *knows* we're actual hosts that can talk to it, and besides all of these instances are inside a virtual private cloud that is inaccessible to the wider Internet except via port 8080 between the load balancer and the application instances (again enforced by the security groups mechanism) so there's no way an outsider could talk to the puppet server anyhow, but... puppet insists on validating these SSL certificates before letting the instances talk to it. Even though that's a totally useless exercise given that Amazon's enforcing the ACL's at the virtual network (firewall) layer to prevent anybody unauthorized from getting anywhere near that puppet port or puppet IP address.

Am I missing a configuration option in the manual to somehow disable SSL certificate validation? Does everybody add a cron job to their puppet master to stop the puppetmaster daemon and blow away its SSL directory then restart it at exactly 12:00AM every day, and the same on the instances at exactly 12:02AM every day? Or are we the only people on the planet who actually use Amazon's auto-scaling feature *plus* use Puppet at the same time? Curious penguins are... curious!



Jakov Sosic

unread,
May 25, 2014, 6:21:20 PM5/25/14
to puppet...@googlegroups.com
On 05/24/2014 07:54 AM, Bad Tux wrote:

> Am I missing a configuration option in the manual to somehow disable SSL
> certificate validation? Does everybody add a cron job to their puppet
> master to stop the puppetmaster daemon and blow away its SSL directory
> then restart it at exactly 12:00AM every day, and the same on the
> instances at exactly 12:02AM every day? Or are we the only people on the
> planet who actually use Amazon's auto-scaling feature *plus* use Puppet
> at the same time? Curious penguins are... curious!

Can you somehow get list of active nodes from balancer? You could use
that list in a daily cron to do a 'puppet cert clean' and remove all
other certificates?

Another, and maybe even better solution would be to add a script that
will signal puppet to remove cert of an instance once the instance goes
into spindown? Don't know if thats possible, didn't use amazon so much...

daddy dp

unread,
May 26, 2014, 1:35:11 PM5/26/14
to puppet...@googlegroups.com
I think, you need to use master less configuration, it is more robust solution and more suitable for autoscaling env. Just keep puppet and puppet modules on ami or check out on first boot.

On Saturday, May 24, 2014 8:54:04 AM UTC+3, Bad Tux wrote:

Wha?

Hugh Cole-Baker

unread,
May 27, 2014, 11:23:41 AM5/27/14
to puppet...@googlegroups.com

Am I missing a configuration option in the manual to somehow disable SSL certificate validation? Does everybody add a cron job to their puppet master to stop the puppetmaster daemon and blow away its SSL directory then restart it at exactly 12:00AM every day, and the same on the instances at exactly 12:02AM every day? Or are we the only people on the planet who actually use Amazon's auto-scaling feature *plus* use Puppet at the same time? Curious penguins are... curious!

We have enabled the Amazon SNS notifications from Autoscaling, and subscribed a SQS queue to the SNS topic. We have a written a small daemon, which runs on the puppet master and consumes from that queue, and calls "puppet cert clean" when it receives messages about instances being terminated by autoscaling.

We also have it listen for instance launch messages and add their certnames into /etc/puppet/autosign.conf and call "puppet cert sign" on them, which is also useful for security (you don't have to turn on auto signing for everything that way).

Jeremy T. Bouse

unread,
May 27, 2014, 11:56:05 AM5/27/14
to puppet...@googlegroups.com
This actually sounds like a useful tool. Is this something you're (or
would) consider releasing as OSS for others to make use of?

I've put my autosign script up on a GitHub gist and at least one other
has considered it useful so I've included a header releasing it under
Apache 2.0 license.

Hugh Cole-Baker

unread,
May 27, 2014, 12:00:13 PM5/27/14
to puppet...@googlegroups.com
On Tuesday, 27 May 2014 12:56:05 UTC+1, Jeremy wrote:

This actually sounds like a useful tool. Is this something you're (or
would) consider releasing as OSS for others to make use of?

I've put my autosign script up on a GitHub gist and at least one other
has considered it useful so I've included a header releasing it under
Apache 2.0 license.

It's a bit complicated with code to do various other things that our infrastructure needs, and it makes some assumptions (for example we always use <group name>-<instance ID> for our hostnames, so instances in "mygroup" are always named things like "mygroup-abcd1234", which avoids us having to call the EC2 API to find the hostname), but I will see if I can separate out the useful parts and publish them.

Jeremy T. Bouse

unread,
May 27, 2014, 12:25:17 PM5/27/14
to puppet...@googlegroups.com
Yeah I can understand that. My autosign script made use of the instance
id being embedded as an extra attribute within the CSR. It left out
anything specific in processing beyond showing how to pull the instance
id from the CSR and validate it was a valid running instance using Fog.

Felipe Salum

unread,
May 27, 2014, 3:06:22 PM5/27/14
to puppet...@googlegroups.com
I work around this by using a cloudinit script during the autoscale instance launch that gets the instance-id of the instance, rename the hostname and update /etc/hosts before running puppet.

Jeremy T. Bouse

unread,
May 27, 2014, 3:39:13 PM5/27/14
to puppet...@googlegroups.com
On 27.05.2014 11:06, Felipe Salum wrote:
> I work around this by using a cloudinit script during the autoscale
> instance launch that gets the instance-id of the instance, rename the
> hostname and update /etc/hosts before running puppet.
>

Cloud-init helps but there are limitations. I use cloud-init to deploy
a script that generates the extra attributes file prior to puppet being
deployed to include the instance id in the CSR and this works if your
master can then otherwise determine how to configure the host. If the
master is deterministic on the client cert name (read: hostname) then
you can override that with cloud-init but it fails to be useful when
using the full power of the cloud architecture and use auto scaling
groups.

Felipe Salum

unread,
May 27, 2014, 3:52:38 PM5/27/14
to puppet...@googlegroups.com
I have prod, qa, staging all using autoscaling and my master uses node regex based on the <hostgroup>-<instance-id> hostnames to apply the right roles modules. I have been using it for a long time and no issue at all to use the full power of the cloud and autoscale :)





--
You received this message because you are subscribed to a topic in the Google Groups "Puppet Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/puppet-users/m_fffsKR9aM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to puppet-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/85960579c1c0eab21f9068aa33299130%40undergrid.net.
For more options, visit https://groups.google.com/d/optout.

Bad Tux

unread,
May 27, 2014, 4:51:16 PM5/27/14
to puppet...@googlegroups.com
On Sunday, May 25, 2014 11:21:20 AM UTC-7, Jakov Sosic wrote:
Can you somehow get list of active nodes from balancer? You could use
that list in a daily cron to do a 'puppet cert clean' and remove all
other certificates?

I can get a list of active nodes in the constellation, the instances in the constellation have a constant embedded in the instance name that tells which constellation they belong to. That's how my Nagios instance works after all -- it queries AWS for the list of active nodes and reconfigures Nagios to look at the active nodes. Otherwise Nagios would be completely out of date after the first scaling event. I'm somewhat reluctant to embed AWS credentials into the puppetmaster though. The other thing that someone mentioned in another forum was to look for nodes reporting in the reports directory, and if a node hasn't reported for over an hour (mine are checking in every twenty minutes at a minimum so should have checked in by then) to do a 'puppet cert clean' on that node and then a 'puppet certificate_revocation_list destroy' just in case it comes back to life and checks in again. The other thing mentioned has been to change the hostname of the instances as part of their cloud_init stage to add the instance ID as well as the IP address as the hostname. That would actually work fairly well, I suspect, since the chance of both instance ID *and* the IP address being reused for the same instance are pretty much non-existent, and also gives me more information on my Splunk server about what Splunk event applies to what instance,but will require a lot of time to debug on my part because the only way to debug it is to run Cloudformation time... after... time... after... time... creating and destroying constellations until I get it right.

I don't want to go to master-less configuration because I tweak constellations before rolling them into production, and it's easier to do that via a configuration master. For example, new constellations start out pointed at a testing database, and one of the things that happens when they're moved into production is that they get re-pointed at the production database. I might try migrating to a different configuration tool such as Chef in the future, but I have limited time to devote to this project. So right now the priority is just forcing Puppet to work the way it needs to work in the cloud, rather than the way the Puppet authors believe it should work, which is completely incompatible with cloud ops.

Peter Romfeld

unread,
May 27, 2014, 5:00:43 PM5/27/14
to puppet...@googlegroups.com
hey,

if you dont use cross aws/datacenter and only aws i would recommend to use use ops-works.. its chef based but for aws only its quite nice

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/80ef1916-4434-4405-a357-62345d111618%40googlegroups.com.

Bad Tux

unread,
May 27, 2014, 6:50:55 PM5/27/14
to puppet...@googlegroups.com
On Tuesday, May 27, 2014 10:00:43 AM UTC-7, Peter Romfeld wrote:
hey,

if you dont use cross aws/datacenter and only aws i would recommend to use use ops-works.. its chef based but for aws only its quite nice

I trialed OpsWorks but it didn't handle some functionality that I needed. I forget what it was now, lots of water under the bridge LOL. I might take a look at it again at some point since Amazon may have added the functionality that I need in the year or so since I trialed it, but time's always the issue. 

Poil

unread,
May 28, 2014, 6:02:08 AM5/28/14
to puppet...@googlegroups.com
Hi

I use this solution (one certs) : https://gist.github.com/ahpook/1182243, I deploy the certificate via cfn-init and I've a pre-script for settings the hostname of the server via a reverse dns query on my AutoScale group.

Best regards
--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.

Hugh Cole-Baker

unread,
May 28, 2014, 10:05:29 AM5/28/14
to puppet...@googlegroups.com
Here is the code that we use to pull Auto Scaling messages off an SQS queue and add/remove the respective nodes to autosign.conf and sign or clean up their certificates.


It's copied and pasted out of a larger application that handles various events from Auto Scaling and applies changes to other systems than Puppet, so some parts may be missing, but it should be understandable... It assumes that all the instance hostnames will be <AS group name>-<hex digits from instance ID> and the domain appended will be based on the AWS region like 'east.internal'. You'd have to change the code in message.py to alter that behaviour.

jcbollinger

unread,
May 28, 2014, 5:27:12 PM5/28/14
to puppet...@googlegroups.com


On Tuesday, May 27, 2014 6:23:41 AM UTC-5, Hugh Cole-Baker wrote:

Am I missing a configuration option in the manual to somehow disable SSL certificate validation? Does everybody add a cron job to their puppet master to stop the puppetmaster daemon and blow away its SSL directory then restart it at exactly 12:00AM every day, and the same on the instances at exactly 12:02AM every day? Or are we the only people on the planet who actually use Amazon's auto-scaling feature *plus* use Puppet at the same time? Curious penguins are... curious!

We have enabled the Amazon SNS notifications from Autoscaling, and subscribed a SQS queue to the SNS topic. We have a written a small daemon, which runs on the puppet master and consumes from that queue, and calls "puppet cert clean" when it receives messages about instances being terminated by autoscaling.



+1

That, or something like it, is exactly what you ought to do, even before considering the possibility of hostname reuse.  In any Puppet environment, you should clean out the certificates of nodes that have been decommissioned.  And decommissioning is exactly what the auto-scaledown is doing: even if another node is later commissioned with the same hostname, it is a different node.

As another possible alternative, if EC2 nodes have a genuinely unique identifier (an Amazon-assigned UUID, for instance) then you can configure your clients to use that as their certificate names, instead of their hostname.  (But you still might want to set up automatic certificate cleaning to avoid Puppet's certificate stash growing out of control.)

 
We also have it listen for instance launch messages and add their certnames into /etc/puppet/autosign.conf and call "puppet cert sign" on them, which is also useful for security (you don't have to turn on auto signing for everything that way).


Nice.

There are other alternatives, but I haven't thought of any better ones.


John

Felipe Salum

unread,
May 28, 2014, 5:47:45 PM5/28/14
to puppet...@googlegroups.com
I use a different approach to clean up certificates and the node on the puppet dashboard, but it is a ugly hack. I'm writing something in python to read the autoscaling termination message posted to SNS->SQS and I should have something up tonight. I will share here and get feedback, I'm planning to replace my ugly hack by this python script.


--
You received this message because you are subscribed to a topic in the Google Groups "Puppet Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/puppet-users/m_fffsKR9aM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to puppet-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/45c97a85-6783-4239-b04d-bf7b378bfa2c%40googlegroups.com.

Felipe Salum

unread,
May 29, 2014, 2:14:03 AM5/29/14
to puppet...@googlegroups.com
I finished writing a script to read messages from a SQS queue subscribed to a SNS topic used by the autoscaling groups notifications. Based on the termination event it runs 'puppet node {clean,deactivate}' and a rake task to delete it from Puppet Dashboard.


I included autoscaling.msg to be easy to add the messages to the SQS queue and test the script.

Let me know what you guys think.


On Wednesday, May 28, 2014 10:47:45 AM UTC-7, Felipe Salum wrote:
I use a different approach to clean up certificates and the node on the puppet dashboard, but it is a ugly hack. I'm writing something in python to read the autoscaling termination message posted to SNS->SQS and I should have something up tonight. I will share here and get feedback, I'm planning to replace my ugly hack by this python script.
On Wed, May 28, 2014 at 3:05 AM, Hugh Cole-Baker <hu...@fanduel.com> wrote:
Here is the code that we use to pull Auto Scaling messages off an SQS queue and add/remove the respective nodes to autosign.conf and sign or clean up their certificates.


It's copied and pasted out of a larger application that handles various events from Auto Scaling and applies changes to other systems than Puppet, so some parts may be missing, but it should be understandable... It assumes that all the instance hostnames will be <AS group name>-<hex digits from instance ID> and the domain appended will be based on the AWS region like 'east.internal'. You'd have to change the code in message.py to alter that behaviour.

--
You received this message because you are subscribed to a topic in the Google Groups "Puppet Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/puppet-users/m_fffsKR9aM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to puppet-users+unsubscribe@googlegroups.com.

Rich Burroughs

unread,
May 29, 2014, 5:32:04 PM5/29/14
to puppet...@googlegroups.com
Yeah I think masterless is probably the way to go for autoscaling. Even with autosigning on you have to deal with the cert issue somehow.

I have a vagrant environment I hacked together for some testing and I re-use hostnames there. My solution was to have the agent execute a "puppet cert clean" on the master for its own hostname over ssh as it comes up the first time.



Rich

On Monday, May 26, 2014, daddy dp <jaro...@gmail.com> wrote:
I think, you need to use master less configuration, it is more robust solution and more suitable for autoscaling env. Just keep puppet and puppet modules on ami or check out on first boot.

On Saturday, May 24, 2014 8:54:04 AM UTC+3, Bad Tux wrote:

Wha?


Uhm, okay. So I need to solve this problem so that my new instances can get deployed. Only thing I can think of is to trash the ssl directories on both the puppet master and all of the clients, and then run puppet again. Note that all the instances and puppet are in a "puppet" network security group that was created by CloudFormation, and instances not part of the "puppet" security group cannot connect to the puppet master, so we *know* that we're talking to the puppet master, and the puppet master *knows* we're actual hosts that can talk to it, and besides all of these instances are inside a virtual private cloud that is inaccessible to the wider Internet except via port 8080 between the load balancer and the application instances (again enforced by the security groups mechanism) so there's no way an outsider could talk to the puppet server anyhow, but... puppet insists on validating these SSL certificates before letting the instances talk to it. Even though that's a totally useless exercise given that Amazon's enforcing the ACL's at the virtual network (firewall) layer to prevent anybody unauthorized from getting anywhere near that puppet port or puppet IP address.

Am I missing a configuration option in the manual to somehow disable SSL certificate validation? Does everybody add a cron job to their puppet master to stop the puppetmaster daemon and blow away its SSL directory then restart it at exactly 12:00AM every day, and the same on the instances at exactly 12:02AM every day? Or are we the only people on the planet who actually use Amazon's auto-scaling feature *plus* use Puppet at the same time? Curious penguins are... curious!



--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/2daac4c2-18c8-4b90-b144-01524acf200c%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages