Jira (PUP-2958) Rapid-fire puppet runs cause race condition with SSL data

Zack Smith (JIRA)

unread,

Jun 2, 2015, 2:49:26 PM6/2/15

to puppe...@googlegroups.com

Zack Smith commented on

PUP-2958

Re: Rapid-fire puppet runs cause race condition with SSL data

I have a customer who is running into this as well

[root@host ssl]# openssl req -modulus -noout -in certificate_requests/foo.pem |openssl md5

(stdin)= 9505e558ccd9868c187ac85df5a606d5

[root@host ssl]# openssl x509 -modulus -noout -in certs/foo.pem |openssl md5           (stdin)= 9505e558ccd9868c187ac85df5a606d5

[root@host ssl]# openssl rsa -modulus -noout -in private_keys/foo.pem |openssl md5     (stdin)= 6c3040b604d0898dbaa4a85383e54c16

Add Comment

This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)

Zack Smith (JIRA)

unread,

Jun 2, 2015, 2:53:25 PM6/2/15

to puppe...@googlegroups.com

Zack Smith updated an issue

Puppet /

PUP-2958

Rapid-fire puppet runs cause race condition with SSL data

Change By:	Zack Smith
CS Priority:	Needs Priority

Add Comment

Owen Rodabaugh (JIRA)

unread,

Jun 4, 2015, 6:53:20 PM6/4/15

to puppe...@googlegroups.com

Owen Rodabaugh updated an issue

Puppet /

PUP-2958

Rapid-fire puppet runs cause race condition with SSL data

Change By:	Owen Rodabaugh
CS Priority:	Needs Priority Normal

Add Comment

Jesus Garcia (JIRA)

unread,

Mar 11, 2016, 3:45:02 PM3/11/16

to puppe...@googlegroups.com

Jesus Garcia commented on

PUP-2958

Re: Rapid-fire puppet runs cause race condition with SSL data

I have one customer that has reproduced this issue, or pretty close to it, at will and in front of me via Webex.

Details:
After buildout of Windows SQL Server VM (and now others non-SQL), the nose fingerprint seems to get lost after the first agent run:

{{C:\Program Files\Puppet Labs\Puppet\bin>puppet agent --fingerprint

C:\Program Files\Puppet Labs\Puppet\bin>}}

Full Description
Issue:
After provisioning a Windows 2012r2 server, puppet agent gets installed. Custom fact is generated, and `puppet agent –t –waitforcert 120` is issued from the node. Puppet master signs certificated (via vRO WF) and node is classified via the custom fact. Puppet run begins and all configuration is successfully applied. Any subsequent puppet runs result in the following message from the node:
ruby 2.1.8p440 (2015-12-16 revision 53160) [x64-mingw32]

C:\Program Files\Puppet Labs\Puppet\bin>puppet agent -t
Error: Could not request certificate: The certificate retrieved from the masterdoes not match the agent's private key.
Certificate fingerprint: <FINGERPRINT SCRUBBED>
To fix this, remove the certificate from both the master and the agent and then
start a puppet run, which will automatically regenerate a certficate.
On the master:
puppet cert clean <HOSTNAME SCRUBBED>
On the agent:
1a. On most platforms: find C:/ProgramData/PuppetLabs/puppet/etc/ssl -name <CERTNAME SCRUBBED>.pem -delete
1b. On Windows: del "C:/ProgramData/PuppetLabs/puppet/etc/ssl/<CERTNAME SCRUBBED>.pem" /f
2. puppet agent -t

Exiting; failed to retrieve certificate and waitforcert is disabled

Node is in Puppet console:

Output from `puppet cert list –all` on master:
+ "<NODENAME SCRUBBED>" (SHA256) 91:4C:9F:82:D4:57:A1:64:C2:95:D1:9B:A3:C0:07:7F:F5:AA:F4:AA:D5:CA:24:94:BE:6F:B2:12:85:C5:7E:9D

Removing the node from master, deleting certs on the node and re-adding does fix the problem.

This is happening intermittently, but seems to be more consistent with MS SQL servers. These do have significantly longer puppet run times. (oftern greater than 30 minutes). Not sure if that is anything…

Here is the node’s puppet.conf file:

PS C:\ProgramData\PuppetLabs\puppet\etc> cat .\puppet.conf
[main]
server=<HOSTNAME SCRUBBED>
pluginsync=true
autoflush=true
environment=production
runinterval = 6h

Attached is the Application log from the affected node. Let me know what other info you may need to help troubleshoot this issue.

Add Comment

This message was sent by Atlassian JIRA (v6.4.12#64027-sha1:e3691cc)

Jesus Garcia (JIRA)

unread,

Mar 16, 2016, 5:29:03 PM3/16/16

to puppe...@googlegroups.com

Jesus Garcia commented on

PUP-2958

Re: Rapid-fire puppet runs cause race condition with SSL data

Team,
Customer and I had a webex session and he has a successful workaround to this issue. It is confirmed to be an SSL race condition.

Workaround:
As you provision a new VM or host and rely on an external workflow (example: vSphere integration) and/or script driven deployment, a sleep command (value = 30) needs to be introduced after the installation of the agent and before the first agent run. Only in this fashion can the race condition be avoided.