Jira (PUP-2958) Rapid-fire puppet runs cause race condition with SSL data

18 views
Skip to first unread message

Zack Smith (JIRA)

unread,
Jun 2, 2015, 2:49:26 PM6/2/15
to puppe...@googlegroups.com
Zack Smith commented on Bug PUP-2958
 
Re: Rapid-fire puppet runs cause race condition with SSL data

I have a customer who is running into this as well

[root@host ssl]# openssl req -modulus -noout -in certificate_requests/foo.pem |openssl md5
(stdin)= 9505e558ccd9868c187ac85df5a606d5
[root@host ssl]# openssl x509 -modulus -noout -in certs/foo.pem |openssl md5           (stdin)= 9505e558ccd9868c187ac85df5a606d5
[root@host ssl]# openssl rsa -modulus -noout -in private_keys/foo.pem |openssl md5     (stdin)= 6c3040b604d0898dbaa4a85383e54c16

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v6.3.15#6346-sha1:dbc023d)
Atlassian logo

Zack Smith (JIRA)

unread,
Jun 2, 2015, 2:53:25 PM6/2/15
to puppe...@googlegroups.com
Zack Smith updated an issue
 
Puppet / Bug PUP-2958
Change By: Zack Smith
CS Priority: Needs Priority

Owen Rodabaugh (JIRA)

unread,
Jun 4, 2015, 6:53:20 PM6/4/15
to puppe...@googlegroups.com
Owen Rodabaugh updated an issue
Change By: Owen Rodabaugh
CS Priority: Needs Priority Normal

Jesus Garcia (JIRA)

unread,
Mar 11, 2016, 3:45:02 PM3/11/16
to puppe...@googlegroups.com
Jesus Garcia commented on Bug PUP-2958
 
Re: Rapid-fire puppet runs cause race condition with SSL data

I have one customer that has reproduced this issue, or pretty close to it, at will and in front of me via Webex.

Details:
After buildout of Windows SQL Server VM (and now others non-SQL), the nose fingerprint seems to get lost after the first agent run:

{{C:\Program Files\Puppet Labs\Puppet\bin>puppet agent --fingerprint

C:\Program Files\Puppet Labs\Puppet\bin>}}

Full Description
Issue:
After provisioning a Windows 2012r2 server, puppet agent gets installed. Custom fact is generated, and `puppet agent –t –waitforcert 120` is issued from the node. Puppet master signs certificated (via vRO WF) and node is classified via the custom fact. Puppet run begins and all configuration is successfully applied. Any subsequent puppet runs result in the following message from the node:
ruby 2.1.8p440 (2015-12-16 revision 53160) [x64-mingw32]

C:\Program Files\Puppet Labs\Puppet\bin>puppet agent -t
Error: Could not request certificate: The certificate retrieved from the masterdoes not match the agent's private key.
Certificate fingerprint: <FINGERPRINT SCRUBBED>
To fix this, remove the certificate from both the master and the agent and then
start a puppet run, which will automatically regenerate a certficate.
On the master:
puppet cert clean <HOSTNAME SCRUBBED>
On the agent:
1a. On most platforms: find C:/ProgramData/PuppetLabs/puppet/etc/ssl -name <CERTNAME SCRUBBED>.pem -delete
1b. On Windows: del "C:/ProgramData/PuppetLabs/puppet/etc/ssl/<CERTNAME SCRUBBED>.pem" /f
2. puppet agent -t

Exiting; failed to retrieve certificate and waitforcert is disabled

Node is in Puppet console:

Output from `puppet cert list –all` on master:
+ "<NODENAME SCRUBBED>" (SHA256) 91:4C:9F:82:D4:57:A1:64:C2:95:D1:9B:A3:C0:07:7F:F5:AA:F4:AA:D5:CA:24:94:BE:6F:B2:12:85:C5:7E:9D

Removing the node from master, deleting certs on the node and re-adding does fix the problem.

This is happening intermittently, but seems to be more consistent with MS SQL servers. These do have significantly longer puppet run times. (oftern greater than 30 minutes). Not sure if that is anything…

Here is the node’s puppet.conf file:

PS C:\ProgramData\PuppetLabs\puppet\etc> cat .\puppet.conf
[main]
server=<HOSTNAME SCRUBBED>
pluginsync=true
autoflush=true
environment=production
runinterval = 6h

Attached is the Application log from the affected node. Let me know what other info you may need to help troubleshoot this issue.

This message was sent by Atlassian JIRA (v6.4.12#64027-sha1:e3691cc)
Atlassian logo

Jesus Garcia (JIRA)

unread,
Mar 16, 2016, 5:29:03 PM3/16/16
to puppe...@googlegroups.com
Jesus Garcia commented on Bug PUP-2958

Team,
Customer and I had a webex session and he has a successful workaround to this issue. It is confirmed to be an SSL race condition.

Workaround:
As you provision a new VM or host and rely on an external workflow (example: vSphere integration) and/or script driven deployment, a sleep command (value = 30) needs to be introduced after the installation of the agent and before the first agent run. Only in this fashion can the race condition be avoided.

Eric Sorenson (JIRA)

unread,
Jan 30, 2017, 7:30:04 PM1/30/17
to puppe...@googlegroups.com
Eric Sorenson updated an issue
 
Change By: Eric Sorenson
Sprint: SE Triage
This message was sent by Atlassian JIRA (v6.4.14#64029-sha1:ae256fe)
Atlassian logo

Eric Sorenson (JIRA)

unread,
Jan 30, 2017, 7:30:09 PM1/30/17
to puppe...@googlegroups.com

Eric Sorenson (JIRA)

unread,
Jan 30, 2017, 7:31:24 PM1/30/17
to puppe...@googlegroups.com
Eric Sorenson commented on Bug PUP-2958
 
Re: Rapid-fire puppet runs cause race condition with SSL data

Bumping to Sys Eng for CA subsystem work.

Karen Van der Veer (JIRA)

unread,
Feb 22, 2017, 1:23:04 PM2/22/17
to puppe...@googlegroups.com

Ruth Linehan (JIRA)

unread,
May 15, 2017, 7:37:06 PM5/15/17
to puppe...@googlegroups.com
Ruth Linehan updated an issue
Change By: Ruth Linehan
Team: Systems Engineering Agent

Moses Mendoza (JIRA)

unread,
May 18, 2017, 1:54:08 PM5/18/17
to puppe...@googlegroups.com
Moses Mendoza updated an issue
Change By: Moses Mendoza
Labels: customer support  triaged

Geoff Nichols (JIRA)

unread,
May 24, 2017, 1:45:06 PM5/24/17
to puppe...@googlegroups.com

Karen Van der Veer (JIRA)

unread,
Aug 15, 2017, 5:24:04 PM8/15/17
to puppe...@googlegroups.com
Karen Van der Veer updated an issue
Change By: Karen Van der Veer
Fix Version/s: PUP 5.2.0
Fix Version/s: PUP 5.y

Josh Cooper (JIRA)

unread,
Mar 16, 2018, 2:48:04 PM3/16/18
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Sub-team: Coremunity
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Maggie Dreyer (JIRA)

unread,
Oct 2, 2018, 11:53:06 AM10/2/18
to puppe...@googlegroups.com
Maggie Dreyer commented on Bug PUP-2958
 
Re: Rapid-fire puppet runs cause race condition with SSL data

My thought on this is that we should fix it as part of stopping using the Key indirection (and as part of the larger effort to overhaul agent cert initialization), the way we're no longer using indirections for the other SSL objects in https://github.com/puppetlabs/puppet/blob/master/lib/puppet/ssl/host.rb. That way it would be much more clear that we are checking for keys on disk before generating new ones. The current code is pretty impenetrable.

Josh Cooper (JIRA)

unread,
Mar 21, 2019, 6:17:03 PM3/21/19
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Fix Version/s: PUP 5.y
Fix Version/s: PUP 6.y

Josh Cooper (JIRA)

unread,
Apr 10, 2019, 1:49:06 PM4/10/19
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Sprint: Coremunity Grooming

Josh Cooper (JIRA)

unread,
Apr 10, 2019, 1:50:08 PM4/10/19
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Fix Version/s: PUP 6.y
Fix Version/s: PUP 6.5.0

Josh Cooper (JIRA)

unread,
Apr 10, 2019, 2:03:08 PM4/10/19
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Fix Version/s: PUP 6.5.0
Fix Version/s: PUP 6.4.z

Josh Cooper (JIRA)

unread,
May 2, 2019, 12:43:04 PM5/2/19
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Fix Version/s: PUP 6.4.z
Fix Version/s: PUP 6.5.0

Josh Cooper (JIRA)

unread,
May 3, 2019, 5:19:07 PM5/3/19
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Sprint: Coremunity Grooming Platform Core KANBAN

Josh Cooper (JIRA)

unread,
May 3, 2019, 5:31:04 PM5/3/19
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Release Notes Summary: Puppet will prevent multiple puppet processes from concurrently bootstrapping its SSL keys and certs.
Release Notes: Bug Fix

Jorie Tappa (JIRA)

unread,
May 6, 2019, 12:49:03 PM5/6/19
to puppe...@googlegroups.com

Kris Bosland (JIRA)

unread,
May 10, 2019, 7:51:05 PM5/10/19
to puppe...@googlegroups.com

Josh Cooper (JIRA)

unread,
May 13, 2019, 4:20:04 PM5/13/19
to puppe...@googlegroups.com
Josh Cooper commented on Bug PUP-2958

Reverted in 926e0e6436 because puppet's locking code doesn't supported nested behavior. In this case puppet infrastructure locks the pidlock and runs puppet apply, which tries to exec puppet ssl bootstrap. This is similar to the problem caused in PUP-5609.

Kris Bosland (JIRA)

unread,
May 31, 2019, 5:42:07 PM5/31/19
to puppe...@googlegroups.com

Josh Cooper (JIRA)

unread,
Jun 4, 2019, 4:07:05 PM6/4/19
to puppe...@googlegroups.com

Josh Cooper (JIRA)

unread,
Jun 5, 2019, 1:22:04 PM6/5/19
to puppe...@googlegroups.com
Josh Cooper commented on Bug PUP-2958

Failed integration, it doesn't reclaim stale lockfiles

[root@rllkgfa3qeth9t6 ~]# echo 23423 > /etc/puppetlabs/puppet/ssl/ssl.lock
[root@rllkgfa3qeth9t6 ~]# puppet agent -t --certname test
Error: Could not run: Another puppet instance is already running; exiting

Kris Bosland (JIRA)

unread,
Jun 5, 2019, 2:57:09 PM6/5/19
to puppe...@googlegroups.com

Kris Bosland (JIRA)

unread,
Jun 6, 2019, 7:06:05 PM6/6/19
to puppe...@googlegroups.com
Kris Bosland commented on Bug PUP-2958

Passed CI in 51f852185979d748fdaf3079d96e3b7e3614cbee.

Kris Bosland (JIRA)

unread,
Jun 6, 2019, 8:40:07 PM6/6/19
to puppe...@googlegroups.com

Heston Hoffman (JIRA)

unread,
Jun 12, 2019, 4:54:04 PM6/12/19
to puppe...@googlegroups.com
Heston Hoffman updated an issue
 
Change By: Heston Hoffman
Labels: customer resolved-issue-added support
Reply all
Reply to author
Forward
0 new messages