Very frequent "Error: Could not request certificate: The certificate retrieved from the master does not match the agent's private key." on Windows

1,496 views
Skip to first unread message

Fredrik Nilsson

unread,
Oct 7, 2016, 3:33:23 AM10/7/16
to Puppet Users
Hi Guys,

Hopefully one of you have a splendid idea on how to solve this...

The problem is that I'm getting this error message a lot (to much is more like it):

Error: Could not request certificate: The certificate retrieved from the master does not match the agent's private key.
Certificate fingerprint: FINGERPRINT
To fix this, remove the certificate from both the master and the agent and then start a puppet run, which will automatic
ally regenerate a certficate.
On the master:
  puppet cert clean SERVERNAME
On the agent:
  1a. On most platforms: find C:/ProgramData/PuppetLabs/puppet/etc/ssl -name SERVERNAME.pem -delete
  1b. On Windows: del "C:/ProgramData/PuppetLabs/puppet/etc/ssl/SERVERNAME" /f
  2. puppet agent -t

Some characteristics:
This is on newly provisioned hosts (provisioned from Foreman)
The machinses is running Windows Server of different flavours
Puppet Agent version is 3.8.7 (upgrade to a 4 release is in the pipe)
We have two VmWare clusters and this occurs on both (the checkbox for time sync with hardware host is NOT checked)

I actually had this problem from start, but back then it was so seldomly occuring so I decided to live with it, say it occured like 1/20 or so machines. But now it has escalated and it is rather 1/20 that got a working certificate from start, actually when starting to banging my head against the wall again yesterday I had two machines working, after adding an extra timesync in the provisioning workflow, but that was shortlived happiness as I've made 3 more machines after that with no success.

So my first suspects on this was time and change of "security context", but I think they're of the hook for the moment as I'm pretty confident in that my time is right and that I to my knowledge have stayed in the same security context.

To make sure that I got the time right I have this runing under the oobeSystem step in my provisioning workflow :
powershell.exe -noprofile -executionpolicy bypass -command "& {Start-Service W32Time -ErrorAction SilentlyContinue; .\w32tm.exe /resync}"

After installing chocolatey and the puppet agent the agent phones home like this (command composed from how this is done in the Linux half of our department):
powershell.exe -noprofile -executionpolicy bypass -command " & {& 'C:\Program Files\Puppet Labs\Puppet\bin\puppet.bat' agent -o --tags no_such_tag --no-daemonize}"

The user loging on and running the commands are the local administrator account, to be extra thorough I logged on as that account trying to run a puppet agent -t after the host is built, just to be sure there was no logon account related stuff going on, but no difference.

Following the steps in the error message, generating a new certificate, ofcourse works, but we can all see the inconvinience of dowing that constantly on newly provisioned hosts, right?

I think that sums things up quite good, as said I've been baning my head against this, while not ignoring it, could still be something fishy going on on the puppetmaster that is not managed by me, but me colleauges in the linux neighborhood don't ecperience this so it is seemingly something to do with the Windows hosts.

Cheers,
Fredrik

Andrew

unread,
Oct 9, 2016, 10:30:32 PM10/9/16
to Puppet Users
I recently had a similar issue, but not on windows. To fix, I replaced the puppet root ca with a sha256 cert instead of the older sha1.
This or course meant re-signing all the client certs, which for me was about 4 hours worth of logging into every computer. My cut'n'paste fu is strong now ....

Replacing the puppet ca with the newer one fixed the errors tho. Sorry I dont have an easier fix for you :(

Andrew.

Fredrik Nilsson

unread,
Oct 12, 2016, 4:55:09 AM10/12/16
to Puppet Users
Thanks for your reply Andrew, sadly I guess that wont be an option as the pain of resigning the actual certificate for erroneous hosts are less the re-signing every certificate for all existing hosts. After all we are in the process of upgrading to Puppet 4 so hopefully one of the side effects of that upgrade is that this error goes away as a part of the process. Thanks though, one should always train ones cut'n'paste skills ;-).

Josh Cooper

unread,
Oct 12, 2016, 12:32:14 PM10/12/16
to puppet...@googlegroups.com
The (--no)--daemonize flags are actually meaningless on Windows, and awhile ago I changed the default value of daemonize to false on Windows.

The reason is because services work differently on Windows than most *nix. On *nix, the process typically forks, creates a new session, detaching from the old one, etc. On Windows, the logic is inverted. The Service Control Manager starts the process and the process needs to communicate back with the SCM in a specific way. Rather than add SCM specific logic to puppet, we have a daemon.rb shim. So the SCM runs rubyw.exe daemon.rb, and that runs puppet agent every runinterval seconds.

So back to the issue above. The problem is that `puppet agent --no-daemonize` will run the agent so it connects to the puppet master every 30 minutes! That command will block until you Ctrl-C. But your powershell command is running puppet asynchronously. Process explorer is handy for debugging that.

Later when the Service Control Manager starts the Puppet service, it is going to race with the instance you started above. Due to race conditions in puppet's SSL bootstrapping process, you can get into a situation where one instance creates a keypair and submits a CSR. And before the cert is signed, the second instance sees there's no cert, and generates a new key pair, overwriting the old one. The first instance then downloads the signed cert, which doesn't match the new key pair.

To fix the problem you'll want to run puppet using C:\Program Files\Puppet Labs\Puppet\bin\puppet.bat' agent -o --tags no_such_tag --onetime` and make the powershell command synchronous.
 
The user loging on and running the commands are the local administrator account, to be extra thorough I logged on as that account trying to run a puppet agent -t after the host is built, just to be sure there was no logon account related stuff going on, but no difference.

Following the steps in the error message, generating a new certificate, ofcourse works, but we can all see the inconvinience of dowing that constantly on newly provisioned hosts, right?

I think that sums things up quite good, as said I've been baning my head against this, while not ignoring it, could still be something fishy going on on the puppetmaster that is not managed by me, but me colleauges in the linux neighborhood don't ecperience this so it is seemingly something to do with the Windows hosts.

Cheers,
Fredrik

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/56a91341-3509-403a-8eb7-e88d903eb02f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Josh Cooper
Developer, Puppet

Fredrik Nilsson

unread,
Oct 12, 2016, 3:06:39 PM10/12/16
to Puppet Users
The talking about the possibility of a race condition between my manual execution and the Puppet service makes perfectly sense, I didn't realize that it existed before I read your reply above. As a matter of fact the powershell command described in my post is ran as a series of synchronous powershell commands before Windows restarts one last time to enter its normal state, as described briefly it is still in an installation automated state when the command is executed. Anyway one of the commands before the manual puppet run, that is the issue here, is the installation of the puppet agent package, it is installed via chocolatey and supplied with the host address to the master and the puppet ca. So my guess, bearing what you described above in mind, either the service or the explicit powershell command creates the keypair, that is almost immediately overwritten by the other resulting in the error message described. I can't investigate this using processexplorer as I am still in an automatic installation stage, but first thing tomorrow I will remove the manual run altogether as I think it is causing all the headache and is excessive as I presume that the Puppet service is already on top of things.... I'll post back with the results! Thanks Josh!

To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.

Fredrik Nilsson

unread,
Oct 13, 2016, 5:12:35 AM10/13/16
to Puppet Users
Removed the excessive explicit run and seemingly, knock on wood, I provisioned 3 hosts with no certificate error so I think this made the trick. Thanks alot for pointing me in the right direction Josh!

John Gelnaw

unread,
Oct 17, 2016, 11:56:05 AM10/17/16
to Puppet Users

On Wednesday, October 12, 2016 at 4:55:09 AM UTC-4, Fredrik Nilsson wrote:
Thanks for your reply Andrew, sadly I guess that wont be an option as the pain of resigning the actual certificate for erroneous hosts are less the re-signing every certificate for all existing hosts. After all we are in the process of upgrading to Puppet 4 so hopefully one of the side effects of that upgrade is that this error goes away as a part of the process. Thanks though, one should always train ones cut'n'paste skills ;-).

If you have an mcollective environment, you could turn on auto-sign, use mcollective to whack the local ca/host certs, and then use mcollective to trigger a puppet run (which would auto-request a new certificate).

Since my puppet environment is now 5 years old, I'm experiencing a rolling expiration of puppet agent certs, and I wrote a script that lives on the puppet master that checks for impending expirations, and if it finds them, it runs:

puppet cert clean <hostname>
mco puppet resource exec "rm -rf /var/lib/puppet/ssl/*" -W fqdn=<hostname>
mco puppet runonce -W fqdn=<hostname>
puppet cert sign <hostname>

... there's a bit of a tricky timing issue that (usually) doesn't matter, since we configure mcollective to actually use the puppet agent's certificate/key pair.

If you're doing this on Windows, the equivalent powershell-fu shouldn't be too tough.  You'd probably want to stop the puppet service on the agent, nuke the certs/keys, and then invoke a single synchronous run of puppet to request the new certificate.

Reply all
Reply to author
Forward
0 new messages