Jira (PUP-10844) Agent failures with server_list when one puppetserver fails

Jarret Lavallee (Jira)

unread,

Dec 22, 2020, 11:07:03 AM12/22/20

to puppe...@googlegroups.com

Jarret Lavallee created an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Issue Type:	Bug
Affects Versions:	PUP 6.15.0
Assignee:	Unassigned
Created:	2020/12/22 8:06 AM
Priority:	Normal
Reporter:	Jarret Lavallee

Puppet Version: 6.15.0+
Puppet Server Version: 6.x
OS Name/Version: Any

After the changes in 6.15.0 the server_list setting has different behavior. Previously when server_list was configured and the first puppetserver in the list failed, the agent would continue to run by connecting to the next puppetserver on the list. In 6.15.0, if the primary puppetserver fails while an agent is running, it results in a failed agent run.

Desired Behavior:
When the first puppetserver in the server_list goes offline, the agents should automatically try to connect to the second puppetserver in the server_list even if it is mid agent run.

Actual Behavior:
The agent run fails if the first puppetserver in the server_list goes offline while the agent is in the middle of a run.

Some failures are below.

Could not evaluate: Could not retrieve file metadata for puppet:///pe_packages/2019.8.1/windows-x86_64/puppet-agent-x64.msi: Request to https://primary.example.com:8140/puppet/v3/file_metadata/pe_packages/2019.8.1/windows-x86_64/puppet-agent-x64.msi?links=manage&checksum_type=sha256lite&source_permissions=ignore&environment=windows_testing failed after 21.011 seconds: Failed to open TCP connection to primary.example.com:8140 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) for "primary.example.com" port 8140)

puppet-agent[7053]: Could not retrieve catalog from remote server: Request to https://primary.example.com:8140/puppet/v3/catalog/agent.example.com?environment=development failed after 0.004 seconds: Failed to open TCP connection to primary.example.com:8140 (Connection refused - connect(2) for "primary.example.com" port 8140)

Reproduction
1. Configure the server_list for two Puppetservers
2. Configure 10 agents with the server_list and a run interval of a minute
3. Shutdown the Puppetserver service on the first server in the server_list

Likely one of the agents will have the failure. It seems to be more reproducible with file resources inside the catalog.

We believe this is related to the changes in ~~PUP-10363~~

Add Comment

This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)

zendesk.jira (Jira)

unread,

Dec 22, 2020, 11:08:03 AM12/22/20

to puppe...@googlegroups.com

zendesk.jira updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	zendesk.jira
Labels:	jira_escalated

Add Comment

zendesk.jira (Jira)

unread,

Dec 22, 2020, 11:08:04 AM12/22/20

to puppe...@googlegroups.com

zendesk.jira updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	zendesk.jira
Zendesk Ticket Count:	1
Zendesk Ticket IDs:	40535

Add Comment

Josh Cooper (Jira)

unread,

Jan 4, 2021, 1:29:04 PM1/4/21

to puppe...@googlegroups.com

Josh Cooper updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Josh Cooper
Labels:	jira_escalated platform_7.2

Add Comment

Gheorghe Popescu (Jira)

unread,

Jan 5, 2021, 10:52:03 AM1/5/21

to puppe...@googlegroups.com

Gheorghe Popescu updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Gheorghe Popescu
Sprint:	NW - 2021-01-20

Add Comment

Gheorghe Popescu (Jira)

unread,

Jan 5, 2021, 10:52:04 AM1/5/21

to puppe...@googlegroups.com

Gheorghe Popescu updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Gheorghe Popescu
Team:	Coremunity Night's Watch

Add Comment

Mihai Buzgau (Jira)

unread,

Jan 6, 2021, 4:43:03 AM1/6/21

to puppe...@googlegroups.com

Mihai Buzgau updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Mihai Buzgau
Story Points:	5

Add Comment

Dorin Pleava (Jira)

unread,

Jan 11, 2021, 9:45:04 AM1/11/21

to puppe...@googlegroups.com

Dorin Pleava assigned an issue to Dorin Pleava

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Dorin Pleava
Assignee:	Dorin Pleava

Add Comment

Mihai Buzgau (Jira)

unread,

Jan 20, 2021, 5:29:04 AM1/20/21

to puppe...@googlegroups.com

Mihai Buzgau updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Mihai Buzgau
Sprint:	NW - 2021-01-20 , NW - 2021-02-03

Add Comment

Dorin Pleava (Jira)

unread,

Jan 26, 2021, 8:08:03 AM1/26/21

to puppe...@googlegroups.com

Dorin Pleava commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

After some digging around on versions 5.5 21, 6.13.0, 6.14.0 and the newly released 6.20.0, only 6.14.0 did some things different.

On 6.14.0, when a server failed midrun, the current running part would fail (like Retrieving pluginfacts, or Retrieving facts) and the code would continue to the next part(Retrieving locales) where it would check again for an available server, choosing the next functional server from server_list. I think this is not the intended functionality, as it could cause some sort of mix between catalogs from different servers.

[root@blue-bumper ~]# puppet --version

6.14.0

[root@blue-bumper ~]# puppet agent -t --debug

...

Debug: Creating new connection for https://past-medication.delivery.puppetlabs.net:8140

Debug: Starting connection for https://past-medication.delivery.puppetlabs.net:8140

Error: Could not retrieve catalog from remote server: Request to https://past-medication.delivery.puppetlabs.net:8140/puppet/v3/catalog/blue-bumper.delivery.puppetlabs.net?environment=production failed after 0.001 seconds: Failed to open TCP connection to past-medication.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "past-medication.delivery.puppetlabs.net" port 8140)

Wrapped exception:

Failed to open TCP connection to past-medication.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "past-medication.delivery.puppetlabs.net" port 8140)

Warning: Not using cache on failed catalog

Error: Could not retrieve catalog; skipping run

Debug: Resolving service 'report' using Puppet::HTTP::Resolver::ServerList

Debug: Creating new connection for https://past-medication.delivery.puppetlabs.net:8140

Debug: Starting connection for https://past-medication.delivery.puppetlabs.net:8140

Debug: Unable to connect to server from server_list setting: Request to https://past-medication.delivery.puppetlabs.net:8140/status/v1/simple/master failed after 0.001 seconds: Failed to open TCP connection to past-medication.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "past-medication.delivery.puppetlabs.net" port 8140)

Debug: Closing connection for https://full-ink.delivery.puppetlabs.net:8140

Debug: Creating new connection for https://full-ink.delivery.puppetlabs.net:8140

Debug: Starting connection for https://full-ink.delivery.puppetlabs.net:8140

Debug: Using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256

Debug: HTTP GET https://full-ink.delivery.puppetlabs.net:8140/status/v1/simple/master returned 200 OK

...

I think the current implementation is ok, where if a part of the execution would fail, puppet should not try to recover on the same run.
The next run should select the next available functional server, and use it for the rest of the run.
If puppet were to recover from a server connection error, I think it would be best to retry the whole run, as I had some differences in the catalog when running puppet with puppetserver from a PE deployment, and running puppet with puppetserver from compiler node.

I would say that 6.14.0 had a bug where it mixed puppetservers from server_list when a midrun failure occurred, and close this ticket as this no longer happens on versions > 6.15.0 where the first functional server is used throughout the puppet run.

Add Comment

Jarret Lavallee (Jira)

unread,

Jan 26, 2021, 11:14:02 AM1/26/21

to puppe...@googlegroups.com

Jarret Lavallee commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

Josh Cooper and Dorin Pleava Thank you for looking into this deeper and providing some great analysis. I think you are correct with the desired behavior and we should close this ticket.

Add Comment

Josh Cooper (Jira)

unread,

Jan 26, 2021, 11:30:05 AM1/26/21

to puppe...@googlegroups.com

Josh Cooper commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

Jarret Lavallee sounds good. Also to summarize this issue, the current 6.x behavior matches how 5.x worked. It just so happened that 6.14.0 would process server_list for every REST request, regardless of whether the previous request succeeded or not.

Add Comment

Vadym Chepkov (Jira)

unread,

Feb 18, 2021, 11:57:05 AM2/18/21

to puppe...@googlegroups.com

Vadym Chepkov commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

That's not my experience, I have 6.19.1 in PE2019.8.4 and each time I have to restart primary server for patching or other maintenance, dozens of agents fail, which defeats the purpose of HA.
When we had PE 2018.1.x, agents have worked without failures

Add Comment

Nick Walker (Jira)

unread,

Feb 26, 2021, 2:27:03 PM2/26/21

to puppe...@googlegroups.com

Nick Walker commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

Vadym Chepkov reports this is still happening in his install. He's going to report back when he upgrades to PE 2019.8.5 if this is still happening.

Add Comment

Vadym Chepkov (Jira)

unread,

Mar 3, 2021, 7:37:01 AM3/3/21

to puppe...@googlegroups.com

Vadym Chepkov commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

I have upgraded non-prod environment and non-prod nodes and problem persists in PE2019.8.5 with puppet 6.21.1. The root cause, which was in the ticket description maybe not accurate, but they method Jarret Lavallee used is still valid.

I extracted events from splunk after I shutdown pe-pupeptserver on the primary

puppet-agents.txt

Add Comment

Vadym Chepkov (Jira)

unread,

Mar 3, 2021, 7:37:01 AM3/3/21

to puppe...@googlegroups.com

Vadym Chepkov updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Vadym Chepkov
Attachment:	puppet-agents.txt

Add Comment

Vadym Chepkov (Jira)

unread,

Mar 3, 2021, 7:46:01 AM3/3/21

to puppe...@googlegroups.com

Vadym Chepkov updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Vadym Chepkov
Attachment:	puppet.txt

Add Comment

Vadym Chepkov (Jira)

unread,

Mar 3, 2021, 7:46:03 AM3/3/21

to puppe...@googlegroups.com

Vadym Chepkov updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Vadym Chepkov
Attachment:	puppet-agents.txt

Add Comment

Vadym Chepkov (Jira)

unread,

Mar 3, 2021, 8:35:01 AM3/3/21

to puppe...@googlegroups.com

Vadym Chepkov commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

Something occurred to me. Is it possible the problem is on 'presentation' side and not on functionality?

I looked through puppetserver.log on the replica and I do see nodes connecting to it during primary shutdown:

2021-03-03T07:27:53.295-05:00 INFO  [qtp1835431929-14632] [puppetserver] Puppet Not using expired facts for pubtstx-web104.example.com from cache; expired at 2020-09-01 11:55:17 -0400

2021-03-03T07:27:53.356-05:00 INFO  [qtp1835431929-14632] [puppetserver] Puppet Caching facts for pubtstx-web104.example.com

2021-03-03T07:27:55.563-05:00 ERROR [clojure-agent-send-off-pool-13553] [p.e.file-sync-errors] File Sync failure during sync or fetch phase: Couldn't connect to server (https://infdevx-puppet202.example.com:8140/file-sync/v1/latest-commits): (Connection refused).

So, maybe the problem is with how agent handles the exception? Ideally, agent shouldn't through an error in the log and into submitted report if it was able to recover? At the end the problem manifests itself with splunk and report processor sounding false alerts

Add Comment

Luchian Nemes (Jira)

unread,

Apr 9, 2021, 4:14:03 AM4/9/21

to puppe...@googlegroups.com

Luchian Nemes updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Luchian Nemes
Fix Version/s:	PUP 6.22.0

Add Comment

This message was sent by Atlassian Jira (v8.13.2#813002-sha1:c495a97)

Mihai Buzgau (Jira)

unread,

Apr 12, 2021, 3:23:02 AM4/12/21

to puppe...@googlegroups.com

Mihai Buzgau updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Mihai Buzgau
Fix Version/s:	PUP 6.22.0

Add Comment

Mihai Buzgau (Jira)

unread,

Apr 13, 2021, 9:41:04 AM4/13/21

to puppe...@googlegroups.com

Mihai Buzgau updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Mihai Buzgau
Sprint:	NW - 2021-01-20, NW - 2021-02-03 , NW-2021-04-28

Add Comment

Dorin Pleava (Jira)

unread,

Apr 20, 2021, 6:43:04 AM4/20/21

to puppe...@googlegroups.com

Dorin Pleava commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

I think now I understand what the issue was:

When puppet processes server_list and tries to find a functional server, it go through each server, and if it cannot connect it throws an error, but it still moves on to the next server in server_list.

Now it only throws a warning for each server it cannot connect to, and if no server from server_list is functional, then it throws an error.

[root@twin-federalism puppet]# puppet config print server_list

full-maturation.delivery.puppetlabs.net,crash-leapfrog.delivery.puppetlabs.net

[root@twin-federalism puppet]# puppet agent -t

Warning: Unable to connect to server from server_list setting: Request to https://full-maturation.delivery.puppetlabs.net:8140/status/v1/simple/server failed after 0.002 seconds: Failed to open TCP connection to full-maturation.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "full-maturation.delivery.puppetlabs.net" port 8140) Trying with next server from server_list.

Info: Using configured environment 'production'

Info: Retrieving pluginfacts

Info: Retrieving plugin

Info: Loading facts

Info: Caching catalog for twin-federalism.delivery.puppetlabs.net

Info: Applying configuration version '1618915174'

Notice: Applied catalog in 0.04 seconds

Add Comment

Josh Cooper (Jira)

unread,

Apr 23, 2021, 1:02:01 PM4/23/21

to puppe...@googlegroups.com

Josh Cooper commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

Merged to 6.x in https://github.com/puppetlabs/puppet/commit/cd615a0abbe321549fcc85a3a723f5969c269ff2

Add Comment

Josh Cooper (Jira)

unread,

Apr 23, 2021, 1:02:04 PM4/23/21

to puppe...@googlegroups.com

Josh Cooper updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Josh Cooper
Fix Version/s:	PUP 7.7.0
Fix Version/s:	PUP 6.23.0

Add Comment

Claire Cadman (Jira)

unread,

May 18, 2021, 10:06:01 AM5/18/21

to puppe...@googlegroups.com

Claire Cadman updated an issue

Puppet /

PUP-10844

Agent failures with server_list when one puppetserver fails

Change By:	Claire Cadman
Labels:	doc-reviewed jira_escalated platform_7.2

Add Comment

Vadym Chepkov (Jira)

unread,

Jun 26, 2021, 1:23:02 PM6/26/21

to puppe...@googlegroups.com

Vadym Chepkov commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

I have upgraded non-prod infrastructure to 2019.8.7 and agents to 6.23.0. RedHat Linux agents produce the following log entry, but run succeeds :

Jun 26 13:05:38 sdltstx-jira902 puppet-agent[23647]: Unable to connect to server from server_list setting: Request to https://infdevx-puppet202.example.com:8140/status/v1/simple/master failed after 0.007 seconds: Failed to open TCP connection to infdevx-example202.bnaint.com:8140 (Connection refused - connect(2) for "infdevx-example202.bnaint.com" port 8140) Trying with next server from server_list.

Windows nodes, unfortunately, still fail

Could not retrieve catalog from remote server: Request to https://infdevx-puppet202.example.com:8140/puppet/v3/catalog/infdevw-mdt001.example.com?environment=windows_development failed after 1.027 seconds: Failed to open TCP connection to infdevx-puppet202.example.com:8140 (No connection could be made because the target machine actively refused it. - connect(2) for "infdevx-puppet202.example.com" port 8140)

Wrapped exception:

Failed to open TCP connection to infdevx-puppet202.example.com:8140 (No connection could be made because the target machine actively refused it. - connect(2) for "infdevx-puppet202.example.com" port 8140)

Add Comment

Ciprian Badescu (Jira)

unread,

Jun 29, 2021, 11:27:01 AM6/29/21

to puppe...@googlegroups.com

Ciprian Badescu commented on

PUP-10844

Re: Agent failures with server_list when one puppetserver fails

Vadym Chepkov, we created a new ticket to track Windows specific error (https://tickets.puppetlabs.com/browse/PUP-11134)

Add Comment

Reply all

Reply to author

Forward