Jira (PUP-10844) Agent failures with server_list when one puppetserver fails

48 views
Skip to first unread message

Jarret Lavallee (Jira)

unread,
Dec 22, 2020, 11:07:03 AM12/22/20
to puppe...@googlegroups.com
Jarret Lavallee created an issue
 
Puppet / Bug PUP-10844
Agent failures with server_list when one puppetserver fails
Issue Type: Bug Bug
Affects Versions: PUP 6.15.0
Assignee: Unassigned
Created: 2020/12/22 8:06 AM
Priority: Normal Normal
Reporter: Jarret Lavallee

Puppet Version: 6.15.0+
Puppet Server Version: 6.x
OS Name/Version: Any

After the changes in 6.15.0 the server_list setting has different behavior. Previously when server_list was configured and the first puppetserver in the list failed, the agent would continue to run by connecting to the next puppetserver on the list. In 6.15.0, if the primary puppetserver fails while an agent is running, it results in a failed agent run.

Desired Behavior:
When the first puppetserver in the server_list goes offline, the agents should automatically try to connect to the second puppetserver in the server_list even if it is mid agent run.

Actual Behavior:
The agent run fails if the first puppetserver in the server_list goes offline while the agent is in the middle of a run.

Some failures are below.

Could not evaluate: Could not retrieve file metadata for puppet:///pe_packages/2019.8.1/windows-x86_64/puppet-agent-x64.msi: Request to https://primary.example.com:8140/puppet/v3/file_metadata/pe_packages/2019.8.1/windows-x86_64/puppet-agent-x64.msi?links=manage&checksum_type=sha256lite&source_permissions=ignore&environment=windows_testing failed after 21.011 seconds: Failed to open TCP connection to primary.example.com:8140 (A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. - connect(2) for "primary.example.com" port 8140)

puppet-agent[7053]: Could not retrieve catalog from remote server: Request to https://primary.example.com:8140/puppet/v3/catalog/agent.example.com?environment=development failed after 0.004 seconds: Failed to open TCP connection to primary.example.com:8140 (Connection refused - connect(2) for "primary.example.com" port 8140)

Reproduction
1. Configure the server_list for two Puppetservers
2. Configure 10 agents with the server_list and a run interval of a minute
3. Shutdown the Puppetserver service on the first server in the server_list

Likely one of the agents will have the failure. It seems to be more reproducible with file resources inside the catalog.

We believe this is related to the changes in PUP-10363

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)
Atlassian logo

zendesk.jira (Jira)

unread,
Dec 22, 2020, 11:08:03 AM12/22/20
to puppe...@googlegroups.com

zendesk.jira (Jira)

unread,
Dec 22, 2020, 11:08:04 AM12/22/20
to puppe...@googlegroups.com
zendesk.jira updated an issue
Change By: zendesk.jira
Zendesk Ticket Count: 1
Zendesk Ticket IDs: 40535

Josh Cooper (Jira)

unread,
Jan 4, 2021, 1:29:04 PM1/4/21
to puppe...@googlegroups.com
Josh Cooper updated an issue
Change By: Josh Cooper
Labels: jira_escalated platform_7.2

Gheorghe Popescu (Jira)

unread,
Jan 5, 2021, 10:52:03 AM1/5/21
to puppe...@googlegroups.com

Gheorghe Popescu (Jira)

unread,
Jan 5, 2021, 10:52:04 AM1/5/21
to puppe...@googlegroups.com

Mihai Buzgau (Jira)

unread,
Jan 6, 2021, 4:43:03 AM1/6/21
to puppe...@googlegroups.com

Dorin Pleava (Jira)

unread,
Jan 11, 2021, 9:45:04 AM1/11/21
to puppe...@googlegroups.com

Mihai Buzgau (Jira)

unread,
Jan 20, 2021, 5:29:04 AM1/20/21
to puppe...@googlegroups.com
Mihai Buzgau updated an issue
Change By: Mihai Buzgau
Sprint: NW - 2021-01-20 , NW - 2021-02-03

Dorin Pleava (Jira)

unread,
Jan 26, 2021, 8:08:03 AM1/26/21
to puppe...@googlegroups.com
Dorin Pleava commented on Bug PUP-10844
 
Re: Agent failures with server_list when one puppetserver fails

After some digging around on versions 5.5 21, 6.13.0, 6.14.0 and the newly released 6.20.0, only 6.14.0 did some things different.

On 6.14.0, when a server failed midrun, the current running part would fail (like Retrieving pluginfacts, or Retrieving facts) and the code would continue to the next part(Retrieving locales) where it would check again for an available server, choosing the next functional server from server_list. I think this is not the intended functionality, as it could cause some sort of mix between catalogs from different servers.

 

[root@blue-bumper ~]# puppet --version
6.14.0
[root@blue-bumper ~]# puppet agent -t --debug
...
Debug: Creating new connection for https://past-medication.delivery.puppetlabs.net:8140
Debug: Starting connection for https://past-medication.delivery.puppetlabs.net:8140
Error: Could not retrieve catalog from remote server: Request to https://past-medication.delivery.puppetlabs.net:8140/puppet/v3/catalog/blue-bumper.delivery.puppetlabs.net?environment=production failed after 0.001 seconds: Failed to open TCP connection to past-medication.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "past-medication.delivery.puppetlabs.net" port 8140)
Wrapped exception:
Failed to open TCP connection to past-medication.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "past-medication.delivery.puppetlabs.net" port 8140)
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Debug: Resolving service 'report' using Puppet::HTTP::Resolver::ServerList
Debug: Creating new connection for https://past-medication.delivery.puppetlabs.net:8140
Debug: Starting connection for https://past-medication.delivery.puppetlabs.net:8140
Debug: Unable to connect to server from server_list setting: Request to https://past-medication.delivery.puppetlabs.net:8140/status/v1/simple/master failed after 0.001 seconds: Failed to open TCP connection to past-medication.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "past-medication.delivery.puppetlabs.net" port 8140)
Debug: Closing connection for https://full-ink.delivery.puppetlabs.net:8140
Debug: Creating new connection for https://full-ink.delivery.puppetlabs.net:8140
Debug: Starting connection for https://full-ink.delivery.puppetlabs.net:8140
Debug: Using TLSv1.2 with cipher DHE-RSA-AES128-GCM-SHA256
Debug: HTTP GET https://full-ink.delivery.puppetlabs.net:8140/status/v1/simple/master returned 200 OK
...

 

I think the current implementation is ok, where if a part of the execution would fail, puppet should not try to recover on the same run.
The next run should select the next available functional server, and use it for the rest of the run.
If puppet were to recover from a server connection error, I think it would be best to retry the whole run, as I had some differences in the catalog when running puppet with puppetserver from a PE deployment, and running puppet with puppetserver from compiler node.

I would say that 6.14.0 had a bug where it mixed puppetservers from server_list when a midrun failure occurred, and close this ticket as this no longer happens on versions > 6.15.0 where the first functional server is used throughout the puppet run.

 

Jarret Lavallee (Jira)

unread,
Jan 26, 2021, 11:14:02 AM1/26/21
to puppe...@googlegroups.com

Josh Cooper and Dorin Pleava Thank you for looking into this deeper and providing some great analysis. I think you are correct with the desired behavior and we should close this ticket.

Josh Cooper (Jira)

unread,
Jan 26, 2021, 11:30:05 AM1/26/21
to puppe...@googlegroups.com
Josh Cooper commented on Bug PUP-10844

Jarret Lavallee sounds good. Also to summarize this issue, the current 6.x behavior matches how 5.x worked. It just so happened that 6.14.0 would process server_list for every REST request, regardless of whether the previous request succeeded or not.

Vadym Chepkov (Jira)

unread,
Feb 18, 2021, 11:57:05 AM2/18/21
to puppe...@googlegroups.com

That's not my experience, I have 6.19.1 in PE2019.8.4 and each time I have to restart primary server for patching or other maintenance, dozens of agents fail, which defeats the purpose of HA.
When we had PE 2018.1.x, agents have worked without failures

Nick Walker (Jira)

unread,
Feb 26, 2021, 2:27:03 PM2/26/21
to puppe...@googlegroups.com
Nick Walker commented on Bug PUP-10844

Vadym Chepkov reports this is still happening in his install. He's going to report back when he upgrades to PE 2019.8.5 if this is still happening.

Vadym Chepkov (Jira)

unread,
Mar 3, 2021, 7:37:01 AM3/3/21
to puppe...@googlegroups.com

I have upgraded non-prod environment and non-prod nodes and problem persists in PE2019.8.5 with puppet 6.21.1. The root cause, which was in the ticket description maybe not accurate, but they method Jarret Lavallee used is still valid.

I extracted events from splunk after I shutdown pe-pupeptserver on the primary

puppet-agents.txt

 

Vadym Chepkov (Jira)

unread,
Mar 3, 2021, 7:37:01 AM3/3/21
to puppe...@googlegroups.com
Vadym Chepkov updated an issue
 
Change By: Vadym Chepkov
Attachment: puppet-agents.txt

Vadym Chepkov (Jira)

unread,
Mar 3, 2021, 7:46:01 AM3/3/21
to puppe...@googlegroups.com

Vadym Chepkov (Jira)

unread,
Mar 3, 2021, 7:46:03 AM3/3/21
to puppe...@googlegroups.com
Vadym Chepkov updated an issue
Change By: Vadym Chepkov
Attachment: puppet-agents.txt

Vadym Chepkov (Jira)

unread,
Mar 3, 2021, 8:35:01 AM3/3/21
to puppe...@googlegroups.com
 
Re: Agent failures with server_list when one puppetserver fails

Something occurred to me. Is it possible the problem is on 'presentation' side and not on functionality?

I looked through puppetserver.log on the replica and I do see nodes connecting to it during primary shutdown:

2021-03-03T07:27:53.295-05:00 INFO  [qtp1835431929-14632] [puppetserver] Puppet Not using expired facts for pubtstx-web104.example.com from cache; expired at 2020-09-01 11:55:17 -0400
2021-03-03T07:27:53.356-05:00 INFO  [qtp1835431929-14632] [puppetserver] Puppet Caching facts for pubtstx-web104.example.com
2021-03-03T07:27:55.563-05:00 ERROR [clojure-agent-send-off-pool-13553] [p.e.file-sync-errors] File Sync failure during sync or fetch phase: Couldn't connect to server (https://infdevx-puppet202.example.com:8140/file-sync/v1/latest-commits): (Connection refused).

So, maybe the problem is with how agent handles the exception? Ideally, agent shouldn't through an error in the log and into submitted report if it was able to recover? At the end the problem manifests itself with splunk and report processor sounding false alerts

 

 

Luchian Nemes (Jira)

unread,
Apr 9, 2021, 4:14:03 AM4/9/21
to puppe...@googlegroups.com
Luchian Nemes updated an issue
 
Change By: Luchian Nemes
Fix Version/s: PUP 6.22.0
This message was sent by Atlassian Jira (v8.13.2#813002-sha1:c495a97)
Atlassian logo

Mihai Buzgau (Jira)

unread,
Apr 12, 2021, 3:23:02 AM4/12/21
to puppe...@googlegroups.com

Mihai Buzgau (Jira)

unread,
Apr 13, 2021, 9:41:04 AM4/13/21
to puppe...@googlegroups.com
Mihai Buzgau updated an issue
Change By: Mihai Buzgau
Sprint: NW - 2021-01-20, NW - 2021-02-03 , NW-2021-04-28

Dorin Pleava (Jira)

unread,
Apr 20, 2021, 6:43:04 AM4/20/21
to puppe...@googlegroups.com
Dorin Pleava commented on Bug PUP-10844
 
Re: Agent failures with server_list when one puppetserver fails

I think now I understand what the issue was:

When puppet processes server_list and tries to find a functional server, it go through each server, and if it cannot connect it throws an error, but it still moves on to the next server in server_list.

Now it only throws a warning for each server it cannot connect to, and if no server from server_list is functional, then it throws an error.

[root@twin-federalism puppet]# puppet config print server_list
full-maturation.delivery.puppetlabs.net,crash-leapfrog.delivery.puppetlabs.net
[root@twin-federalism puppet]# puppet agent -t
Warning: Unable to connect to server from server_list setting: Request to https://full-maturation.delivery.puppetlabs.net:8140/status/v1/simple/server failed after 0.002 seconds: Failed to open TCP connection to full-maturation.delivery.puppetlabs.net:8140 (Connection refused - connect(2) for "full-maturation.delivery.puppetlabs.net" port 8140) Trying with next server from server_list.
Info: Using configured environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Info: Caching catalog for twin-federalism.delivery.puppetlabs.net
Info: Applying configuration version '1618915174'
Notice: Applied catalog in 0.04 seconds

 

Josh Cooper (Jira)

unread,
Apr 23, 2021, 1:02:01 PM4/23/21
to puppe...@googlegroups.com

Josh Cooper (Jira)

unread,
Apr 23, 2021, 1:02:04 PM4/23/21
to puppe...@googlegroups.com
Josh Cooper updated an issue
 
Change By: Josh Cooper
Fix Version/s: PUP 7.7.0
Fix Version/s: PUP 6.23.0

Claire Cadman (Jira)

unread,
May 18, 2021, 10:06:01 AM5/18/21
to puppe...@googlegroups.com
Claire Cadman updated an issue
Change By: Claire Cadman
Labels: doc-reviewed jira_escalated platform_7.2

Vadym Chepkov (Jira)

unread,
Jun 26, 2021, 1:23:02 PM6/26/21
to puppe...@googlegroups.com
Vadym Chepkov commented on Bug PUP-10844
 
Re: Agent failures with server_list when one puppetserver fails

I have upgraded non-prod infrastructure to 2019.8.7 and agents to 6.23.0. RedHat Linux agents produce the following log entry, but run succeeds :
 

Jun 26 13:05:38 sdltstx-jira902 puppet-agent[23647]: Unable to connect to server from server_list setting: Request to https://infdevx-puppet202.example.com:8140/status/v1/simple/master failed after 0.007 seconds: Failed to open TCP connection to infdevx-example202.bnaint.com:8140 (Connection refused - connect(2) for "infdevx-example202.bnaint.com" port 8140) Trying with next server from server_list.

Windows nodes, unfortunately, still fail

Could not retrieve catalog from remote server: Request to https://infdevx-puppet202.example.com:8140/puppet/v3/catalog/infdevw-mdt001.example.com?environment=windows_development failed after 1.027 seconds: Failed to open TCP connection to infdevx-puppet202.example.com:8140 (No connection could be made because the target machine actively refused it. - connect(2) for "infdevx-puppet202.example.com" port 8140)
 
Wrapped exception:
 
Failed to open TCP connection to infdevx-puppet202.example.com:8140 (No connection could be made because the target machine actively refused it. - connect(2) for "infdevx-puppet202.example.com" port 8140)

 

Ciprian Badescu (Jira)

unread,
Jun 29, 2021, 11:27:01 AM6/29/21
to puppe...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages