Puppet Agent Hang when PuppetServer Crashes...

73 views
Skip to first unread message

Matt Wise

unread,
Jan 1, 2018, 5:52:10 PM1/1/18
to puppet...@googlegroups.com
Puppet Agent: 5.3.2
Puppet Server: 5.1.4 - Packaged in Docker, running on Amazon ECS

So we've recently started rolling over from our ancient Puppet 3.x system to a new Puppet 5.x service. The new service consists of a PuppetServer Docker Image (5.1.4) running in Amazon ECS, and our hosts booting up and running Puppet Agent 5.3.2. At this point in the migration, we're running ~150-200 hosts on the new Puppet5 system and we replace ~30-80 of them daily.

We are currently tracking down a problem with our PuppetServers and their memory usage, which is causing the containers to be OOM'd a few times a day (~10 OOMs a day across ~20 containers). While we know that we need to fix this, we've seen a scary behavior on the Puppet Agent side that we could use some advice with.

It seems that at least a few times a day now we will get a server hung in the boot process. The `puppet agent -t ...` process will just hang midway through the run. It seems that these hangs happen when the backend underlying PuppetServer process that they were connected to gets OOMed and goes away. Obviously the OOM is a problem.. but frankly I am more concerned with the Puppet Agent getting wedged for hours and hours without making any progress.

It seems that when this failure happens, the puppet agent does not ever time out. It never fails, or throws an error. It just hangs. We've had these hangs last upwards of 4-5 hours before our systems are automatically terminated.

We've enabled debug logging, but haven't caught one of these failures yet with debug mode turned on. In the mean time, are there any  known regressions or configuration tweaks we need to make to Puppet Agent 5.x more quick to fail or resilient in this case? I could obviously try to build in some wrapper around Puppet to catch this behavior .. but I am hoping that there are just some settings we need to tweak.

Any thoughts?

R.I.Pienaar

unread,
Jan 1, 2018, 5:54:34 PM1/1/18
to puppet...@googlegroups.com


On Mon, 1 Jan 2018, at 23:51, Matt Wise wrote:
> *Puppet Agent: 5.3.2*
> *Puppet Server: 5.1.4 - Packaged in Docker, running on Amazon ECS*
I see this often for other kinds of interruptions like network interruptions etc

I do recall a number of bugs around this to make it more robust, you might want to try searching Puppet jita


--
R.I.Pienaar / www.devco.net / @ripienaar

John Gelnaw

unread,
Jan 2, 2018, 12:50:52 AM1/2/18
to Puppet Users
On Monday, January 1, 2018 at 5:52:10 PM UTC-5, Matt Wise wrote:
Puppet Agent: 5.3.2
Puppet Server: 5.1.4 - Packaged in Docker, running on Amazon ECS

I'm running a docker-compose based puppet setup, and had the same problem.  Short version was to increase the java heap size for the JRuby instances for puppetserver.

Using the docker-compose.yml, I added:

    environment:
     
- PUPPETSERVER_JAVA_ARGS=-Xmx1024m

to the puppet stanza, which gets passed to the puppetserver init script.

We also increased the number of JRuby instances to 7, but that might be overkill (roughly 200-250 nodes).  That also means 8 gigs of memory on the docker host.

The agents would eventually time out, but I seem to recall it was on the order of hours for the timeout.

Matt Wise

unread,
Jan 2, 2018, 1:08:51 AM1/2/18
to puppet...@googlegroups.com
We're still tuning, but I ended up dropping our PuppetServer JRubyInstance count down to 2, and I have the -Xmx setting set to 4GB(!!). I think that we have a few libraries loaded in that are causing some major bloat, but we haven't had time to track that down yet.

The big concern I have is not the crashing of the servers... we can handle that. The main issue is that it seems that the Puppet Agents get into a hung state and never recover. Thats not a behavior we ever saw on the older Puppet 3.x clients.

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/20b2d83e-7752-4f87-995f-3ec2fcde5368%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Josh Cooper

unread,
Jan 5, 2018, 2:53:25 PM1/5/18
to puppet...@googlegroups.com

For more options, visit https://groups.google.com/d/optout.

In Puppet 4 we added settings for configuring http connect and read timeouts independently[1]. Previously they were both controlled by the configfiletimeout. The default read timeout is unlimited, so the hung agent may be stuck in a socket read. You might want to strace the stuck agent to see what it's up to.

In our upcoming 4.10.x/5.3.x releases, we've added a watchdog to kill a stuck run[2].

Josh


--
Josh Cooper | Software Engineer

John Sellens

unread,
Jan 5, 2018, 3:32:06 PM1/5/18
to puppet...@googlegroups.com, Josh Cooper
Hi Josh - thanks for the info.

Can I make an assertion that having the default read timeout be unlimited
is a mistake? In practical terms, anything over 60 seconds means
something is broken.

Could I suggest (without having to go and update the bug because I'm a
bad bad lazy person) that along with the watchdog you change the default
timeout to, say, 5 minutes? That's effectively infinite, but would
likely keep things from getting stuck.

(I wrote some tools back in the early puppet 3 days to run puppet the
way I wanted, and of course I included a timeout on the total run time.
There were some interesting failure modes back in the olden days.)

Thanks - cheers!

John

Josh Cooper

unread,
Feb 21, 2018, 1:41:08 AM2/21/18
to John Sellens, puppet...@googlegroups.com
On Fri, Jan 5, 2018 at 12:31 PM, John Sellens <jsel...@syonex.com> wrote:
Hi Josh - thanks for the info.

Can I make an assertion that having the default read timeout be unlimited
is a mistake?  In practical terms, anything over 60 seconds means
something is broken.

Timeouts are hard. How does the client know 60 seconds is long enough? Compile times of ~1 min are not unthinkable. Maybe the server is under heavy load? If the timeout is reached, what should the client do? Retry (bad idea since it makes the situation worse)... Fail (bad idea if the server was making progress)...


Could I suggest (without having to go and update the bug because I'm a
bad bad lazy person) that along with the watchdog you change the default
timeout to, say, 5 minutes?  That's effectively infinite, but would
likely keep things from getting stuck.

It's definitely possible for the connection to be lost between the time that the server responds and when the agent would normally receive the response. In this half-open scenario, the agent may wait indefinitely, so I agree having a timeout "less than infinite" makes sense. I'm thinking it should be strictly less than runinterval, otherwise you could have agent runs stacking up, and contending for the agent lock.

(I wrote some tools back in the early puppet 3 days to run puppet the
way I wanted, and of course I included a timeout on the total run time.
There were some interesting failure modes back in the olden days.)

Yeah, "interesting" is one way to put it :) Puppet 2/3 conflated TCP connect and read timeouts. And it required that the entire pluginsync operation take less than Puppet[:configtimeout] minutes (defaulted to 2), otherwise the agent would abort the pluginsync operation, even though it could be making progress downloading individual files (see PUP-2885)!


Thanks - cheers!

John



On Fri, 2018/01/05 11:53:12AM -0800, Josh Cooper <jo...@puppet.com> wrote:
| In Puppet 4 we added settings for configuring http connect and read
| timeouts independently[1]. Previously they were both controlled by the
| configfiletimeout. The default read timeout is unlimited, so the hung agent
| may be stuck in a socket read. You might want to strace the stuck agent to
| see what it's up to.
|
| In our upcoming 4.10.x/5.3.x releases, we've added a watchdog to kill a
| stuck run[2].
|
| Josh
|
| [1] https://tickets.puppetlabs.com/browse/PUP-3666
| [2] https://tickets.puppetlabs.com/browse/PUP-7517
|
| --
| Josh Cooper | Software Engineer
| jo...@puppet.com | @coopjn
|
Reply all
Reply to author
Forward
0 new messages