Puppetserver 6.0.2 timeouts in the puppetserver log and on the agent side

91 views
Skip to first unread message

Mike Sharpton

unread,
Feb 11, 2019, 8:42:01 AM2/11/19
to Puppet Users
Hey all,

We have recently upgraded our environment from Puppetserver 4.2.2 to Puppetserver 6.0.2.  We are running a mix of Puppet 4 and Puppet 6 agents until we can get them all upgraded to 6.  We have around 6000 nodes, and we had 4 Puppetservers, but we added two more due to capacity issues with Puppet 6.  The load is MUCH higher with Puppet 6.  To the question, I am seeing longer and longer agent run times after about two days of the services running.  The only error in the logs that seems to have any relation to this is this string.

2019-02-11T04:32:28.409-06:00 ERROR [qtp1148783071-4075] [p.r.core] Internal Server Error: java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 30001/30000 ms


After I restart the puppetserver service, this goes away for about two days.  I think Puppetserver is dying a slow death under this load (load average of around 5-6).  We are running Puppetserver on vm's that are 10X8GB and using 6 Jruby workers per Puppetserver and a 4GB heap.  I have not seen any OOM exceptions and the process never crashes.  Has anyone else seen anything like this?  I did some Googling and didn't find a ton of relevant stuff.  Perhaps we need to upgrade to the latest version to see if this helps?  Even more capacity?  Seems silly.  Thanks in advance!

Mike

Henrik Lindberg

unread,
Feb 11, 2019, 3:06:11 PM2/11/19
to puppet...@googlegroups.com
There may be a slow memory leak that over time makes the server busy win
non productive work (scanning for garbage on an ever increasing heap).
If you were to increase capacity you would risk only changing the 2 days
to a couple more, but not actually solving the issue.

Try to look at server memory usage over the two days.

Also, naturally, upgrade to latest and make sure modules are updated as
well.

Do you by any chance have many environments with different versions of
ruby code? The environment isolation "puppet generate types" may be of
help if that is the case as loaded ruby resource types become sticky in
memory.

- henrik

> Mike
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-users...@googlegroups.com
> <mailto:puppet-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com
> <https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


--

Visit my Blog "Puppet on the Edge"
http://puppet-on-the-edge.blogspot.se/

Mike Sharpton

unread,
Feb 11, 2019, 3:59:50 PM2/11/19
to Puppet Users
Hello Henrik,

The heap being at 4GB is all the higher I would raise it, as you say GC becomes costly with big heaps.  The memory usage ramps up quite quickly to well above the configured max heap within minutes.  It comes up to about 5.8GB of usage quickly as we manage many resources on many nodes.  We do not have many environments.  We normally have a production branch and only use a preprod branch to move changes up our environment with a module that managed the puppet.conf on our nodes.  We will keep looking, but I don't see a smoking gun.  Anyone else have any ideas?  Puppet 4 was able to handle this load with only 4 JRuby workers and 4 Puppet servers.  Thanks for your help,

Mike

Justin Stoller

unread,
Feb 11, 2019, 6:21:06 PM2/11/19
to puppet...@googlegroups.com
Off the top of my head:
1. Have you tried lowering the JRuby workers to JVM heap ratio? (I would try 1G to 1worker to see if it really is worker performance)
2. That error is most likely from Jetty (it can be tuned with idle-timeout-milliseconds[1]). Are agent runs failing with a 500 from the server when that happens? Are clients failing to post their facts or reports in a timely manner? Is Puppet Server failing its connections to PuppetDB?
3. Are you managing any other server settings? Having a low max-requests-per-instance is problematic for newer servers (they more aggressively compile/optimize the Ruby code the worker loads, so with shorter lifetimes it does a bunch of work to then throw it a way and start over - and that can cause much more load).
4. What version of java are you using/do you have any custom tuning of Java that maybe doesn't work well with newer servers? Server 5+ only has support for Java 8 and will use more non-heap memory/code cache for those new optimizations mentioned above.

HTH,
Justin




Mike

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com.

Mike Sharpton

unread,
Feb 12, 2019, 7:59:46 AM2/12/19
to Puppet Users
Hello Justin,

We were thinking the same thing with the JRuby workers.  Perhaps lowering them back to 4, and lowering the heap size back to three, which worked fine before now that we have added 2 more Puppet servers.  The behavior we see is failing Puppet runs like this on random modules.  

Could not evaluate: Could not retrieve file metadata for puppet:///modules/modulename/resourcename: SSL_connect returned=6 errno=0 state=unknown state

Something took far too long to answer is our guess.  Reports are fine, PuppetDB is fine.  Things always make it there.  We see the failures.  It is likely we have herd that comes in and sometimes makes the situation worse.  Based on this fact, this should happen every 30 minutes.  It doesn't.  I can't think of any other server settings we are managing besides JRuby instances, heap size, and a tmp dir for Java to work with  (/tmp is noexec here).   We are using this JVM, and we do not have any custom tuning.

openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)

We will try and mess with the JRuby/heap ratio now that we have more Puppetservers.  We consistently see all JRuby instances being utilized even when being set a 6.  Another thing we may consider is doing Puppet runs every 45 mins instead of 30.  This will lower load as well.  Thanks for your thoughts,

Mike

Mike Sharpton

unread,
Feb 12, 2019, 8:24:21 AM2/12/19
to Puppet Users
I just found this as well.

If you're working outside of lab environment, increase ReservedCodeCache to 512m under normal load. If you're working with 6-12 JRuby instances (or a max-requests-per-instance value significantly less than 100k), run with a ReservedCodeCache of 1G. Twelve or more JRuby instances in a single server might require 2G or more.

We need to look into this setting.

Henrik Lindberg

unread,
Feb 12, 2019, 1:08:45 PM2/12/19
to puppet...@googlegroups.com
On 2019-02-11 21:59, Mike Sharpton wrote:
> Hello Henrik,
>
> The heap being at 4GB is all the higher I would raise it, as you say GC
> becomes costly with big heaps.  The memory usage ramps up quite quickly
> to well above the configured max heap within minutes.  It comes up to
> about 5.8GB of usage quickly as we manage many resources on many nodes.
> We do not have many environments.  We normally have a production branch
> and only use a preprod branch to move changes up our environment with a
> module that managed the puppet.conf on our nodes.  We will keep looking,
> but I don't see a smoking gun.  Anyone else have any ideas?  Puppet 4
> was able to handle this load with only 4 JRuby workers and 4 Puppet
> servers.  Thanks for your help,
>
> Mike

I suppose you have read this:
https://puppet.com/docs/puppetserver/6.0/tuning_guide.html

Read your description again. Think it is of value to look at the
stacktrace you get when the timeout occur. This to figure out what it is
that is timing out.

Best,
- henrik
> > an email to puppet-users...@googlegroups.com <javascript:>
> > <mailto:puppet-users...@googlegroups.com <javascript:>>.
> <https://groups.google.com/d/msgid/puppet-users/197c0ad5-83c0-4562-833b-82028f0e3e9c%40googlegroups.com?utm_medium=email&utm_source=footer
> <https://groups.google.com/d/optout>.
>
>
> --
>
> Visit my Blog "Puppet on the Edge"
> http://puppet-on-the-edge.blogspot.se/
> <http://puppet-on-the-edge.blogspot.se/>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to puppet-users...@googlegroups.com
> <mailto:puppet-users...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/puppet-users/59932f88-0416-45d3-b938-0f3c417051fa%40googlegroups.com
> <https://groups.google.com/d/msgid/puppet-users/59932f88-0416-45d3-b938-0f3c417051fa%40googlegroups.com?utm_medium=email&utm_source=footer>.

Mike Sharpton

unread,
Feb 15, 2019, 11:06:38 AM2/15/19
to Puppet Users
Hey all,

Update to this thread in case others have a similar issue.  It IS the setting below it appears.  This is listed as a "Potential" tuning setting.  It is basically required for any real workload.  I saw this after pulling out some of last hairs I have left on my head (in messages, which I was not even focused on, Doh!).

Feb 15 08:38:33  puppetserver: OpenJDK 64-Bit Server VM warning: CodeCache is full. Compiler has been disabled.
Feb 15 08:38:33  puppetserver: OpenJDK 64-Bit Server VM warning: Try increasing the code cache size using -XX:ReservedCodeCacheSize=
Feb 15 08:38:33  puppetserver: CodeCache: size=245760Kb used=207748Kb max_used=242118Kb free=38011Kb
Feb 15 08:38:33  puppetserver: bounds [0x00007f4394be4000, 0x00007f43a3be4000, 0x00007f43a3be4000]
Feb 15 08:38:33  puppetserver: total_blobs=29412 nmethods=28684 adapters=635
Feb 15 08:38:33  puppetserver: compilation: disabled (not enough contiguous free space left)

Boom goes the dynamite.  This explains a great many things.  I have just adjusted the settings and now need to wait and see if it is better.  I will drop a final post to finish this thread if it works out.

Mike

Mike Sharpton

unread,
Feb 18, 2019, 9:18:59 AM2/18/19
to Puppet Users
Hey all,

Final reply.  Thanks to all that tossed me a few bones.  The issue is directly related to the new JDK (new to us).  After giving more CodeCache we are now performing very well and have not blown up since.  The potential JVM args are required in a prod environment with any kind of load.  Until next time,

Mike
Reply all
Reply to author
Forward
0 new messages