Puppetserver performance plummeting a few hours after startup

285 views
Skip to first unread message

Martijn Grendelman

unread,
Feb 6, 2020, 10:51:42 AM2/6/20
to Puppet Users
Hi,

A question about Puppetserver performance.

For quite a while now, our primary Puppet server is suffering from severe slowness and high CPU usage. We have tried to tweak its settings, giving it more memory (Xmx = 6 GB at the moment) and toying with the 'max-active-instances' setting to no avail. The server has 8 virtual cores and 12 GB memory in total, to run Pupperserver, PuppetDB and PostgreSQL.

Notably, after a restart, the performance is acceptable for a while (several hours, up to a almost day), but then it plummets again.

We figured that the server was just unable to cope with the load (we had over 270 nodes talking to it in 30 min intervals), so we added a second master that now takes more than half of that load (150 nodes). That did not make any difference at all for the primary server. The secondary server however, has no trouble at all dealing with the load we gave it.

In the graph below, that displays catalog compilation times for both servers, you can see the new master in green. It has very constant high performance. The old master is in yellow. After a restart, the compile times are good (not great) for a while.The first dip represents ca. 4 hours, the second dip was 18 hours. At some point, the catalog compilation times sky-rocket, as does the server load. 10 seconds in the graph below corresponds to a server load of around 2, while 40 seconds corresponds to a server load of around 5. It's the Puppetserver process using the CPU.

The second server, the green line, has a consistent server load of around 1, with 4 GB memory (2 GB for the Puppetserver JVM) and 2 cores (it's an EC2 t3.medium).



If I have 110 nodes, doing two runs per hour, that each take 30 seconds to run, I would still have a concurrency of less than 2, so Puppet causing a consistent load of 5 seems strange. My first thought would be that it's garbage collection or something like that, but the server plenty of memory (OS cache has 2GB).

Any ideas on what makes the Puppetserver starting using so much CPU? What can we try to keep it down?

Thanks,
Martijn Grendelman

Yvan Broccard

unread,
Feb 6, 2020, 1:14:16 PM2/6/20
to puppet...@googlegroups.com
Hi,

I add the same issue a couple of months ago. The catalog compilation time of the puppet server was taking more and more time for no reason. I added more Puppet server to take the load, but the load on that actual "first" server could not go back to a lower execution time.

After some troubleshooting, I've noticed that there was one file growing and growing and that took all the resources of the server :

/opt/puppetlabs/puppet/cache/state/state.yaml 

This file was getting huuuuuuuge, filled with report file entries.
I truncated this file and then voilà, everything was back in good shape.

I then added this to my puppetserver manifest :

  # let's remove growing state.yaml if it becomes too big
  tidy {'/opt/puppetlabs/puppet/cache/state/state.yaml':
    size => '10m',
  }  

Have a look if you suffer the same issue ...

Cheers

Yvan B

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/0f6e7373-404a-45fd-8bc7-5daed3fa67f3%40googlegroups.com.

Matthias Baur

unread,
Feb 6, 2020, 2:24:21 PM2/6/20
to Puppet Users
Hey Martijn,

this sounds similar to what we experienced after upgrading to Puppetserver 6. Are you running Puppetserver 6? If so, please check the CodeCache settings.

Regards,
Matthias

KevinR

unread,
Feb 6, 2020, 4:43:12 PM2/6/20
to Puppet Users
Hi Martijn,

it sounds like you have a sub-optimal combination of:
  • The amount of JRubies
  • The total amount of java heap memory for puppetserver
  • The size of your code base
This typically causes the kind of problems you're experiencing. What's happening in a nutshell is that puppet is loading so much code in memory that is starts running out of it and starts performing garbage collection more and more aggressively. At the end, 95% of all cpu cycles are spent on garbage collection and you don't have any cpu cycles left over to actually do work like compile catalogs...

To understand how Puppet loads code into memory:

Your code base is:  ( [ size of your control-repo ] + [ size of all the modules from the Puppetfile ] )  x  [ the amount of puppet code environments]
So let's say:
  • your control repo is 5MB in size
  • all modules together are 95MB in size
  • you have 4 code environments: development, testing, acceptance and production
That's 100MB of code to load in memory, per environment. For 4 environments, that's 400MB.
A different way to get this amount directly is to run du -h /etc/puppetlabs/code/environments on the puppet master and look at the size reported for /etc/puppetlabs/code/environments

Now every JRuby will load that entire code base into memory. So if you have 4 JRubies, that's 1600MB of java heap memory that's actually needed. You can imagine what problems will happen if there isn't this much heap memory configured...

If you're using the defaults, Puppet will create the same amount of JRubies as the number of cpu cores on your master, minus 1, with a maximum of 4 JRubies for the system.
If you override the defaults, you can specify any number of JRubies you want with the max-active-instances setting.

So by default a 2-cpu puppet master will create 1 JRuby, a 4-cpu puppet master will create 3 JRubies, an 8-cpu puppet master will create 4 JRubies.

So now you know how to determine the amount of java heap memory you need to configure, which you can do by configuring the -Xmx and -Xms options in the JAVA_ARGS section of the puppetserver startup command.
Then finally make sure the host has enough physical memory available to provide this increased amount of java heap memory.

Once enough java heap memory is provided, you'll see the cpu usage stay stable.

Kind regards,

Kevin Reeuwijk

Principal Sales Engineer @ Puppet

Justin Stoller

unread,
Feb 6, 2020, 5:04:33 PM2/6/20
to puppet...@googlegroups.com
Yvan your issue sounds like https://tickets.puppetlabs.com/browse/PUP-3647, do you know if that is fixed now, or has regressed since then?

Your issue does sound like a CodeCache or Metaspace issue.

One tunable you didn't mention was "max-active-instances" I've found a bunch of folks that turned that very low to combat leaky code in 5.x or 4.x, despite it causing Puppet & the ruby runtime to be reloaded frequently. In 6.x that loading became much more expensive so small values of "max-active-instances" can be very detrimental to performance (and contribute to excessive Metaspace/CodeCache usage).

This is also assuming that your servers are both 6.x and both at the same version. Can you confirm that? There are recent improvements in Server performance that could contribute (though probably not completely explain) the difference in performance your seeing if your new Server is the latest version and your old server hasn't been upgraded in a few months.

HTH,
Justin



--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.

Josh Cooper

unread,
Feb 6, 2020, 6:41:34 PM2/6/20
to puppet...@googlegroups.com
On Thu, Feb 6, 2020 at 9:04 AM Justin Stoller <jus...@puppet.com> wrote:
Yvan your issue sounds like https://tickets.puppetlabs.com/browse/PUP-3647, do you know if that is fixed now, or has regressed since then?

To add to what Justin said, the state file is only supposed to be loaded/updated by the puppet agent process. It's possible the file is accidentally getting loaded in puppetserver, or maybe your puppet agent is an older version and doesn't have the fix?

In PUP-3647, the statettl puppet setting controls how long to keep entries in the state file. It defaults to 32 days. You may want to set that to a lower value like "1d", though see the configuration reference page about how that affects scheduled resources. Also note setting it to 0 disables pruning entirely so the file can grow unbounded, which honestly seems wrong. I'd expect 0 to never cache...

Josh 

--
Josh Cooper | Software Engineer

Yvan Broccard

unread,
Feb 7, 2020, 10:11:51 AM2/7/20
to puppet...@googlegroups.com
Hi,
About the growing state.yaml :
I have upgraded to version 6.7 in the mean time. I've seen in the code that the issue with :statettl was fixed.
As the default setting is 32days, I still need to wait a couple of days before being sure it works, as I still had a "tidy" statement for cleanup up Puppet reports until recently.
I still have a huge state.yaml file, containing 200'000 lines, and referencing 60'000 reports that have already been removed, but they are only reports generated during the last month, with the oldest entries with a ":checked:" date of  2020-01-06 ... So I guess it works !

Thank you for clarifying this issue !

Yvan

Andy Hall

unread,
Feb 7, 2020, 2:47:57 PM2/7/20
to Puppet Users
So we saw similar and with only the default maximum 4 puppetserver jruby instances would often get 5 or 6 clients connecting at once which in turn led to blocking and then a queue building as more clients connected. We would check port 8140 and often see over 80 established connections.

Now that we have doubled the max-active-instances to 8 and increased the JVM heap size to 4GB the concurrent connections are able to be handled and a queue no longer builds so puppet runs are much quicker and the server does not get so bogged down.

I hope this helps.

Josh Cooper

unread,
Feb 7, 2020, 6:55:38 PM2/7/20
to puppet...@googlegroups.com
On Fri, Feb 7, 2020 at 2:11 AM Yvan Broccard <yvan.b...@gmail.com> wrote:
Hi,
About the growing state.yaml :
I have upgraded to version 6.7 in the mean time. I've seen in the code that the issue with :statettl was fixed.
As the default setting is 32days, I still need to wait a couple of days before being sure it works, as I still had a "tidy" statement for cleanup up Puppet reports until recently.
I still have a huge state.yaml file, containing 200'000 lines, and referencing 60'000 reports that have already been removed, but they are only reports generated during the last month, with the oldest entries with a ":checked:" date of  2020-01-06

That's odd. Are those yaml reports that puppetserver received and stored locally due to the default puppet setting "reports=store"? Puppetserver should be storing reports in /opt/puppetlabs/server/data/puppetserver/reports/<client>/*yaml, so those should not appear in state.yaml. Can you confirm whether puppetserver is loading/updating state.yaml or only the local puppet agent running on the server. For example to confirm puppetserver isn't accessing that file:

$ sudo strace -p <puppetserver java pid> -P /opt/puppetlabs/puppet/cache/state/state.yaml

Only the agent should ever access the state file and I wouldn't expect reports to be in there. But it's possible a puppetserver extension (function, report processor, hiera backend, ...) could be calling Puppet::Util::Storage.load/store erroneously.
 

Yvan Broccard

unread,
Feb 10, 2020, 7:26:30 AM2/10/20
to puppet...@googlegroups.com
Hello,

Actually, I had this in my puppetserver manifest :

-  tidy {'/opt/puppetlabs/server/data/puppetserver/reports':
-    age     => '4d',
-    matches => '*.yaml',
-    recurse => true,
-    rmdirs  => false,
-    type    => ctime,
-  }
And this is the reason whey all the generated report files where being referenced into 
state.yaml !

Now, I've removed this code, and I have as well changed the way the reports are stored. I don't need the YAML report files, as I already use PuppetDB + Foreman. So, now I can see the state.yaml file is slowly decreasing (I need to wait 32 days until all old reports will be cleaned).

Thank you

Cheers

Yvan

Martijn Grendelman

unread,
Feb 10, 2020, 9:44:12 AM2/10/20
to puppet...@googlegroups.com
Hi Kevin and others who have responded,

Thanks all for your tips. Unfortunately, no breakthroughs yet.

The current state is this:
  • Both Puppetservers typically run at the latest version, currently both 6.8.0.
  • The primary server has 8 virtual cores and 12 GB of physical (virtualized) RAM, Java is running with -Xms6g -Xmx6g.
  • Max-active-instances is currently set to 7.
  • This morning, I added -XX:ReservedCodeCacheSize=1g to the JVM startup config.
  • The size of our 'environments' directory is 131 MB. We currently have 3 environments.
I've been looking at JVM stats with 'jstat', and the server doesn't appear to spend any significant amount of time doing GC (seems to be about 1%).

After a server restart, compilation times typically drop to 9 seconds on average (on the secondary server, it's 5 seconds consistently), but after a while, they go back to 30 or 40 seconds.

As I noted in my first post, our server has an average  of less than 2 concurrent agents talking to it, so I can't imagine this happening due to lack of resources. The fact that our secondary server handles a bigger load than the primary, with a third of the memory and only 2 cores, seems to confirm this.

So:
- enough CPU power (I would think)
- enough memory
- no significant garbage collection
- Puppetserver causing a load of 5

Any more tips? Would it make sense to run PuppetDB and PostgreSQL on a different VM?

Thanks,
Martijn Grendelman.










Op 6-2-2020 om 17:43 schreef KevinR:
--
You received this message because you are subscribed to a topic in the Google Groups "Puppet Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/puppet-users/sYARGwznkOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to puppet-users...@googlegroups.com.

Justin Stoller

unread,
Feb 10, 2020, 6:03:04 PM2/10/20
to puppet...@googlegroups.com
On Mon, Feb 10, 2020 at 1:44 AM Martijn Grendelman <mar...@grendelman.net> wrote:
Hi Kevin and others who have responded,

Thanks all for your tips. Unfortunately, no breakthroughs yet.

The current state is this:
  • Both Puppetservers typically run at the latest version, currently both 6.8.0.
  • The primary server has 8 virtual cores and 12 GB of physical (virtualized) RAM, Java is running with -Xms6g -Xmx6g.
  • Max-active-instances is currently set to 7.
  • This morning, I added -XX:ReservedCodeCacheSize=1g to the JVM startup config.
  • The size of our 'environments' directory is 131 MB. We currently have 3 environments.
I've been looking at JVM stats with 'jstat', and the server doesn't appear to spend any significant amount of time doing GC (seems to be about 1%).

fwiw, gceasy.io and their family of jvm analysis reports can be helpful if you have the gc logs, etc available. You might want to see if you're managing Metaspace (like -XX:MaxMetaspace=1G). Mostlikely not, and if you were having issues I think it'd cause full GCs so its not likely a problem, but its a thing to check. It should, if you have to manage it, have a similar value to CodeCache.
 

After a server restart, compilation times typically drop to 9 seconds on average (on the secondary server, it's 5 seconds consistently), but after a while, they go back to 30 or 40 seconds. 

As I noted in my first post, our server has an average  of less than 2 concurrent agents talking to it, so I can't imagine this happening due to lack of resources. The fact that our secondary server handles a bigger load than the primary, with a third of the memory and only 2 cores, seems to confirm this.

One thing that gets folks is that each worker instance is pretty heavy-weight (heap, non-heap, and cpu, even when relatively idle). If you only need 2 or 3 instances, you should try lowering your max active instances to that number and see what happens.

So:
- enough CPU power (I would think)
- enough memory
- no significant garbage collection
- Puppetserver causing a load of 5

I also asked about max-requests-per-instance, ideally it should be 0 (ie off) or some very high number (like 1000000).


Any more tips? Would it make sense to run PuppetDB and PostgreSQL on a different VM?

If that's the biggest difference you might want to go in that direction. I've seen PSQL tuned to where it's different child processes would consume way more memory than intended. You'd probably want to confirm that with top, et al.

You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to puppet-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/d5fc4c74-e9c0-c5da-88a2-07784767b6a2%40grendelman.net.

Kevin Reeuwijk

unread,
Feb 10, 2020, 6:58:05 PM2/10/20
to puppet...@googlegroups.com
Hi Martijn,

for the small amount of nodes you're managing, you shouldn't need a separate DB server.
I'd change max-active-instances to 4, that should be plenty for your environment.

Since your problems clearly are not coming from too much code in memory, I'd look at other possible causes:
  • Check if you're managing very large amounts of file resources (especially if you use recurse=>true anywhere), this can be a cause of issues
  • Check if you're managing the content of a file resource in a bad way somewhere, e.g. I've seen this really bad idea, which will cause the entire file content to end up in the catalog:
    file { 'huge_file.exe':
      content => file('my_module/huge_file.exe')
      ensure => present
    }
  • Check if you're using the puppet server as a fileserver for large files, this will cause excessive CPU usage for calculating the MD5 hash of the file each puppet run:
    file { 'huge_file.exe':
      source => 'puppet:///modules/my_module/huge_file.exe'
      ensure => present
    }
Kind regards,
Kevin Reeuwijk
Principal Sales Engineer, NEMEA
mobile: +31 6 272 33 55 1
kev...@puppet.com | @KevinReeuwijk
Puppet. The shortest path to better software.


Martijn Grendelman

unread,
Feb 12, 2020, 9:28:13 AM2/12/20
to puppet...@googlegroups.com
Hi all,

It seems that in fact, setting ReservedCodeCacheSize has had a positive impact on the performance. Since monday, catalog compilation times have been steady at around 10 seconds average. Still not as good as the secondary server, but acceptable!

For the record, we are not managing metaspace size or max-requests-per-instance.

I'll try to lower the numer of workers and see if that has a positive effect on memory / cpu usage while keeping the compilation times down.

Thanks again for all your suggestions!!

Best regards,
Martijn.



Op 10-2-2020 om 19:02 schreef Justin Stoller:
Reply all
Reply to author
Forward
0 new messages