General performance questions in Puppetserver 2.8

89 views
Skip to first unread message

Ramin K

unread,
Feb 10, 2018, 8:39:39 PM2/10/18
to puppet...@googlegroups.com
We recently switched a large codebase with a lot of history over to
Puppet 4.10. Performance of the new servers is significantly slower than
the old Apache/Passenger system, roughly half. I have a list of possible
causes and general cleanup to do, but was hoping the community could
help me order than list in a way that got me to better performance
soonest. Or point out something I've missed.

We serve a lot of file. Whatever you've seen, this is likely worst. Some
cases are 10k files per server. We've farmed out some tickets to teams
to cleanup, templatize, etc etc their code. I could do the heavy lifting
a drop it significantly with a week of work.

50% of agents on Centos 6 running Puppet 3.8.x. Ruby 1.8.7 so no http
multiplexing. 50% of agents on Centos 7 w/ Puppet 3.8.x. Ruby 2.0 with
http multiplexing at least with Puppet 3 servers also on Centos 7. Have
not checked Puppetserver 2.8, but assume this is still working. Moving
all Centos6 agents to puppetagent 1.10.10 should reduce the ssl/http
connection overhead.

Puppetserver tuning. Not much documentation online. Running instance per
core (24) with 18GB. Runs around 23GB consistently on a 32GB machine.
env time of 30s. Java 8. On disk files in ramdisk. Could move ./cache/
to ramdisk as well. Other than memory is anyone tuning much?

Stdlib validation deprecation messages. Have not moved over to the new
type system and validation. Generally the logs are pretty clean other
than these messages.

content => vs source =>. We've been slowly moving to content for files
that are on a significant percentage of the fleet. Still the way to go
assuming we're on 4.x or better agent?

That the list so far. I'm guessing less files and agent upgrades are the
mostly likely to get us back to parity. I was surprised that catalog
generation wasn't significantly faster when comparing unloaded 3 and 4
servers to each other. Suspect cleaning up deprecation might help some,
but suspect there is something odd about our code here.

Ramin

Poil

unread,
Feb 11, 2018, 8:32:10 AM2/11/18
to puppet...@googlegroups.com, Ramin K
Here at Claranet France, when we've switched to Puppet4, we've made
several mistake
First, the environment cache wasn't enable, and the performance with
Puppet4 is very bad when it's not.
Second, when we've reached 3000 nodes (we have 4 master nodes, 48 core,
64GB, behind 2 HaProxy) the catalog application (not compute) became
slow and slower. we tried to increase the number of JRuby instances from
4 (max in auto mode) to 12 but we had very high load and crashes. The
problem is that the JVM memory is shared between all JRuby Instance, so
just increase th xmx/xms to 1GB per JRuby instance seems to have
resolved all our performance problem.

Now our catalog application time is about 50% faster than on our old
Puppet3 infrastructures (we've also switch all our agent to Puppet4).

Best regards,

Ramin K

unread,
Feb 11, 2018, 9:53:09 PM2/11/18
to puppet...@googlegroups.com
Thanks for the response. Nice to get some new data points.

We're unfortunately unable to lengthen the environment cache past 30s
till we get r10k wired up and able to clear the cache automatically.

We're running 32 instances for 32 cores which seems fine, but I'm not
clear on if I should run lower. I set JAVA_ARGS="-Xms18g -Xmx18g
-XX:+UseG1GC" which seems fine the number of cores so far. Have not had
crashing problems. CPU was running 90-100% till I set the heaviest
hitting role to once an hour while we make changes.

One interesting data point is that I was using an ancient function as a
replacement from fqdn_rand(60,$seed) to keep crons from moving around
during the transition. Removing the function from the majority of cases
dropping compile times by 1-3 seconds and smoothed out the CPU curve. I
suspect this function was running poorly and might be the cause of my
1.9.3 Ruby deprecation notices I see in the logs. I suspect the more
Ruby/stdlib I remove from the manifests the better performance will be.

In regards to agents I ran a few tests on our Centos6 hosts. Definitely
see an improvement in apply times particularly in roles that have a lot
of file resources. Dropped from 45s to 30s on one of these roles which
should be http connection reuse though still need to verify it's because
we're using something other than 1.8.7 as the runtime.

Ramin
Reply all
Reply to author
Forward
0 new messages