| I've been troubleshooting "performance" issues with puppet when we sync new code (and update the cache via the API) PuppetServer 5.3.1 Tested with Jruby-9k and "normal jruby" as well as compile mode jit and off, with no real difference other than "9k" seems even 30% slower overall Environment, 18,000 Agents, 18 PuppetMasters as configured below, 1 hour check-in time. 32 Environments with approx. 1100 classes per environment PuppetServer Switches/Args Tested: Configuration 1: /usr/bin/java -Xms45G -Xmx45G -XX:+UseTransparentHugePages -XX:+UseLargePagesInMetaspace -XX:+AlwaysPreTouch -Xloggc:/var/log/puppetlabs/puppetserver/puppetjvmgarbagecollect.log -verbose:gc -XX:ReservedCodeCacheSize=768m -XX:MetaspaceSize=4096m -XX:MaxMetaspaceSize=4096m -XX:+UseConcMarkSweepGC -XX:G1HeapRegionSize=8m -Dappdynamics.agent.applicationName=Puppet -Dappdynamics.agent.nodeName=fmnpmprh1.paychex.com -Dappdynamics.agent.tierName=PuppetMaster -Dappdynamics.controller.hostName=appdcontroller.paychex.com -Dappdynamics.controller.port=9998 -Dappdynamics.controller.ssl.enabled=false -Dappdynamics.agent.disable.retransformation=true -Dappdynamics.agent.accountName=customer1 -Dappdynamics.agent.accountAccessKey=SJ5b2m7d1$354 -Dappdynamics.agent.force.agent.registration=true -Dappdynamics.agent.agentRuntimeDir=/opt/product/appdynamics-agent/AppServerAgent -javaagent:/opt/product/appdynamics-agent/AppServerAgent/javaagent.jar -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/server/apps/puppetserver/jruby-1_7.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter Configuration 2: /usr/bin/java -Xms62720m -Xmx62720m -Xloggc:/var/log/puppetlabs/puppetserver/puppetjvmgarbagecollect.log -verbose:gc -Dappdynamics.agent.applicationName=Puppet -Dappdynamics.agent.nodeName=fmnpmprh2.paychex.com -Dappdynamics.agent.tierName=PuppetMaster -Dappdynamics.controller.hostName=appdcontroller.paychex.com -Dappdynamics.controller.port=9998 -Dappdynamics.controller.ssl.enabled=false -javaagent:/opt/product/appdynamics-agent/AppServerAgent/javaagent.jar -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/server/apps/puppetserver/jruby-1_7.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter Java Version: [jlang1@fmnpmprh2 ~]$ /usr/bin/java -version openjdk version "1.8.0_232" OpenJDK Runtime Environment (build 1.8.0_232-b09) OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) My issue is that "when" code updates - we hit the API endpoint to refresh the environment cache. This spawns 32 jruby processes hitting: /puppet/v3/environment_classes" My puppet masters have 20 jruby processes each (we cannot go larger due to JVM HEAP requirements ballooning out of control) this means 20 threads are consumed, by environment_classes calls, and 12 "queue up". These "environment_classes" calls take 300-450 seconds each, during which all puppetmasters are effectively "paused" and queuing up thier normal requests. This causes the queue to top out, puppet run's hang, etc. For 5-10 minutes before everything catches back up. This is all viewed from the following endpoint: /status/v1/services/jruby-metrics?level=debug I've tried "tons" of different combinations (see above) of switches/args, jruby versions, compile mode settings, with no real change here. scanning my environment classes for changes takes forever. How do i troubleshoot this more, and possibly correct/optimize it? I'm wondering if it's expected to take 300-450 seconds to update the cache if it's "large", or if i maybe have a "bad" class or something - but not really sure how to check/dive in more. At an OS level, i have free CPU, my I/O is almost non-existent (iotop shows <10% as the highest spike for the duration) and i have free memory, and HEAP usage is maybe 60% during the environment scans I am already exploring triggering my environment refreshes "per environment" versus globally (which was introduced with puppetserver 5.3.x it seems) - but in some cases, we still update 14-20+ environments "at once" which will continue to gum up the works. |