Jira (PUP-10233) environment_classes API endpoint extremely slow

Jason V Lang (JIRA)

unread,

Jan 14, 2020, 9:08:17 AM1/14/20

to puppe...@googlegroups.com

Jason V Lang created an issue

Puppet /

PUP-10233

environment_classes API endpoint extremely slow

Issue Type:	Bug
Assignee:	Unassigned
Created:	2020/01/14 6:07 AM
Priority:	Minor
Reporter:	Jason V Lang

I've been troubleshooting "performance" issues with puppet when we sync new code (and update the cache via the API)

PuppetServer 5.3.1

Tested with Jruby-9k and "normal jruby" as well as compile mode jit and off, with no real difference other than "9k" seems even 30% slower overall

Environment, 18,000 Agents, 18 PuppetMasters as configured below, 1 hour check-in time.

32 Environments with approx. 1100 classes per environment

PuppetServer Switches/Args Tested:

Configuration 1: /usr/bin/java -Xms45G -Xmx45G -XX:+UseTransparentHugePages -XX:+UseLargePagesInMetaspace -XX:+AlwaysPreTouch -Xloggc:/var/log/puppetlabs/puppetserver/puppetjvmgarbagecollect.log -verbose:gc -XX:ReservedCodeCacheSize=768m -XX:MetaspaceSize=4096m -XX:MaxMetaspaceSize=4096m -XX:+UseConcMarkSweepGC -XX:G1HeapRegionSize=8m -Dappdynamics.agent.applicationName=Puppet -Dappdynamics.agent.nodeName=fmnpmprh1.paychex.com -Dappdynamics.agent.tierName=PuppetMaster -Dappdynamics.controller.hostName=appdcontroller.paychex.com -Dappdynamics.controller.port=9998 -Dappdynamics.controller.ssl.enabled=false -Dappdynamics.agent.disable.retransformation=true -Dappdynamics.agent.accountName=customer1 -Dappdynamics.agent.accountAccessKey=SJ5b2m7d1$354 -Dappdynamics.agent.force.agent.registration=true -Dappdynamics.agent.agentRuntimeDir=/opt/product/appdynamics-agent/AppServerAgent -javaagent:/opt/product/appdynamics-agent/AppServerAgent/javaagent.jar -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/server/apps/puppetserver/jruby-1_7.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter

Configuration 2: /usr/bin/java -Xms62720m -Xmx62720m -Xloggc:/var/log/puppetlabs/puppetserver/puppetjvmgarbagecollect.log -verbose:gc -Dappdynamics.agent.applicationName=Puppet -Dappdynamics.agent.nodeName=fmnpmprh2.paychex.com -Dappdynamics.agent.tierName=PuppetMaster -Dappdynamics.controller.hostName=appdcontroller.paychex.com -Dappdynamics.controller.port=9998 -Dappdynamics.controller.ssl.enabled=false -javaagent:/opt/product/appdynamics-agent/AppServerAgent/javaagent.jar -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/server/apps/puppetserver/jruby-1_7.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter

Java Version:

[jlang1@fmnpmprh2 ~]$ /usr/bin/java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

My issue is that "when" code updates - we hit the API endpoint to refresh the environment cache. This spawns 32 jruby processes hitting: /puppet/v3/environment_classes"

My puppet masters have 20 jruby processes each (we cannot go larger due to JVM HEAP requirements ballooning out of control) this means 20 threads are consumed, by environment_classes calls, and 12 "queue up". These "environment_classes" calls take 300-450 seconds each, during which all puppetmasters are effectively "paused" and queuing up thier normal requests. This causes the queue to top out, puppet run's hang, etc. For 5-10 minutes before everything catches back up.

This is all viewed from the following endpoint: /status/v1/services/jruby-metrics?level=debug

I've tried "tons" of different combinations (see above) of switches/args, jruby versions, compile mode settings, with no real change here. scanning my environment classes for changes takes forever.

How do i troubleshoot this more, and possibly correct/optimize it? I'm wondering if it's expected to take 300-450 seconds to update the cache if it's "large", or if i maybe have a "bad" class or something - but not really sure how to check/dive in more.

At an OS level, i have free CPU, my I/O is almost non-existent (iotop shows <10% as the highest spike for the duration) and i have free memory, and HEAP usage is maybe 60% during the environment scans

I am already exploring triggering my environment refreshes "per environment" versus globally (which was introduced with puppetserver 5.3.x it seems) - but in some cases, we still update 14-20+ environments "at once" which will continue to gum up the works.

Add Comment

This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)

Josh Cooper (JIRA)

unread,

Jan 16, 2020, 6:32:03 PM1/16/20

to puppe...@googlegroups.com

Josh Cooper updated an issue

Puppet /

PUP-10233

environment_classes API endpoint extremely slow

Change By:	Josh Cooper
Team:	Froyo

Add Comment

Thomas Hallgren (JIRA)

unread,

Jan 19, 2020, 2:40:04 PM1/19/20

to puppe...@googlegroups.com

Thomas Hallgren commented on

PUP-10233

Re: environment_classes API endpoint extremely slow

One option here could be to use a Go puppet parser. It has way better performance than the Ruby parser and would be pretty easy to also make multi-threaded.

Add Comment

Charlie Sharpsteen (Jira)

unread,

Sep 11, 2020, 3:38:04 PM9/11/20

to puppe...@googlegroups.com

Charlie Sharpsteen commented on

PUP-10233

Re: environment_classes API endpoint extremely slow

I'm a bit confused by:

This spawns 32 jruby processes hitting: /puppet/v3/environment_classes"

How exactly is that activity showing up? As far as I know, that API endpoint is only used by external services like the PE Console. Unless I'm missing something, having 32 request roll in for it seems like bad behavior in some external process.

Add Comment

This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)

Maggie Dreyer (Jira)

unread,

Nov 5, 2020, 6:28:02 PM11/5/20

to puppe...@googlegroups.com

Maggie Dreyer commented on

PUP-10233

Re: environment_classes API endpoint extremely slow

Are you still having performance issues around this?

Add Comment

Reply all

Reply to author

Forward