Jira (PUP-10233) environment_classes API endpoint extremely slow

10 views
Skip to first unread message

Jason V Lang (JIRA)

unread,
Jan 14, 2020, 9:08:17 AM1/14/20
to puppe...@googlegroups.com
Jason V Lang created an issue
 
Puppet / Bug PUP-10233
environment_classes API endpoint extremely slow
Issue Type: Bug Bug
Assignee: Unassigned
Created: 2020/01/14 6:07 AM
Priority: Minor Minor
Reporter: Jason V Lang

I've been troubleshooting "performance" issues with puppet when we sync new code (and update the cache via the API)

 

PuppetServer 5.3.1

Tested with Jruby-9k and "normal jruby" as well as compile mode jit and off, with no real difference other than "9k" seems even 30% slower overall

Environment, 18,000 Agents, 18 PuppetMasters as configured below, 1 hour check-in time.

32 Environments with approx. 1100 classes per environment

PuppetServer Switches/Args Tested:

Configuration 1: /usr/bin/java -Xms45G -Xmx45G -XX:+UseTransparentHugePages -XX:+UseLargePagesInMetaspace -XX:+AlwaysPreTouch -Xloggc:/var/log/puppetlabs/puppetserver/puppetjvmgarbagecollect.log -verbose:gc -XX:ReservedCodeCacheSize=768m -XX:MetaspaceSize=4096m -XX:MaxMetaspaceSize=4096m -XX:+UseConcMarkSweepGC -XX:G1HeapRegionSize=8m -Dappdynamics.agent.applicationName=Puppet -Dappdynamics.agent.nodeName=fmnpmprh1.paychex.com -Dappdynamics.agent.tierName=PuppetMaster -Dappdynamics.controller.hostName=appdcontroller.paychex.com -Dappdynamics.controller.port=9998 -Dappdynamics.controller.ssl.enabled=false -Dappdynamics.agent.disable.retransformation=true -Dappdynamics.agent.accountName=customer1 -Dappdynamics.agent.accountAccessKey=SJ5b2m7d1$354 -Dappdynamics.agent.force.agent.registration=true -Dappdynamics.agent.agentRuntimeDir=/opt/product/appdynamics-agent/AppServerAgent -javaagent:/opt/product/appdynamics-agent/AppServerAgent/javaagent.jar -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/server/apps/puppetserver/jruby-1_7.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter

Configuration 2: /usr/bin/java -Xms62720m -Xmx62720m -Xloggc:/var/log/puppetlabs/puppetserver/puppetjvmgarbagecollect.log -verbose:gc -Dappdynamics.agent.applicationName=Puppet -Dappdynamics.agent.nodeName=fmnpmprh2.paychex.com -Dappdynamics.agent.tierName=PuppetMaster -Dappdynamics.controller.hostName=appdcontroller.paychex.com -Dappdynamics.controller.port=9998 -Dappdynamics.controller.ssl.enabled=false -javaagent:/opt/product/appdynamics-agent/AppServerAgent/javaagent.jar -Djava.security.egd=/dev/urandom -XX:OnOutOfMemoryError=kill -9 %p -cp /opt/puppetlabs/server/apps/puppetserver/puppet-server-release.jar:/opt/puppetlabs/server/apps/puppetserver/jruby-1_7.jar:/opt/puppetlabs/server/data/puppetserver/jars/* clojure.main -m puppetlabs.trapperkeeper.main --config /etc/puppetlabs/puppetserver/conf.d --bootstrap-config /etc/puppetlabs/puppetserver/services.d/,/opt/puppetlabs/server/apps/puppetserver/config/services.d/ --restart-file /opt/puppetlabs/server/data/puppetserver/restartcounter

 

Java Version:

[jlang1@fmnpmprh2 ~]$ /usr/bin/java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

 

 

My issue is that "when" code updates - we hit the API endpoint to refresh the environment cache. This spawns 32 jruby processes hitting: /puppet/v3/environment_classes"

 

My puppet masters have 20 jruby processes each (we cannot go larger due to JVM HEAP requirements ballooning out of control) this means 20 threads are consumed, by environment_classes calls, and 12 "queue up". These "environment_classes" calls take 300-450 seconds each, during which all puppetmasters are effectively "paused" and queuing up thier normal requests. This causes the queue to top out, puppet run's hang, etc. For 5-10 minutes before everything catches back up.

 

This is all viewed from the following endpoint: /status/v1/services/jruby-metrics?level=debug

 

I've tried "tons" of different combinations (see above) of switches/args, jruby versions, compile mode settings, with no real change here. scanning my environment classes for changes takes forever.

 

How do i troubleshoot this more, and possibly correct/optimize it? I'm wondering if it's expected to take 300-450 seconds to update the cache if it's "large", or if i maybe have a "bad" class or something - but not really sure how to check/dive in more.

At an OS level, i have free CPU, my I/O is almost non-existent (iotop shows <10% as the highest spike for the duration) and i have free memory, and HEAP usage is maybe 60% during the environment scans

 

I am already exploring triggering my environment refreshes "per environment" versus globally (which was introduced with puppetserver 5.3.x it seems) - but in some cases, we still update 14-20+ environments "at once" which will continue to gum up the works.

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Josh Cooper (JIRA)

unread,
Jan 16, 2020, 6:32:03 PM1/16/20
to puppe...@googlegroups.com

Thomas Hallgren (JIRA)

unread,
Jan 19, 2020, 2:40:04 PM1/19/20
to puppe...@googlegroups.com
Thomas Hallgren commented on Bug PUP-10233
 
Re: environment_classes API endpoint extremely slow

One option here could be to use a Go puppet parser. It has way better performance than the Ruby parser and would be pretty easy to also make multi-threaded.

Charlie Sharpsteen (Jira)

unread,
Sep 11, 2020, 3:38:04 PM9/11/20
to puppe...@googlegroups.com

I'm a bit confused by:

This spawns 32 jruby processes hitting: /puppet/v3/environment_classes"

How exactly is that activity showing up? As far as I know, that API endpoint is only used by external services like the PE Console. Unless I'm missing something, having 32 request roll in for it seems like bad behavior in some external process.

This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)
Atlassian logo

Maggie Dreyer (Jira)

unread,
Nov 5, 2020, 6:28:02 PM11/5/20
to puppe...@googlegroups.com

Are you still having performance issues around this?

Reply all
Reply to author
Forward
0 new messages