Puppet 3.6.0... and scaling?

368 views
Skip to first unread message

Tristan Smith

unread,
May 22, 2014, 12:59:25 PM5/22/14
to puppet...@googlegroups.com
After much hacking to get directory environments settled and the manifest directory in place, I rolled Puppet 3.6 to our puppetmasters last night.

One of our puppetmasters has nearabouts 1000 clients, runs passenger under apache 2.2 (ruby 1.8.7, sadly, thanks CentOS), and normally doesn't really notice puppet running - basically peaks out at 20% CPU usage.

Under 3.6, even doubling the passenger worker count, it couldn't keep up with the load - I started running out of apache procs due to workers stuck in waiting mode and they were all just hanging waiting for a passenger worker to free up. CPU usage on the system capped out.

Strace -c on the passenger workers had them spending 30% of their time in clone() and 60% in wait4(), fwiw.

I'm going to be digging to figure out what in hell changed to cause this, but has anyone else experienced a significant change in performance under 3.6?

--Triss

Ellison Marks

unread,
May 22, 2014, 1:17:24 PM5/22/14
to puppet...@googlegroups.com
Hey, I think I remember another thread that mentioned that there were some performance issues with directory environments. Basically, the next 3.6 release will add a caching option that mostly alleviated the problem for the OP from that thread.

https://groups.google.com/forum/#!topic/puppet-users/wzy8NPWauu4

Tristan Smith

unread,
May 22, 2014, 1:26:47 PM5/22/14
to puppet...@googlegroups.com
Dang. That does look an awful lot like my issue. I am in fact using directory environments (mostly because of the screaming deprecation warnings telling me I was a bad man if I didn't).

:/ Shame on me for using a .0 release.

Daniele Sluijters

unread,
May 22, 2014, 3:21:37 PM5/22/14
to puppet...@googlegroups.com
The environment caching is already there, use the environment_timeout setting. Mine is set to unlimited and I reload at deploy time by touching tmp/restart.txt. This so far seems to work really well.

Ellison Marks

unread,
May 22, 2014, 3:30:09 PM5/22/14
to puppet...@googlegroups.com
Ah, whoops. Shame on *me* for not remembering what version they were talking about in that thread.

Chuck

unread,
May 23, 2014, 1:23:14 PM5/23/14
to puppet...@googlegroups.com
I am noticing increase CPU and Memory requirements from Puppet 3.4.3

I had to set the following passenger config so that my server would not run out of memory:
PassengerMaxRequets 10000

CPU idle 3.4.3 = 70% - 80%
Memory utilization = 52%

CPU idle 3.6.1 =  60% - 73%
Memory utilization = 85%

Tristan Smith

unread,
May 23, 2014, 3:20:33 PM5/23/14
to puppet...@googlegroups.com
Whuf.   I'm already at 10000, so that's not it for me. 

Seems like the doubled run time was just murdering me in the face. Trying to run 1000 clients in 30m was just too much. Guess I could make it an hour splay, but the overall resource requirement change is sufficiently large I'm probably just going to see about opting out of directory environments for a couple revisions, wait for them to tune up a bit.




--
You received this message because you are subscribed to a topic in the Google Groups "Puppet Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/puppet-users/arLckU-mPhw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to puppet-users...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/puppet-users/2c9172e1-6129-46d1-94f1-38c078d72226%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Josh Partlow

unread,
May 23, 2014, 4:24:43 PM5/23/14
to puppet...@googlegroups.com
Hi Tristan, Chuck,

What environment_timeout do you have set currently?

Chuck

unread,
May 23, 2014, 8:25:02 PM5/23/14
to puppet...@googlegroups.com
I had it set to 3 minutes for this test.

I am going to work on trying some different settings.

So far the best result has been with the following settings

PassengerMaxRequests 5000
environment_timeout 15 minutes
"service httpd graceful" whenever my svn update script detects a change.

I will be working on getting more information.

Tristan Smith

unread,
May 25, 2014, 5:53:47 PM5/25/14
to puppet...@googlegroups.com
I've been using the default (which if I read correctly is 5s?); seems likely that a good answer here for me would be to push it out to $something_long and use restart.txt or somesuch to uptake changes. I may try that this week.

My MaxRequests was already 10k. I'm guessing that 5 seconds of cache was approximately equivalent to jack and squat. I'm not sure if my attempt to tune by granting more workers would've done me any good.


Konrad Scherer

unread,
May 27, 2014, 8:53:29 AM5/27/14
to puppet...@googlegroups.com
On 14-05-22 03:21 PM, Daniele Sluijters wrote:
> The environment caching is already there, use the environment_timeout
> setting. Mine is set to unlimited and I reload at deploy time by
> touching tmp/restart.txt. This so far seems to work really well.

Thanks for the suggestion. I have also been dealing with high CPU load
on my puppet masters since 3.5.0. Triggering the puppet master restart
makes a lot of sense. I am using a git post commit hook to reload the
puppet configs on my three puppet masters and I have added the code to
restart the puppet rack app after changes have been detected. I will
report back once I have had some time to analyze the results.

This seems like a major change from previous puppet versions. I have
been using Puppet since 2.6 and any changes to puppet configs on the
master were always picked up immediately. Is this because the puppet
master was not doing any caching or is the puppet master watching the
puppet configs for changes? Has this behavior now changed? Will changes
to puppet manifests on the master only be detected after the
environment_timeout has expired?

Thank you in advance for any insight.

--
Konrad Scherer, MTS, Linux Products Group, Wind River

Andy Parker

unread,
May 27, 2014, 5:39:15 PM5/27/14
to puppet...@googlegroups.com
On Tuesday, May 27, 2014 5:53:29 AM UTC-7, Konrad Scherer wrote:
On 14-05-22 03:21 PM, Daniele Sluijters wrote:
> The environment caching is already there, use the environment_timeout
> setting. Mine is set to unlimited and I reload at deploy time by
> touching tmp/restart.txt. This so far seems to work really well.

Thanks for the suggestion. I have also been dealing with high CPU load
on my puppet masters since 3.5.0. Triggering the puppet master restart
makes a lot of sense. I am using a git post commit hook to reload the
puppet configs on my three puppet masters and I have added the code to
restart the puppet rack app after changes have been detected. I will
report back once I have had some time to analyze the results.


By "puppet configs" do you mean the puppet manifest files? Under rack the puppet master doesn't watch nor reload the puppet.conf file.
 
This seems like a major change from previous puppet versions. I have
been using Puppet since 2.6 and any changes to puppet configs on the
master were always picked up immediately. Is this because the puppet
master was not doing any caching or is the puppet master watching the
puppet configs for changes? Has this behavior now changed? Will changes
to puppet manifests on the master only be detected after the
environment_timeout has expired?


The caching behavior for directory environments is a bit different from the previous system. I've been working on a blog post about this, but haven't finished it yet :(

First off, what is being cached? When we talk about caching environments we are talking (mostly) about caching the parsed and validated form of the manifest files. This saves the cost of disk access (stat to find files, reads to list directory contents, reads to fetch manifest file contents) as well as a certain amount of CPU use (lexing, parsing, building an AST, validating the AST). This is what has been part of the cache for quite a while now.

What has changed is the cache eviction mechanism that is used. The directory environments employ a different eviction and caching system that the "legacy" environments. The legacy environments had singleton instances that the master would never get rid of to track each individual environment. The environments have references to the AST objects as well as to WatchedFile objects, which are used to track changes to the mtime of the manifest files. The WatchedFile instances would stat the file that they are supposed to watch, but limit the stat calls to happen no more often than the filetimeout setting specified. Before Puppet 3.4 (? 3.5? I lose track of what version had what change) the WatchedFile instances would get interrogated throughout the compilation process. In fact, every time it asked if one file had changed it ended up asking if *any* files had changed. There were a lot of side effects of that, but I won't derail the conversation to go in to that. In 3.4 (or was it 3.5) the legacy environment system was changed to only check if files had changed at the beginning of a compile. This, however, meant that it would still in the worst case issue a stat call for every manifest file, in the best case (depending on your viewpoint) issue no stat calls because the filetimeout had not expired, or it would be some in-between number of stats. The in-between number of stats is possible because each WatchedFile instance had its own timer for the filetimeout and so they can drift apart over time, which allowed it to detect changes to some files but not others.

For the directory environments we chose a different system for managing the caches. The watch word here was KISS. Under the new system there isn't any file watching involved (right now, that is. There is a PR open to introduce a 'manual' environment_timeout system), instead once an environment has loaded a file it simply holds onto the result. All of the caching now comes down to holding onto just the environment instance. Cache eviction is just about when puppet should throw away that environment instance and re-create it. There are a few options here:

  * environment_timeout = 0 : Good to use in an development setup where you are editing manifest and running an agent to see what happens. Nothing will be cached and so the full lex, parse, validate overhead is incurred on every agent catalog request.
  * environment_timeout = <some number> : If you have "spikey" agent requests. For instance, if you don't run agents continually and instead only trigger them as needed with mco. In that case you know that from the first agent checking in to the last agent checking in it is 20 minutes and you do this kind of on demand deploy once a day, then just set the timeout to 30m (20 minutes + some extra time to deal with variance). This way the cache will last through the whole run, but will have expired by the next time you run.
  * environment_timeout = unlimited : When agents are checking in all of the time, then there is no "down time" when the cache can reasonably go away. Anything less that unlimited here will cause periodic spikes in CPU usage as each passenger worker reparses everything.  Even with unlimited you'll get reparsing simply because the passenger workers will be periodically killed, but that is out of my control. When unlimited is in use, then the question becomes, "when *should* the manifests be reloaded?". Well, whenever you deploy some new manifests. Any more often than that is just a waste (in fact, that is really the answer for all of these cases). This is where the graceful restarts are used. Whenever a manifest change is done, trigger a graceful restart, which will cause the environment caches to be lost (because the process dies) and recreated for the next request.

The reason we went with this system for cache eviction is because it actually puts a lot more control in your hands for resource utilization. When puppet was watching files it would end up doing stat calls (which can be very slow) even when nothing was changing. Since manifests only change when they are being actively changed by something (either a person editing them, or a deploy process laying down new versions), this moves the decision about when to incur a manifest reload cost back to the user.

There has also been a question once or twice about why we didn't just go with inotify or a similar system. Mostly it comes down to complexity and portability. This system works anywhere that puppet can run and is immensely simpler.

Thank you in advance for any insight.


I hope what I said above is of help.

Konrad Scherer

unread,
May 30, 2014, 2:49:57 PM5/30/14
to puppet...@googlegroups.com
On 05/27/2014 05:39 PM, Andy Parker wrote:
> On Tuesday, May 27, 2014 5:53:29 AM UTC-7, Konrad Scherer wrote:
>
> On 14-05-22 03:21 PM, Daniele Sluijters wrote:
> > The environment caching is already there, use the environment_timeout
> > setting. Mine is set to unlimited and I reload at deploy time by
> > touching tmp/restart.txt. This so far seems to work really well.
>
> Thanks for the suggestion. I have also been dealing with high CPU load
> on my puppet masters since 3.5.0. Triggering the puppet master restart
> makes a lot of sense. I am using a git post commit hook to reload the
> puppet configs on my three puppet masters and I have added the code to
> restart the puppet rack app after changes have been detected. I will
> report back once I have had some time to analyze the results.
>
> By "puppet configs" do you mean the puppet manifest files? Under rack the puppet
> master doesn't watch nor reload the puppet.conf file.

That wasn't clear, sorry. I mean puppet manifest *.pp files, not the conf files.
Yes thank you for taking the time to explain this. It helps me understand the
behavior I was seeing and the change in behavior starting with 3.5.0. I agree
that moving the manifest reload cost decision to the user is a good idea.
Especially in my environment where I can go days without making changes to the
manifests. I am testing now with environment_timeout=unlimited and a 'touch
/etc/puppet/rack/tmp/reload.txt' in the script that gets notified of changes to
my git repo of puppet manifests. I will report back when I have some data.

Tristan Smith

unread,
May 31, 2014, 1:43:44 PM5/31/14
to puppet...@googlegroups.com
A followup for closure;

I've now turned environment_timeout to 'unlimited' and am doing a graceful restart of httpd to reset passenger. This seems to work.

The resultant performance is _much_ more sane; I'm still seeing mebbe 20% increase in server CPU requirements, but that's quite manageable. Overall runtime is palpably better (about 10%); haven't identified where I'm getting my savings, though since I'm in parser=future land that may be a fair chunk of it.

Upshot: Oh, god, 5s default timeout is nightmarishly bad for large systems, made it look like Puppet suddenly didn't know how to scale. The reality is that this has minimal real effect once you tune and I'm rocking 3.6 in the wild.

Brian Wilkins

unread,
May 31, 2014, 2:41:07 PM5/31/14
to puppet...@googlegroups.com
What about staggering your runs? It seems trivial but at least it would reduce your load I think.

Tristan Smith

unread,
Jun 1, 2014, 12:04:14 AM6/1/14
to puppet...@googlegroups.com
I've plenty of possible mechanisms to experiment with for lessening the load, but I honestly don't _have_ to now that I've fixed the caching problem. 20% higher load == the system peaks out at 40%idle, now. I've got some headroom. With the 5s cache, it was bringing the server to its knees. That's no longer an issue.


On Sat, May 31, 2014 at 11:41 AM, Brian Wilkins <bwil...@gmail.com> wrote:
What about staggering your runs? It seems trivial but at least it would reduce your load I think.
--
You received this message because you are subscribed to a topic in the Google Groups "Puppet Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/puppet-users/arLckU-mPhw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to puppet-users...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages