The source of the massive slowdown with directory environments

18 views
Skip to first unread message

Andy Parker

unread,
May 1, 2014, 7:21:23 PM5/1/14
to puppe...@googlegroups.com
This is an explanation of the slowdowns that some people saw with 3.5.0/1 when directory environments were enabled. It also explains how we didn't notice it and some of the pains we've had to work through on the caching system that we put into place (and there has been ongoing discussion here) for 3.6.0

---------------------------------

In order to get in the directory environments system, we added an environment loading system. That can be found in the code in lib/puppet/environments.rb (Puppet::Environments). The current set of configured loaders is accessible via the new context system (a more controlled way of tracking global information). The "current environment" is also accessible in the context system (Puppet::Context and accessible using Puppet.lookup(name). Note: This is being used for several things in puppet now, but there is a possibility that it will be replaced with another system in puppet 4.0, so be careful about using it outside of the puppet codebase).

In the old system the current environment was available as Puppet::Node::Environment.current. The change to that was made in order to start moving the code into a more decoupled design. Rather than everything having this single global and having to set and reset it, the current environment (as well as other things) is tracked in a system (Puppet::Context) that automatically tracks setting and unsetting. But it also enforces that the boundaries for when something needs to change are well known.

In the old system, the way of getting an environment (any environment) was to simply call Puppet::Node::Environment.new. The code was littered with those calls. There was an optional argument to the new() method that would give you a particular environment. If nothing was given, then you got back the "configured environment", which might be the current environment. Various parts of the code were getting the environment from another object, or assuming that the current value of the environment setting was the name of the environment it needs to use, or using no name at all, or any number of other paths.

We have encountered the problems caused by all of this before. It has shown up as code being loaded from the wrong environment or environments being checked for no reason at all. The number of times we've encountered this "lost environment" problem is pretty large.

The new code tried to add some guardrails. Environments only come from a loader. The current environment is tracked in the context. This opened up interesting problems as we modified the code. There were a lot of areas that had no clear environment! Sometimes, depending on when the call occurred, the environment would change over time.

The 20x slowdown that was reported stemmed from one of these locations.

In order for puppet to find a resource in the catalog, it has to load the definition of the resource's type. To find the type, it needs to know the environment. However, by that point in the code it had lost track of what environment to use. It ended up falling back to the "configured" environment (the one in the environment setting). I believe that we've even had reports of strange issues where some resource references might not end up working, or odd code gets loaded.

For legacy environments, the environment had already been loaded, parsed, and was ready to go (because it is likely cached). In directory environments, without caching, it had to load up the environment fresh and search for the type's definition. This lookup and reparse could happen hundreds of times during a manifest run.

The benchmarks didn't catch this because their configured environment was the default "production". In order for any bootstrapping to occur in puppet, it turns out that all sorts of things assume that the configured environment exists. Almost all of this is entirely by accident, and isn't needed. However, in order to not have to try to untangle all of that as well, I added a static environment definition for the configured environment. This static definition has no modules, no manifests, nothing.

When the benchmarks ran all of the resource lookups would go through this dementia patient stage of forgetting what environment was in effect. It would grab the configured environment, which turned out to be the static, null-object environment rather than a real one. Since that environment didn't have anything in it there wasn't a large cost and so this incorrect environment usage wasn't noticed.

For users that tried turning on directory environments, they likely had left the configured environment as the default "production", and actually had a "production" environment. So when the configured environment is looked up, it ends up loading a real environment, which is going to take a significant amount of time to search for the type definition.

The recent work to get caching in place took us to that exact place in the code again. The more we poked it, the more it became apparent that there where many, many places in the code that completely lost track of what environment should be used. Sometimes this didn't matter, sometimes it was unclear how the code ever worked. The fix for this was to start threading through the right environment. In places where there wasn't an environment available we started using a Puppet::Node::Environment::NONE null-object environment to signal that.

--
Andrew Parker
Freenode: zaphod42
Twitter: @aparker42
Software Developer

Join us at PuppetConf 2014September 22-24 in San Francisco
Register by May 30th to take advantage of the Early Adopter discount save $349!
Reply all
Reply to author
Forward
0 new messages