Hi Andy,
On Thu, 2014-02-06 at 15:38 -0800, Andy Parker wrote:
> On Thu, Feb 6, 2014 at 1:50 PM, Brice Figureau
> <
brice-...@daysofwonder.com> wrote:
> Hi,
>
> During (the awesome) cfgmgmtcamp Puppet Contributor Summit
> earlier this
> week, Peter Meier and I were chasing (in fact he did all the
> hard work)
> a couple of strange bugs, and we found while stracing the
> agent what
> might be a performance issue.
>
>
>
> That was an absolutely incredible write up of the issue that you put
> in the ticket! Many thanks to both of you for tracking that down. Erik
> Dalen had investigated something that looks like the same problem and
> posted the info to PUP-751.
Yes, this looks the same, but Erik was finding the issue on the master,
whereas Peter and I focused mostly on the agent.
> Everything is explained in detail in PUP-1592 [1], but the
> tl;dr version
> is we're stat(2)ing a lot of inexistent files during the
> transaction for
> every instance of a given defined types (that might well get
> in the
> order of 100s of stat(2) per instance of defined types).
>
>
>
> I think your idea of changing the order that puppet searches makes
> sense. It is much more likely that the requested name is a defined
> type than a custom type and so looking for defined types first short
> circuits the whole process more quickly.
That was my gut feeling too.
> There is maybe another issue at play, though. As Erik points out it
> seems like puppet will look up the same things again and again. We've
> done a lot of hacking around to try to put in caches and all sorts of
> other things, but those all end up being very fragile and we regress
> on the fix. The problem is that known_resource_types nearly impossible
> for us to manage in the system in a sane matter. I think the full fix
> for this is going to need to address the problem of keeping track of
> what is loaded, what isn't, where it was loaded from, and make a much
> cleaner step for when the cache is invalidated.
All the problem resolves to negative caching. The problem is that Puppet
tries to validate existence of given files to load them, but doesn't
keep an eye on those that don't exist. So next lookup of the same files
will also trigger a swarm of stat call, because Puppet doesn't remember
what it already didn't find.
I do agree that negative caching (and caching in general) is very
difficult. In the agent case, the transaction now happens isolated in
its own process and is very short-lived, which means we could cache in a
global singleton for the duration of the run. The problem on the master
is a bit more complex due to the global nature of the
known_resource_types (and the autoloader).
> I recently created the start of a benchmarking system in puppet
> (
https://github.com/puppetlabs/puppet/commit/e48157e42360aaf0c67e2ef3866b091f0a113ce8) and the profile of the one scenario added already shows an absurd number of calls to determine if any files have changed. Thankfully in this case the caching of the stat call is working, but it is still spending a huge amount of time just deciding if the filetimeout has expired.
Yes, I've already seen this issue.
> I'd like to get the input of all the devs here on the proper
> way of
> fixing the bug before sending a PR :)
>
>
>
> I think that just swapping the search order is a good fix that should
> be less likely for us to regress. If that shows a performance
> improvement, then that should definitely go in.
This will show an improvement on the master for sure, but as I said in
the ticket that doesn't really solve the agent issue (see the second
part of the explanation). That's where I'm unsure about the proper fix.
I'd really like your (and the others) input on the problem.
> I think the more proper fix will require a larger redesign of how
> puppet loads things and tracks what it has loaded.
It's even worst than that. We need to track things that couldn't be
loaded to not try to reload them again.
In this specific bug, I'm wondering if we really need to have those
defined type as resources in the RAL catalog (I believe it might be used
for relationship or event propagation).
If we really do, we might be able to flag those as "defined type" and
then skip compound parsing title (because compound titles can only
happen on ruby/builtins types and is the root cause of trying to find
the 'defined type' type).
This would allow us to prevent calling P::R#resource_type in the agent,
at the expense of having a new flag in the resource (which also means
more information in the transferred catalog and incompatibilities and so
on).
Any more ideas on how to solve this issue?
Anyway thanks for listening :)