Rethinking file recursion

Luke Kanies

unread,

Mar 19, 2009, 5:50:45 PM3/19/09

to puppe...@googlegroups.com

Hi all,

I've been thinking a lot about file recursion and why it's so darn
complicated, and I think one reason is the recursion happening in the
same resource type doing the managing. As a result, I've been
thinking of moving the file recursion into a Fileset resource type.

Currently, the file type generates new file resources during
recursion; this basic model would be the same, except that the fileset
resource type would be generating files.

This would also make file globbing essentially trivial. You currently
can't do it, because the glob file itself shouldn't be managing
anything, it should just generate resources that match the glob and
thus match something. Since there's no such thing as an abstract file
(i.e., one that doesn't do anything), this doesn't work.

With the fileset, though, it would never do anything other than
generate other resources.

I almost think we could build a simple 'set' resource type that knew
how to manage any kind of resource, and it would just generate other
resources via a specific interface - probably a 'find' method on the
class, I guess.

I don't know if it makes more sense to have a single 'set' resource
type, or a set type for each resource type (e.g., userset, fileset,
etc.). Seems silly, and would require metaprogramming, but is the
only real option until we get multiple primary keys.

This 'set' type would replace the existing 'resources' resource type.

Any comments?

--
Now and then an innocent man is sent to the legislature.
--Kin Hubbard
---------------------------------------------------------------------
Luke Kanies | http://reductivelabs.com | http://madstop.com

Brice Figureau

unread,

Mar 19, 2009, 6:09:37 PM3/19/09

to puppe...@googlegroups.com

On 19/03/09 22:50, Luke Kanies wrote:
> Hi all,
>
> I've been thinking a lot about file recursion and why it's so darn
> complicated, and I think one reason is the recursion happening in the
> same resource type doing the managing. As a result, I've been
> thinking of moving the file recursion into a Fileset resource type.
>
> Currently, the file type generates new file resources during
> recursion; this basic model would be the same, except that the fileset
> resource type would be generating files.

I've also been thinking a lot about local file recursion lately, but for
performance reasons.

I understand your idea and what are the benefits of your proposals in
term of clarity, code concision and such.

Right now, the main performance issue with local recursive file
resources is creating one newchild file resource per managed sub file,
which in turn will be managed by the system.
Ruby seems particularly slow at creating tons of objects, and it uses
memory for something that is at really transient.

My idea on the subject (but I didn't research if that's doable) was that
we don't really need to create those objects, if we consider a recursive
resource as an "opaque" system which manages its own sub-resources
itself. This behavior could be supported by the puppet ral system by
defining a kind of recursive manager system that could offer
programmatic resource management instead of being object based.
I'm not sure I'm prefectly clear, it's late here and my brain need some
rest :-)

I think this violates the current puppet contract, but I'm sure we could
implement the recursive behavior outside of the file resource while
still being able to manage sub-resource procedurally instead of having
to generate them.

Maybe that's what you are proposing (still late here).
If not, please try to think about it and see if that could make sense.
--
Brice Figureau
Days of Wonder
http://www.daysofwonder.com

Mark Plaksin

unread,

Mar 19, 2009, 8:19:13 PM3/19/09

to puppe...@googlegroups.com

Brice Figureau <brice-...@daysofwonder.com> writes:

> On 19/03/09 22:50, Luke Kanies wrote:
>
>> Hi all,
>>
>> I've been thinking a lot about file recursion and why it's so darn
>> complicated, and I think one reason is the recursion happening in the
>> same resource type doing the managing. As a result, I've been
>> thinking of moving the file recursion into a Fileset resource type.
>>
>> Currently, the file type generates new file resources during
>> recursion; this basic model would be the same, except that the fileset
>> resource type would be generating files.
>
> I've also been thinking a lot about local file recursion lately, but for
> performance reasons.
>
> I understand your idea and what are the benefits of your proposals in
> term of clarity, code concision and such.
>
> Right now, the main performance issue with local recursive file
> resources is creating one newchild file resource per managed sub file,
> which in turn will be managed by the system.
> Ruby seems particularly slow at creating tons of objects, and it uses
> memory for something that is at really transient.

I'm outta my league as far as the code goes but this *sounds* like what
caused the support case about tidy that I submitted today. We tried to
tidy our puppet reports directory which had 350k files using 1.5G.
Puppet used more and more RAM until it segfaulted when it hit 2G (I
assume it was exactly 2G--I wasn't watching that closely but our stats
say used RAM went up about 2G while puppet was running).

Luke Kanies

unread,

Mar 19, 2009, 8:28:56 PM3/19/09

to puppe...@googlegroups.com

It's not actually what I'm proposing - I'd say it's a parallel and
possibly competing proposal.

I've been thinking about something similar. I think there are at
least three ways one could do what you're asking (in inverse order of
overhead). I'm describing them here with simple names so it's easier
to refer back to them; the names aren't perfect, but hopefully they'll
do.

1) Transient resources: Continue to create the resources but create
and destroy them one at a time

2) Set resources: Use a single recursive operation that somehow
manages to retain transactional integrity

3) Set operations: Perform a recursive operation that loses
transactional integrity

I think you're essentially proposing something like #3.

I'll provide some more detail on each, but there are a couple of
points of complexity that are worth noting. In particular, the choice
here has a significant affect on logging and events. You can actually
think of logs and events as isomorphic, and they're only going to get
moreso: I hope by 0.26 or so all transaction logs are actually
generated by events.

Obviously logging is critical so you know what's happening on your
system. Events are critical so that you can react to those changes.
E.g., if you need to restart a service if any file in a fileset
changes, then an event from a file deep in the hierarchy needs to be
routed appropriately and then it needs to be able to be logged as
being from that location.

We solve that right now using proxy resources - the recursing resource
is the proxy for the event-generating resource. This will likely work
for any of the other solutions, too, but it's worth thinking about
here because I've found it to be the major source of complexity, and I
think that would continue.

So, on to descriptions:

Transient resources:

Currently, we create the whole list of resources, add them to the
graph, and then iterate over them. Instead, we could essentially
process them one at a time and then discard them. Our current method
(loosely) is:

file.eval_generate.each { |resource| add_to_catalog(resource) and
eval_resource(resource) }

Instead, it would become:

file.eval_generate { |resource| eval_resource(resource) }

Ridiculous pseudo-code, obviously, but hopefully you get the idea.
The optimization here is that 1) we aren't adding to the catalog and
2) we aren't building a list at all.

Set resources:

We could have some kind of resource that didn't use instance variables
for any of the values or comparisons, such that a single resource
instance could be used to do all of the operations. The main thing is
that it kicks out change instances for each of the things that needs
to be done, with all of the appropriate information for logging and
events.

We're still paying a per-change overhead, but I don't think you can
get away from that and retain transactional integrity.

Set operations:

This is pretty much just a big, painful chmod -R. This would be a bit
more difficult because we'd have to skip any resources that are
managed anywhere else.

In writing this description, I think the second option is the best,
even though I was leaning toward the first, initially. I think it's
doable - the big limitation right now is the use of instance variables
in files. If 'should' weren't set anywhere, then you'd be passing it
in each time, and if you're passing it in, then you could use the same
instance for every file you needed to operate on.

One of the big changes we did (but no one noticed) in the fall of '07
was that we switched 'is' from being an instance variable to being
transient, only maintained by the transaction. I've always wanted to
switch 'should', too, but I haven't known how.

If we were to combine this with my goal of splitting resource types
into a Resource class (which already exists in 0.25) and a
ResourceType class (whose instances will model individual resource
types), then these resource types could be written to operate with no
instance variables, essentially. This would, I think, enable the set
resources pretty easily. (I suppose I should open this as a ticket,
so people know wtf I'm talking about.)

Well, 'easily' once you refactored the RAL entirely and broke backward
compatibility.

--
Learning is not attained by chance, it must be sought for with ardor and
attended to with diligence. -- Abigail Adams

Luke Kanies

unread,

Mar 19, 2009, 8:29:15 PM3/19/09

to puppe...@googlegroups.com

Indeed, that was likely the problem. I was just about to close your
support ticket. :)

--
It is dangerous to be right when the government is wrong.
-- Voltaire

Brice Figureau

unread,

Apr 25, 2009, 11:25:19 AM4/25/09

to puppe...@googlegroups.com

Hi,

For whatever reasons, it appears I've never followed up this interesting
conversation.

I was about to resurrect a patch I submitted around this timeframe to
"compress" file path in File resources (I'll repost it soon, as I'd like
it to be part of 0.25 beta cycle if possible), and remembered this topic.

Hum, I wasn't really clear in my previous e-mail, but in fact I was
thinking more about something like your #2. Ie File resource could get
support from a recursive meta-resource to actually "perform" the recursion.

> I'll provide some more detail on each, but there are a couple of
> points of complexity that are worth noting. In particular, the choice
> here has a significant affect on logging and events. You can actually
> think of logs and events as isomorphic, and they're only going to get
> moreso: I hope by 0.26 or so all transaction logs are actually
> generated by events.
>
> Obviously logging is critical so you know what's happening on your
> system. Events are critical so that you can react to those changes.
> E.g., if you need to restart a service if any file in a fileset
> changes, then an event from a file deep in the hierarchy needs to be
> routed appropriately and then it needs to be able to be logged as
> being from that location.

Of course, it makes perfect sense.

> We solve that right now using proxy resources - the recursing resource
> is the proxy for the event-generating resource. This will likely work
> for any of the other solutions, too, but it's worth thinking about
> here because I've found it to be the major source of complexity, and I
> think that would continue.

OK.

> So, on to descriptions:
>
> Transient resources:
>
> Currently, we create the whole list of resources, add them to the
> graph, and then iterate over them. Instead, we could essentially
> process them one at a time and then discard them. Our current method
> (loosely) is:
>
> file.eval_generate.each { |resource| add_to_catalog(resource) and
> eval_resource(resource) }
>
> Instead, it would become:
>
> file.eval_generate { |resource| eval_resource(resource) }
>
> Ridiculous pseudo-code, obviously, but hopefully you get the idea.
> The optimization here is that 1) we aren't adding to the catalog and
> 2) we aren't building a list at all.

If that's the low hanging fruit, which would add the lowest resource
comsumption, then I'm all for it for 0.25.
The new more generic mechanism (ie #2) can still be implemented for 0.26
(or any other subsequent release).

> Set resources:
>
> We could have some kind of resource that didn't use instance variables
> for any of the values or comparisons, such that a single resource
> instance could be used to do all of the operations. The main thing is
> that it kicks out change instances for each of the things that needs
> to be done, with all of the appropriate information for logging and
> events.
>
> We're still paying a per-change overhead, but I don't think you can
> get away from that and retain transactional integrity.
>
> Set operations:
>
> This is pretty much just a big, painful chmod -R. This would be a bit
> more difficult because we'd have to skip any resources that are
> managed anywhere else.
>
> In writing this description, I think the second option is the best,
> even though I was leaning toward the first, initially. I think it's
> doable - the big limitation right now is the use of instance variables
> in files. If 'should' weren't set anywhere, then you'd be passing it
> in each time, and if you're passing it in, then you could use the same
> instance for every file you needed to operate on.

That's plain true. I vote for this!

> One of the big changes we did (but no one noticed) in the fall of '07
> was that we switched 'is' from being an instance variable to being
> transient, only maintained by the transaction. I've always wanted to
> switch 'should', too, but I haven't known how.
>
> If we were to combine this with my goal of splitting resource types
> into a Resource class (which already exists in 0.25) and a
> ResourceType class (whose instances will model individual resource
> types), then these resource types could be written to operate with no
> instance variables, essentially. This would, I think, enable the set
> resources pretty easily. (I suppose I should open this as a ticket,
> so people know wtf I'm talking about.)
>
> Well, 'easily' once you refactored the RAL entirely and broke backward
> compatibility.

Yes, that's the difficult task, I do agree.

Now, if we can move forward to at least have #1 in 0.25, that'd be über
great, because it'd solve the one of the biggest performance (ie memory)
issue in puppetd while managing files (issue I tried to overcome with
#1469 or the soon-resurrected path compression stuff).
People, myself included, find natural to use recursive file resources on
large trees just to chmod/chown, and we all expect those to work as fast
as chmod/chown, so it is not imaginable to see puppetd failing at this
simple task. And you can be sure, that's almost the first thing a puppet
newcomer will use, so imagine her reaction seeing the tool not succeeding...

/me preparing the pathcomp patch for a new review.
--
Brice Figureau
http://www.masterzen.fr/

Luke Kanies

unread,

Apr 25, 2009, 8:24:29 PM4/25/09

to puppe...@googlegroups.com

Ok.

I'm basically unwilling to add any new functionality to 0.25, unless
the code is already ready and largely trivial. I can't think of
anything that's so critical that it's worth delaying the release.

Plus, if we've got multiple ideas for performance gains, let's
separate them into different releases so people see multiple stages
instead of one big gain. :)

Like I said, I really don't want to try to fit this into 0.25. I
believe James is essentially ready to announce 0.25b1, with the only
major known issue being the lack of rack support. There are open
tickets, but none look too difficult to fix.

>
> /me preparing the pathcomp patch for a new review.
> --
> Brice Figureau
> http://www.masterzen.fr/
>
>
> >

--
The hypothalamus is one of the most important parts of the brain,
involved in many kinds of motivation, among other functions. The
hypothalamus controls the "Four F's": 1. fighting; 2. fleeing;
3. feeding; and 4. mating.
-- Psychology professor in neuropsychology intro course

Reply all

Reply to author

Forward