File optimizations

Bostjan Skufca

unread,

Oct 22, 2012, 3:09:45 PM10/22/12

to puppet...@googlegroups.com

Hi there,

I'm running into slow catalog runs because of many files that are managed. I was thinking about some optimizations of this functionality.

1: On puppetmaster:

For files with "source => 'puppet:///modules...' puppetmaster should already calculate md5 and send it with the catalog.

2: On managed node:

As md5s for files are already there once catalog is received, there is no need for x https calls (x is the number of files managed with source=> parameter)

3. Puppetmaster md5 cache

This would of course put some strain on puppetmaster, which would then benefit from some sort of file md5 cache:

- when md5 is calculated, put in into cache, key is filename. Also add file mtime and time of cache insert.

- on each catalog request, for each file in the catalog check if mtime has changed, and if so, recalculate md5 hash, else just retrieve md5 hash from cache

- some sort of stale cache entries removal, based on cache insert time, maybe at the end of each puppet catalog compilation, maybe controlled with probability 1:100 or something

Do you have any comments about these optimizations? They will be greatly appreciated... really :)

b.

Stephen Gran

unread,

Oct 22, 2012, 3:27:51 PM10/22/12

to puppet...@googlegroups.com

Hi,

On Mon, 2012-10-22 at 12:09 -0700, Bostjan Skufca wrote:
> Hi there,
>
>
> I'm running into slow catalog runs because of many files that are
> managed. I was thinking about some optimizations of this
> functionality.

Your suggestions look reasonable to me, but I'm not a puppetlabs person,
so can't make an official comment.

Turn the question around for a moment: why do you have so many file
resources?

Cheers,
--
Stephen Gran
Senior Systems Integrator - guardian.co.uk

Please consider the environment before printing this email.
------------------------------------------------------------------
Visit guardian.co.uk - newspaper of the year

www.guardian.co.uk www.observer.co.uk www.guardiannews.com

On your mobile, visit m.guardian.co.uk or download the Guardian
iPhone app www.guardian.co.uk/iphone and iPad edition www.guardian.co.uk/iPad

Save up to 37% by subscribing to the Guardian and Observer - choose the papers you want and get full digital access.
Visit guardian.co.uk/subscribe

---------------------------------------------------------------------
This e-mail and all attachments are confidential and may also
be privileged. If you are not the named recipient, please notify
the sender and delete the e-mail and all attachments immediately.
Do not disclose the contents to another person. You may not use
the information for any purpose, or store, or copy, it in any way.

Guardian News & Media Limited is not liable for any computer
viruses or other material transmitted with or as part of this
e-mail. You should employ virus checking software.

Guardian News & Media Limited

A member of Guardian Media Group plc
Registered Office
PO Box 68164
Kings Place
90 York Way
London
N1P 2AP

Registered in England Number 908396

Nikola Petrov

unread,

Oct 22, 2012, 7:39:53 PM10/22/12

to puppet...@googlegroups.com

Hi,

When using puppet I found that it is a far better idea to serf files
with something else. You will be far better with something else for this
job like sftp or ssh. My conclusions just came from the fact that we
were trying to import a big dump(2GB seems ok to me but who knows) and
puppet just died because they are not streaming the file but it is
*fully* loaded into memory.

Apart from that these ideas seems reasonable but I am not a commiter.

Best, Nikola

Brice Figureau

unread,

Oct 22, 2012, 5:17:52 PM10/22/12

to puppet...@googlegroups.com

This assertion is not true anymore since at least 2.6.0.
Also, for big files you can activate http compression on the client,
this might help (or not, YMMV).
--
Brice Figureau
My Blog: http://www.masterzen.fr/

Brice Figureau

unread,

Oct 22, 2012, 5:27:41 PM10/22/12

to puppet...@googlegroups.com

Hi,

For development questions, feel free to post in puppet-dev :)

You're not the first irritated by those md5 computations taking time.
That's something I'd like to really optimize since a loooong time.
That's simple quite difficult.

On 22/10/12 21:09, Bostjan Skufca wrote:
> Hi there,
>
> I'm running into slow catalog runs because of many files that are
> managed. I was thinking about some optimizations of this functionality.
>
> 1: On puppetmaster:
> For files with "source => 'puppet:///modules...' puppetmaster should
> already calculate md5 and send it with the catalog.

That's what the static compiler does, if I'm not mistaken. The static
compiler is part of puppet since 2.7.

> 2: On managed node:
> As md5s for files are already there once catalog is received, there is
> no need for x https calls (x is the number of files managed with
> source=> parameter)
>
> 3. Puppetmaster md5 cache
> This would of course put some strain on puppetmaster, which would then
> benefit from some sort of file md5 cache:
> - when md5 is calculated, put in into cache, key is filename. Also add
> file mtime and time of cache insert.
> - on each catalog request, for each file in the catalog check if mtime
> has changed, and if so, recalculate md5 hash, else just retrieve md5
> hash from cache
> - some sort of stale cache entries removal, based on cache insert time,
> maybe at the end of each puppet catalog compilation, maybe controlled
> with probability 1:100 or something

Actually checking the mtime/size prior to do any md5 computations could
be a big win.

But that's not all, in fact there are 3 md5 computations per files
taking place during a puppet run:
* one by the master when computing file metadata
* one by the agent on the existing file (this helps to know if the file
changed)
* and finally one after writing the change to the files to make sure we
wrote it correctly.

A potential solution would be to implement a different checksum type
(maybe less powerful than a md5, but faster).

> Do you have any comments about these optimizations? They will be greatly
> appreciated... really :)

Well, I believe we're (at least myself) very aware of those issues. The
fact that it never got fixed (except by the static compiler) is that
it's a complex stuff. Last time I tried to fiddle with the checksumming,
I never quite got anywhere :)

As I said in the preamble, feel free to chime in puppet-dev to talk
about this, and check the various redmine tickets regarding those issues.

Nikola Petrov

unread,

Oct 22, 2012, 8:37:21 PM10/22/12

to puppet...@googlegroups.com

Nice to hear that. It is true that the version with which I tested this
was somewhat old.

Best, Nikola

Bostjan Skufca

unread,

Oct 22, 2012, 6:20:46 PM10/22/12

to puppet...@googlegroups.com, brice-...@daysofwonder.com

Inline

On Monday, 22 October 2012 23:28:03 UTC+2, Brice Figureau wrote:

Hi,

For development questions, feel free to post in puppet-dev :)

You're not the first irritated by those md5 computations taking time.
That's something I'd like to really optimize since a loooong time.
That's simple quite difficult.

Damn it, two Google Groups tabs open, for this topic it was intended to go to puppet-dev :/

> 1: On puppetmaster:
> For files with "source => 'puppet:///modules...' puppetmaster should
> already calculate md5 and send it with the catalog.

That's what the static compiler does, if I'm not mistaken. The static
compiler is part of puppet since 2.7.

Documentation is scarce at best. What I could find, I see that it maintains another copy of the file in the bucket, which is not the best option (my usecase, if someone wants even speedier approach, this is the way to go).

b.

Bostjan Skufca

unread,

Oct 22, 2012, 6:35:20 PM10/22/12

to puppet...@googlegroups.com, stephe...@guardian.co.uk

Hi Stephen,

On Monday, 22 October 2012 21:28:23 UTC+2, Stephen Gran wrote:

Turn the question around for a moment: why do you have so many file
resources?

These systems are puppet-controlled from the /etc/inittab through the whole boot process and each and every service startup file definition, along with their packages, which are compiled (unfurtunately, this requires at least one file resource - install script) and service configuration files.

Why? Let's put it this way: if you have a cluster of redundant machines, you can do a rolling upgrade to newer OSes etc. If not, then uptime must not be disturbed and this is the only way we can run recent/fresh software on quite old distributions (install bare! distro, no libs, then compile everything in a controlled maner.).

I hope I have given you a decent answer, because Borat tends to disagree with me:)

http://twitter.com/DEVOPS_BORAT/status/209720453881798656

b.

Reply all

Reply to author

Forward