Hi Georgi,
The catalog compilation time is how long it takes to compile the catalog
*on the master*. You can find it on CentOS with `grep Compile
/var/log/messages`. The amount of time it takes for your agent to run is
not at all tied with how long it takes to compile the catalog. Your
puppet agents are not talking to the puppet master once they have
received the catalog, except for file requests[1] and to submit a report.
If you are solving for long agent runs, check out the logs which include
timing information.
By chance are you serving any large binary files with Puppet?
Actually, sometime last night something happened and puppet stopped processing requests altogether. Stopping and starting httpd fixed this, but this could be just some bug in one of the new versions of software I upgraded to. I'll keep monitoring.
On Friday, October 31, 2014 9:50:41 AM UTC-4, Georgi Todorov wrote:Actually, sometime last night something happened and puppet stopped processing requests altogether. Stopping and starting httpd fixed this, but this could be just some bug in one of the new versions of software I upgraded to. I'll keep monitoring.So, unfortunately issue is not fixed :(. For whatever reason, everything ran great for a day. Catalog compiles were taking around 7 seconds, client runs finished in about 20s - happy days. Then overnight, the catalog compile times jumped to 20-30 seconds and client runs were now taking 200+ seconds. Few hours later, and there would be no more requests arriving at the puppet master at all. Is my http server flaking out?Running some --trace --evaltrace and strace it looks like most of the time is spent stat-ing:% time seconds usecs/call calls errors syscall------ ----------- ----------- --------- --------- ----------------83.01 5.743474 9 673606 612864 stat7.72 0.534393 7 72102 71510 lstat6.76 0.467930 77988 6 wait4That's a pretty poor "hit" rate (7k out of 74k stats)...I've increased the check time to 1 hour on all clients, and the master seems to be keeping up for now - catalog compile avg 8 seconds, client run avg - 15 seconds, queue size = 0;Here is what a client run looks like when the server is keeping up:Notice: Finished catalog run in 11.93 seconds
File: 1.72Config retrieval: 13.35
And when the server is just about dead:
File: 128.94
Config retrieval: 26.80
Probably 500 of the "Resources" are autofs maps using https://github.com/pdxcat/puppet-module-autofs/commits/master
So there is definitely some bottle neck on the system, the problem is I can't figure out what it is. Is disk IO (iostat doesn't seem to think so), is it CPU (top looks fine), is it memory (ditto), is http/passenger combo not up to the task, is the postgres server not keeping up? There are so many components that it is hard for me to do a proper profile to find where the bottleneck is. Any ideas?So far I've timed the ENC script that pulls the classes for a node - takes less than 1 second.From messages the catalog compile is from 7 seconds to 25 seconds (worst case, overloaded server).Anyway, figured I'd share that, unfortunately ruby was not the issue. Back to poking around and testing.
By far the biggest difference is is File retrieval time. This will be for File resources where you specify content via a 'source' parameter rather than via a 'content' property. The agent must make a separate request to the master for each such file, and those are collectively taking a long time. Most likely they are backing up behind a bottleneck, so that much of the time consumed per node is actually spent waiting for service.
If the CPU is not overloaded and you have free physical RAM then It seems to me that the system service (i.e. httpd) and the I/O subsystem are your remaining candidates for the locus of the issue. As you attempt to identify the bottleneck, do not ignore the total number of transactions serviced. That could be as important as -- or even more important than -- the volume of data exchanged.
On Tuesday, November 4, 2014 9:50:15 AM UTC-5, jcbollinger wrote:By far the biggest difference is is File retrieval time. This will be for File resources where you specify content via a 'source' parameter rather than via a 'content' property. The agent must make a separate request to the master for each such file, and those are collectively taking a long time. Most likely they are backing up behind a bottleneck, so that much of the time consumed per node is actually spent waiting for service.I agree, there is something odd about File retrieval. We are creating a total of 10 autofs maps, however when I run trace, it looks like every separate entry creates a concat::fragment (puppetlabs-concat) which is later combined into a single file. I'm not too familliar with what concat::fragment does, but a quick scan of the code shows a lot of "source" references.
We might have to just go to template files, instead of using the autofs module. That might take care of all our issues.
On Tuesday, November 4, 2014 9:32:32 AM UTC-6, Georgi Todorov wrote:On Tuesday, November 4, 2014 9:50:15 AM UTC-5, jcbollinger wrote:By far the biggest difference is is File retrieval time. This will be for File resources where you specify content via a 'source' parameter rather than via a 'content' property. The agent must make a separate request to the master for each such file, and those are collectively taking a long time. Most likely they are backing up behind a bottleneck, so that much of the time consumed per node is actually spent waiting for service.I agree, there is something odd about File retrieval. We are creating a total of 10 autofs maps, however when I run trace, it looks like every separate entry creates a concat::fragment (puppetlabs-concat) which is later combined into a single file. I'm not too familliar with what concat::fragment does, but a quick scan of the code shows a lot of "source" references.
The Concat module provides a mechanism for managing a single file as a collection of separately-managed pieces, or 'fragments'. This has a number of advantages, mostly deriving from the ability to spread management of a single target file across multiple manifests, modules, or even separate nodes. As a practical matter, though, the implementation manages each fragment as a separate File, and combines them on the client.
Those fragment files, one per Concat::Fragment resource, can use either 'content' or 'source', as specified for that fragment. I find it a bit odd that an autofs module would choose the 'source' option (with a 'puppet://*' URL as its value), but if it's indeed doing so then perhaps that is generating large numbers of tiny, 'source'd files.
On Nov 4, 2014 7:32 AM, "Georgi Todorov" <ter...@gmail.com> wrote:
>
> I agree, there is something odd about File retrieval. We are creating a total of 10 autofs maps, however when I run trace, it looks like every separate entry creates a concat::fragment (puppetlabs-concat) which is later combined into a single file. I'm not too familliar with what concat::fragment does, but a quick scan of the code shows a lot of "source" references. We might have to just go to template files, instead of using the autofs module. That might take care of all our issues.
>
If you're generating the same map for a large subset of hosts, why not do that on the master and put the resulting file in a place the master's file service can serve it from?
Wil