puppetmaster 100%cpu usage on 2.6 (not on 0.24)

Chris

unread,

Dec 14, 2010, 3:24:55 AM12/14/10

to Puppet Users

Hi

I recently upgraded my puppet masters (and clients) from 0.24.8 to
2.6.4

Previously, my most busy puppet master would hover around about 0.9
load average, after the upgrade, its load hovers around 5

I am running passenger and mysql based stored configs.

Checking my running processes, ruby (puppetmasterd) shoots up to 99%
cpu load and stays there for a few seconds before dropping again.
Often there are 4 of these running simultaneously, pegging each core
at 99% cpu.

It seems that there has been a serious performance regression between
0.24 and 2.6 for my configuration

I hop the following can help work out where...

I ran puppetmasterd through a profiler to find the root cause of this
(http://boojum.homelinux.org/profile.svg). The main problem appears
to be in /usr/lib/ruby/site_ruby/1.8/puppet/parser/ast/resource.rb, in
the evaluate function.

I added a few timing commands around various sections of that function
to find the following breakdown of times spent inside it, and the two
most intensive calls are
---
paramobjects = parameters.collect { |param|
param.safeevaluate(scope)
}
---

and
---
resource_titles.flatten.collect { |resource_title|
exceptwrap :type => Puppet::ParseError do
resource = Puppet::Parser::Resource.new(
fully_qualified_type, resource_title,
:parameters => paramobjects,
:file => self.file,
:line => self.line,
:exported => self.exported,
:virtual => virt,
:source => scope.source,
:scope => scope,
:strict => true
)

if resource.resource_type.is_a? Puppet::Resource::Type
resource.resource_type.instantiate_resource(scope, resource)
end
scope.compiler.add_resource(scope, resource)
scope.compiler.evaluate_classes([resource_title],scope,false)
if fully_qualified_type == 'class'
resource
end
}.reject { |resource| resource.nil? }
---

Unfortunately, that is about the limit of my current ruby skills.
What else can be looked at to speed 2.6 back up to the performance of
0.24?

Ken Barber

unread,

Dec 14, 2010, 6:40:22 PM12/14/10

to puppet...@googlegroups.com

Hi Chris,

Sorry - I can't say I'm seeing this performance issue myself with my setup :-(.

I'm not an expert around that part of the code. Having said that its probably DSL parsing related (possibly a recursion somewhere) ... I'd be focusing on your content not just the Ruby to see what part of the puppet DSL is causing it. Strip your content right back and add bits back in slowly. I think this would make your report very useful if it turns out to be a bug, and perhaps you can find a workaround that way as well.

That's just my 2c. Good luck :-).

ken.

Nigel Kersten

unread,

Dec 14, 2010, 7:48:45 PM12/14/10

to puppet...@googlegroups.com

On Tue, Dec 14, 2010 at 12:24 AM, Chris <iwouldrather...@gmail.com> wrote:

Hi

I recently upgraded my puppet masters (and clients) from 0.24.8 to
2.6.4

Previously, my most busy puppet master would hover around about 0.9
load average, after the upgrade, its load hovers around 5

I am running passenger and mysql based stored configs.

Checking my running processes, ruby (puppetmasterd) shoots up to 99%
cpu load and stays there for a few seconds before dropping again.
Often there are 4 of these running simultaneously, pegging each core
at 99% cpu.

It seems that there has been a serious performance regression between
0.24 and 2.6 for my configuration

Some useful info would be:

OS

OS version

Ruby version

Apache version/worker model

Passenger version

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To post to this group, send email to puppet...@googlegroups.com.
To unsubscribe from this group, send email to puppet-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.

--
Nigel Kersten - Puppet Labs - http://www.puppetlabs.com

Chris

unread,

Dec 15, 2010, 2:10:27 AM12/15/10

to Puppet Users

>
> Some useful info would be:
>
> OS
> OS version
> Ruby version
> Apache version/worker model
> Passenger version
>

CentOS 5.2
ruby-1.8.5-5.el5_3.7
httpd-2.2.3-31.el5.centos.2
rubygem-passenger-2.2.11-2el5.ecn
rubygem-rails-2.1.1-2.el5
rubygem-rack-1.1.0-1el5

Brice Figureau

unread,

Dec 15, 2010, 5:42:13 AM12/15/10

to puppet...@googlegroups.com

On Tue, 2010-12-14 at 00:24 -0800, Chris wrote:
> Hi
>
> I recently upgraded my puppet masters (and clients) from 0.24.8 to
> 2.6.4
>
> Previously, my most busy puppet master would hover around about 0.9
> load average, after the upgrade, its load hovers around 5
>
> I am running passenger and mysql based stored configs.
>
> Checking my running processes, ruby (puppetmasterd) shoots up to 99%
> cpu load and stays there for a few seconds before dropping again.
> Often there are 4 of these running simultaneously, pegging each core
> at 99% cpu.

I would say it is perfectly normal. Compiling the catalog is a hard and
complex problem and requires CPU.

The difference between 0.24.8 and 2.6 (or 0.25 for what matters) is that
some performance issues have been fixed. Those issues made the master be
more I/O bound under 0.24, but now mostly CPU bound in later versions.

Now compare the compilation time under 0.24.8 and 2.6 and you should see
that it reduced drastically (allowing to fit more compilation in the
same time unit). The reverse of the medal is that now your master
requires transient high CPU usage.

I don't really get what is the issue about using 100% of CPU?

You're paying about the same price when your CPU is used and when it's
idle, so that shouldn't make a difference :)

If that's an issue, reduce the concurrency of your setup (run less
compilation in parallel, implement splay time, etc...).

> It seems that there has been a serious performance regression between
> 0.24 and 2.6 for my configuration

I think it's the reverse that happened.

Yes, this is what the compiler is doing during compilation: evaluating
resources and parameters. The more resources you use, the more the
compilation will take time and CPU.
--
Brice Figureau
Follow the latest Puppet Community evolutions on www.planetpuppet.org!

Chris

unread,

Dec 15, 2010, 8:28:40 AM12/15/10

to Puppet Users

On Dec 15, 12:42 pm, Brice Figureau <brice-pup...@daysofwonder.com>
wrote:

> On Tue, 2010-12-14 at 00:24 -0800, Chris wrote:
> > Hi
>
> > I recently upgraded my puppet masters (and clients) from 0.24.8 to
> > 2.6.4
>
> > Previously, my most busy puppet master would hover around about 0.9
> > load average, after the upgrade, its load hovers around 5
>
> > I am running passenger and mysql based stored configs.
>
> > Checking my running processes, ruby (puppetmasterd) shoots up to 99%
> > cpu load and stays there for a few seconds before dropping again.
> > Often there are 4 of these running simultaneously, pegging each core
> > at 99% cpu.
>
> I would say it is perfectly normal. Compiling the catalog is a hard and
> complex problem and requires CPU.
>
> The difference between 0.24.8 and 2.6 (or 0.25 for what matters) is that
> some performance issues have been fixed. Those issues made the master be
> more I/O bound under 0.24, but now mostly CPU bound in later versions.

If we were talking about only cpu usage, I would agree with you. But
in this case, the load average of the machine has gone up over 5x.
And as high load average indicates processes not getting enough
runtime, in this case it is an indication to me that 2.6 is performing
worse than 0.24 (previously, on average, all processes got enough
runtime and did not have to wait for system resources, now processes
are sitting in the run queue, waiting to get a chance to run)

>
> I don't really get what is the issue about using 100% of CPU?

Thats not the issue, just an indication of what is causing it

>
> You're paying about the same price when your CPU is used and when it's
> idle, so that shouldn't make a difference :)

Generally true, but this is a on VM which is also running some of my
radius and proxy instances, amongst others.

>
> If that's an issue, reduce the concurrency of your setup (run less
> compilation in parallel, implement splay time, etc...).

splay has been enabled since 0.24

My apache maxclients is set to 15 to limit concurrency.

Trevor Vaughan

unread,

Dec 15, 2010, 9:12:16 AM12/15/10

to puppet...@googlegroups.com

What is your CPU to Puppetmaster instance ratio?

I've had decent luck with 1 CPU to 2 PM, but not much above that.

If you need dedicated resources for other tasks, you may want to
ensure that you don't have as many masters spawning as you have
processors.

Trevor

> --
> You received this message because you are subscribed to the Google Groups "Puppet Users" group.
> To post to this group, send email to puppet...@googlegroups.com.
> To unsubscribe from this group, send email to puppet-users...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.
>
>

--
Trevor Vaughan
Vice President, Onyx Point, Inc
(410) 541-6699
tvau...@onyxpoint.com

-- This account not approved for unencrypted proprietary information --

Brice Figureau

unread,

Dec 15, 2010, 12:15:15 PM12/15/10

to puppet...@googlegroups.com

Load is not necessarily an indication of an issue. It can also mean some
tasks are waiting for I/O not only CPU.
The only real issue under load is if service time is beyond an
admissible value, otherwise you can't say it's bad or not.
If you see some hosts reporting timeouts, then it's an indication that
service time is not good :)

BTW, do you run your mysql storedconfig instance on the same server?
You can activate thin_storeconfigs to reduce the load on the mysql db.

> >
> > I don't really get what is the issue about using 100% of CPU?
> Thats not the issue, just an indication of what is causing it
>
> >
> > You're paying about the same price when your CPU is used and when it's
> > idle, so that shouldn't make a difference :)
> Generally true, but this is a on VM which is also running some of my
> radius and proxy instances, amongst others.
>
> >
> > If that's an issue, reduce the concurrency of your setup (run less
> > compilation in parallel, implement splay time, etc...).
> splay has been enabled since 0.24
>
> My apache maxclients is set to 15 to limit concurrency.

I think this is too many except if you have 8 cores. As Trevor said in
another e-mail in this thread, 2PM/Core is the best.

Now it all depends on your number of nodes and sleeptime. I suggest you
use ext/puppet-load to find your setup real concurrency.

Ashley Penney

unread,

Dec 15, 2010, 1:27:29 PM12/15/10

to puppet...@googlegroups.com

This issue is definitely a problem. I have a support ticket in with Puppet Labs about the same thing. My CPU remains at 100% almost constantly and it slows things down significantly. If you strace it you can see that very little appears to be going on. This is absolutely not normal behavior. Even when I had 1 client checking in I had all cores fully used.

--

Disconnect

unread,

Dec 15, 2010, 1:35:11 PM12/15/10

to puppet...@googlegroups.com

"me too". All the logs show nice quick compilations but the actual wall clock to get anything done is HUGE.

Dec 15 13:10:29 puppet puppet-master[31406]: Compiled catalog for puppet.foo.com in environment production in 21.52 seconds
Dec 15 13:10:51 puppet puppet-agent[8251]: Caching catalog for puppet.foo.com

That was almost 30 minutes ago. Since then, it has sat there doing nothing...
$ sudo strace -p 8251
Process 8251 attached - interrupt to quit
select(7, [6], [], [], {866, 578560}

lsof shows:
puppetd 8251 root 6u IPv4 11016045 0t0 TCP puppet.foo.com:33065->puppet.foo.com:8140 (ESTABLISHED)

Disconnect

unread,

Dec 15, 2010, 1:38:40 PM12/15/10

to puppet...@googlegroups.com

One addendum - the box is absolutely not io or cpu bound:
Cpu(s): 83.0%us, 13.1%sy, 0.0%ni, 2.5%id, 0.0%wa, 0.1%hi, 1.3%si, 0.0%st

(64bit kvm vm w/ 6 3.5ghz amd64 cpus, on an lvm partition - raw disk - with 5G ram, but only using 3 gigs. PLENTY of power, and monitoring supports that..)

Brice Figureau

unread,

Dec 15, 2010, 2:14:07 PM12/15/10

to puppet...@googlegroups.com

On 15/12/10 19:35, Disconnect wrote:
> "me too". All the logs show nice quick compilations but the actual wall
> clock to get anything done is HUGE.
>
> Dec 15 13:10:29 puppet puppet-master[31406]: Compiled catalog for

> puppet.foo.com <http://puppet.foo.com> in environment production in
> 21.52 seconds

This looks long.

> Dec 15 13:10:51 puppet puppet-agent[8251]: Caching catalog for

> puppet.foo.com <http://puppet.foo.com>

>
> That was almost 30 minutes ago. Since then, it has sat there doing
> nothing...
> $ sudo strace -p 8251
> Process 8251 attached - interrupt to quit
> select(7, [6], [], [], {866, 578560}
>
> lsof shows:
> puppetd 8251 root 6u IPv4 11016045 0t0 TCP

> puppet.foo.com:33065->puppet.foo.com:8140 <http://puppet.foo.com:8140>
> (ESTABLISHED)

Note: we were talking about the puppet master taking 100% CPU, but
you're apparently looking to the puppet agent, which is a different story.

--
Brice Figureau
My Blog: http://www.masterzen.fr/

Brice Figureau

unread,

Dec 15, 2010, 2:15:24 PM12/15/10

to puppet...@googlegroups.com

On 15/12/10 19:27, Ashley Penney wrote:
> This issue is definitely a problem. I have a support ticket in with
> Puppet Labs about the same thing. My CPU remains at 100% almost
> constantly and it slows things down significantly. If you strace it you
> can see that very little appears to be going on. This is absolutely not
> normal behavior. Even when I had 1 client checking in I had all cores
> fully used.

I do agree that it's not the correct behavior. I suggest you to strace
or use any other ruby introspection techniques to find what part of the
master is taking CPU.

Disconnect

unread,

Dec 15, 2010, 2:24:24 PM12/15/10

to puppet...@googlegroups.com

On Wed, Dec 15, 2010 at 2:14 PM, Brice Figureau <brice-...@daysofwonder.com> wrote:
>
> Note: we were talking about the puppet master taking 100% CPU, but
> you're apparently looking to the puppet agent, which is a different story.
>

The agent isn't taking cpu, it is hanging waiting for the master to do anything. (The run I quoted earlier eventually ended with a timeout..) The master has pegged the cpus, and it seems to be related to file resources:

$ ps auxw|grep master
puppet   31392 74.4 4.7 361720 244348 ?       R    10:42 162:06 Rack: /usr/share/puppet/rack/puppetmasterd
puppet   31396 70.0 4.9 369524 250200 ?       R    10:42 152:32 Rack: /usr/share/puppet/rack/puppetmasterd
puppet   31398 66.2 3.9 318828 199472 ?       R    10:42 144:10 Rack: /usr/share/puppet/rack/puppetmasterd
puppet   31400 66.6 4.9 369992 250588 ?       R    10:42 145:04 Rack: /usr/share/puppet/rack/puppetmasterd
puppet   31406 68.6 3.9 318292 200992 ?       R    10:42 149:31 Rack: /usr/share/puppet/rack/puppetmasterd
puppet   31414 67.0 2.4 243800 124476 ?       R    10:42 146:00 Rack: /usr/share/puppet/rack/puppetmasterd

Dec 15 13:42:23 puppet puppet-master[31406]: Compiled catalog for puppet.foo.com in environment production in 30.83 seconds
Dec 15 13:42:49 puppet puppet-agent[10515]: Caching catalog for puppet.foo.com
Dec 15 14:00:18 puppet puppet-agent[10515]: Applying configuration version '1292438512'
...
Dec 15 14:14:56 puppet puppet-agent[10515]: Finished catalog run in 882.43 seconds
Changes:
            Total: 6
Events:
          Success: 6
            Total: 6
Resources:
          Changed: 6
      Out of sync: 6
            Total: 287
Time:
   Config retrieval: 72.20
             Cron: 0.05
             Exec: 32.42
             File: 752.33
       Filebucket: 0.00
            Mount: 0.98
          Package: 6.13
         Schedule: 0.02
          Service: 9.09
   Ssh authorized key: 0.07
           Sysctl: 0.00

real    34m56.066s
user    1m6.030s
sys    0m26.590s

Brice Figureau

unread,

Dec 15, 2010, 2:43:24 PM12/15/10

to puppet...@googlegroups.com

On 15/12/10 20:24, Disconnect wrote:
> On Wed, Dec 15, 2010 at 2:14 PM, Brice Figureau

> <brice-...@daysofwonder.com <mailto:brice-...@daysofwonder.com>>

> wrote:
>>
>> Note: we were talking about the puppet master taking 100% CPU, but
>> you're apparently looking to the puppet agent, which is a different story.
>>
>
> The agent isn't taking cpu, it is hanging waiting for the master to do
> anything. (The run I quoted earlier eventually ended with a timeout..)
> The master has pegged the cpus, and it seems to be related to file
> resources:

Oh, I see.

> $ ps auxw|grep master
> puppet 31392 74.4 4.7 361720 244348 ? R 10:42 162:06 Rack:
> /usr/share/puppet/rack/puppetmasterd
>
> puppet 31396 70.0 4.9 369524 250200 ? R 10:42 152:32 Rack:
> /usr/share/puppet/rack/puppetmasterd
>
> puppet 31398 66.2 3.9 318828 199472 ? R 10:42 144:10 Rack:
> /usr/share/puppet/rack/puppetmasterd
>
> puppet 31400 66.6 4.9 369992 250588 ? R 10:42 145:04 Rack:
> /usr/share/puppet/rack/puppetmasterd
>
> puppet 31406 68.6 3.9 318292 200992 ? R 10:42 149:31 Rack:
> /usr/share/puppet/rack/puppetmasterd
>
> puppet 31414 67.0 2.4 243800 124476 ? R 10:42 146:00 Rack:
> /usr/share/puppet/rack/puppetmasterd

Note that they're all running. That means there is none left to serve
file content if they are all busy for several seconds (in our case
around 20) while compiling catalogs.

> Dec 15 13:42:23 puppet puppet-master[31406]: Compiled catalog for

> puppet.foo.com <http://puppet.foo.com> in environment production in

> 30.83 seconds
> Dec 15 13:42:49 puppet puppet-agent[10515]: Caching catalog for

> puppet.foo.com <http://puppet.foo.com>

> Dec 15 14:00:18 puppet puppet-agent[10515]: Applying configuration
> version '1292438512'
> ...
> Dec 15 14:14:56 puppet puppet-agent[10515]: Finished catalog run in
> 882.43 seconds
> Changes:
> Total: 6
> Events:
> Success: 6
> Total: 6
> Resources:
> Changed: 6
> Out of sync: 6
> Total: 287

That's not a big number.

> Time:
> Config retrieval: 72.20

This is also suspect.

> Cron: 0.05
> Exec: 32.42
> File: 752.33

Indeed.

> Filebucket: 0.00
> Mount: 0.98
> Package: 6.13
> Schedule: 0.02
> Service: 9.09
> Ssh authorized key: 0.07
> Sysctl: 0.00
>
> real 34m56.066s
> user 1m6.030s
> sys 0m26.590s
>

That just means your master are so busy serving catalogs that they
barely have the time to serve files. One possibility is to use file
content offloading (see one of my blog post about this:
http://www.masterzen.fr/2010/03/21/more-puppet-offloading/).

How many nodes are you compiling at the same time? Apparently you have 6
master processes running at high CPU usage.

As I said earlier, I really advise people to try puppet-load (which can
be found in the ext/ directory of the source tarball since puppet 2.6)
to execise load againts a master. This will help you find your actual
concurrency.

But, if it's a bug, then could this be an issue with passenger?

Disconnect

unread,

Dec 15, 2010, 3:10:39 PM12/15/10

to puppet...@googlegroups.com

As a datapoint, this exact config (with mongrel_cluster) was working great under 0.25.x. With fewer, slower cpus, slower storage (vm image files) and 2G of ram...

I gave puppet-load a try, but it is throwing errors that I don't have time to dig into today:
debug: reading facts from: puppet.foo.com.yaml
/var/lib/gems/1.8/gems/em-http-request-0.2.15/lib/em-http/request.rb:72:in `send_request': uninitialized constant EventMachine::ConnectionError (NameError)
    from /var/lib/gems/1.8/gems/em-http-request-0.2.15/lib/em-http/request.rb:59:in `setup_request'
    from /var/lib/gems/1.8/gems/em-http-request-0.2.15/lib/em-http/request.rb:49:in `get'
    from ./puppet-load.rb:272:in `spawn_request'
    from ./puppet-load.rb:334:in `spawn'

Running about 250 nodes, every 30 minutes.

Brice Figureau

unread,

Dec 15, 2010, 4:45:08 PM12/15/10

to puppet...@googlegroups.com

On 15/12/10 21:10, Disconnect wrote:
> As a datapoint, this exact config (with mongrel_cluster) was working
> great under 0.25.x. With fewer, slower cpus, slower storage (vm image
> files) and 2G of ram...

So I ask it again: could it be a problem with passenger more than an
issue with puppet itself?

It would really be interesting to use some ruby introspection[1] to find
exactly where the cpu is spent in those masters.

Like with passenger it reparses everything instead of just compiling?
(I simply don't know, just throwing out some ideas)

I myself use nginx + mongrel, but have only a dozen of nodes, so I don't
really qualify.

> I gave puppet-load a try, but it is throwing errors that I don't have
> time to dig into today:
> debug: reading facts from: puppet.foo.com.yaml
> /var/lib/gems/1.8/gems/em-http-request-0.2.15/lib/em-http/request.rb:72:in
> `send_request': uninitialized constant EventMachine::ConnectionError
> (NameError)
> from
> /var/lib/gems/1.8/gems/em-http-request-0.2.15/lib/em-http/request.rb:59:in
> `setup_request'
> from
> /var/lib/gems/1.8/gems/em-http-request-0.2.15/lib/em-http/request.rb:49:in
> `get'
> from ./puppet-load.rb:272:in `spawn_request'
> from ./puppet-load.rb:334:in `spawn'

Could it be that you're missing EventMachine?

> Running about 250 nodes, every 30 minutes.

Did you try to use mongrel?
Do you use splay time?

Just some math (which might be totally wrong), to give an idea of how I
think we can compute our optimal scaling case:
So with 250 nodes and a sleep time of 30 minutes, we need to overcome
250 compiles in every 30 minute time spans. If we assume a concurrency
of 2 and all nodes evenly spaced (in time), that means we must compile
125 nodes in 30 minutes. If each compilation takes about 10s, then that
means it'll take 1250s, which means 20 minutes so you have some room for
growth :)
Now during those 20min your 2 master processes will consume 100% CPU.
Since we're consuming the the CPU for only 66% of the 30 minute span,
you'll globally consume 66% of all your CPU available...

Hope that helps,

[1]: http://projects.puppetlabs.com/projects/1/wiki/Puppet_Introspection

Ashley Penney

unread,

Dec 15, 2010, 7:47:19 PM12/15/10

to puppet...@googlegroups.com

Just to reply to this - like I said earlier I can get this problem with 1 node checking in against puppetmaster. All the puppetmasterd processes use maximum CPU. It's not a scaling issue considering serving one node is certainly not going to max out a newish physical server.

On Wed, Dec 15, 2010 at 4:45 PM, Brice Figureau <brice-...@daysofwonder.com> wrote:

Just some math (which might be totally wrong), to give an idea of how I
think we can compute our optimal scaling case:
So with 250 nodes and a sleep time of 30 minutes, we need to overcome
250 compiles in every 30 minute time spans. If we assume a concurrency
of 2 and all nodes evenly spaced (in time), that means we must compile
125 nodes in 30 minutes. If each compilation takes about 10s, then that
means it'll take 1250s, which means 20 minutes so you have some room for
growth :)
Now during those 20min your 2 master processes will consume 100% CPU.
Since we're consuming the the CPU for only 66% of the 30 minute span,
you'll globally consume 66% of all your CPU available...

Hope that helps,

[1]: http://projects.puppetlabs.com/projects/1/wiki/Puppet_Introspection
--

Brice Figureau
My Blog: http://www.masterzen.fr/

Nigel Kersten

unread,

Dec 15, 2010, 8:25:13 PM12/15/10

to puppet...@googlegroups.com

On Wed, Dec 15, 2010 at 4:47 PM, Ashley Penney <ape...@gmail.com> wrote:

Just to reply to this - like I said earlier I can get this problem with 1 node checking in against puppetmaster. All the puppetmasterd processes use maximum CPU. It's not a scaling issue considering serving one node is certainly not going to max out a newish physical server.

That is definitely a problem.

Does this happen as soon as a node checks in? Or as soon as you start the passenger processes?

Can you post a sanitized strace somewhere?

Brice Figureau

unread,

Dec 16, 2010, 4:25:13 AM12/16/10

to puppet...@googlegroups.com

On Wed, 2010-12-15 at 19:47 -0500, Ashley Penney wrote:
> Just to reply to this - like I said earlier I can get this problem
> with 1 node checking in against puppetmaster. All the puppetmasterd
> processes use maximum CPU. It's not a scaling issue considering
> serving one node is certainly not going to max out a newish physical
> server.

This looks like a bug to me.

Do your manifests use many file sources?
And/or recursive file resources?
It's possible that those masters are spending their time checksuming
files.

Like I said earlier in the thread the only real way to know is to use
Puppet introspection:
http://projects.puppetlabs.com/projects/1/wiki/Puppet_Introspection

--
Brice Figureau

Leonid Batizhevsky

unread,

Dec 16, 2010, 7:49:08 AM12/16/10

to puppet...@googlegroups.com

I have same issuer with running puppentmaster and puppted in same
host. When I updated ruby to 1.8.7 enterprise It resolve problem for
me.
Leonid S. Batizhevsky

Nigel Kersten

unread,

Dec 16, 2010, 7:03:38 PM12/16/10

to puppet...@googlegroups.com

On Thu, Dec 16, 2010 at 4:49 AM, Leonid Batizhevsky
<the.l...@gmail.com> wrote:
> I have same issuer with running puppentmaster and puppted in same
> host. When I updated ruby to 1.8.7 enterprise It resolve problem for
> me.
> Leonid S. Batizhevsky

For the sake of the archives, what version did you upgrade *from* Leonid ?

--

Leonid Batizhevsky

unread,

Dec 17, 2010, 11:27:37 AM12/17/10

to puppet...@googlegroups.com

Ruby or puppet?
I start use from 0.25.x (from epel repo and not long) and 1.8.5 ruby.
Then I update to 2.6.0 and I saw memory problems.
I start to google and found:
http://projects.puppetlabs.com/projects/1/wiki/Puppet_Red_Hat_Centos
"The 1.8.5 branch of Ruby shipped will RHEL5 can exhibit memory leaks. "
And upgrade to ruby enterprise 1.8.7 and It solve my problems!

Leonid S. Batizhevsky

Ashley Penney

unread,

Dec 17, 2010, 11:39:20 AM12/17/10

to puppet...@googlegroups.com

As a datapoint, I experience this problem on RHEL6:

ruby-1.8.7.299-4.el6.x86_64

Gems:

passenger (3.0.0)

rack (1.2.1)

rack-mount (0.6.13)

rack-test (0.5.6)

rails (3.0.3)

Leonid Batizhevsky

unread,

Jan 8, 2011, 5:51:43 PM1/8/11

to puppet...@googlegroups.com

No, I have not, maybe try to play with passanger worker time to live?
Leonid S. Batizhevsky

Micah Anderson

unread,

Jan 25, 2011, 5:11:10 PM1/25/11

to puppet...@googlegroups.com

Brice Figureau <brice-...@daysofwonder.com> writes:

I'm having a similar problem with 2.6.3. At this point, I can't get
reliable puppet runs, and I'm not sure what to do.

What seems to happen is things are working fine at the
beginning. Catalog compiles peg the CPU for the puppet process that is
doing them and take anywhere from between 20 seconds and 75
seconds. Then things get drastically worse after 4 compiles (note: I
have four mongrels too, coincidence?), catalog compiles shoot up to 115,
165, 209, 268, 273, 341, 418, 546, 692, 774, 822, then 1149
seconds... then things are really hosed. Sometimes hosts will fail
outright and complain about weird things, like:

Jan 25 14:04:34 puppetmaster puppet-master[30294]: Host is missing hostname and/or domain: gull.example.com
Jan 25 14:04:55 puppetmaster puppet-master[30294]: Failed to parse template site-apt/local.list: Could not find value for 'lsbdistcodename' at /etc/puppet/modules/site-apt/manifests/init.pp:4 on node gull.example.com

All four of my mongrels are constantly pegged, doing 40-50% of the CPU
each, occupying all available CPUs. They never settle down. I've got 74
nodes checking in now, it doesn't seem like its that many, but perhaps
i've reached a tipping point with my puppetmaster (its a dual 1ghz,
2gigs of ram machine)?

I've tried a large number of different things to attempt to work around
this:

0. reduced my node check-in times to be once an hour (and splayed
randomly)

1. turn on puppetqd/stomp queuing

This didn't seem to make a difference, its off now

2. turn on thin stored configs

This sort of helped a little, but not enough

3. tried to upgrade rails from 2.3.5 (the debian version) to 2.3.10

I didn't see any appreciable difference here. I ended up going back to
2.3.5 because that was the packaged version.

4. tried to offload file content via nginx[1]

This maybe helped a little, but its clear that the problem isn't the
fileserving, it seems to be something in the catalog compilation.

5. tried to cache catalogs through adding a http front-end cache and
expiring that cache when manifests are updated[1]

I'm not sure this works at all.

6. set 'fair' queuing in my nginx.conf[3]

This seemed to help for a few days, but then things got bad again.

7. set --http_compression

I'm not sure if this actually hurts the master or not (because it has
to now occupy the CPU compressing catalogs?)

8. tried to follow the introspection technique[2]

this wasn't so easy to do, I had to operate really fast, because if I
was too slow the thread would exit, or it would get hung up on:

[Thread 0xb6194b70 (LWP 25770) exited]
[New Thread 0xb6194b70 (LWP 25806)]

Eventually I did manage to get somewhere:

0xb74f1b16 in memcpy () from /lib/i686/cmov/libc.so.6
(gdb) session-ruby
(gdb) redirect_stdout
$1 = 2
(gdb)
$2 = 2
(gdb) eval "caller"
$3 = 3
(gdb) rb_object_counts
Cannot get thread event message: debugger service failed
An error occurred while in a function called from GDB.
Evaluation of the expression containing the function
(rb_eval_string_protect) will be abandoned.
When the function is done executing, GDB will silently stop.
(gdb) eval "total = \[\[ObjectSpace\]\].each_object(Array)\{\|x\| puts '---'; puts x.inspect \}; puts \\"---\\nTotal Arrays: \#{total}\\""
Invalid character '\' in expression.

... then nothing.

In the tail:

root@puppetmaster:/tmp# tail -f ruby-debug.28724
207 Puppet::Util::LoadedFile["/usr/lib/ruby/1.8/active_record/base.rb:2746:in `attributes='", "/usr/lib/ruby/1.8/active_record/base.rb:2742:in `each'", "/usr/lib/ruby/1.8/active_record/base.rb:2742:in `attributes='", "/usr/lib/ruby/1.8/active_record/base.rb:2438:in `initialize'", "/usr/lib/ruby/1.8/active_record/reflection.rb:162:in `new'", "/usr/lib/ruby/1.8/active_record/reflection.rb:162:in `build_association'", "/usr/lib/ruby/1.8/active_record/associations/association_collection.rb:423:in `build_record'", "/usr/lib/ruby/1.8/active_record/associations/association_collection.rb:102:in `build'", "/usr/lib/ruby/1.8/puppet/rails/host.rb:145:in `merge_facts'", "/usr/lib/ruby/1.8/puppet/rails/host.rb:144:in `each'", "/usr/lib/ruby/1.8/puppet/rails/host.rb:144:in `merge_facts'", "/usr/lib/ruby/1.8/puppet/rails/host.rb:140:in `each'", "/usr/lib/ruby/1.8/puppet/rails/host.rb:140:in `merge_facts'", "/usr/lib/ruby/1.8/puppet/indirector/facts/active_record.rb:32:in `save'", "/usr/lib/ruby/1.8/puppet/indirector/indirection.rb:256:in `save'", "/usr/lib/ruby/1.8/puppet/node/facts.rb:15:in `save'", "/usr/lib/ruby/1.8/puppet/indirector.rb:64:in `save'", "/usr/lib/ruby/1.8/puppet/indirector/catalog/compiler.rb:25:in `extract_facts_from_request'", "/usr/lib/ruby/1.8/puppet/indirector/catalog/compiler.rb:30:in `find'", "/usr/lib/ruby/1.8/puppet/indirector/indirection.rb:193:in `find'", "/usr/lib/ruby/1.8/puppet/indirector.rb:50:in `find'", "/usr/lib/ruby/1.8/puppet/network/http/handler.rb:101:in `do_find'", "/usr/lib/ruby/1.8/puppet/network/http/handler.rb:68:in `send'", "/usr/lib/ruby/1.8/puppet/network/http/handler.rb:68:in `process'", "/usr/lib/ruby/1.8/mongrel.rb:159:in `process_client'", "/usr/lib/ruby/1.8/mongrel.rb:158:in `each'", "/usr/lib/ruby/1.8/mongrel.rb:158:in `process_client'", "/usr/lib/ruby/1.8/mongrel.rb:285:in `run'", "/usr/lib/ruby/1.8/mongrel.rb:285:in `initialize'", "/usr/lib/ruby/1.8/mongrel.rb:285:in `new'", "/usr/lib/ruby/1.8/mongrel.rb:285:in `run'", "/usr/lib/ruby/1.8/mongrel.rb:268:in `initialize'", "/usr/lib/ruby/1.8/mongrel.rb:268:in `new'", "/usr/lib/ruby/1.8/mongrel.rb:268:in `run'", "/usr/lib/ruby/1.8/puppet/network/http/mongrel.rb:22:in `listen'", "/usr/lib/ruby/1.8/puppet/network/server.rb:127:in `listen'", "/usr/lib/ruby/1.8/puppet/network/server.rb:142:in `start'", "/usr/lib/ruby/1.8/puppet/daemon.rb:124:in `start'", "/usr/lib/ruby/1.8/puppet/application/master.rb:114:in `main'", "/usr/lib/ruby/1.8/puppet/application/master.rb:46:in `run_command'", "/usr/lib/ruby/1.8/puppet/application.rb:287:in `run'", "/usr/lib/ruby/1.8/puppet/application.rb:393:in `exit_on_fail'", "/usr/lib/ruby/1.8/puppet/application.rb:287:in `run'", "/usr/lib/ruby/1.8/puppet/util/command_line.rb:55:in `execute'", "/usr/bin/puppet:4"]

190 Puppet::Parser::AST::CaseStatement
181 ZAML::Label
170 Puppet::Parser::AST::Default
152 ActiveRecord::DynamicFinderMatch
152 ActiveRecord::DynamicScopeMatch
150 ActiveSupport::OrderedHash
148 OptionParser::Switch::RequiredArgument
138 YAML::Syck::Node
125 Range
124 Puppet::Parser::AST::IfStatement
117 ActiveRecord::Errors
115 Puppet::Provider::Confine::Exists
109 Puppet::Parser::AST::Selector
108 UnboundMethod
107 File::Stat
99 Puppet::Parameter::Value
90 Bignum
86 OptionParser::Switch::NoArgument
85 Puppet::Util::Settings::Setting
80 Puppet::Indirector::Request
75 Puppet::Parser::AST::ComparisonOperator
74 Puppet::Parser::Lexer::Token
73 Puppet::Parser::AST::ResourceOverride
70 ActiveRecord::ConnectionAdapters::MysqlColumn
66 Sync
65 StringIO
64 Binding
62 ActiveSupport::Callbacks::Callback
61 Puppet::Util::Settings::FileSetting
58 Puppet::Provider::ConfineCollection
56 Mysql::Result
52 Puppet::Module
47 Puppet::Network::AuthStore::Declaration
46 IPAddr
39 Puppet::Util::Settings::BooleanSetting
38 Thread
36 Puppet::Util::Autoload
35 Mysql
35 ActiveRecord::ConnectionAdapters::MysqlAdapter
34 Puppet::Parser::AST::Not
28 Puppet::Type::MetaParamLoglevel
28 Puppet::Type::File
28 Puppet::Type::File::ParameterPurge
28 Puppet::Type::File::ParameterLinks
28 Puppet::Type::File::Ensure
28 Puppet::Type::File::ParameterBackup
28 Puppet::Type::File::ParameterReplace
28 Puppet::Type::File::ParameterProvider
28 Puppet::Type::File::ParameterPath
28 Puppet::Type::File::ProviderPosix
28 Puppet::Type::File::ParameterChecksum

but then it seemed to stop logging entirely...

I'm available on IRC to try more advanced debugging, just ping me
(hacim). I'd really like things to function again!

micah

1. http://www.masterzen.fr/2010/03/21/more-puppet-offloading/
2. http://projects.puppetlabs.com/projects/1/wiki/Puppet_Introspection
3. http://www.mail-archive.com/puppet...@googlegroups.com/msg13692.html

Felix Frank

unread,

Jan 26, 2011, 4:21:03 AM1/26/11

to puppet...@googlegroups.com

> What seems to happen is things are working fine at the
> beginning. Catalog compiles peg the CPU for the puppet process that is
> doing them and take anywhere from between 20 seconds and 75
> seconds. Then things get drastically worse after 4 compiles (note: I
> have four mongrels too, coincidence?), catalog compiles shoot up to 115,
> 165, 209, 268, 273, 341, 418, 546, 692, 774, 822, then 1149
> seconds... then things are really hosed. Sometimes hosts will fail
> outright and complain about weird things, like:
>
> Jan 25 14:04:34 puppetmaster puppet-master[30294]: Host is missing hostname and/or domain: gull.example.com
> Jan 25 14:04:55 puppetmaster puppet-master[30294]: Failed to parse template site-apt/local.list: Could not find value for 'lsbdistcodename' at /etc/puppet/modules/site-apt/manifests/init.pp:4 on node gull.example.com
>
> All four of my mongrels are constantly pegged, doing 40-50% of the CPU
> each, occupying all available CPUs. They never settle down. I've got 74
> nodes checking in now, it doesn't seem like its that many, but perhaps
> i've reached a tipping point with my puppetmaster (its a dual 1ghz,
> 2gigs of ram machine)?

Hmm, some quick math:

You have 74 nodes that (I assume) check in at least once each 1800
seconds. Each compile takes above 40 seconds on average. So all compiles
(if run serially) take some 3000 seconds. Of course, seeing as you have
two cores on that machine, you can take advantage of some concurrency,
but in the ideal case you're down to 1500 seconds, which leaves you with
little room to breathe (and in real life, concurrency will not even be
that efficient).

I propose you need to restructure your manifest so that it compiles
faster (if at all possible) or scale up your master. What you're
watching is probably just overload and resource thrashing.

Do you have any idea why each individual compilation takes that long? I
see a 15-20 seconds compile now and again, but most compiles are under 3
seconds in my case (but then, that's with 4 2.4GHz cores, so it doesn't
necessarily compare).

Regards,
Felix

Brice Figureau

unread,

Jan 26, 2011, 5:13:23 AM1/26/11

to puppet...@googlegroups.com

The puppetmaster is mostly CPU bound. Since you have only 2 CPUs, you
shouldn't try to achieve a concurrency of 4 (which your mongrel are
trying to do), otherwise what will happen is that more than one request
will be accepted by one mongrel process and each thread will contend for
the CPU. The bad news is that the ruby MRI uses green threading, so the
second thread will only run when the first one will either sleep, do I/O
or relinquish the CPU voluntary. In a word, it will only run when the
first thread will finish its compilation.

Now you have 74 nodes, with the worst compilation time of 75s (which is
a lot), that translates to 74*75 = 5550s of compilation time.
With a concurrency of 2, that's still 2775s of compilation time per
round of <insert here your default sleep time>. With the default 30min
of sleep time and assuming a perfect scheduling, that's still larger
than a round of sleep time, which means that you won't ever finish
compiling nodes, when the first node will ask again for a catalog.

And I'm talking only about compilation. If your manifests use file
sourcing, you must also add this to the equation.

Another explanation of the issue is swapping. You mention your server
has 2GiB of RAM. Are you sure your 4 mongrel processes after some times
still fit in the physical RAM (along with the other thing running on the
server)?
Maybe your server is constantly swapping.

So you can do several thing to get better performances:
* reduce the number of nodes that check in at a single time (ie increase
sleep time)

* reduce the time it takes to compile a catalog:
+ which includes not using storeconfigs (or using puppetqd or
thin_storeconfig instead).
+ Check the server is not swapping.
+ Reduce the number of mongrel instances, to artifically reduce the
concurrency (this is counter-intuitive I know)
+ use a "better" ruby interpreter like Ruby Enterprise Edition (for
several reasons this ones has better GC, better memory footprint).
+ Cache compiled catalogs in nginx
+ offload file content serving in nginx
+ Use passenger instead of mongrel

Note: you can use puppet-load (in the 2.6 source distribution) to
simulate concurrent node asking for catalogs. This is really helpful to
size a puppetmaster and check the real concurrency a stack/hardware can
give.

> I've tried a large number of different things to attempt to work around
> this:
>
>
> 0. reduced my node check-in times to be once an hour (and splayed
> randomly)
>
> 1. turn on puppetqd/stomp queuing
>
> This didn't seem to make a difference, its off now
>
> 2. turn on thin stored configs
>
> This sort of helped a little, but not enough
>
> 3. tried to upgrade rails from 2.3.5 (the debian version) to 2.3.10
>
> I didn't see any appreciable difference here. I ended up going back to
> 2.3.5 because that was the packaged version.

Since you seem to use Debian, make sure you use either the latest ruby
lenny backports (or REE) as they fixed an issue with pthreads and CPU
consumption:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=579229

> 4. tried to offload file content via nginx[1]
>
> This maybe helped a little, but its clear that the problem isn't the
> fileserving, it seems to be something in the catalog compilation.

Actually offloading only helps when the puppet agent needs the file
content which happens only when this one changes or if the node doesn't
have this file. In practice this helps only for new nodes.

> 5. tried to cache catalogs through adding a http front-end cache and
> expiring that cache when manifests are updated[1]
>
> I'm not sure this works at all.

This should have helped because this would prevent the puppetmaster to
even be called. You might check your nginx configuration then.

> 6. set 'fair' queuing in my nginx.conf[3]
>
> This seemed to help for a few days, but then things got bad again.
>
> 7. set --http_compression
>
> I'm not sure if this actually hurts the master or not (because it has
> to now occupy the CPU compressing catalogs?)

This is a client option, and you need the collaboration of nginx for it
to work. This will certainly add more burden on your master CPU, because
nginx now has to gzip everything you're sending.

> 8. tried to follow the introspection technique[2]
>
> this wasn't so easy to do, I had to operate really fast, because if I
> was too slow the thread would exit, or it would get hung up on:
>
> [Thread 0xb6194b70 (LWP 25770) exited]
> [New Thread 0xb6194b70 (LWP 25806)]

When you attach gdb, how many threads are running?

This is just the objects used at a given time. What is more interesting
is where the CPU time is spent (ie getting a stacktrace would be
helpful, but not easy).

> but then it seemed to stop logging entirely...
>
> I'm available on IRC to try more advanced debugging, just ping me
> (hacim). I'd really like things to function again!

I'll ping you, but I'm just really busy for this very next couple of
days :(

Micah Anderson

unread,

Jan 26, 2011, 9:44:45 AM1/26/11

to puppet...@googlegroups.com

Felix Frank <felix...@alumni.tu-berlin.de> writes:

> I propose you need to restructure your manifest so that it compiles
> faster (if at all possible) or scale up your master. What you're
> watching is probably just overload and resource thrashing.

I'm interested in ideas for what are good steps for restructuring
manifests so they can compile faster, or at least methods for
identifying problematic areas in manifests.

> Do you have any idea why each individual compilation takes that long?

It wasn't before. Before things start spinning, compilation times are
between 9 seconds and 60 seconds, usually averaging just shy of 30
seconds.

micah

Felix Frank

unread,

Jan 26, 2011, 9:46:54 AM1/26/11

to puppet...@googlegroups.com

On 01/26/2011 03:44 PM, Micah Anderson wrote:
> Felix Frank <felix...@alumni.tu-berlin.de> writes:
>
>> I propose you need to restructure your manifest so that it compiles
>> faster (if at all possible) or scale up your master. What you're
>> watching is probably just overload and resource thrashing.
>
> I'm interested in ideas for what are good steps for restructuring
> manifests so they can compile faster, or at least methods for
> identifying problematic areas in manifests.

Are there many templates or use of the file() function?

Do you make heavy use of modules and the autoloader?

>> Do you have any idea why each individual compilation takes that long?
>
> It wasn't before. Before things start spinning, compilation times are
> between 9 seconds and 60 seconds, usually averaging just shy of 30
> seconds.

That's still quite considerable IMO.

Regards,
Felix

Micah Anderson

unread,

Jan 26, 2011, 10:11:18 AM1/26/11

to puppet...@googlegroups.com

Brice Figureau <brice-...@daysofwonder.com> writes:

> On Tue, 2011-01-25 at 17:11 -0500, Micah Anderson wrote:
>> Brice Figureau <brice-...@daysofwonder.com> writes:
>>
>> All four of my mongrels are constantly pegged, doing 40-50% of the CPU
>> each, occupying all available CPUs. They never settle down. I've got 74
>> nodes checking in now, it doesn't seem like its that many, but perhaps
>> i've reached a tipping point with my puppetmaster (its a dual 1ghz,
>> 2gigs of ram machine)?
>
> The puppetmaster is mostly CPU bound. Since you have only 2 CPUs, you
> shouldn't try to achieve a concurrency of 4 (which your mongrel are
> trying to do), otherwise what will happen is that more than one request
> will be accepted by one mongrel process and each thread will contend for
> the CPU. The bad news is that the ruby MRI uses green threading, so the
> second thread will only run when the first one will either sleep, do I/O
> or relinquish the CPU voluntary. In a word, it will only run when the
> first thread will finish its compilation.

Ok, that is a good thing to know. I wasn't aware that ruby was not able
to do that.

> Now you have 74 nodes, with the worst compilation time of 75s (which is
> a lot), that translates to 74*75 = 5550s of compilation time.
> With a concurrency of 2, that's still 2775s of compilation time per
> round of <insert here your default sleep time>. With the default 30min
> of sleep time and assuming a perfect scheduling, that's still larger
> than a round of sleep time, which means that you won't ever finish
> compiling nodes, when the first node will ask again for a catalog.

I'm doing 60 minutes of sleep time, which is 3600 seconds an hour, the
concurrency of 2 giving me 2775s of compile time per hour does keep me
under the 3600 seconds... assuming scheduling is perfect, which it very
likely is not.

> And I'm talking only about compilation. If your manifests use file
> sourcing, you must also add this to the equation.

As explained, I set up your nginx method for offloading file sourcing.

> Another explanation of the issue is swapping. You mention your server
> has 2GiB of RAM. Are you sure your 4 mongrel processes after some times
> still fit in the physical RAM (along with the other thing running on the
> server)?
> Maybe your server is constantly swapping.

I'm actually doing fine on memory, not dipping into swap. I've watched
i/o to see if I could identify either a swap or disk problem, but didn't
notice very much happening there. The CPU usage of the mongrel processes
is pretty much where everything is spending its time.

I've been wondering if I have some loop in a manifest or something that
is causing them to just spin.

> So you can do several thing to get better performances:
> * reduce the number of nodes that check in at a single time (ie increase
> sleep time)

I've already reduced to once per hour, but I could consider reducing it
more.

> * reduce the time it takes to compile a catalog:
> + which includes not using storeconfigs (or using puppetqd or
> thin_storeconfig instead).

I need to use storeconfigs, and as detailed in my original message, I've
tried puppetqd and it didn't do much for me. thin_storeconfigs did help,
and I'm still using it, so this one has already been done too.

> + Check the server is not swapping.

Not swapping.

> + Reduce the number of mongrel instances, to artifically reduce the
> concurrency (this is counter-intuitive I know)

Ok, I'm backing off to two mongrels to see how well that works.

> + use a "better" ruby interpreter like Ruby Enterprise Edition (for
> several reasons this ones has better GC, better memory footprint).

I'm pretty sure my problem isn't memory, so I'm not sure if these will
help much.

> + Cache compiled catalogs in nginx

Doing this.

> + offload file content serving in nginx

Doing this

> + Use passenger instead of mongrel

I tried to switch to passenger, and things were much worse. Actually,
passenger worked fine with 0.25, but when I upgraded I couldn't get it
to function anymore. I actually had to go back to nginx to get things
functioning again.

>> 3. tried to upgrade rails from 2.3.5 (the debian version) to 2.3.10
>>
>> I didn't see any appreciable difference here. I ended up going back to
>> 2.3.5 because that was the packaged version.
>
> Since you seem to use Debian, make sure you use either the latest ruby
> lenny backports (or REE) as they fixed an issue with pthreads and CPU
> consumption:
> http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=579229

I'm using Debian Squeeze, which has the same version you are mentioning
from lenny backports (2.3.5).

>> 5. tried to cache catalogs through adding a http front-end cache and
>> expiring that cache when manifests are updated[1]
>>
>> I'm not sure this works at all.
>
> This should have helped because this would prevent the puppetmaster to
> even be called. You might check your nginx configuration then.

Hmm. According to jamesturnbull, the rest terminus shouldn't allow you
to request any node's catalog, so I'm not sure how this can work at
all... but in case I've got something screwed up in my nginx.conf, I'd
really be happy if you could have a look at it, its possible that I
misunderstood something from your blog post! Here it is:

user www-data;
worker_processes 2;

error_log /var/log/nginx/error.log;
pid /var/run/nginx.pid;

events {
# In a reverse proxy situation, max_clients becomes
# max_clients = worker_processes * worker_connections/4
worker_connections 2048;
}

http {
default_type application/octet-stream;

sendfile on;
tcp_nopush on;
tcp_nodelay on;

large_client_header_buffers 1024 2048k;
client_max_body_size 150m;
proxy_buffers 128 4k;

keepalive_timeout 65;

gzip on;
gzip_min_length 1000;
gzip_types text/plain;

ssl on;
ssl_certificate /var/lib/puppet/ssl/certs/puppetmaster.pem;
ssl_certificate_key /var/lib/puppet/ssl/private_keys/puppetmaster.pem;
ssl_client_certificate /var/lib/puppet/ssl/ca/ca_crt.pem;
ssl_ciphers SSLv2:-LOW:-EXPORT:RC4+RSA;
ssl_session_cache shared:SSL:8m;
ssl_session_timeout 5m;

proxy_read_timeout 600;
upstream puppet_mongrel {
fair;
server 127.0.0.1:18140;
server 127.0.0.1:18141;
server 127.0.0.1:18142;
server 127.0.0.1:18143;
}
log_format noip '0.0.0.0 - $remote_user [$time_local] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent"';

proxy_cache_path /var/cache/nginx/cache levels=1:2 keys_zone=puppetcache:10m;

server {
listen 8140;
access_log /var/log/nginx/access.log noip;
ssl_verify_client required;

root /etc/puppet;

# make sure we serve everything
# as raw
types { }
default_type application/x-raw;

# serve static file for the [files] mountpoint
location /production/file_content/files/ {
allow 172.16.0.0/16;
allow 10.0.1.0/8;
allow 127.0.0.1/8;
deny all;

alias /etc/puppet/files/;
}

# serve modules files sections
location ~ /production/file_content/[^/]+/files/ {
# it is advisable to have some access rules here
allow 172.16.0.0/16;
allow 10.0.1.0/8;
allow 127.0.0.1/8;
deny all;

root /etc/puppet/modules;

# rewrite /production/file_content/module/files/file.txt
# to /module/file.text
rewrite ^/production/file_content/([^/]+)/files/(.+)$ $1/$2 break;
}

# Variables
# $ssl_cipher returns the line of those utilized it is cipher for established SSL-connection
# $ssl_client_serial returns the series number of client certificate for established SSL-connection
# $ssl_client_s_dn returns line subject DN of client certificate for established SSL-connection
# $ssl_client_i_dn returns line issuer DN of client certificate for established SSL-connection
# $ssl_protocol returns the protocol of established SSL-connection

location / {
proxy_pass http://puppet_mongrel;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Client-Verify SUCCESS;
proxy_set_header X-SSL-Subject $ssl_client_s_dn;
proxy_set_header X-SSL-Issuer $ssl_client_i_dn;
proxy_buffer_size 16k;
proxy_buffers 8 32k;
proxy_busy_buffers_size 64k;
proxy_temp_file_write_size 64k;
proxy_read_timeout 540;

# we handle catalog differently
# because we want to cache them
location /production/catalog {
proxy_pass http://puppet_mongrel;
proxy_redirect off;

# it is a good thing to actually restrict who
# can ask for a catalog (especially for cached
# catalogs)
allow 172.16.0.0/16;
allow 10.0.1.0/8;
allow 127.0.0.1/8;
deny all;

# where to cache contents
proxy_cache puppetcache;

# we cache content by catalog host
# we could also use $args to take into account request
# facts, but those change too often (ie uptime or memory)
# to be really usefull
proxy_cache_key $uri;

# define how long to cache response

# normal catalogs will be cached 2 weeks
proxy_cache_valid 200 302 301 2w;

# errors are not cached long
proxy_cache_valid 500 403 1m;

# the rest is cached a little bit
proxy_cache_valid any 30m;
}

# catch all location for other terminii
location / {
proxy_pass http://puppet_mongrel;
proxy_redirect off;
}
}
}
server {
listen 8141;
ssl_verify_client off;
root /var/empty;
access_log /var/log/nginx/access.log noip;

location / {
proxy_pass http://puppet_mongrel;
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Client-Verify FAILURE;
proxy_set_header X-SSL-Subject $ssl_client_s_dn;
proxy_set_header X-SSL-Issuer $ssl_client_i_dn;
}
}
}

>> 7. set --http_compression
>>
>> I'm not sure if this actually hurts the master or not (because it has
>> to now occupy the CPU compressing catalogs?)
>
> This is a client option, and you need the collaboration of nginx for it
> to work. This will certainly add more burden on your master CPU, because
> nginx now has to gzip everything you're sending.

Yeah, I have the gzip compression turned on in nginx, but I dont really
need it and my master could use the break.

>> 8. tried to follow the introspection technique[2]
>>
>> this wasn't so easy to do, I had to operate really fast, because if I
>> was too slow the thread would exit, or it would get hung up on:
>>
>> [Thread 0xb6194b70 (LWP 25770) exited]
>> [New Thread 0xb6194b70 (LWP 25806)]
>
> When you attach gdb, how many threads are running?

I'm not sure, how can I determine that? I just had the existing 4
mongrel processes.

>> (gdb) eval "total = \[\[ObjectSpace\]\].each_object(Array)\{\|x\| puts '---'; puts x.inspect \}; puts \\"---\\nTotal Arrays: \#{total}\\""
>> Invalid character '\' in expression.

The above seemed to be a problem with the expression on the wiki page,
does anyone know what that should be so gdb doesn't have a problem with
it?

>> I'm available on IRC to try more advanced debugging, just ping me
>> (hacim). I'd really like things to function again!
>
> I'll ping you, but I'm just really busy for this very next couple of
> days :(

Thanks for any help or ideas, I'm out of ideas myself so anything helps!

micah

Brice Figureau

unread,

Jan 26, 2011, 10:35:44 AM1/26/11

to puppet...@googlegroups.com

Do you use a External Node Classifier?

Brice Figureau

unread,

Jan 26, 2011, 11:23:11 AM1/26/11

to puppet...@googlegroups.com

I don't think it's the problem. There could be some ruby internals
issues playing here, but I doubt something in your manifest creates a
loop.

What is strange is that you mentioned that the very first catalog
compilations were fine, but then the compilation time increases.

> > So you can do several thing to get better performances:
> > * reduce the number of nodes that check in at a single time (ie increase
> > sleep time)
>
> I've already reduced to once per hour, but I could consider reducing it
> more.

That would be interesting. This would help us know if the problem is too
many load/concurrency for your clients or a problem in the master
itself.

BTW, what's the load on the server?

> > * reduce the time it takes to compile a catalog:
> > + which includes not using storeconfigs (or using puppetqd or
> > thin_storeconfig instead).
>
> I need to use storeconfigs, and as detailed in my original message, I've
> tried puppetqd and it didn't do much for me. thin_storeconfigs did help,
> and I'm still using it, so this one has already been done too.
>
> > + Check the server is not swapping.
>
> Not swapping.

OK, good.

> > + Reduce the number of mongrel instances, to artifically reduce the
> > concurrency (this is counter-intuitive I know)
>
> Ok, I'm backing off to two mongrels to see how well that works.

Let me know if that changes something.

> > + use a "better" ruby interpreter like Ruby Enterprise Edition (for
> > several reasons this ones has better GC, better memory footprint).
>
> I'm pretty sure my problem isn't memory, so I'm not sure if these will
> help much.

Well, having a better GC means that the ruby interpreter will become
faster at allocating stuff or recycling object. That in the end means
the overall memory footprint can be better, but that also means it will
spend much less time doing garbage stuff (ie better use the CPU for your
code and not for tidying stuff).

> > + Cache compiled catalogs in nginx
>
> Doing this.
>
> > + offload file content serving in nginx
>
> Doing this
>
> > + Use passenger instead of mongrel
>
> I tried to switch to passenger, and things were much worse. Actually,
> passenger worked fine with 0.25, but when I upgraded I couldn't get it
> to function anymore. I actually had to go back to nginx to get things
> functioning again.
>
> >> 3. tried to upgrade rails from 2.3.5 (the debian version) to 2.3.10
> >>
> >> I didn't see any appreciable difference here. I ended up going back to
> >> 2.3.5 because that was the packaged version.
> >
> > Since you seem to use Debian, make sure you use either the latest ruby
> > lenny backports (or REE) as they fixed an issue with pthreads and CPU
> > consumption:
> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=579229
>
> I'm using Debian Squeeze, which has the same version you are mentioning
> from lenny backports (2.3.5).

I was talking about the ruby1.8 package, not rails. Make sure you use
the squeeze version or the lenny-backports one.

> >> 5. tried to cache catalogs through adding a http front-end cache and
> >> expiring that cache when manifests are updated[1]
> >>
> >> I'm not sure this works at all.
> >
> > This should have helped because this would prevent the puppetmaster to
> > even be called. You might check your nginx configuration then.
>
> Hmm. According to jamesturnbull, the rest terminus shouldn't allow you
> to request any node's catalog, so I'm not sure how this can work at
> all... but in case I've got something screwed up in my nginx.conf, I'd
> really be happy if you could have a look at it, its possible that I
> misunderstood something from your blog post! Here it is:

When a client asks for a catalog, nginx checks it has already cached it,
if it is and the cache is still fresh, it serves it otherwise it asks a
puppetmaster for the same REST url, and then cache what the master
returns.

It's easy to check if nginx is caching the catalog:
Have a look into /var/cache/nginx/cache and see if there are some files
containing some of your catalogs.

Puppet doesn't send the necessary caching headers right now, and I'm not
sure how nginx deals with that. I hope it would still cache (through the
vertue of proxy_cache_valid).

What version of nginx are you using?

> server {
> listen 8140;
> access_log /var/log/nginx/access.log noip;
> ssl_verify_client required;

Make that:
ssl_verify_client optional;

And remove the second server{} block, and make sure your clients do not
use a different ca_port. But only if you use nginx >= 0.7.64

If you used ssl_verify_client as I explained above, this should be:
proxy_set_header X-Client-Verify $ssl_client_verify

> proxy_set_header X-SSL-Subject $ssl_client_s_dn;
> proxy_set_header X-SSL-Issuer $ssl_client_i_dn;
> proxy_buffer_size 16k;
> proxy_buffers 8 32k;
> proxy_busy_buffers_size 64k;
> proxy_temp_file_write_size 64k;
> proxy_read_timeout 540;
>
> # we handle catalog differently
> # because we want to cache them
> location /production/catalog {

Warning: this ^^ will work only if your nodes are in the "production"
environment. Adjust for your environments.

You already have a location '/' above.
Are you sure nginx is correctly using this configuration?
Try:
nginx -t
it will check your configuration

> proxy_pass http://puppet_mongrel;
> proxy_redirect off;
> }
> }
> }
> server {
> listen 8141;
> ssl_verify_client off;
> root /var/empty;
> access_log /var/log/nginx/access.log noip;
>
> location / {
> proxy_pass http://puppet_mongrel;
> proxy_redirect off;
> proxy_set_header Host $host;
> proxy_set_header X-Real-IP $remote_addr;
> proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
> proxy_set_header X-Client-Verify FAILURE;
> proxy_set_header X-SSL-Subject $ssl_client_s_dn;
> proxy_set_header X-SSL-Issuer $ssl_client_i_dn;
> }
> }
> }

This server{} wouldn't be needed if you use the ssl_verify_client as
explained above.

>
> >> 7. set --http_compression
> >>
> >> I'm not sure if this actually hurts the master or not (because it has
> >> to now occupy the CPU compressing catalogs?)
> >
> > This is a client option, and you need the collaboration of nginx for it
> > to work. This will certainly add more burden on your master CPU, because
> > nginx now has to gzip everything you're sending.
>
> Yeah, I have the gzip compression turned on in nginx, but I dont really
> need it and my master could use the break.

Actually your nginx are only compressing text/plain documents, so it
won't compress your catalogs.

> >> 8. tried to follow the introspection technique[2]
> >>
> >> this wasn't so easy to do, I had to operate really fast, because if I
> >> was too slow the thread would exit, or it would get hung up on:
> >>
> >> [Thread 0xb6194b70 (LWP 25770) exited]
> >> [New Thread 0xb6194b70 (LWP 25806)]
> >
> > When you attach gdb, how many threads are running?
>
> I'm not sure, how can I determine that? I just had the existing 4
> mongrel processes.

Maybe you can first try to display the full C trace for all threads:
thread apply all bt

Then, resume everything, and 2 to 5s take another snapshot with the
command above. Comparing the two trace might help us understand what the
process is doing.

HTH,

Brice Figureau

unread,

Jan 26, 2011, 11:30:15 AM1/26/11

to puppet...@googlegroups.com

make this:
proxy_cache_path /var/cache/nginx/cache levels=1:2 keys_zone=puppetcache:50m inactive=300m

The default inactive is 10 minute which is too low for a sleeptime of 60
minutes, and it is possible the cached catalog to be evicted.

Micah Anderson

unread,

Jan 26, 2011, 2:47:43 PM1/26/11

to puppet...@googlegroups.com

Felix Frank <felix...@alumni.tu-berlin.de> writes:

> On 01/26/2011 03:44 PM, Micah Anderson wrote:
>> Felix Frank <felix...@alumni.tu-berlin.de> writes:
>>
>>> I propose you need to restructure your manifest so that it compiles
>>> faster (if at all possible) or scale up your master. What you're
>>> watching is probably just overload and resource thrashing.
>>
>> I'm interested in ideas for what are good steps for restructuring
>> manifests so they can compile faster, or at least methods for
>> identifying problematic areas in manifests.
>
> Are there many templates or use of the file() function?

Yes, there are quite a few. I'm not really sure the best way to count
them. I have 288 'source => "$fileserver"' lines in my
manifests. Another ~160 of them in various modules. As far as templates
go, I have ~77 "content => template(...)" lines in my manifests and
another 55 in modules.

> Do you make heavy use of modules and the autoloader?

I do make heavy use of modules, I have about 50 of them. I'm importing
18 of them in my manifests/modules.pp. I think, if they are set up
right, I only need to import one of those, and I've been slowly pairing
those down. I presume that by 'the autoloader' you are meaning those
modules which aren't explictly included somewhere?

>>> Do you have any idea why each individual compilation takes that long?
>>
>> It wasn't before. Before things start spinning, compilation times are
>> between 9 seconds and 60 seconds, usually averaging just shy of 30
>> seconds.
>
> That's still quite considerable IMO.

Actually looking at my logs, compile time actually was averaging around
15 seconds each. some taking very little time at all. When things go
bad, its more or less a thundering herd and the times start going up and
up.

micah

Micah Anderson

unread,

Jan 26, 2011, 2:48:04 PM1/26/11

to puppet...@googlegroups.com

Brice Figureau <brice-...@daysofwonder.com> writes:

> On Wed, 2011-01-26 at 09:44 -0500, Micah Anderson wrote:
>> Felix Frank <felix...@alumni.tu-berlin.de> writes:
>>
>> > I propose you need to restructure your manifest so that it compiles
>> > faster (if at all possible) or scale up your master. What you're
>> > watching is probably just overload and resource thrashing.
>>
>> I'm interested in ideas for what are good steps for restructuring
>> manifests so they can compile faster, or at least methods for
>> identifying problematic areas in manifests.
>>
>> > Do you have any idea why each individual compilation takes that long?
>>
>> It wasn't before. Before things start spinning, compilation times are
>> between 9 seconds and 60 seconds, usually averaging just shy of 30
>> seconds.
>
> Do you use a External Node Classifier?

I do not.

micah

Micah Anderson

unread,

Jan 26, 2011, 3:40:45 PM1/26/11

to puppet...@googlegroups.com

Brice Figureau <brice-...@daysofwonder.com> writes:

Yes, and it increases quite rapidly. Interesting to note that the first
few compile times are basically within range of what I was experiencing
before things started to tip over (the last few days). I'm struggling to
try and think of anything I could have changed, but so far have not been
able to think of anything.

>> > So you can do several thing to get better performances:
>> > * reduce the number of nodes that check in at a single time (ie increase
>> > sleep time)
>>
>> I've already reduced to once per hour, but I could consider reducing it
>> more.
>
> That would be interesting. This would help us know if the problem is too
> many load/concurrency for your clients or a problem in the master
> itself.

I'll need to setup mcollective to do that I believe.

Right now I'm setting up a cronjob like this:

"<%= scope.function_fqdn_rand(['59']) %> * * * *"

which results in a cronjob (on one host):
6 * * * * root /usr/sbin/puppetd --onetime --no-daemonize --config=/etc/puppet/puppet.conf --color false | grep -E '(^err:|^alert:|^emerg:|^crit:)'

> BTW, what's the load on the server?

The server is dedicated to puppetmaster. When I had four mongrels
running it was basically at 4 constantly. Now that I've backed it down
to 2 mongrels, its:

11:57:41 up 58 days, 21:20, 2 users, load average: 2.31, 1.97, 2.02

>> Not swapping.
>
> OK, good.

Just as a confirmation to this... vmstat shows no si/so happening, and
very high numbers in the CPU user column. Very little bi/bo, and low sys
values. Context switches are a bit high... this clearly points to the
process eating CPU, not any disk/memory/swap scenario.

>> > + Reduce the number of mongrel instances, to artifically reduce the
>> > concurrency (this is counter-intuitive I know)
>>
>> Ok, I'm backing off to two mongrels to see how well that works.
>
> Let me know if that changes something.

Doesn't seem to help. Compiles start out low, and are inching up
(started at 27, and now they are at 120 seconds).

>> > + use a "better" ruby interpreter like Ruby Enterprise Edition (for
>> > several reasons this ones has better GC, better memory footprint).
>>
>> I'm pretty sure my problem isn't memory, so I'm not sure if these will
>> help much.
>
> Well, having a better GC means that the ruby interpreter will become
> faster at allocating stuff or recycling object. That in the end means
> the overall memory footprint can be better, but that also means it will
> spend much less time doing garbage stuff (ie better use the CPU for your
> code and not for tidying stuff).

That could be interesting. I haven't tried REE or jruby on debian
before, I suppose its worth a try.

>> >> 3. tried to upgrade rails from 2.3.5 (the debian version) to 2.3.10
>> >>
>> >> I didn't see any appreciable difference here. I ended up going back to
>> >> 2.3.5 because that was the packaged version.
>> >
>> > Since you seem to use Debian, make sure you use either the latest ruby
>> > lenny backports (or REE) as they fixed an issue with pthreads and CPU
>> > consumption:
>> > http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=579229
>>
>> I'm using Debian Squeeze, which has the same version you are mentioning
>> from lenny backports (2.3.5).
>
> I was talking about the ruby1.8 package, not rails. Make sure you use
> the squeeze version or the lenny-backports one.

Yep, I'm using the squeeze ruby1.8, which is 1.8.7.302-2

>> >> 5. tried to cache catalogs through adding a http front-end cache and
>> >> expiring that cache when manifests are updated[1]
>> >>
>> >> I'm not sure this works at all.
>> >
>> > This should have helped because this would prevent the puppetmaster to
>> > even be called. You might check your nginx configuration then.

It wasn't really caching before, because of the nginx parameter you
pointed out in a previous message. But now it seems like it is:

find /var/cache/nginx/cache -type f |wc
29 29 1769

> What version of nginx are you using?

0.7.67-3

> Make that:
> ssl_verify_client optional;
>
> And remove the second server{} block, and make sure your clients do not
> use a different ca_port. But only if you use nginx >= 0.7.64

Ok, that second server block was for the cert request... but sounds like
if I tweak the verify to optional, I dont need that. I'm sure the
clients aren't using a different ca_port (except for the initial node
bootstrap). I've changed that and removed the block.

> If you used ssl_verify_client as I explained above, this should be:
> proxy_set_header X-Client-Verify $ssl_client_verify

Changed.

>> # we handle catalog differently
>> # because we want to cache them
>> location /production/catalog {
>
> Warning: this ^^ will work only if your nodes are in the "production"
> environment. Adjust for your environments.

/etc/puppet/puppet.conf has:

environment = production

I do occasionally use development environments, but rarely enough that
not having caching is ok.

> You already have a location '/' above.
> Are you sure nginx is correctly using this configuration?
> Try:
> nginx -t
> it will check your configuration

Hm, good catch. nginx -t seems ok with it, but I've removed the extra
location '/' just in case.

> This server{} wouldn't be needed if you use the ssl_verify_client as
> explained above.

Removed.

>> >> 7. set --http_compression
>> >>
>> >> I'm not sure if this actually hurts the master or not (because it has
>> >> to now occupy the CPU compressing catalogs?)
>> >
>> > This is a client option, and you need the collaboration of nginx for it
>> > to work. This will certainly add more burden on your master CPU, because
>> > nginx now has to gzip everything you're sending.
>>
>> Yeah, I have the gzip compression turned on in nginx, but I dont really
>> need it and my master could use the break.
>
> Actually your nginx are only compressing text/plain documents, so it
> won't compress your catalogs.

Ah, interesting! Well, again... I'm turning it off on the nodes, its not
needed.

>> >> 8. tried to follow the introspection technique[2]
>> >>
>> >> this wasn't so easy to do, I had to operate really fast, because if I
>> >> was too slow the thread would exit, or it would get hung up on:
>> >>
>> >> [Thread 0xb6194b70 (LWP 25770) exited]
>> >> [New Thread 0xb6194b70 (LWP 25806)]
>> >
>> > When you attach gdb, how many threads are running?
>>
>> I'm not sure, how can I determine that? I just had the existing 4
>> mongrel processes.
>
> Maybe you can first try to display the full C trace for all threads:
> thread apply all bt
>
> Then, resume everything, and 2 to 5s take another snapshot with the
> command above. Comparing the two trace might help us understand what the
> process is doing.

Now that I've fixed up the nginx.conf and caching is actually happening,
I've noticed that catalog compiles are 10s, 14s, 19s, 10s, 25s, 8s and
things haven't fallen over yet, so its much better right now. I'm going
to let this run for an hour or two and if things are still bad, I'll
look at the thread traces.

m

Felix Frank

unread,

Jan 27, 2011, 4:57:59 AM1/27/11

to puppet...@googlegroups.com

>> Are there many templates or use of the file() function?
>
> Yes, there are quite a few. I'm not really sure the best way to count
> them. I have 288 'source => "$fileserver"' lines in my

Those don't hurt compilation.

> manifests. Another ~160 of them in various modules. As far as templates
> go, I have ~77 "content => template(...)" lines in my manifests and
> another 55 in modules.

If in doubt, try and loose a greater portion of them to see if compile
times are affected. But that won't be very practical.

>> Do you make heavy use of modules and the autoloader?
>
> I do make heavy use of modules, I have about 50 of them. I'm importing
> 18 of them in my manifests/modules.pp. I think, if they are set up
> right, I only need to import one of those, and I've been slowly pairing
> those down. I presume that by 'the autoloader' you are meaning those
> modules which aren't explictly included somewhere?

Yes. If your structure leads to all modules being eventually included on
all nodes, you're possibly wasting CPU cycles during compilation.

I recently had a script rename all my classes to
<module_name>::classname (including all references, i.e. includes etc.)
and got rid of all import statements. I didn't determine any changes in
compilation time, but the manifests are now generally less messy, so
there really are no downsides for me.

>>>> Do you have any idea why each individual compilation takes that long?
>>>
>>> It wasn't before. Before things start spinning, compilation times are
>>> between 9 seconds and 60 seconds, usually averaging just shy of 30
>>> seconds.
>>
>> That's still quite considerable IMO.
>
> Actually looking at my logs, compile time actually was averaging around
> 15 seconds each. some taking very little time at all. When things go
> bad, its more or less a thundering herd and the times start going up and
> up.

Reminiscent of what I saw before moving away from Webrick.

Seeing as you noticed in the other branch that tuning nginx (was it?)
helped you, I still think you were just overloaded before.

Regards,
Felix

Udo Waechter

unread,

Jan 31, 2011, 1:11:28 PM1/31/11

to puppet...@googlegroups.com

Hi.

I am just reading this thread, and it strikes me that we have the same problems with 2.6.3.

Since upgrading from 2.6.2 to .3 puppetmaster shows the behabiour described in this thread.
We have about 160 clients, and puppetmaster (is now) 8 core, 8Gb RAM kvm instance. Had this with 4 cores and 4 gigs RAM, "doublesizing" the VM did not change a thing!

We use passenger 2.2.11debian-2 and apache 2.2.16-3, ruby1.8 from squeeze.

Puppetmaster works fine after restart, then after about 2-3 hours it becomes pretty unresponsive, catalog runs go upt do 120 seconds and more (the baseline being something about 10 seconds).

I need to restart apache/puppetmaster about once a day. When I do that I need to:

* stop apache
* kill (still running) pupppetmasters (with SIGKILL!), some are always left running with "CPU 100%"
* start apache

Something is very weird there, and there were no fundamental changes to the manifests/modules.

The only thing that really changed is the VM itself. It was XEN (for years), we switched to KVM with kernel 2.6.35

Another strange thing:

puppet-clients do run a lot longer nowadays. A machine usually took about 40-50 seconds for one run. When puppetmaster goes crazy it now takes ages (500 seconds and even more).

Something is weird there...
--udo,

Nan Liu

unread,

Jan 31, 2011, 1:16:16 PM1/31/11

to puppet...@googlegroups.com

When it takes longer, is the agent simply spending more time on
config_retreival? You can find this metric in store reports. I would
not focus on the agent, if the delays are caused by compilation delays
by the master.

Thanks,

Nan

Brice Figureau

unread,

Jan 31, 2011, 4:43:16 PM1/31/11

to puppet...@googlegroups.com

On 31/01/11 19:11, Udo Waechter wrote:
> Hi.
>
> I am just reading this thread, and it strikes me that we have the
> same problems with 2.6.3.
>
> Since upgrading from 2.6.2 to .3 puppetmaster shows the behabiour
> described in this thread. We have about 160 clients, and puppetmaster
> (is now) 8 core, 8Gb RAM kvm instance. Had this with 4 cores and 4
> gigs RAM, "doublesizing" the VM did not change a thing!
>
> We use passenger 2.2.11debian-2 and apache 2.2.16-3, ruby1.8 from
> squeeze.

I see a pattern here. It seems Micah (see a couple of mails above in
this thread) has about the same setup, except he's using mongrels.

It would be great to try a non-debian ruby (hint: Ruby Enterprise
Edition for instance) to see if that's any better.

Do you use storeconfigs?

> Puppetmaster works fine after restart, then after about 2-3 hours it
> becomes pretty unresponsive, catalog runs go upt do 120 seconds and
> more (the baseline being something about 10 seconds).

With 160 hosts, a 30 min sleeptime, and a compilation of 10s, that means
you need 1600 cpu seconds to build catalogs for all your fleet.
With a concurrency of 8 core (assuming you use a pool of 8 passenger
app), that's 200s per core, which way less than the max of 1800s you can
accomodate in a 30 min time-frame. Of course this assumes an evenly
distributed load an perfect concurrency, but still you have plenty of
available resources. So I conclude this is not normal.

> I need to restart apache/puppetmaster about once a day. When I do
> that I need to:
>
> * stop apache * kill (still running) pupppetmasters (with SIGKILL!),
> some are always left running with "CPU 100%" * start apache

Does stracing/ltracing the process show something useful?

> Something is very weird there, and there were no fundamental changes
> to the manifests/modules.
>
> The only thing that really changed is the VM itself. It was XEN (for
> years), we switched to KVM with kernel 2.6.35
>
> Another strange thing:
>
> puppet-clients do run a lot longer nowadays. A machine usually took
> about 40-50 seconds for one run. When puppetmaster goes crazy it now
> takes ages (500 seconds and even more).

If your master are busy, there are great chances your clients have to
wait more to get served either catalogs or sourced files (or file
metadata). This can dramatically increase the run time.

> Something is weird there... --udo,

Indeed.

John Warburton

unread,

Jan 31, 2011, 4:50:32 PM1/31/11

to puppet...@googlegroups.com

On 1 February 2011 08:43, Brice Figureau <brice-...@daysofwonder.com> wrote:

On 31/01/11 19:11, Udo Waechter wrote:

Do you use storeconfigs?

Speaking of resource hogs, do you run the puppet labs dashboard on the same host? I had a similar setup (on crusty old Sun kit mind), and found a big performance hit in writing the reports by the client to the puppet master and then those reports to the dashboard. Everything calmed down once I moved the dashboard to another host

John

Waechter Udo

unread,

Feb 1, 2011, 5:30:12 AM2/1/11

to puppet...@googlegroups.com

Hi,

On 31.01.2011, at 22:43, Brice Figureau wrote:

> On 31/01/11 19:11, Udo Waechter wrote:

>> [..]

>> We use passenger 2.2.11debian-2 and apache 2.2.16-3, ruby1.8 from
>> squeeze.
>
> I see a pattern here. It seems Micah (see a couple of mails above in
> this thread) has about the same setup, except he's using mongrels.
>
> It would be great to try a non-debian ruby (hint: Ruby Enterprise
> Edition for instance) to see if that's any better.
>

Well, since this behavior turned up with 2.6.3 I did not think about blaming it on another tool. Like ruby. I will try RubyEE though.

> Do you use storeconfigs?
Yes, ALOT! Nowadays with stompserver and puppetqd. I did switch it off already and that did not change a (performance) thing.

>
>> Puppetmaster works fine after restart, then after about 2-3 hours it
>> becomes pretty unresponsive, catalog runs go upt do 120 seconds and
>> more (the baseline being something about 10 seconds).
>
> With 160 hosts, a 30 min sleeptime, and a compilation of 10s, that means
> you need 1600 cpu seconds to build catalogs for all your fleet.
> With a concurrency of 8 core (assuming you use a pool of 8 passenger
> app), that's 200s per core, which way less than the max of 1800s you can
> accomodate in a 30 min time-frame. Of course this assumes an evenly
> distributed load an perfect concurrency, but still you have plenty of
> available resources. So I conclude this is not normal.

Nope, like I said. We had pupetmaster running as a VM with 4 Cores and 4 Gigs 'o RAM. This worked fine since 0.22.x, now its twice as big (if this comparison holds) and performance is worse than ever.

Also, we do not do 30 Minutes puppet runs. We do it every hour for workstations and every 2 hours for servers all with half of that time random sleep. The load on the server is pretty evenly distributed. Once or twice a day there are some peaks, but those are not critical at all.

Thanks,
udo.

Waechter Udo

unread,

Feb 1, 2011, 5:31:28 AM2/1/11

to puppet...@googlegroups.com

Hi,

Yes I do, but I always did.... Even if this is not a good idea, performance was acceptable until 2.6.3 Something must have changed there.
--udo.

Ashley Penney

unread,

Feb 1, 2011, 10:30:02 AM2/1/11

to puppet...@googlegroups.com

This is the crux of the situation for me too - Puppetlabs blame it on a Ruby bug that hasn't been resolved with RHEL6 (in my situation) but this wasn't an issue until .3 for me too. I feel that fact that many of us have this problem since upgrading means it can be fixed within Puppet, rather than Ruby, because it was fine before.

Brice Figureau

unread,

Feb 1, 2011, 12:14:03 PM2/1/11

to puppet...@googlegroups.com

On Tue, 2011-02-01 at 10:30 -0500, Ashley Penney wrote:
> This is the crux of the situation for me too - Puppetlabs blame it on
> a Ruby bug that hasn't been resolved with RHEL6 (in my situation) but
> this wasn't an issue until .3 for me too. I feel that fact that many
> of us have this problem since upgrading means it can be fixed within
> Puppet, rather than Ruby, because it was fine before.

Do you mean puppet 2.6.2 wasn't exhibiting this problem?

Ashley Penney

unread,

Feb 1, 2011, 2:35:40 PM2/1/11

to puppet...@googlegroups.com

Yes, it didn't happen with the earlier versions of 2.6.

--

Brice Figureau

unread,

Feb 1, 2011, 2:45:47 PM2/1/11

to puppet...@googlegroups.com

On 01/02/11 20:35, Ashley Penney wrote:
> Yes, it didn't happen with the earlier versions of 2.6.

If it's easy for you to reproduce the issue you really should git bisect
the issue and tell puppetlabs what commit is the root cause (the
differences between 2.6.2 and 2.6.3 is not that big).
This way, they'll certainly be able to fix it.

Do we have a redmine ticket to track this issue?

Udo Waechter

unread,

Feb 1, 2011, 3:17:14 PM2/1/11

to puppet...@googlegroups.com

On 01.02.2011, at 18:14, Brice Figureau wrote:

> On Tue, 2011-02-01 at 10:30 -0500, Ashley Penney wrote:
>> This is the crux of the situation for me too - Puppetlabs blame it on
>> a Ruby bug that hasn't been resolved with RHEL6 (in my situation) but
>> this wasn't an issue until .3 for me too. I feel that fact that many
>> of us have this problem since upgrading means it can be fixed within
>> Puppet, rather than Ruby, because it was fine before.
>
> Do you mean puppet 2.6.2 wasn't exhibiting this problem?

Yes for me.
--udo.

--
:: udo waechter - ro...@zoide.net :: N 52º16'30.5" E 8º3'10.1"
:: genuine input for your ears: http://auriculabovinari.de
:: your eyes: http://ezag.zoide.net
:: your brain: http://zoide.net

Ashley Penney

unread,

Feb 7, 2011, 11:23:22 AM2/7/11

to puppet...@googlegroups.com

Because I like to live dangerously I upgraded to 2.6.5 and it seems like this has resolved the CPU problem completely for me.

Brice Figureau

unread,

Feb 7, 2011, 1:56:40 PM2/7/11

to puppet...@googlegroups.com

On 07/02/11 17:23, Ashley Penney wrote:
> Because I like to live dangerously I upgraded to 2.6.5 and it seems like
> this has resolved the CPU problem completely for me.

Did you upgrade the master or the master and all the nodes?

I had a discussion about this issue with Nigel during the week-end, and
he said something really interesting I didn't thought about:
it might be possible that the reports generated by 2.6.3 were larger
than what they were in previous versions.

It is then possible that the CPU time taken to unserialize and process
those larger reports is the root cause of the high CPU usage.

That'd be great if one of the people having the problem could disable
reports to see if that's the culprit.

And if this is the case, we should at least log how long it takes to
process a report on the master.

Ashley Penney

unread,

Feb 7, 2011, 2:15:52 PM2/7/11

to puppet...@googlegroups.com

I just upgraded the master, I was too lazy to do the nodes yet.

--

Udo Waechter

unread,

Feb 10, 2011, 9:55:02 AM2/10/11

to puppet...@googlegroups.com

Hello,
I am one of those who have this problem. Some people suggested using Ruby Enterprise. I looked at its installation, it looked a little bit time-consuming, so I did not try that one out.
I upgraded to debian squeeze (of course), and the problem persists.

Thus I did some tests:

1. got ruby from "Ubuntu Meercat":
libruby1.8 1.8.7.299-2
ruby1.8 1.8.7.299-2
ruby1.8-dev 1.8.7.299-2

Same Problem (debian is 1.8.7.302 I think), with ruby from ubuntu lucid (1.8.7.249) the problem is the same. I guess we can rule out debian's ruby here.

2. I reported that after stopping apache, stray master process remain and do 100% cpu. I did an strace on those processes and they do this (whatever that means):

$ strace -p 1231
Process 1231 attached - interrupt to quit
brk(0xa49a000) = 0xa49a000
brk(0xbf51000) = 0xbf51000
brk(0xda09000) = 0xda09000
brk(0xa49a000) = 0xa49a000
brk(0xbf52000) = 0xbf52000
brk(0xda09000) = 0xda09000
brk(0xa49a000) = 0xa49a000
brk(0xbf52000) = 0xbf52000
brk(0xda09000) = 0xda09000
^CProcess 1231 detached

3. I have now disabled reports, lets see what happens.

Thanks for the effort and have a nice day.
udo.

Brice Figureau

unread,

Feb 10, 2011, 10:22:33 AM2/10/11

to puppet...@googlegroups.com

On Thu, 2011-02-10 at 15:55 +0100, Udo Waechter wrote:
> Hello,
> I am one of those who have this problem. Some people suggested using Ruby Enterprise. I looked at its installation, it looked a little bit time-consuming, so I did not try that one out.
> I upgraded to debian squeeze (of course), and the problem persists.
>
> Thus I did some tests:
>
> 1. got ruby from "Ubuntu Meercat":
> libruby1.8 1.8.7.299-2
> ruby1.8 1.8.7.299-2
> ruby1.8-dev 1.8.7.299-2
>
> Same Problem (debian is 1.8.7.302 I think), with ruby from ubuntu lucid (1.8.7.249) the problem is the same. I guess we can rule out debian's ruby here.
>
> 2. I reported that after stopping apache, stray master process remain and do 100% cpu. I did an strace on those processes and they do this (whatever that means):
>
> $ strace -p 1231
> Process 1231 attached - interrupt to quit
> brk(0xa49a000) = 0xa49a000
> brk(0xbf51000) = 0xbf51000
> brk(0xda09000) = 0xda09000
> brk(0xa49a000) = 0xa49a000
> brk(0xbf52000) = 0xbf52000
> brk(0xda09000) = 0xda09000
> brk(0xa49a000) = 0xa49a000
> brk(0xbf52000) = 0xbf52000
> brk(0xda09000) = 0xda09000
> ^CProcess 1231 detached

This process is allocating memory like crazy :)

> 3. I have now disabled reports, lets see what happens.
>
> Thanks for the effort and have a nice day.
> udo.

Are you still on puppet 2.6.3?
Can you upgrade to 2.6.5 to see if that's better as reported by one
other user?

Patrick

unread,

Feb 10, 2011, 2:40:09 PM2/10/11

to puppet...@googlegroups.com

On Feb 10, 2011, at 6:55 AM, Udo Waechter wrote:

Hello,
I am one of those who have this problem. Some people suggested using Ruby Enterprise. I looked at its installation, it looked a little bit time-consuming, so I did not try that one out.

Well, I find it takes about 30 min at the most, saves on RAM, and causes puppet it use a little more CPU. Here's what I did. This method requires a compiler. You can also do everything up to (but not including) step 5 without affecting puppet. It's also easy to reverse.

1) Changed /usr/share/puppet/rack/puppetmasterd/config.ru to use an absolute path to the folder.

Need this line:

$:.unshift('/usr/lib/ruby/1.8/')

2) Install the dependencies for the compile:

package { "libssl-dev": ensure => present }

package { "libsqlite3-dev": ensure => present }

package { 'libmysql++-dev': ensure => present }

package { 'libpq-dev': ensure => present }

package { 'apache2-prefork-dev': ensure => present }

package { 'libapr1-dev': ensure => present }

package { 'libaprutil1-dev': ensure => present }

3) Installed RubyEE from their universal package.

4) Added a passengerEE mod to /etc/apache/mods-avaliable/

/etc/apache2/mods-avaliable/passengeree.load:

LoadModule passenger_module /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/passenger-2.2.15/ext/apache2/mod_passenger.so

PassengerRoot /opt/ruby-enterprise-1.8.7-2010.02/lib/ruby/gems/1.8/gems/passenger-2.2.15

PassengerRuby /opt/ruby-enterprise-1.8.7-2010.02/bin/ruby

5) Disable the old passenger and enable the new one

a2dismod passenger

a2enmod passengeree

service apache2 restart

If things don't work do this to enable your old passenger:

a2enmod passenger
a2dismod passengeree
service apache2 restart

John Warburton

unread,

Feb 10, 2011, 6:55:22 PM2/10/11

to puppet...@googlegroups.com

On 8 February 2011 06:15, Ashley Penney <ape...@gmail.com> wrote:

I just upgraded the master, I was too lazy to do the nodes yet.

On Mon, Feb 7, 2011 at 1:56 PM, Brice Figureau <brice-...@daysofwonder.com> wrote:

On 07/02/11 17:23, Ashley Penney wrote:
> Because I like to live dangerously I upgraded to 2.6.5 and it seems like
> this has resolved the CPU problem completely for me.

Did you upgrade the master or the master and all the nodes?

Was that upgrade to 2.6.5rc2? Seems there has been a nice patch to speed up large HTTP POST & PUTs. Since 2.6.x reports can be large (I have some approaching 1 Mb), this might be where the problem may have been

https://projects.puppetlabs.com/projects/puppet/wiki/Release_Notes#2.6.5
https://projects.puppetlabs.com/issues/6257

John

Reply all

Reply to author

Forward