How can I diagnose hopeless performance.

73 views
Skip to first unread message

Robert Inder

unread,
Mar 19, 2015, 6:40:21 PM3/19/15
to puppet...@googlegroups.com

We have about a dozen CentOS 6 servers being managed by puppet 3.7.4, using puppet apply with locally stored modules and node definition.

Recently, performance on one of them has collapsed to hopeless levels, and I really don't know how to diagnose the problem.

Puppet used to take 5 - 10 minutes to run, but processing essentially the same manifest is now taking around 40 minutes.

Often, though not always, many things are being (inappropriately) skipped because "trivial" operations are timing out.  Specifically

exec { $shell_exists: command => "/usr/bin/test -f '${shell}'" }
is being logged as

Error: Command exceeded timeout
Wrapped exception:
execution expired
Error: /Stage[main]/Users::Ii::Virtual/Users::Developer[robert]/Users::Account[robert]/Exec[check /bin/bash exists for robert]/returns: change from notrun to 0 failed: Command exceeded timeout

even though ${shell} (/bin/bash) exists, and other invocations of exactly the same command in the same run of puppet are confirming this.

In struggling to diagnose this, I have tried cutting down the content of the node by commenting things out until I arrived at the empty node definition...

node 'hostname' {
# Lines
# Commented
# Out
}
On most of our servers, processing the empty node definition takes 4 or 5 seconds.  On the problem node, it it taking between 3 minutes (when the server is otherwise idle) and over 15 minutes when there other things going on.  Below, you can see the complete log, as produced by puppet apply --evaltrace with an empty node definition.  It is spectacularly un-helpful, as is the yaml report.

The server is currently a standby machine, so apart from Puppet (and me running "top", "tail" and the like) the server is either completely idle, or possibly receiving an rsync or running an occasional (short) cron job.   Yet "top" reports CPU utilisation at close to 100%, and puppet is the only process it shows doing any significant processing.  Throughout the puppet run, it responds like a heavily loaded machine, and there is nothing in system logs.

Can anyone suggest how I can go about diagnosing what is going wrong?

Robert.


Info: Loading facts
Info: Loading facts
Notice: Compiled catalog for standby1.interactive.co.uk in environment production in 6.71 seconds
Info: Applying configuration version '1426787754'
Info: /Schedule[daily]: Starting to evaluate the resource
Info: /
Schedule[daily]: Evaluated in 0.00 seconds
Info: /Schedule[monthly]: Starting to evaluate the resource
Info: /
Schedule[monthly]: Evaluated in 0.00 seconds
Info: /Schedule[hourly]: Starting to evaluate the resource
Info: /
Schedule[hourly]: Evaluated in 0.00 seconds
Info: Stage[main]: Starting to evaluate the resource
Info: Stage[main]: Evaluated in 0.00 seconds
Info: Class[Main]: Starting to evaluate the resource
Info: Class[Main]: Evaluated in 0.00 seconds
Info: Node[standby1]: Starting to evaluate the resource
Info: Node[standby1]: Evaluated in 0.00 seconds
Info: Class[Settings]: Starting to evaluate the resource
Info: Class[Settings]: Evaluated in 0.00 seconds
Info: Class[Settings]: Starting to evaluate the resource
Info: Class[Settings]: Evaluated in 0.00 seconds
Info: Node[standby1]: Starting to evaluate the resource
Info: Node[standby1]: Evaluated in 0.00 seconds
Info: /Schedule[never]: Starting to evaluate the resource
Info: /
Schedule[never]: Evaluated in 0.00 seconds
Info: /Filebucket[puppet]: Starting to evaluate the resource
Info: /
Filebucket[puppet]: Evaluated in 0.00 seconds
Info: /Schedule[weekly]: Starting to evaluate the resource
Info: /
Schedule[weekly]: Evaluated in 0.00 seconds
Info: /Schedule[puppet]: Starting to evaluate the resource
Info: /
Schedule[puppet]: Evaluated in 0.00 seconds
Info: Class[Main]: Starting to evaluate the resource
Info: Class[Main]: Evaluated in 0.00 seconds
Info: Stage[main]: Starting to evaluate the resource
Info: Stage[main]: Evaluated in 0.00 seconds
Notice: Finished catalog run in 1119.24 seconds


jamese

unread,
Mar 20, 2015, 4:18:15 AM3/20/15
to puppet...@googlegroups.com
Try running puppet with "--debug" and "--evaltrace" to see where it's taking the time.
I'd be looking at DNS as that is often the culprit for unexplained things.

jcbollinger

unread,
Mar 20, 2015, 9:46:19 AM3/20/15
to puppet...@googlegroups.com


On Friday, March 20, 2015 at 3:18:15 AM UTC-5, jamese wrote:
Try running puppet with "--debug" and "--evaltrace" to see where it's taking the time.
I'd be looking at DNS as that is often the culprit for unexplained things.


DNS issues can indeed cause long runtimes when requests time out, but that would not explain the full CPU utilization the OP reports.  If the affected machine exhibits full CPU utilization and yet much longer runtimes during Puppet runs than other machines, it follows that either the machine is doing more work during the time of catalog application than the other machines do (but not necessarily via Puppet), or the work capacity of the affected machine's hardware is less than the others'.

If the affected machine is a VM, then it may be that the specs of its (virtual) hardware have been reduced, or that it is suffering from heavy contention with one or more other VMs hosted on the same physical machine.  Whether physical or virtual, it might be suffering from hardware [near-]failure such as many I/O failures/retries due to a failing disk, or many memory errors detected on a marginal ECC memory module.  If the physical machine is running hot for whatever reason then it may be that CPU speed has been throttled back.  The information obtained from running puppet with --evaltrace might shed some light on this, as might the system log.

If the problem cannot be traced back to hardware, then the question becomes what the affected machine is doing that the others are not doing.  If 'top' and 'puppet --debug apply' does not illuminate this, however, then perhaps the easiest solution would be to just rebuild the machine.  Since it is currently a standby anyway, that should be viable.  (And since it is Puppet-managed, that should be easy, too :-)).


John

Peter Berghold

unread,
Mar 20, 2015, 9:58:44 AM3/20/15
to puppet...@googlegroups.com
On Fri, Mar 20, 2015 at 4:18 AM jamese <james.e...@gmail.com> wrote:
Try running puppet with "--debug" and "--evaltrace" to see where it's taking the time.
I'd be looking at DNS as that is often the culprit for unexplained things.

--

A tale I tell is where Puppet was taking three or more minutes to even begin to do anything.   This was on an RHEL platform and it turned out my support guys has RHN configured to look for an RPM repository that was unreachable due to network restrictions.

I removed the RHN plugin to yum and Puppet began to function normally. 

Just throwing that out there...
 

Robert Inder

unread,
Mar 21, 2015, 3:08:07 PM3/21/15
to puppet...@googlegroups.com


On Friday, 20 March 2015 08:18:15 UTC, jamese wrote:
Try running puppet with "--debug" and "--evaltrace" to see where it's taking the time.

Ah!  As before, the evaltrace for the empty node definition lists a handful of things
being evaluated in negligible times.

But that out put is bracketed by debug info.  It is preceded by...

    Debug: Loaded state in 40.54 seconds
    Debug: Loaded state in 90.30 seconds

And followed by

   Debug: Stored state in 130.67 seconds
   Notice: Finished catalog run in 224.24 seconds

So the time appears to be going on loading and saving "State". 

So I look.  And I find the answer.

Our server is running applications written in Perl, using Catalyst.
Those applications track user sessions.in files created below a "sessions" directory.
We have configured Puppet to Tidy the sessions directories, by removing files over 3 days old.
Currently, we have something like 1200 files in the "sessions" directory tree.  A few hundred sessions per day.  Plausible.

/var/lib/puppet/state/state.yaml has four hundred thousand lines related to these session files.

So in its state file, Puppet is keeping a record of every session file it has ever deleted. 
It looks like every time it starts, it reads them all in.  And every time it finishes, it writes them all back out again,
even though they don't exist.

If I remove the "state" directory, puppet processes the empty node in "no time at all"

FFS!

This is not quite the behaviour I want.

But how can I stop it?




Christopher Wood

unread,
Mar 21, 2015, 6:08:34 PM3/21/15
to puppet...@googlegroups.com
In your place I'd probably just purge stale sessions with a cron job. Any particular reason why puppet's tidy is better than find->rm?

> --
> You received this message because you are subscribed to the Google Groups
> "Puppet Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [1]puppet-users...@googlegroups.com.
> To view this discussion on the web visit
> [2]https://groups.google.com/d/msgid/puppet-users/154122c1-d228-4de6-82db-d1df8fe1bd9c%40googlegroups.com.
> For more options, visit [3]https://groups.google.com/d/optout.
>
> References
>
> Visible links
> 1. mailto:puppet-users...@googlegroups.com
> 2. https://groups.google.com/d/msgid/puppet-users/154122c1-d228-4de6-82db-d1df8fe1bd9c%40googlegroups.com?utm_medium=email&utm_source=footer
> 3. https://groups.google.com/d/optout

Robert Inder

unread,
Mar 22, 2015, 10:36:24 AM3/22/15
to puppet...@googlegroups.com, christop...@pobox.com


On Saturday, 21 March 2015 22:08:34 UTC, Christopher Wood wrote:
On Sat, Mar 21, 2015 at 12:08:07PM -0700, Robert Inder wrote:
    :
    :
>    This is not quite the behaviour I want.
>
>    But how can I stop it?

In your place I'd probably just purge stale sessions with a cron job. Any particular reason why puppet's tidy is better than find->rm?


 Thanks for replying.

I did not set up the system, so I can't tell you for sure, but I guess it was originally done by puppet because it looked like a natural thing to do -- like exactly the
kind of task that "Tidy" is intended for!

But having now found the unexpected consequences of doing the obvious thing, doing it with cron does seem like the way to to.  So I'll switch to that, and manually remove all the "state.yaml" files to get things to a "clean" state.


Reply all
Reply to author
Forward
0 new messages