On Jan 7, 9:40 pm, Andreas N <
d...@pseudoterminal.org> wrote:
> On Friday, January 6, 2012 5:31:34 PM UTC+1, jcbollinger wrote:
>
> > Nothing in your log suggests that the Puppet agent is doing any work
> > when it fails. It appears to apply a catalog successfully, then
> > create a report successfully, then nothing else. That doesn't seem
> > like a problem in a module. Nevertheless, you could try removing
> > classes from the affected node's configuration and testing whether
> > Puppet still freezes.
>
> John, thanks for your reply. I'll be deploying a node that includes no
> modules at all and see if a zombie process appears again.
>
> > You said the agent runs for several hours before it hangs. Does it
> > perform multiple successful runs during that time? That also would
> > tend to counterindicate a problem in your manifests.
>
> Yes, the agents perform several runs (with no changes to the catalog) and
> then simply freeze up, waiting for the defunct sh process to return.
>
> > I'm suspicious that something else on your systems is interfering with
> > the Puppet process; some kind of service manager, for example. You'll
> > have to say whether that's a reasonable guess. Alternatively, you may
> > have a system-level bug; there have been a few Ruby bugs and kernel
> > regressions that interfered with Puppet operation.
>
> Those are all pretty plain Ubuntu 10.04.3 server installations (both i386
> and x86_64), especially the ones I deployed this week, which aren't in
> production yet. What kind of service manager could there even be that
> interferes?
I was thinking along the lines of an intrusion detection system, or
perhaps a monitoring / management tool such as Nagios. That's not to
say that I suspect Nagios in particular -- a lot of people seem to use
it together with Puppet with great success. It sounds like such a
thing is not in your picture, however.
> > You could try using strace to determine where the failure happens,
> > though that's not as simple as it may sound.
>
> Simply trying to strace the zombie process only results in an "Operation
> not permitted". The agent process shows these lines repeatedly:
>
> Process 3741 attached - interrupt to quit
> select(8, [7], NULL, NULL, {1, 723393}) = 0 (Timeout)
> sigprocmask(SIG_BLOCK, NULL, []) = 0
> sigprocmask(SIG_BLOCK, NULL, []) = 0
> select(8, [7], NULL, NULL, {2, 0}) = 0 (Timeout)
> sigprocmask(SIG_BLOCK, NULL, []) = 0
> sigprocmask(SIG_BLOCK, NULL, []) = 0
> ...
>
> That doesn't tell me anything other than that the puppet agent is blocking
> on select() with a timeout of two seconds.
I kinda meant to trace a new agent process so as to catch whatever
happens when it transitions to non-functional state. Nevertheless,
the trace does yield a bit of information. In particular, it shows
that the agent is not fully blocked. In that case, the fact that it
has a defunct child process that it has not collected makes me even
more suspect a Ruby bug. I am also a bit curious what open FD 7 that
Puppet is selecting for might be, but I don't think that's directly
related to your issue.
I suggest you compare the Ruby and kernel versions installed on the
affected nodes to those installed on unaffected nodes. It may also be
useful to compare the Puppet configuration (/etc/puppet/puppet.conf)
on failing nodes to those on non-failing nodes to see whether there
any options are set differently. I am especially curious as to
whether the 'listen' option might be enabled when it does not need to
be (or does it?), but there might be other significant differences.
John