Puppet agent hangs after running a few hours, defunct sh process

1,038 views
Skip to first unread message

Andreas N

unread,
Jan 4, 2012, 7:23:29 PM1/4/12
to puppet...@googlegroups.com
Hi,

On a node running Puppet 2.7.9 from apt.puppetlabs.com, using Ubuntu 10.04.3 the agent hangs after a few hours of operation. I have to kill -9 it, nothing else helps. Obviously, this is unfortunate.

Looking at ps -ef I see this:

root      4842  4594  0 Jan04 pts/0    00:00:55 /usr/bin/ruby1.8 /usr/bin/puppet agent --verbose --no-daemonize --debug
root      9803  4842  0 Jan04 pts/0    00:00:00 [sh] <defunct>

It seems a defunct sh process is responsible. This has happened before on that node so I started the agent with the command line arguments you see above. Unfortunately the produced debug logs don't look any different from the debug logs on a node where I haven't observed this behavior. The logs from the last run can be found here, nonetheless: http://pastie.org/3128200

The problem seems to happen regularly on that particular node but I looked around other nodes we have running and it seems to happen on a few others as well. These nodes don't have anything in common (not even the puppet master) but do have a few common modules applied. Could this be caused by one of those modules? How would I go about debugging? Or does anyone already know what's going on here?

Thanks,

Andreas

Andreas N

unread,
Jan 6, 2012, 2:06:36 AM1/6/12
to puppet...@googlegroups.com
Wow, it took quite a while for my post to reach this group. No idea why, is it moderated?

Anyway, this problem seems to also happen with agents running Puppet 2.7.6, although apparently less frequently. I'm almost positive it must have something to do with a module but I wouldn't know how or where to begin debugging.

Does anyone have any ideas?

Thanks,

Andreas

jcbollinger

unread,
Jan 6, 2012, 11:31:34 AM1/6/12
to Puppet Users
Nothing in your log suggests that the Puppet agent is doing any work
when it fails. It appears to apply a catalog successfully, then
create a report successfully, then nothing else. That doesn't seem
like a problem in a module. Nevertheless, you could try removing
classes from the affected node's configuration and testing whether
Puppet still freezes.

You said the agent runs for several hours before it hangs. Does it
perform multiple successful runs during that time? That also would
tend to counterindicate a problem in your manifests.

I'm suspicious that something else on your systems is interfering with
the Puppet process; some kind of service manager, for example. You'll
have to say whether that's a reasonable guess. Alternatively, you may
have a system-level bug; there have been a few Ruby bugs and kernel
regressions that interfered with Puppet operation.

You could try using strace to determine where the failure happens,
though that's not as simple as it may sound.

You could also try just sidestepping the problem by using cron to
launch puppetd --runonce at your desired intervals, instead of leaving
puppetd running in daemon mode. A fair number of people seem to run
Puppet that way, and it has some advantages.


John

Andreas N

unread,
Jan 7, 2012, 10:40:29 PM1/7/12
to puppet...@googlegroups.com
On Friday, January 6, 2012 5:31:34 PM UTC+1, jcbollinger wrote:

Nothing in your log suggests that the Puppet agent is doing any work
when it fails.  It appears to apply a catalog successfully, then
create a report successfully, then nothing else.  That doesn't seem
like a problem in a module.  Nevertheless, you could try removing
classes from the affected node's configuration and testing whether
Puppet still freezes.

John, thanks for your reply. I'll be deploying a node that includes no modules at all and see if a zombie process appears again.
 
You said the agent runs for several hours before it hangs.  Does it
perform multiple successful runs during that time?  That also would
tend to counterindicate a problem in your manifests.

Yes, the agents perform several runs (with no changes to the catalog) and then simply freeze up, waiting for the defunct sh process to return.
 
I'm suspicious that something else on your systems is interfering with
the Puppet process; some kind of service manager, for example.  You'll
have to say whether that's a reasonable guess.  Alternatively, you may
have a system-level bug; there have been a few Ruby bugs and kernel
regressions that interfered with Puppet operation.

Those are all pretty plain Ubuntu 10.04.3 server installations (both i386 and x86_64), especially the ones I deployed this week, which aren't in production yet. What kind of service manager could there even be that interferes? 
 
You could try using strace to determine where the failure happens,
though that's not as simple as it may sound.

Simply trying to strace the zombie process only results in an "Operation not permitted". The agent process shows these lines repeatedly:

Process 3741 attached - interrupt to quit
select(8, [7], NULL, NULL, {1, 723393}) = 0 (Timeout)
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
sigprocmask(SIG_BLOCK, NULL, [])        = 0
sigprocmask(SIG_BLOCK, NULL, [])        = 0
...

That doesn't tell me anything other than that the puppet agent is blocking on select() with a timeout of two seconds.

You could also try just sidestepping the problem by using cron to
launch puppetd --runonce at your desired intervals, instead of leaving
puppetd running in daemon mode.  A fair number of people seem to run
Puppet that way, and it has some advantages.

Thanks, that's a good idea that I will probably have to resort to if the problem doesn't go away.

Andreas

Nigel Kersten

unread,
Jan 7, 2012, 11:26:50 PM1/7/12
to puppet...@googlegroups.com
On Thu, Jan 5, 2012 at 11:06 PM, Andreas N <da...@pseudoterminal.org> wrote:
Wow, it took quite a while for my post to reach this group. No idea why, is it moderated?


We moderate the first post from everyone to stop spam getting through.

This sucks, but it sucks less than the other alternatives of moderating every post, or approving membership manually. 

Andreas N

unread,
Jan 7, 2012, 11:45:02 PM1/7/12
to puppet...@googlegroups.com
On Sunday, January 8, 2012 5:26:50 AM UTC+1, Nigel Kersten wrote:
We moderate the first post from everyone to stop spam getting through.

This sucks, but it sucks less than the other alternatives of moderating every post, or approving membership manually.

Nigel, good to know, thanks!

Andreas

jcbollinger

unread,
Jan 9, 2012, 9:56:20 AM1/9/12
to Puppet Users


On Jan 7, 9:40 pm, Andreas N <d...@pseudoterminal.org> wrote:
> On Friday, January 6, 2012 5:31:34 PM UTC+1, jcbollinger wrote:
>
> > Nothing in your log suggests that the Puppet agent is doing any work
> > when it fails.  It appears to apply a catalog successfully, then
> > create a report successfully, then nothing else.  That doesn't seem
> > like a problem in a module.  Nevertheless, you could try removing
> > classes from the affected node's configuration and testing whether
> > Puppet still freezes.
>
> John, thanks for your reply. I'll be deploying a node that includes no
> modules at all and see if a zombie process appears again.
>
> > You said the agent runs for several hours before it hangs.  Does it
> > perform multiple successful runs during that time?  That also would
> > tend to counterindicate a problem in your manifests.
>
> Yes, the agents perform several runs (with no changes to the catalog) and
> then simply freeze up, waiting for the defunct sh process to return.
>
> > I'm suspicious that something else on your systems is interfering with
> > the Puppet process; some kind of service manager, for example.  You'll
> > have to say whether that's a reasonable guess.  Alternatively, you may
> > have a system-level bug; there have been a few Ruby bugs and kernel
> > regressions that interfered with Puppet operation.
>
> Those are all pretty plain Ubuntu 10.04.3 server installations (both i386
> and x86_64), especially the ones I deployed this week, which aren't in
> production yet. What kind of service manager could there even be that
> interferes?


I was thinking along the lines of an intrusion detection system, or
perhaps a monitoring / management tool such as Nagios. That's not to
say that I suspect Nagios in particular -- a lot of people seem to use
it together with Puppet with great success. It sounds like such a
thing is not in your picture, however.


> > You could try using strace to determine where the failure happens,
> > though that's not as simple as it may sound.
>
> Simply trying to strace the zombie process only results in an "Operation
> not permitted". The agent process shows these lines repeatedly:
>
> Process 3741 attached - interrupt to quit
> select(8, [7], NULL, NULL, {1, 723393}) = 0 (Timeout)
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> select(8, [7], NULL, NULL, {2, 0})      = 0 (Timeout)
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> sigprocmask(SIG_BLOCK, NULL, [])        = 0
> ...
>
> That doesn't tell me anything other than that the puppet agent is blocking
> on select() with a timeout of two seconds.


I kinda meant to trace a new agent process so as to catch whatever
happens when it transitions to non-functional state. Nevertheless,
the trace does yield a bit of information. In particular, it shows
that the agent is not fully blocked. In that case, the fact that it
has a defunct child process that it has not collected makes me even
more suspect a Ruby bug. I am also a bit curious what open FD 7 that
Puppet is selecting for might be, but I don't think that's directly
related to your issue.

I suggest you compare the Ruby and kernel versions installed on the
affected nodes to those installed on unaffected nodes. It may also be
useful to compare the Puppet configuration (/etc/puppet/puppet.conf)
on failing nodes to those on non-failing nodes to see whether there
any options are set differently. I am especially curious as to
whether the 'listen' option might be enabled when it does not need to
be (or does it?), but there might be other significant differences.


John

Jo Rhett

unread,
Jan 9, 2012, 12:40:42 PM1/9/12
to puppet...@googlegroups.com
On Jan 7, 2012, at 7:40 PM, Andreas N wrote:
That doesn't tell me anything other than that the puppet agent is blocking on select() with a timeout of two seconds.

Sounds like #10418.  Check your kernel version.
  https://projects.puppetlabs.com/issues/10418

-- 
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source and other randomness

jcbollinger

unread,
Jan 10, 2012, 11:41:44 AM1/10/12
to Puppet Users


On Jan 9, 11:40 am, Jo Rhett <jrh...@netconsonance.com> wrote:
> On Jan 7, 2012, at 7:40 PM, Andreas N wrote:
>
> > That doesn't tell me anything other than that the puppet agent is blocking on select() with a timeout of two seconds.
>
> Sounds like #10418.  Check your kernel version.
>  https://projects.puppetlabs.com/issues/10418

It sounds similar, but 10418 is specific to a particular RedHat /
CentOS kernel, and the OP is observing his problem on Ubuntu. My
awareness of that issue is one of the reasons I advised the OP to look
at kernel versions, however.


John

Jo Rhett

unread,
Jan 10, 2012, 4:01:44 PM1/10/12
to puppet...@googlegroups.com
The comments in the redhat bug indicated that this breakage came from upstream, as did the fix.  So it's entirely possible that this bug appeared in some Debian kernels, but I don't know which.

--
You received this message because you are subscribed to the Google Groups "Puppet Users" group.
To post to this group, send email to puppet...@googlegroups.com.
To unsubscribe from this group, send email to puppet-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/puppet-users?hl=en.

Reply all
Reply to author
Forward
0 new messages