defunct process hangs agent

26 views
Skip to first unread message

Gregory Matthews

unread,
Jun 14, 2016, 4:44:22 AM6/14/16
to help-cfengine
trying to promise a service like this:

services:
om_installed::
"openmanage"
service_policy => "start",
service_method => dls;

where the service_method body is defined:

body service_method dls
{
service_bundle =>
dls_services("$(this.promiser)","$(this.service_policy)");
}

and the service bundle basically follows the template provided by the
documentation[1]. The service is started in a "processes" promises by
running the start command which is defined as:

"startcommand[openmanage]" string =>
"/opt/dell/srvadmin/sbin/srvadmin-services.sh start";

The problem is that this leaves a "defunct" process:

root 21732 21731 0 17:03 ? 00:00:02
/var/cfengine/bin/cf-agent -KI
root 24179 21732 0 17:03 ? 00:00:00 [srvadmin-servic] <defunct>

which leaves the agent hanging.

I've tried setting a timeout on the command, I've tried putting it into
a shell, I've tried running it as a background process but nothing gets
around the problem.

Anyone have any pointers?

This is using v3.5.3, soon to be upgraded to 3.7.x

GREG


[1]
https://auth.cfengine.com/docs/3.5/examples-policy-ensure-service-is-enabled-and-running.html
--
Greg Matthews 01235 778658
Scientific Computing Group Leader
Diamond Light Source Ltd. OXON UK

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom

Gregory Matthews

unread,
Jun 17, 2016, 6:36:49 AM6/17/16
to help-c...@googlegroups.com
Is there no workaround for this?

Marco Marongiu

unread,
Jun 17, 2016, 6:52:12 AM6/17/16
to help-c...@googlegroups.com
On 17/06/16 12:36, Gregory Matthews wrote:
> Is there no workaround for this?

Have you tried with setting a reasonable value for agent_expireafter in
body executor control? E.g.: if you expect each agent run to finish by,
say, 3 minutes, you could set it to 18 (6x3) minutes so that:

- no more than 6 agents can accumulate and cross-lock each other
- if an agent has been hanging for more than 18 minutes, it will be
killed

Ciao
-- bronto

Aleksey Tsalolikhin

unread,
Jun 17, 2016, 10:53:01 AM6/17/16
to Marco Marongiu, help-cfengine
Where do you get 6 from, Bronto? I mean, why 6 and not 5 or 7?  Just curious if there is any special significance to it...

Ciao,
Aleksey

Marco Marongiu

unread,
Jun 17, 2016, 11:13:44 AM6/17/16
to Aleksey Tsalolikhin, help-cfengine
On 17/06/16 16:52, Aleksey Tsalolikhin wrote:
> Where do you get 6 from, Bronto? I mean, why 6 and not 5 or 7? Just
> curious if there is any special significance to it...

On our systems we have

agent_expireafter => "30" ;

because on CFEngine's standard schedule (5 minutes) we allow a maximum
of 6 agents to run concurrently (6 x 5 = 30).

Why 6, you ask. We expect an agent to be able to apply our policies in
much less than the canonical 5 minutes in normal conditions.

If an agent is around after, say, 10 minutes, it could still be normal.
For example, it may be waiting for apt to download a package from a slow
repository. Thus, I definitely don't want to kill an agent if it's not
done after the canonical interval.

How many agents is it reasonable to leave around then?

having one agent around, or none, is the expected normal situation
having two agents around can still happen (see above)
having three would be strange already

so I leave myself some buffer and double that: if I have double the
number of agents that I think is normal, then I have a problem and I
must start take them down.

Say that something is going really wrong on the system, e.g.: a
filesystem corruption that messes up the CFEngine agent. The agents may
start piling up adding a problem over another: not only you have a
filesystem corruption and the system is malfunctioning, now you're also
filling up the process table and, possibly, eating memory and CPU
cycles. It's reasonable to stop the madness before it hurts: if 2 agents
running are still OK and 3 are strange, then two times 3 is madness and
we must stop it there.

Ciao!
-- bronto

Aleksey Tsalolikhin

unread,
Jun 17, 2016, 12:17:19 PM6/17/16
to Marco Marongiu, help-cfengine
Thanks very much for explaining your reasoning, Bronto.  :)
--
Aleksey Tsalolikhin
Founder and Chief Trainer

Gregory Matthews

unread,
Jun 28, 2016, 5:50:27 AM6/28/16
to Aleksey Tsalolikhin, Marco Marongiu, help-cfengine
sorry guys, just back from a week off.

I can try the agent_expireafter option but I assume that this will mean
anything that is promised after this will not be fulfilled.

I can't work out why these Dell scripts can be run on the command line
ok but when run from cf-agent leave these <defunct> processes in the
process list (and hang the cf-agent).

Very annoying!

GREG

On 17/06/16 17:16, Aleksey Tsalolikhin wrote:
> Thanks very much for explaining your reasoning, Bronto. :)
> Vertical Sysadmin, Inc. <http://www.VerticalSysadmin.com>
> +1-323-393-0779
> <tel:%2B1-323-393-0779>
>
> --
> You received this message because you are subscribed to the Google
> Groups "help-cfengine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to help-cfengin...@googlegroups.com
> <mailto:help-cfengin...@googlegroups.com>.
> To post to this group, send email to help-c...@googlegroups.com
> <mailto:help-c...@googlegroups.com>.
> Visit this group at https://groups.google.com/group/help-cfengine.
> For more options, visit https://groups.google.com/d/optout.

Neil Watson

unread,
Jun 28, 2016, 9:01:19 AM6/28/16
to help-cfengine
Any config management tool's command or shell environment can be
unexpected. Yesterday I had a command work fine in all shell tests,
local and remote, but failed as an Ansible task. I seen the same thing
happen in CFEngine too. My best practise is to use a shell wrapper that
sets up the environment and have CFEngine run that instead.

For long running jobs I suggest delegating them to cron or even your own
daemon. You can have the job leave flags for CFEngine if it needs to
know the job state.

--
Neil H Watson
CFEngine reporting: https://github.com/neilhwatson/delta_reporting
CFEngine policy: https://github.com/neilhwatson/evolve_cfengine_freelib
CFEngine and vim: https://github.com/neilhwatson/vim_cf3
Reply all
Reply to author
Forward
0 new messages