cf-agent/cf-execd acts weird after killed

40 views
Skip to first unread message

Todd Erwin

unread,
Apr 5, 2018, 10:48:06 AM4/5/18
to help-cfengine
I searched to see if I could find any bugs or issues related to this and didn't come up with anything but wanted to see if anyone could help shed some light on a situation.

Im running cfengine only 1 time per hour (instead of every 5 minutes), a typical run takes ~1 minute to complete from start to finish, I have over 10k hosts and we have many policy servers(Load balancers) that answer the call when a client needs updating.  Each run the client builds and array of hosts (Same unless new ones are added) to get the policy data from, however on occasion these policy servers become OVERLOADED (reach max number of connections) because a server goes down or overloaded and clients start hitting another policy server, which in turn overloads that one and then the flood happens.

Ex 2000 hosts per Policy server and there are 10 policy servers for 20k hosts
1 goes down.. now 2000 hosts have to spread out to the other 9 but fail to there primary every time it needs policy.

What happens is that the cf-agent process starts to take a LONG LONG time to complete because now instead of getting policy on first attempt it has to fail to 10 servers before it moves on.  SO monitoring comes along and sees that cf-agent has been running longer than 90 minutes and KILLS it, this is where things get bad.

From all debugging that I could do because it kills it at 90 minutes there are actually now 2 processes running (runs every hour) when it kills the FIRST one, somehow the Second one now has only HARD classes and none of the DYNAMIC classes.. As you can imagine this is bad because now it makes assumptions about policies that are not true resulting in undesired configurations..

QUESTION.. is all of the dynamic classes stored in memory or stored back to the cf-execd?  Im guessing the latter because then cf-execd calling the second agent hands it all the bad info..

How would a guy right a policy that would enable only a SINGLE run of cf-agent and if one is already running to not SPAWN a second one?

Thanks.

Nick Anderson

unread,
Apr 6, 2018, 12:14:49 PM4/6/18
to Todd Erwin, help-cfengine

Hi Todd,
This indeed is a bad condition. Curious what version of the agent
your running. I think there were some performance related patches
in 3.10.
How many files are being considered during policy updates?


> QUESTION.. is all of the dynamic classes stored in memory or
> stored back to
> the cf-execd? Im guessing the latter because then cf-execd
> calling the
> second agent hands it all the bad info..

During an agent run all non persistent classes are stored in
memory. Persistent classes are stored so that they can be defined
at the start of the next agent run. Promise executions will result
in locks that are stored, but promise locks aren't classes.

> How would a guy right a policy that would enable only a SINGLE
> run of
> cf-agent and if one is already running to not SPAWN a second
> one?

There is =agent_expireafter= in =body executor control= which is
the number of minutes that an agent should be allowed to execute
without sending some data back to =cf-execd=. That won't provide a
hard limit to a single agent though. If you want a strict limit on
the number of agent processes I think a good option is to replace
the =exec_command= with a custom wrapper. From the custom wrapper
you can check to see if there is an existing agent process and
then you can choose to kill it or skip the execution (because
there is an agent already at work).

>
agent_expireafter in body executor control is the number of
minutes an agent
can execute without sending some data back to cf-execd. With the
default run
interval of 5 minutes as setting of 20 should limit the number of
concurrent
agents to ~ 4 as you say.
>
--
Nick Anderson
Doer of things, CFEngine

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 12:23:07 PM4/6/18
to Nick Anderson, Todd Erwin, help-cfengine
Todd, what is your splay time?

Did you increase the default setting when you moved to hourly runs?

--
You received this message because you are subscribed to the Google Groups "help-cfengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to help-cfengin...@googlegroups.com.
To post to this group, send email to help-c...@googlegroups.com.
Visit this group at https://groups.google.com/group/help-cfengine.
For more options, visit https://groups.google.com/d/optout.

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 12:28:27 PM4/6/18
to Nick Anderson, Todd Erwin, help-cfengine

Todd Erwin

unread,
Apr 6, 2018, 2:23:33 PM4/6/18
to help-cfengine
        splaytime               => "30";
        schedule                => { "Min00" };

Yep we randomize over the top half of the hour, and then give a quite period over the last half to enable debug/testing time without affecting production runs.

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 2:33:43 PM4/6/18
to Todd Erwin, help-cfengine
Excellent. 

Now, what version are you running on the policy servers?

I did some work with cfengine support and they found and fixed some hot spots in cf-serverd.

Can you update to latest stable on your policy servers?

Todd Erwin

unread,
Apr 6, 2018, 2:35:13 PM4/6/18
to help-c...@googlegroups.com

This indeed is a bad condition. Curious what version of the agent
your running. I think there were some performance related patches
in 3.10.
How many files are being considered during policy updates?

Currently we are running 3.6.5 trying to get over the TLS hump.. and move on to 3.10 while trying to incorporate the new library structure along side our libraries.
Assuming they are ALL bad.. I believe the file count is around 20-30 depending on the host. 
>
agent_expireafter in body executor control is the number of
minutes an agent
can execute without sending some data back to cf-execd. With the
default run
interval of 5 minutes as setting of 20 should limit the number of
concurrent
agents to ~ 4 as you say.
>

I was looking at that the problem is that we run 1x per hour.. so I would think that then the configuration would be 1 if I wanted to use agent_expire?  Or a wrapper as you say might be another option..

Outside sources killing CF-AGENT are defiantly not the right answer ,also thinking about limiting the number of servers it "CHOOSES" to only be 4.  Also thought about changing default timeout to be 10 instead of the standard 30, that way it would fail thru the servers faster.. 4x10x20 still gets me under an hour.  ~40 seconds per copy request when it's bad.. 
--

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 2:36:16 PM4/6/18
to Todd Erwin, help-cfengine

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 2:38:36 PM4/6/18
to Todd Erwin, help-cfengine
Todd, the hot spots were found and fixed definitely between 3 6 5 and 3 10 3.

Moreover, 3.6 is no longer supported. 

What do you need to do in order to upgrade?

I take it you know how to upgrade since we have discussed that recently on the list. 

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 2:39:42 PM4/6/18
to Todd Erwin, help-cfengine
And by no longer supported I didn't mean we wouldn't help you. Just that the vendor no longer offers commercial support for it.

Todd Erwin

unread,
Apr 6, 2018, 2:40:56 PM4/6/18
to help-cfengine
20000

Todd Erwin

unread,
Apr 6, 2018, 2:42:59 PM4/6/18
to help-cfengine
We have are OWN libraries that we use that have a "SMALL" overlap in the new ones so we stopped updating the new libraries, while still getting newer versions.  We want to use the libraries in the newer versions instead of the 2 flat file libraries that we have.. We just have to figure out the overlap and then "CAREFULLY" remove it.

Todd Erwin

unread,
Apr 6, 2018, 2:44:15 PM4/6/18
to help-cfengine
The location where we saw the issue unfortunately still has 3.6.5 on the policy servers, we do have quite a few policy servers already on 3.7.5.. Do you think that 3.7.5 would be enough?


On Friday, April 6, 2018 at 12:38:36 PM UTC-6, Aleksey Tsalolikhin wrote:

Nick Anderson

unread,
Apr 6, 2018, 2:56:03 PM4/6/18
to help-cfengine
I am pretty sure there were some hot spots addressed in early 3.7 releases. I woluldexpect to see an improvement moving from 3.6.5 to 3.7.5.

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 2:57:47 PM4/6/18
to Todd Erwin, help-cfengine
Yes sir 3.7.5 is recent enough that that should be fine. Please do and let us know how it goes. 

Got it on your maxconnects and custom libraries -- would be curious to hear more about the latter if you can talk about it? :-)

P.S. 3.7.x is commercially supported until August. 

--

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 2:58:16 PM4/6/18
to Nick Anderson, help-cfengine
Yep, those were the ones.

--

Todd Erwin

unread,
Apr 6, 2018, 3:31:56 PM4/6/18
to help-cfengine
Sure..Is there something specific you are looking for?
We put prune dir and logrotate in there because they were "NEW" and we wanted to use them but couldn't take all the other stuff.. That is a good example of why upgrading is difficult because then you get duplicated policy

We used to do package management terrible :( so we have agent bundles that Check if RPM/DPKG is installed and then if not, it COPIES them from a central repo locally then installs.. manages classes and stuff like that.  There are 15 or so bundles directly related to that alone.
Really cruddy way of doing it because we didn't have access to a repo.. now we are getting away from that and using built in packages promises.. 

And then we have some CUSTOM copy bodies that deal directly with how we land things on policy servers and clients.

Nothing too fancy :)

Nick Anderson

unread,
Apr 6, 2018, 3:37:04 PM4/6/18
to Todd Erwin, help-cfengine
Have you tried symlinking the lastseen database to a ramdisk on your hubs?

--
You received this message because you are subscribed to a topic in the Google Groups "help-cfengine" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/help-cfengine/XlwVimjsgX4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to help-cfengine+unsubscribe@googlegroups.com.

Todd Erwin

unread,
Apr 6, 2018, 4:31:34 PM4/6/18
to help-cfengine
Wait for all the AIR to be let out of this room......

We are running MOST of our policy servers on VM's... some share the same back end IO.. so there is a bottle neck, we thought about ramdisk.. however most of these have small memory footprint. 2-4GB per. I/O is only a problem when we start PUSHING num of connections.

However the issue is not about Policy servers at all really.. but how the cf-execd handles a cf-agent run being killed while another cf-agent run is running.

I get that we are trying to figure out "WHY" the client can't update in an hour and I know the answer to that question which comes from a snowball effect.  However I wouldn't have had many clients mis-configured if the agent/execd would've handled termination in a more graceful manner..
To unsubscribe from this group and all its topics, send an email to help-cfengin...@googlegroups.com.

Nick Anderson

unread,
Apr 6, 2018, 4:45:07 PM4/6/18
to help-cfengine

Well, it's pretty common to run hubs in vms. No real surprises there. I don't think lastseen is big, so I would really consider that.

So, what's the deal with cf-execd and mutlipel agent runs?


> when it kills the FIRST one, somehow the Second one now has only HARD classes and none of the DYNAMIC classes.. As you can imagine this is bad because now it makes assumptions about policies that are not true resulting in undesired configurations..

By "it" your talking about cf-execd in reference to agent_expireafter in body executor control right?
After agent_expireafter minutes of no feedback from the agent to cf-execd, the apparently dead cf-agent is terminated by cf-execd.

This should have nothing to do with a second run missing classes. How are those classes set? The first thing that comes to my mind is promise locking, but if the launch times of the two agents are far apart that seems really unlikely. And I would only expect that locking to affect something like commands promises who are being run as modules.

Todd Erwin

unread,
Apr 6, 2018, 5:22:00 PM4/6/18
to help-c...@googlegroups.com

So, what's the deal with cf-execd and mutlipel agent runs?

 
> when it kills the FIRST one, somehow the Second one now has only HARD classes and none of the DYNAMIC classes.. As you can imagine this is bad because now it makes assumptions about policies that are not true resulting in undesired configurations..

By "it" your talking about cf-execd in reference to agent_expireafter in body executor control right?
After agent_expireafter minutes of no feedback from the agent to cf-execd, the apparently dead cf-agent is terminated by cf-execd.

Sorry should've been more clear... We have a system monitor that watches for processes that run too long.. Each processes can be configured separately for time, for cf-agent we say it should not run longer than 90 minutes (that probably should be lowered to 45 and we would never have the issue).

So.. First one runs and and then 60 minutes passes and cf-execd kicks off cf-agent number 2 ,another 30 minutes passes and the monitor says hey 1st cf-agent is running too long so the MONITOR (not cfengine ) does a Kill -9 PID_OF_CF_AGENT.  2nd cf-agent now for some reason no longer has any discovered classes and only has hard classes..

 
This should have nothing to do with a second run missing classes. How are those classes set? The first thing that comes to my mind is promise locking, but if the launch times of the two agents are far apart that seems really unlikely. And I would only expect that locking to affect something like commands promises who are being run as modules.
This is where it gets tricky to explain I'll take a stab and try to do it justice..

inside of promises.cf we have
bundle agent modules
{
commands:
   "$(sys.workdir)/modules/perl_script_2_set_classes  || /bin/echo"
      contain => in_shell_and_silent,
      module => "true",
      action => if_elapsed(0);
}
Simple right.. wrong.. So perl_script_2_set_classes (not what it's named) parses a group of directories for MORE perl scripts, and then runs each of thoose perl scripts (it has 60 seconds to run each perl script and print classes)  Each perl script outputs via a print the NAME of desired classes: EXAMPLE      print "+Todds_Best_Class\n";

CFEngine then ACCEPTS/ASSIGNS/MAGICALLY makes these available and can be seen with allclassesreport and of course seen by all other promises.

Perl scripts get information about host from NIS/LDAP/Hypervisor/%DiskFree so cfengine doesn't puke when not enough space/vmbuild specifics/BETA/PRODUTION

A good example of one script would be looks to see if there is a /var/cfengine.beta and if so that client is a BETA client and pulls all BETA configurations and not production. (good way to test before prod)

My guess is these are stored in MEMORY.. However they are invoked WITH the agent because they come from promises.cf and should stay WITH the agent.. I think theses classes live back at cf-execd level and when cf-agent is killed it wipes them out from cf-execd, and the next time agent #2 checks policy against classes they are not there because agent 1 wiped them out on kill..

Just my swag..

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 6:32:07 PM4/6/18
to Todd Erwin, help-cfengine
Thank you for explaining about your libraries Todd. I misunderstood I thought they were C libraries. I'm tracking now.

We follow Nick's suggestion to move the last seen database to RAM to keep out hubs from getting overloaded.

So what causes your agent runs to take more than an hour when they take more than an hour? Is one of these external scripts hanging? If so, do you know which one?

Todd Erwin

unread,
Apr 6, 2018, 6:49:17 PM4/6/18
to help-cfengine


On Friday, April 6, 2018 at 4:32:07 PM UTC-6, Aleksey Tsalolikhin wrote:
Thank you for explaining about your libraries Todd. I misunderstood I thought they were C libraries. I'm tracking now.

We follow Nick's suggestion to move the last seen database to RAM to keep out hubs from getting overloaded.

So what causes your agent runs to take more than an hour when they take more than an hour? Is one of these external scripts hanging? If so, do you know which one?

They try to copy files from an overloaded server that server doesn't answer in 30 seconds.. so they go to server number 2.. it can't answer either because it is overloaded.  It basically happens because of an ENVIRONMENT problem that has ever server having to talk to another server.. so you have 10k hosts that never get an answer..  Once it starts it's very hard to stop it without the kill in place.  It happens because of too many changes being pushed all at once to 10k or more hosts.  or from some service (DNS) taking forever to respond.  WE know why it does it and we keep trying to put safeguards in place.

Aleksey Tsalolikhin

unread,
Apr 6, 2018, 6:56:45 PM4/6/18
to Todd Erwin, help-cfengine
Right! You covered that earlier. Sorry.

I would try to solve this at the most basic level possible -- which is not having overloaded policy servers. Hence the recommendations to upgrade to 3.7 and to move the last seen database to RAM.

Then you don't even have to worry about killing the agents because they will complete their runs in time. No?

--

Nick Anderson

unread,
Apr 6, 2018, 9:11:09 PM4/6/18
to Todd Erwin, help-cfengine
Are you confident about that || /bin/echo?

What's it echoing? 

What if you remove it?

What do you have agent_expireafter set to?

Todd Erwin

unread,
Apr 8, 2018, 9:34:14 PM4/8/18
to help-cfengine
If the script doesn't execute it echos nothing..  Not real confident about it to be honest, I think it was inherited from previous admins/versions.  Never tried to remove it, I'll do some test runs monday.

I am not setting agent_expireafter to anything, so my guess is default, and Im assuming you mean expire after.. there is not agent_expireafter that im aware of.

Aleksey Tsalolikhin

unread,
Apr 8, 2018, 9:39:03 PM4/8/18
to Todd Erwin, help-cfengine

Todd Erwin

unread,
Apr 8, 2018, 10:00:49 PM4/8/18
to help-cfengine

Yeah we talked about this earlier in the thread.

Aleksey Tsalolikhin

unread,
Apr 8, 2018, 10:04:08 PM4/8/18
to Todd Erwin, help-cfengine
Right on.  I still feel if you upgrade and that fixes overloaded policy servers, you won't have long-running agents any more. I don't believe I heard back from you on that? :-)

On Sun, Apr 8, 2018, 10:00 PM Todd Erwin <toddw...@gmail.com> wrote:

Yeah we talked about this earlier in the thread.
<I was looking at that the problem is that we run 1x per hour.. so I would think that then the configuration would be 1 if I wanted to use agent_expire?  Or a wrapper as you say might be another option..

Todd Erwin

unread,
Apr 8, 2018, 10:38:38 PM4/8/18
to help-cfengine


On Sunday, April 8, 2018 at 8:04:08 PM UTC-6, Aleksey Tsalolikhin wrote:
Right on.  I still feel if you upgrade and that fixes overloaded policy servers, you won't have long-running agents any more. I don't believe I heard back from you on that? :-)

I think we will work on a couple of approaches,Given the environment simply upgrading will not fix it.  
1. Get policy servers to 3.7.5 or 3.7.7.. Work on getting policy servers to have the last seen DB File on a RAM disk (Just the one file right?)
2. Lower the thresh for killing from 90 to 50 (ensures 10 minutes before the next run you kill preventing duplicate runs

Investigation also into the echo, lowering the number of round robin policy servers and possibly lowering the default timeout form 30 to 10 for copies.
Reply all
Reply to author
Forward
0 new messages