cf-serverd CPU utilization

348 views
Skip to first unread message

dan...@gmail.com

unread,
Sep 11, 2013, 1:25:03 PM9/11/13
to help-c...@googlegroups.com
The cf-serverd process is using anywhere from 5% to 12% of the CPU on my systems. My systems range from a P4 up to an i7-3970X running at 4.50GHz  Seems kind of high for a system at idle. On an otherwise idle system it's always at the top of the list in CPU usage. What exactly is cf-serverd doing that causes it to use this much CPU?

Neil Watson

unread,
Sep 11, 2013, 1:32:23 PM9/11/13
to help-c...@googlegroups.com
What version of CFEngine?
Is this a policy server?
How many client connect to this host?

--
Neil H Watson
http://evolvethinking.com/evolve-thinkings-free-cfengine-library/
Hardening with Cfengine http://evolvethinking.com/products/
VIM and Cfengine https://github.com/neilhwatson/vim_cf3

Brian Bennett

unread,
Sep 11, 2013, 1:33:17 PM9/11/13
to dan...@gmail.com, help-c...@googlegroups.com
"Using" is a relative term when talking about CPU.

In what way is it using the CPU? System time? User time? I/O wait? Steal time?
Are they bare metal physical instances? VMs? Containers?

You could try running it in debug and no fork mode. It will tell you what it's doing.

-- 
Brian

On Sep 11, 2013, at 10:25 AM, dan...@gmail.com wrote:

The cf-serverd process is using anywhere from 5% to 12% of the CPU on my systems. My systems range from a P4 up to an i7-3970X running at 4.50GHz  Seems kind of high for a system at idle. On an otherwise idle system it's always at the top of the list in CPU usage. What exactly is cf-serverd doing that causes it to use this much CPU?

--
You received this message because you are subscribed to the Google Groups "help-cfengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to help-cfengin...@googlegroups.com.
To post to this group, send email to help-c...@googlegroups.com.
Visit this group at http://groups.google.com/group/help-cfengine.
For more options, visit https://groups.google.com/groups/opt_out.

dan...@gmail.com

unread,
Sep 11, 2013, 5:20:07 PM9/11/13
to help-c...@googlegroups.com, cfen...@watson-wilson.ca
I'm using version 3.5.2. The systems in question are all clients.

On the policy server cf-serverd is showing about the same utilization as on the clients -- around 10% --  and there are about 30 clients using it.

By "CPU utilization" I am referring to the values shown in the "top" program in the column "%CPU".

cf-execd seems to start cf-serverd automatically, but I wondered if it has to be running at all on the client machines.

Laurent Raufaste

unread,
Sep 14, 2013, 6:38:04 PM9/14/13
to help-c...@googlegroups.com
I'm using 3.5.2 with ~400 clients and the CPU usage on the cfhub is becoming a problem too.

As you can see on this, our dedicated cfhub is now reaching 0.4 load.
The big spike end of july is due to an addition of 200 clients in one day.

It's becoming a problem because more and more, we are seeing cfhub giving up, cf-serverd sometimes locks up, not answering requests anymore, and after sometime get restarted.
Here's one of the daily error email we get from random cf clients:
2013-09-14T21:17:07+0000    error: Unable to establish any connection with server.
2013-09-14T21:17:07+0000    error: Unable to establish any connection with server.
2013-09-14T21:17:07+0000    error: Unable to establish any connection with server.
2013-09-14T21:17:07+0000    error: Unable to establish any connection with server.
2013-09-14T21:17:07+0000    error: Unable to establish any connection with server.

Our cfhub was a t1.micro, it's now a m1.small, but as we are getting more of those, I plan to upgrade it again to a larger AWS instance.
Is it the usual behavior ? Or is there an underlying problem ?

Do you guys with ~500+ servers allocate a larger server to the cfhub ?

Thanks

Brian Bennett

unread,
Sep 14, 2013, 9:13:53 PM9/14/13
to Laurent Raufaste, help-c...@googlegroups.com
So you have two main consumers of CPU. User and steal.

The user time is possibly tcdb overhead. Some people have seen extra CPU burn as the tcdb gets larger. You can try removing it to see if that helps.

Secondly, the steal time. This is time that the vCPU had a runnable task but the physical CPU(s) were servicing other instances. If there were no runnable task this time would be reported as idle. You should not account for steal time when determining wether or not something is running hot. Steal time by definition means a task was not running.

I'm also interested, what's your memory look like? In particular is MemUsable low? What's your cache percentage? What's your cache/hit ratio? (See here, http://serverfault.com/q/157612/64706but be warned that SystemTap is not entirely production safe and may crash the system).

Are you seeing any client side errors? I work with high density systems backed by EC2 so I have a bit of experience in this area. I have some systems that run under a consistent load of about 75% and the load average is around 3x cores. The CPU time is almost entirely computational (i.e., very little waiting/blocking and high thread counts). So for a well utilized system a high load is not unreasonable.

-- 
Brian
--

Laurent Raufaste

unread,
Sep 14, 2013, 10:11:39 PM9/14/13
to help-c...@googlegroups.com, Laurent Raufaste
The memory usage is super stable and not a problem, see below:

I don't see any TCDB problem on my side, both on the hub and the clients (cf-agent --self-diagnostics show no coherence problem)

Looking at iostat, when cf-serverd is super active, I don't see any un-legitimate IO, but cf-serverd is crunching a lot as you can see:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                   
25887 root      20   0  598m 8252 1900 S 99.7  0.5   1624:26 cf-serverd                                                 
26224 root      20   0 17324 1240  940 R  0.9  0.1   0:00.14 top                                                        

Do you guys see this behavior on your clusters ? (m1.medium cfhub not being able to handle more than 500 clients ?)

Brian Bennett

unread,
Sep 14, 2013, 10:18:23 PM9/14/13
to Laurent Raufaste, help-c...@googlegroups.com, Laurent Raufaste
What do you mean by "not being able to handle more than 500 clients"?

What kinds of errors are you getting? Connection failures? Connections dropped? Reset? How many open files? What's the limit?

If you can qualify exactly what the problem is we might be able to help you better.

By the way, LinkedIn uses CFEngine to manage thousands of servers (and their admin is active on this list) so the problem you're experiencing is not inherent to CFEngine.

-- 
Brian

Laurent Raufaste

unread,
Sep 15, 2013, 2:28:05 PM9/15/13
to help-c...@googlegroups.com
Regarding the steal CPU, it's an AWS thing and I'm aware of this, on m1.small it prevents you from using more than 0.5 of the core. Hence the steal on the graph mirroring the load.
It's not a problem IMHO, it just means that the load of the instance is capped at 0.5 in the case of a m1.small instance (and around 0.1 on t1.micro)

On Saturday, September 14, 2013 10:18:23 PM UTC-4, bahamat wrote:
What do you mean by "not being able to handle more than 500 clients"?

My cfhub was on a t1.micro before, and started to not keep up with the clients when I had around 50.
Now my cfhub is on a m1.small instance and is starting to not answering client requests around 400 clients.
If you look at the graph, you can see that the load is linear with the number of clients on the cluster, and at this rate, with ~500 clients, cfhub will hit the 0.5 load and won't be able to server clients in an efficient way
 
What kinds of errors are you getting? Connection failures? Connections dropped? Reset? How many open files? What's the limit?

I get the errors I mentioned earlier, from time to time, on random clients, I get those errors:
2013-09-15T04:24:18+0000    error: Unable to establish any connection with server.
2013-09-15T04:24:28+0000    error: Unable to establish any connection with server.
Those errors were not happening that much before, and are happening more since my last addition of 8 clients.

And at the same time, I can see that the server is struggling with cf-serverd at 100% load on the hub.
 
If you can qualify exactly what the problem is we might be able to help you better.

Like my "broken syntax variable" error (https://cfengine.com/dev/issues/3047), it's hard to investigate because it does not happens often on an instance so it's hard to catch, but on the scale of the cluster, it happens a lot.
 
By the way, LinkedIn uses CFEngine to manage thousands of servers (and their admin is active on this list) so the problem you're experiencing is not inherent to CFEngine.

Let's not rule out CFEngine just because someone else is managing 1000s, he might be using uber CPU on metal. I signed for vCPU in the cloud.

My question is really:
- On AWS I reached the cfhub limit with 50 clients on a t1.micro
- On AWS I'll reach the cfhub limit with 500 clients on a m1.small

Is it something that is usually seen ? (In this case I'll just upgrade my cfhub, it's fine really)
Or is the cfhub supposed to handle way more than this, and in this case I need to look at what's not optimal in our configuration.

I'm wondering, because the cfhub job should really be to tell the clients that his files are up to date, and if not, just send a bunch of text files. An rsync job really. But I might be missing something.

Thanks

Laurent Raufaste

unread,
Sep 15, 2013, 2:34:01 PM9/15/13
to help-c...@googlegroups.com
To add more info, I also get a more detailed error message from clients sometimes:
2013-09-15T13:49:51+0000    error: Bad protocol reply 'BAD: Server is currently too busy -- increase maxconnections or splaytime?'
2013-09-15T13:49:51+0000    error: Authentication dialogue with '10.161.53.136' failed
2013-09-15T13:49:51+0000    error: Unable to establish any connection with server.
2013-09-15T13:49:51+0000    error: Unable to establish any connection with server.

my maxconnections is 100 and my splaytime is 5

Brian Bennett

unread,
Sep 15, 2013, 2:44:29 PM9/15/13
to Laurent Raufaste, help-c...@googlegroups.com
Steal time isn't an AWS thing, it's a virtualization thing. And it's not a function of capping exactly, but it is related.

With capping the instance is allocated CPU time in a unit called jiffies. To make numbers simple, lets say one CPU can service 1000 jiffies. 0.5 would allocate your instance 500. If you use 500 then your CPU time is reported as 100%. If you want to use 500 but only 400 are available then your CPU time is reported as 80% and steal time is 20%.

If you want to use 200 but only 100 are available your CPU time will be 20% and 20% will be steal time.

-- 
Brian
--

Brian Bennett

unread,
Sep 15, 2013, 2:47:00 PM9/15/13
to Laurent Raufaste, help-c...@googlegroups.com
Ok this is a bit more helpful. If you can, run your server in verbose mode and see what it says about the connection failures. It should have an accurate reason as to why it's refusing (if it's refusing). You could also try packet capturing and looking for resets.

Secondly, what happens if you increase the max connections? How high can you go, and what happens when you get too high?

-- 
Brian
--

Laurent Raufaste

unread,
Sep 15, 2013, 3:12:02 PM9/15/13
to help-c...@googlegroups.com
I guess I'll need to increase the maxconnections setting to 100+

Here's the graph of the threads on the cfhub:

The threads spikes happen exactly when I'm getting the error emails about the client failing to connect.
I'll increase the maxconnections and see how it goes.
I'm wondering why at some time of the day, more clients pile up to talk to the servers, I usually get 1 to 20 when I look at it. Bad luck ?

But I'm still questioning about the CPU usage. Is anybody running the cfhub on AWS seeing the same usage on t1.micro/m1.small ?

Gregory Matthews

unread,
Sep 16, 2013, 5:51:53 AM9/16/13
to help-c...@googlegroups.com
We are seeing these errors on what is a fairly beefy server with around
1500 clients.

I've posted about this before (see previous thread about splaytime) but
there was no solution found. Interestingly, on upgrade to 3.5.1 the
emails stopped coming - perhaps not seen as a critical error anymore -
but I'm pretty sure the problem persists.

I've tried increasing the thread count but in fact the server never
seems to have more than around 55 instances of cf-serverd process.

Also, is it just me or are all the images stripped from your emails?

GREG

On 15/09/13 20:12, Laurent Raufaste wrote:
> I guess I'll need to increase the maxconnections setting to 100+
>
> Here's the graph of the threads on the cfhub:
>
> <https://lh3.googleusercontent.com/-LBVY0IDt13I/UjYFrIhYSKI/AAAAAAABYVk/rOTcdF266Ik/s1600/Screen+Shot+2013-09-15+at+3.07.10+PM.png>
>
> The threads spikes happen exactly when I'm getting the error emails
> about the client failing to connect.
> I'll increase the maxconnections and see how it goes.
> I'm wondering why at some time of the day, more clients pile up to talk
> to the servers, I usually get 1 to 20 when I look at it. Bad luck ?
>
> But I'm still questioning about the CPU usage. Is anybody running the
> cfhub on AWS seeing the same usage on t1.micro/m1.small ?
>
> On Sunday, September 15, 2013 2:47:00 PM UTC-4, bahamat wrote:
>
> Ok this is a bit more helpful. If you can, run your server in
> verbose mode and see what it says about the connection failures. It
> should have an accurate reason as to why it's refusing (if it's
> refusing). You could also try packet capturing and looking for resets.
>
> Secondly, what happens if you increase the max connections? How high
> can you go, and what happens when you get too high?
>
> --
> Brian
>
> On Sep 15, 2013, at 11:34 AM, Laurent Raufaste <anal...@gmail.com
>> <https://cfengine.com/dev/issues/3047>), it's hard to
>> send an email to help-cfengin...@googlegroups.com <javascript:>.
>> To post to this group, send email to help-c...@googlegroups.com
>> <javascript:>.
>> <http://groups.google.com/group/help-cfengine>.
>> For more options, visit https://groups.google.com/groups/opt_out
>> <https://groups.google.com/groups/opt_out>.
>
> --
> You received this message because you are subscribed to the Google
> Groups "help-cfengine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to help-cfengin...@googlegroups.com.
> To post to this group, send email to help-c...@googlegroups.com.
> Visit this group at http://groups.google.com/group/help-cfengine.
> For more options, visit https://groups.google.com/groups/opt_out.


--
Greg Matthews 01235 778658
Scientific Computing Group Leader
Diamond Light Source Ltd. OXON UK

--
This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom




Nicolas Charles

unread,
Sep 16, 2013, 12:14:49 PM9/16/13
to help-c...@googlegroups.com
Just a question, do you have a fast dns, or have the hostname in the
/etc/hosts in the server ?
I've seen cases when the reverse lookup (in CFEngine) could be a bit
slow, even if the DNS was responding fast.

Nicolas
Nicolas CHARLES

Laurent Raufaste

unread,
Sep 16, 2013, 12:20:40 PM9/16/13
to Nicolas Charles, help-c...@googlegroups.com
That's a good point.
I'm using AWS's DNS, and seeing a lot of those too as errors sent by CFE clients:
2013-09-15T02:40:00+0000    error: During agent identification for '12.12.12.12'. (getnameinfo: Temporary failure in name resolution)
2013-09-15T02:40:00+0000    error: Id-authentication for 'some.host.name.tld' failed
2013-09-15T02:40:00+0000    error: Unable to establish any connection with server.

Does it mean that for every client, the server is trying to reverse lookup it ?
In this case, would it make sense for me to add a caching DNS forwarder on our cluster ? Or locally ?
/etc/hosts is not an option as we have dynamic names

-- 
Laurent Raufaste

You received this message because you are subscribed to a topic in the Google Groups "help-cfengine" group.
To unsubscribe from this group and all its topics, send an email to help-cfengin...@googlegroups.com.

Nicolas Charles

unread,
Sep 17, 2013, 5:55:48 AM9/17/13
to help-c...@googlegroups.com
Yeah, for every client the server reverse lookup it to check its hostname.

You could add a DNS cache, or maybe try to add skipverify ( https://cfengine.com/archive/manuals/cf3-Reference#skipverify-in-server ) or skipidentify ( https://cfengine.com/archive/manuals/cf3-Reference#skipidentify-in-agent ) in your promises

Nicolas

Laurent Raufaste

unread,
Sep 17, 2013, 1:00:06 PM9/17/13
to help-c...@googlegroups.com
For the purpose of the problem, I have no connection error since I increased maxconnections from 100 to 200.
The load on the hub is still high, but at least I don't get those errors anymore for now.
Thanks for your help all.

On Tuesday, September 17, 2013 5:55:48 AM UTC-4, Nicolas Charles wrote:
Yeah, for every client the server reverse lookup it to check its hostname.

You could add a DNS cache, or maybe try to add skipverify ( https://cfengine.com/archive/manuals/cf3-Reference#skipverify-in-server ) or skipidentify ( https://cfengine.com/archive/manuals/cf3-Reference#skipidentify-in-agent ) in your promises
 

That's interesting .
I don't have skipidentify as true. 
I'm using:
skipverify            => { ".*\.$(def.domain)", "127.0.0.1" , "::1", @(def.acl) };

def.acl contains my CIDR block, so it should skip verif.
Could it be that the 1st ".*\.$(def.domain)" forces the server to reverse lookup to check the hostname of clients ? Or is it checking whatever the client is sending as a hostname ?
The doc is not clear
"Server side decision to ignore requirements of DNS identity confirmation."
Huh ? ;)
Message has been deleted

Shreyas Parikh

unread,
Dec 22, 2016, 6:16:48 AM12/22/16
to help-c...@googlegroups.com, cfen...@watson-wilson.ca
Hi ,

I am using below cfengine configuration on our production system

cfengine-community 3.6.2-1  on both server & client side
CFEngine desing-center ( cf-sketch version 3.6.0.) 

On single cfengine server and it's 102 clients running .

We are generating configuration using design-center DCAPIs for 103 host and it's related sketches. it is generating 2.6M cf-sketch-runfile.cf file. 
but some how when cf-agent run on client side it occpying high cpu utilization upto 99.5%, any idea why it is taking high cpu utilization ?

i.e 
21890 root      20   0   80148  38528   5956 R 100.0  0.9   1:56.00 cf-agent

Settings : 

controls/cf_execd.cf

schedule => { "Min00","Min15","Min30","Min45" };
splaytime  => "5";


CPU Specification on client/server : 

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                2
On-line CPU(s) list:   0,1
Thread(s) per core:    1
Core(s) per socket:    1
Reply all
Reply to author
Forward
0 new messages