cf-serverd load testing

40 views
Skip to first unread message

Danny Sauer

unread,
Oct 14, 2014, 6:10:03 PM10/14/14
to help-c...@googlegroups.com
I'm testing out some different mechanisms to distribute policy across several CFEngine servers, and want to validate how they [make cf-serverd] behave under load.  It's inconvenient to point a whole pile of in-use servers at a cf-serverd instance to test it out, and making predictions is only as accurate as my sketchy estimation; has anyone already written a convenient application which can simulate a network load on a CFEngine server?  Ideally, I could take a few test systems and say "pretend to be 5,000 servers running with a 3 minute splaytime fetching this list of files."  I'm not above writing it myself, but it'd be nice if something could already be done or mostly done for me. :)

Thanks.
--Danny

Mike Svoboda

unread,
Oct 15, 2014, 9:06:41 AM10/15/14
to Danny Sauer, help-c...@googlegroups.com
Hey Danny

I’m not sure you need to really stress test cf-serverd, it’ll perform under load quite well.    We did hit a scaling problem with Cfengine file transfers and ended up having to move to a more decentralized model.    

The “cf-serverd supports 5,000 clients” is kind of a ballpark figure just thrown out there.   If there’s only a single policy file — then sure, 5,000 file transfers / comparisions over 5 minutes is a snap.   But if your policy tree contains thousands of files, for each file you add, the load on cf-serverd grows by 5,000x.


Anyways, here’s the discussion about how we moved to this alternative file transfer mechnicm.  We still use Cfengine’s native file transfer (cf-serverd), but its a bit more intelligent than the default behavior:



--
You received this message because you are subscribed to the Google Groups "help-cfengine" group.
To unsubscribe from this group and stop receiving emails from it, send an email to help-cfengin...@googlegroups.com.
To post to this group, send email to help-c...@googlegroups.com.
Visit this group at http://groups.google.com/group/help-cfengine.
For more options, visit https://groups.google.com/d/optout.

Danny Sauer

unread,
Oct 15, 2014, 11:06:27 AM10/15/14
to help-c...@googlegroups.com, dannysa...@gmail.com
On Wednesday, October 15, 2014 8:06:41 AM UTC-5, Mike Svoboda wrote:
Hey Danny

I’m not sure you need to really stress test cf-serverd, it’ll perform under load quite well.

I'm not really testing cf-serverd itself, but rather, testing how it behaves on top of a couple of different filesystems.  The details beyond that aren't really important to my question, but might be useful to someone else.  So:

Initially, I was just using a file copy promise on the secondaries to keep in sync with a single master.  But that didn't scale well as the number of files increased, so I switched to a locally checked-out copy of the SVN repo, but large svn working directories are like horses - they look pretty, but everything seems to kill them.  So, ongoing maintenance of the svn thing got too tedious - and it was slow to do an svn update across ~100K files (and the asosciated .svn directories).  A while back, I moved to Gluster as the means of synchronizing several secondaries with a fail-over pair of primary servers, and a post-commit hook to update the GlusterFS on change.  My policy consists of around 30-ish files which go to all systems, and another seven or so files which are specific to each system.  It's that set which are system-specific that cause the problem (for most everything); I have tens of thousands of systems, and those files are very small.

The first iteration of the Gluster solution was a replicated volume across two bricks, and the CFEngine secondaries serving files directly from a gluster mount.  The gluster client is implemented using FUSE, so it has all the drawbacks of being a userspace filesystem - including really bad performance with a bunch of small files.  In addition, with a replicated gluster volume, a client doing a stat() call needs to contact all of the servers which have a replica to ensure that the data it's getting is accurate.  So, with two replicas, the client first has to essentially lock "hey, youz guys, I need to know about this file" on two systems and wait for a response from both - which is slow on the same network, but is particularly bad across a slow WAN link between geographically distributed sites.  While it performed ok with a couple of systems, the combined impact of both issues caused cf-serverd to be unable to keep up with the client load when I ramped it up to the full ~3-5K per site.  CPU usage was low, but load average went through the roof because the system was constantly waiting on I/O, and clients would time out before cf-serverd could respond.  Since then, I've moved to using Gluster replication, as Gluster 3.4 can replicate from a volume to a local filesystem.  The replication is neat; since the filesystem knows what has changed, it can basically do a smart rsync of just what's changed recently (or since the last sync, if a node is temporarily down).  So, I replicate from the central Gluster filesystem to a local XFS filesystem on each secondary.  However, the amount of activity and number of small files causes some memory issues on the Gluster client, which eventually causes the cf-serverd process to keel over when the system starts running low on memory and connections start to back up.  Also, Gluster 3.4 has poor status tracking on replication.  Gluster 3.5 can show the replication status of replicated volumes with more granularity and is supposed to have a couple of tweaks which should fix the memory issue I'm seeing, but it can't replicate to a local filesystem any more.

So, the reason for wanting to generate a load on cf-serverd is to see how the underlying synchronization mechanism handles the load it will be subjected to.  I'm either going to implement a Gluster plugin to export from a replicated volume to a local filesystem, replicate to a volume with a brick local to each CFEngine secondary and then use an NFS loopback mount to each local replicated volume (this is the easiest solution and removes the FUSE drawbacks, so I hope it works), implement a FUSE filesystem to provide the per-host configuration files using an SQL view rather than "real" files (I implemented this a while back, and it seems awesome, but haven't tested to validate under a load), or go with some other solution if none of those behave like I want.  But in order to determine if they behave like I want, I either have to temporarily shift the production load over to a test system, or generate a synthetic load.  I'd rather generate a synthetic load. :)
 
   We did hit a scaling problem with Cfengine file transfers and ended up having to move to a more decentralized model.

Amusingly enough, the "how might you solve this problem of distributing config files to a bunch of machines" question comes up in your SRE phone screening process, leading to a minor ethical dilemma for someone who happens to follow this mailing list and be familiar with the BitTorrent solution that the phone screener is probably most familiar with. :)

--Danny

Mike Svoboda

unread,
Oct 15, 2014, 11:24:27 AM10/15/14
to Danny Sauer, help-c...@googlegroups.com
HA!   Too bad we couldn’t recruit you!   Seriously though, a SRE role is like a meat grinder.  Being on the hook for some other engineer's code can cause late night / early morning pages and a decrease in the quality of life.  

One question I have..  It seems that you must be doing a lot of “straight file copies” using md5 comparison if you have such large filesystem demands.   Maybe instead of having each host have its own set of unique files, that those policies should use variable based templates?   If you need to compute something per host like IP addresses or hostnames — maybe rely on DNS queries via execresult to populate a variable, and then expand that templated file?   I think what you’re doing that is host specific / causing such a huge filesystem structure might be better accomplished with a different policy based approach….   If you have to make a change to one of these host specific files, that would be a massive commit 10k+ file commit..

> are like horses - they look pretty, but everything seems to kill them.

lulz   



Danny Sauer

unread,
Oct 15, 2014, 12:26:34 PM10/15/14
to help-c...@googlegroups.com, dannysa...@gmail.com
The host-based files *are* the variables which adjust the policy (I posted something yesterday about how moving those to JSON has made things somewhat easier for me).  They define things like what users should be on a system, what the user's passwords should be, which ssh keys they should trust, what sudo rules should be available, and that sort of thing.  It's done this way so that the sensitive parts of any given host's configuration is only available on the host itself, reducing the useful information available should an individual host be compromised in some way.  While it would definitely be more efficient to collapse that into a single file, that would be compromising security principles to fit a product rather than adjusting the product to fit our desired security outcomes.  Given that we're subject to pretty much every kind of regulation known to the US (payment card processing, health information handling, various financial regulations from banks and investment funds, etc), it behooves us to err on the side of security when possible.  If I wasn't reading about a new high-profile breach every other day, perhaps I'd have a higher tolerance for weaking security in favor of saving myself a whole bunch of time... :)

Oh, and for the record, not working for LinkedIn does not free me from supporting other people's junk code (or sometimes my own junk code) at all hours of the night.. ;)

--Danny

Eystein Måløy Stenberg

unread,
Oct 15, 2014, 1:42:47 PM10/15/14
to help-c...@googlegroups.com
Just a related note: cf-serverd 3.5 and newer versions perform much
better in CPU-intensive operations than past versions. This is because
it is utilizing multiple cores in a much better way.

On 15/10/14 00:10, Danny Sauer wrote:
> I'm testing out some different mechanisms to distribute policy across
> several CFEngine servers, and want to validate how they [make
> cf-serverd] behave under load. It's inconvenient to point a whole pile
> of in-use servers at a cf-serverd instance to test it out, and making
> predictions is only as accurate as my sketchy estimation; has anyone
> already written a convenient application which can simulate a network
> load on a CFEngine server? Ideally, I could take a few test systems and
> say "pretend to be 5,000 servers running with a 3 minute splaytime
> fetching this list of files." I'm not above writing it myself, but it'd
> be nice if /something /could already be done or mostly done for me. :)
>
> Thanks.
> --Danny
>
> --
> You received this message because you are subscribed to the Google
> Groups "help-cfengine" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to help-cfengin...@googlegroups.com
> <mailto:help-cfengin...@googlegroups.com>.
> To post to this group, send email to help-c...@googlegroups.com
> <mailto:help-c...@googlegroups.com>.
> Visit this group at http://groups.google.com/group/help-cfengine.
> For more options, visit https://groups.google.com/d/optout.

--

Eystein
Reply all
Reply to author
Forward
0 new messages