Part 1 - History and Present Day File Transfer Architecture - Hacking Cfengine's File Transfer

127 views

Skip to first unread message

Mike Svoboda

unread,

Feb 24, 2014, 8:24:30 AM2/24/14

to help-c...@googlegroups.com, ma...@cfengine.com

Introduction

If Cfengine's execution against promises.cf (apply automation policies to a machine) is the brain of LinkedIn's Operations, then Cfengine's file transfer infrastructure is the backbone of our company. Our entire Operation depends on being able to deliver updated data to tens of thousands of machines with a very low update resolution.
Everything we do as a company depends on Cfengine's core file transfer mechanism:
- We install or remove user accounts by shipping manifest.txt, a LDAP database dump and then performing local /etc/passwd user additions or removals.
- The tools team pushes code to /usr/local/linkedin via Bittorrent. Cfengine acts as the tracker and uses this file transfer to distribute .torrent metadata to clients.
- Our code deployment system, LID and GLU, and underlying Salt system are monitored for their health, bits continuously updated, etc.
- Java and Python are installed, maintained, and upgraded on a continuous basis. These two programming virtual machines make up the bulk of our production platform.
- Policy updates, which is how we administrate our systems is pushed 10 to 15 times a day.
- Range, our lookup system, has its maps distributed among the Cfengine infrastructure. Range has become as critical of a lookup service for Linkedin as DNS.
- Endless other examples of extremely critical parts of infrastructure, tied together with the glue from Cfengine automation.
Without a functioning and healthy Cfengine file transfer mechanism, LinkedIn's Production Operations would become paralyzed.
- In the past 24 hours:
  - 47 unique files were distributed to a single host
  - 60 changes were applied to a single host
  - 11,768 promises are kept on a single host in a 5 minute update interval, 16,945,920 promises are verified over 24 hours on a single host.
- Scaling this upwards of all infrastructure1,141,254 files were distributed across all of production
  - 1,456,920 changes were executed across all of production
  - 411,480,829,440 promises were kept across all of production. Each promise verifies that our infrastructure does not experience configuration drift. Machines converge according to our policy set. They apply automation defined in promises.cf.
- This level of change is required just as a daily maintenance of production, not modifying it. Without Cfengine's ability to enforce automation updates, we would lose our ability to administrate production. At our scale, it would be impossible for the same level of maintenance to be serviced by a team of hundreds of system administrators.

How classical Cfengine (cf-serverd, the server process to cf-agent, the client) file transfers operate

Every 5 minutes, every machine "wakes up" according to our defined schedule and performs a Cfengine file transfer. Once the file transfer is complete and the machine has downloaded new configuration files, it executes them. This is how we push automation changes to production.
The client machine running the cf-agent process performs MD5 digest comparisons with the server running the cf-serverd daemon.
The client looks at the data it has on disk, generates the MD5 for a specific file, and compares that MD5 sum with cf-serverd. If cf-serverd reports a different digest than what the client has, the client pulls the updated file.

Client side, verbose execution of the MD5 digest comparison

cf3>     .........................................................
cf3>      Promise's handle: 'daily_full_filesystem_copy'
cf3>      Promise made by: '/var/cfengine/inputs'
cf3>     .........................................................
cf3>
cf3> Handling file existence constraints on '/var/cfengine/inputs'
cf3> Copy file '/var/cfengine/inputs' from '/var/cfengine/masterfiles/generic_cf-agent_policies' check
cf3> GetIdleConnectionToServer: no existing connection to '172.18.41.113' is established...
cf3> Set cfengine port number to '5308' = 5308
cf3> Set connection timeout to 30
cf3> Connect to 'dc5-cfe-test.corp.cfengine.com' = '172.18.41.113' on port '5308'
cf3> .....................[.h.a.i.l.].................................
cf3> Strong authentication of server 'dc5-cfe-test.corp.cfengine.com' connection confirmed
cf3> Public key identity of host '172.18.41.113' is 'MD5=2798e2bf3ff8182ef92b75ea5f835843'
cf3> Destination purging enabled
cf3> Entering directory '/var/cfengine/masterfiles/generic_cf-agent_policies'
cf3> Destination file '/var/cfengine/inputs/cfengine_stdlib.cf' already exists
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File permissions on '/var/cfengine/inputs/cfengine_stdlib.cf' as promised
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File '/var/cfengine/inputs/cfengine_stdlib.cf' is an up to date copy of source
cf3> Destination file '/var/cfengine/inputs/mps_yum_servers.cf' already exists
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File permissions on '/var/cfengine/inputs/mps_yum_servers.cf' as promised
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File '/var/cfengine/inputs/mps_yum_servers.cf' is an up to date copy of source
cf3> Destination file '/var/cfengine/inputs/check_splunk_installed.cf' already exists
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File permissions on '/var/cfengine/inputs/check_splunk_installed.cf' as promised
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File '/var/cfengine/inputs/check_splunk_installed.cf' is an up to date copy of source
cf3> Destination file '/var/cfengine/inputs/manage_root_crontab_entries.cf' already exists
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File permissions on '/var/cfengine/inputs/manage_root_crontab_entries.cf' as promised
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File '/var/cfengine/inputs/manage_root_crontab_entries.cf' is an up to date copy of source
cf3> Skipping matched excluded directory '/var/cfengine/masterfiles/generic_cf-agent_policies/.svn'
cf3> Destination file '/var/cfengine/inputs/mps_bittorrent_tracker.cf' already exists
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File permissions on '/var/cfengine/inputs/mps_bittorrent_tracker.cf' as promised
cf3> Additional promise info: handle 'daily_full_filesystem_copy' source path '/var/cfengine/inputs/scale_cfengine_data_transfers.cf' at line 262
cf3> File '/var/cfengine/inputs/mps_bittorrent_tracker.cf' is an up to date copy of source
cf3> Destination file '/var/cfengine/inputs/garbage_collection.cf' already exists
...
...
this continues for 1700+ more entries..

The cost of cf-serverd to service a client

When our clients execute, they traverse 4 directories to perform downloads:
- /var/cfengine/inputs, where the bulk of our policies and configurations reside. 1481 files
- /var/cfengine/inputs_site_specific: 44 files
- /var/cfengine/modules: 203 files
- /var/cfengine/torrents: 2 files
Total = 1730 files per client, per execution.
Clients execute every 5 minutes. This means that every 5 minutes, cf-serverd is having to perform 1730 MD5 sum comparisons over the network to support a single machine. This happens even when zero files have changed between the client and server.

CPU load of cf-serverd on a busy Cfengine policy server

On modern hardware, performing this many MD5 digest comparisons is an extreme amount of CPU resources.

Although we have redundant policy servers for failover, in this state, the Cfengine servers are being pushed to the limits of what the hardware of the machine can physically support.
Thousands of client machines are connecting to cf-serverd making thousands of requests.
In the event that we lose one or two policy servers, the remaining infrastructure would have difficulty providing the capacity to service production.

Running in a non-degraded state, we're slamming our infrastructure

As bad as the above sounds, this is as good as things can be in an optimal state.

We use software load balancing via select_class to perform file transfers

We use software load balancing via Cfengine's select_class to distribute clients over 4x Cfengine policy servers in each cage.
- https://cfengine.com/docs/master/reference-promise-types-classes.html#select_class
For details about how select_class works, I presented the below webinar with Cfengine last year detailing our use of shared_global_environment. Start watching at 16:38, with the introduction of shared_global_environment and the use of select_class.

https://www.youtube.com/watch?feature=player_embedded&v=zYSLBbFWlT8

Behavior of select_class in our largest production core in DC1

In DC1 core15, we have 4547 clients and 4x Cfengine policy servers.
Our 5 minute client execution schedule contains a 4 minute splaytime.
- https://cfengine.com/docs/3.5/reference-components-cfexecd.html#splaytime
- From a 10,00 view, splaytime is a "random back-off period" that every client computes. By generating a hash using the hostname of the machine and IP address, clients evenly balance themselves out over a given period. This allows traffic to be distributed to Cfengine cf-serverd process over a given scheduled interval.
This means that clients are actively executing Cfengine in minutes 1 through 4 during a 5 minute period. The 5th minute is a "rest" period, where there is no Cfengine network file transfer activity.

The 4547 machines using select_class evenly balance themselves out between the 4x Cfengine policy servers.
We can view the number of clients per MPS by querying sysops-api. If you are unfamiliar with sysops-api, it is a system that allows us to crawl arbitrary files or remotely executed commands across tends of thousands of machines in seconds.

[msvoboda@dc5-infra01 ~]$ extract_sysops_cache.py --site dc1 --search cm.conf --contents | grep PRIMARY_MPS | sort | uniq -c | grep dc1-core15
   1121 PRIMARY_MPS:dc1-core15-mps01.prod.cfengine.com
   1172 PRIMARY_MPS:dc1-core15-mps02.prod.cfengine.com
   1141 PRIMARY_MPS:dc1-core15-mps03.prod.cfengine.com
   1113 PRIMARY_MPS:dc1-core15-mps04.prod.cfengine.com

So, we've divided the 4547 machines up across the 4x Cfengine policy servers pretty evenly with select_class. Software load balancing works.
On a host, you can look at /etc/cm.conf to determine how select_class has applied software load balancing.
- /usr/local/linkedin/bin/whatami, the Python library implemented by sre-infra, relies on /etc/cm.conf to provide a randomized range server. This object is imported by several Python utilities to determine fabric and perform range lookups.
- This means that outside of Cfengine execution, we utilize the software load balancing to perform range lookups. Any other infrastructure services that we stand up on the Cfengine MPS could take advantage of the load balancing that select_class provides.

[msvoboda@dc1-app9010 ~]$ cat /etc/cm.conf
/etc/cm.conf regnereated at Mon Feb 17 14:42:35 2014
PRIMARY_MPS:dc1-core17-mps02.prod.cfengine.com
SECONDARY_MPS:dc1-core17-mps03.prod.cfengine.com
THIRD_MPS:dc1-core17-mps04.prod.cfengine.com
FORTH_MPS:dc1-core17-mps01.prod.cfengine.com
ENV_SITE:PROD@dc1
MACHINE_TYPE:APP_SERVER
ACCT_TYPE:app_acct
RHEL_LI_RELEASE_VERSION:rh6_release_x86_64_r5
LINUX_HARDWARE_PLATFORM:UCSC_C220_M3L



[msvoboda@dc1-app9010 ~]$ whatami
FABRIC_NAME=prod-dc1
FQDN=dc1-app9010.prod.cfengine.com
SITE=dc1
USING_DEFAULT_FABRIC=1
RANGE_SERVER=dc1-core17-mps02.prod.cfengine.com

Behavior of clients connecting to the 4x Cfengine policy servers in dc1 P2 over our 5 minute schedule

Schedule = 5 minutes

Splaytime = 4 minutes

Minute 1: dc1-core15-mps01.prod = 1121 clients connect to cf-serverd, each client performing 1730 MD5 sum comparisons = 1,939,330 MD5 comparisons in 60 seconds = 32,322 MD5 comparisons a second.

dc1-core15-mps02.prod is idle

dc1-core15-mps03.prod is idle

dc1-core15-mps04.prod is idle

Minute 2: dc1-core15-mps02.prod = 1172 clients connect to cf-serverd, each client performing 1730 MD5 sum comparisons = 2,027,560 MD5 comparisons in 60 seconds = 33,792 MD5 comparisons a second

dc1-core15-mps01.prod finishes requests from minute 1

dc1-core15-mps03.prod is idle

dc1-core15-mps04.prod is idle

Minute 3: dc1-core15-mps03.prod = 1141 clients connect to cf-serverd, each client performing 1730 MD5 sum comparisons = 1,973,930 MD5 comparisons in 60 seconds = 32,898 MD5 comparisons a second

dc1-core15-mps01.prod is idle

dc1-core15-mps02.prod finishes requests from minute 2

dc1-core15-mps04.prod is idle

Minute 4: dc1-core15-mps04.prod = 1113 clients connect to cf-serverd, each client performing 1730 MD5 sum comparisons = 1,925,490 MD5 comparisons in 60 seconds = 32,091 MD5 comparisons a second

dc1-core15-mps01.prod is idle

dc1-core15-mps02.prod is idle

dc1-core15-mps03.prod finishes requests from minute 3

Minute 5:

dc1-core15-mpa01.prod is idle

dc1-core15-mps02.prod is idle

dc1-core15-mps03.prod is idle

dc1-core15-mps04.prod finishes requests from minute 4

Losing a Cfengine server would degrade production infrastructure services to an unsupportable state

Other than servicing Cfengine file transfer requests, the Cfengine servers perform as major pieces of production infrastructure, offering the following various services:
- Bittorrent Trackers
- Bittorrent Seeders of /usr/local/linkedin media
- Range Servers
- Redis Servers supporting SYSOPS-API
- YUM Servers for RPM software installation across production
  - Servicing RHEL / EPEL YUM repos for O/S upgrades, installations, server builds, package updates
  - Serving SRE YUM repos for production software deployments.
If one of the 4x Cfengine servers above went offline, its client load would balance over the 3x remaining servers.
The additional MD5 digest comparison load would degrade the network and CPU utilization on these machines to where we would DDoS our own production infrastructure.

Cfengine responds with cf_promises_validated

On December 9th 2010, Cfengine released version 3.1.2 of the community edition product. This new version included a new feature called cf_promises_validated:
- http://www.blogcompiler.com/2010/12/29/cfengine-3-1-2-extended-change-log/
- http://www.cmdln.org/2012/10/24/cfengine-3-policy-update-or-how-cf_promises_validated-works/

In short, Cfengine created a "flag" that determined when policies had updated. If this flag updated, then clients would then enter this state with cf-serverd of performing millions of MD5 comparisons to check for new data to download. The idea behind cf_promises_validated is that the majority of file transfers (there are 1440 in a 24 hour day), cf-serverd would only enter this state a few times a day instead of always being in this state.
Quoting Mark Burgess, founder and CTO of Cfengine:

https://groups.google.com/d/msg/help-cfengine/91PCP090ZZw/2mPOBrXP2_cJ
Hi David,

the original idea behind the promises_validated marker was to avoid the need to perform a lengthy server-intensive search for changed
policy files in a large policy file tree.

Suppose you are an organization with thousands of hosts, and possibly hundreds of policy files. If you checked every file for every client on the server every five minutes, that would be computationally very time consuming, as each check would require a contentious server-side search. The result is a scaling bottleneck.

The idea of the validation file was as a server-side certification that there was something worth searching for. By having a single file with known location and name, the search is reduced to a trivial time-stamp "stat" which is hiundreds of times cheaper. That scales easily to thousands of hosts every five mins. Only if the validation "certificate" was changed would the agent bother to perform an update. This can only work if a certain discipline is maintained of course.

https://cfengine.com/archive/manuals/st-scale#Scalable-policy-strategy

This mechanism is what allows CFEngine to roll out changes in under five minues on average in a massive environment without trying to "push". Does this make sense?

In the future, this file could actually contain a list of files that differ. Somehow, this never moved forward,

M

Why cf_promises_validated was a good idea in intention, but ended up being a flawed implementation

The problem with this approach is that it required cf_promises_validated to update when media had actually updated, or to not update when media had not updated. Unfortunately, this wasn't always the case.
In Cfengine 3.3.0, cf_promises_validated was modified again.
- http://cfengine.com/blog/cfengine-330-release-notes
In short, cf_promises_validated has been a buggy experience for customers using it since its inception. Customers have been reporting about it always being updated, therefore, cf-serverd would always enter this "bad" state, or it wouldn't have been updated when media on disk had changed, but wasn't visible to cf-promises.

LinkedIn could never have used cf_promises_validated

There are two problems with cf_promises_validated that caused LinkedIn from being able to implement it.

The majority of data that we deliver to clients aren't Cfengine policy files

For cf-promises to execute and update cf_promises_validated, then cf-promises actually has to be able to detect change. This only happens during Cfengine policy file updates.
Out of the 1730 files that we deliver to clients, only 105 of them are Cfengine policy files. This only accounts for 6% of data distributed to clients.
- This means that if we update any of the 1625 files that aren't Cfengine policy files, cf_promises_validated would not update. Therefore, clients would not detect the change and would not download new data.
Our Cfengine policy servers execute several RSYNC commands to pull data outside our SVN repository.
- We download the TOOLS team tarballs, which contain the code for /usr/local/linkedin
- The sudoers USERS/GROUPS mapping, which determines what users belong in which groups. This data is scraped from LDAP
- Several other examples where we need to deliver data reliably from single sources of truth to tens of thousands of machines to be consumed by other pieces of automation.

This bad state isn't avoided. It just doesn't happen all of the time.

Even if cf_promises_validated worked for us as intended, it doesn't solve the problem of cf-serverd being requested to perform millions of MD5 digest comparisons in a short amount of time. It just makes this event happen a few times a day.
CPU load on the Cfengine policy servers would spike during automation changes as they were attempting to push out a single policy file.
- Just changing a single file causes cf-serverd to enter this bad state.
I would rather always run in worst state, and know from a capacity planning standpoint what we had to work with.
My main issue with cf_promises_validated is that it attempts to cover up the root problem. The root problem is that cf-serverd has to perform millions of MD5 digest comparisons even when only one file has updated.

Reply all

Reply to author

Forward

0 new messages