So, here's an idea, and I think you're onto the right idea of having a
two-stage file transfer.
* cf-serverd continuously performs a filesystem scan under
$(sys.masterfiles) to update its hashtable. If it detects a md5sum
update for an object it:
* Updates your in-memory variable (uuid?) saying something new is
available. I personally would avoid multicast. Several shops implement
firewalls that will block this traffic. Also, I'm not sure what the code
complexity is here on implementing multicast traffic, but it may not be
trivial.
When a client looks to perform a update, this could be the workflow:
1. Cf-agent wakes up, sees that previous file transfer with a policy
server had uuid 12345
2. Query cf-serverd for current uuid. Cf-serverd offers uuid 45678. <--
agent realizes that something changed on $(sys.masterfiles) on the policy
server.
3. Cf-agent downloads hashtable map from cf-serverd for
$(sys.masterfiles)/inputs
4a. Cf-agent performs a recursive search on $(sys.inputs), compares
with the hashtable it grabbed from cf-serverd, and determines that file
XYZ needs updating.
4b. Cf-agent performs a recursive search on $(sys.inputs), compares
with the hashtable, but realizes nothing has changed. Maybe
$(sys.masterfiles)/modules had an updated file?
if 4a then, 5. Cf-agent requests file X from cf-serverd
6. Cf-agent saves uuid 45678 as last known good state of cf-serverd.
Its funny you mentioned bittorrent. Facebook also uses bittorrent to push
code. It does have value if you need to push a massively large amount of
data (gigabytes) to several thousand of machines. We are actually
evaluating using bittorrent as well in a new project to distribute data to
end nodes.
For the most part, cf-serverd doesn't serve large objects. It just has
thousands of small policy files, configuration files, and other tiny
objects that it has to move around. Unless you need to move a massive
amount of data, I don¹t know if this has much value.
On 10/8/13 11:04 AM, "Ted Zlatanov" <
t...@lifelogs.com> wrote:
>On Tue, 8 Oct 2013 14:14:51 +0000 Mike Svoboda <
msvo...@linkedin.com>
>wrote:
>
>MS> So, instead of cf_promises_validated which is file based, you want
>MS> something memory based? What problem does this solve that
>MS> cf_promises_validated doesn't already provide?
>
>It scales better than transferring a disk file and can be updated in
>memory by another thread without locking a disk file.
>
>You can also have multiple "client profiles" and have a separate
>timestamp for each one.
>
>Finally, I don't like depending on magic files, personally.
>
>MS> With cf_promises_validated and how its currently designed, you assume
>that
>MS> this file gets updated when cf-promises detects policy updates. This
>MS> isn't the case. We have several examples where our policy servers
>grab
>MS> data from external sources via WGET or RSYNC, drop it into
>MS> /var/cfengine/masterfiles, and assume clients are going to pick up
>that
>MS> data as soon as its available. Cf-promises isn't going to detect that
>MS> something has changed.
>
>Right; my idea is to have cf-serverd see filesystem notifications.
>Running cf-promises is a separate task and could be done only for some
>"client profile" as, essentially, a pre-commit hook.
>
>MS> Also, once your in-memory variable is multicasted out or whatever --
>then
>MS> clients have to perform a full recursive MD5 comparison with
>cf-serverd?
>MS> Yes, maybe offering a small payload to tell clients "Hey! I have new
>MS> things!" is more efficient, but then clients have to perform a full
>MS> recursive MD5 sum crawl with cf-serverd!
>
>The secondary transfer can use rsync, a differential map as you
>described, or the current protocol. I'm talking about the primary
>transfer, the one where 100K clients are hitting you to check for
>updates. That code path should be optimized first and be most
>scalable. My claim is that if you do the primary transfer in one
>packet, that lets the server breathe a bit on the secondary transfers.
>
>MS> If you move towards the model of maintaining a map of md5 objects,
>clients
>MS> can selectively say -- cf-serverd, I'm grabbing file object 12345.
>Thatžs
>MS> all I need. I donžt need to have you perform a md5sum comparison of
>the
>MS> other 10,000 objects under $(masterfiles).
>
>MS> Yes, cf-serverd has to offer the md5sum map out to every client at
>request
>MS> time -- its actually far less network overhead when an update needs to
>MS> happen. It just has to offer one object, it doesn't have to satisfy
>a
>MS> full filesystem scan and the network overheads associated with that.
>
>I think that's a fine way to do the secondary transfer, but don't know
>if it's better than pure rsyncd or the current mechanism; I'd test it.
>An integrated solution is probably simplest and keeping track of a
>secondary file index is probably a bit of extra work, but I really have
>no strong opinion here.
>
>Another direction for cf-serverd is to use something like
>
https://blog.twitter.com/2010/murder-fast-datacenter-code-deploys-using-bi
>ttorrent
>
>A truly distributed strategy would work very well in geographically
>distributed environments, but you can't predict the load on individual
>clients as well.
>
>Ted