Here are some random thoughts, both on your design and Google's internal system.
* Inside google, we have a single system for caching and execution.
The implementation is heavily based on Linux and Google's production
environment.
* The system is heavily based on content-addressable storage (CAS):
files, stdout, stderr values, but also the action (command line,
environment, inputs, output names) are keyed by their checksums. For
structured data, the keying is done over the protobuf serialization.
The checksum used (like Bazel) is md5. Given md5's security record,
this is unfortunate, but changing it requires concerted effort between
the build system, version control system, and all storage systems
involved.
* For performance reasons, all of the data in the execution system is
kept in memory. The execution machines make very clever use of memory:
the inputs are stored on ram backed tmpfs, then the command runs
inside the same file system (but as a different user). The execution
machines also function as a distributed cache. This halves the RAM
requirements of the system. It does make it the complete system more
complex, because it needs a separate component to keep track on which
machine which file is stored.
* There is a frontend that Bazel connects to ask for a result. IIRC,
bazel checksums the whole request into a single md5, and sends that.
If the result is in cache, the (cached) reply is sent back. The reply
typically does not contain stderr/stdout inline, but they are sent
back as another output file (useful for test logs which can be very
large and not always interesting.).
* We may have had a time that compilers, jdk etc. were installed on
the execution machine, but this makes it difficult to manage compiler
upgrades. Nowadays, we check in the compiler into our version control.
This is something that Bazel supports too, and I recommend for a setup
that uses remote execution. Alternatively, if you run the executor in
a (docker) container or similar, you would have to add the checksum or
name of that container to the cache key.
* Inside google, binaries, especially C++ ones, tend to be very large.
This makes remote execution expensive in terms of bandwidth: sending
back the outputs from the compile farm is limited by network
bandwidth, and smaller offices with "limited" bandwidth, remote
compilation could easily swamp the network. This has been solved by
adopting a FUSE file system to lazily load build artifacts. See also
http://google-engtools.blogspot.de/2011/10/build-in-cloud-distributing-build.html.
I guess it can be done in open-source, but it adds considerable
complexity, since you have to write a FUSE daemon, store artifiacts
and then garbage collect them. If you don't do this, you might want to
consider a mechanism so you don't download output files that didn't
change relative to the last run.
* Our compile cluster runs in our data center. This makes sense
because 1) computation is cheaper in our production environment, 2)
the environment is fully managed, so tasks get restarted if machines
break or tasks are preempted. This has a downside, which is that there
is considerable ping time between the client machine running bazel and
the datacenter running the compilation. For batch compilation, that
may not be an issue, but for interactive edit/compile cycles, it adds
some overhead, and the structure of the protocol, makes this worse,
eg.:
bazel: execute action with checksum "abc"
remote: "abc" unknown; send the entire request
bazel: execute "gcc" input "f.c", md5: "123"
remote: "123" unknown; send me the file
bazel: uploads file with checksum "123"
remote: .. (waits for an execution machine to become available)
remote: oops, entry "123" was evicted from the cache, send it again.
etc. You can speculatively upload things to avoid roundtrips, but it
makes the code more complicated.
* In light of the last two items, it may be useful to consider if you
want to design for a service running locally (0ms ping time, infinite
bandwidth, but need to manage machines/tasks: what if a machine in the
build farm runs out of disk space?) or remotely (network limitations,
but Google Cloud / AWS / Kubernetes to manage deployments). I would
suggest the former, as I suspect it will be easier to implement.
* Is OSX a requirement? OSX is a bit of a pain in the behind. There
are no containers, namespaces, and filesystems are often case-folding.
* when you have a remote execution system, and it becomes popular,
people find all kinds of ways to abuse it: run tasks that take large
amounts of CPU, tasks that run out of RAM, produce too much output,
tests that fail sometimes, tests that connect to the internet, etc.
* A remote execution system should use our namespace sandbox (or a
moral equivalent), and it would be good if we could add resource
limits to our namespace sandbox, so it can impose limits on RAM/CPU
usage.
* there are bunch of places where Bazel caching must jibe with the
remote execution system. For example, if you use
--test_cache_results=no, then the remote execution must be suitably
instructed to skip the cache.
* One complaint I heard from the designers about our current protocol,
is that you can't tell file size from the checksum. This means that
it's hard for the execution service to predict how much memory an
action requires, making scheduling more difficult.
* Another complaint is that the request sends a flat list of files. A
compile action includes the compiler as an input, and for some
compilers (checked in!) these lists are large (say 5k files). For each
request, we have to build this list afresh. This causes a large amount
of memory churn (we have to build the request, in Java, serialize as
bytes, then run the raw bytes through md5). All of this means that it
still takes considerable time for bazel to build a binary, even if it
is completely cached remotely. This could be fixed by allowing entire
subtrees to be described by a single 'magical' file entry, or by
having first class support for directories. Or by supplying the
compiler on the worker machine.
* I notice you want to use JSON (?) for sending data back and forth?
Is that wise? How expensive is (de)serialization? Protobuf is already
a little problematic because Java strings are UTF16 and protobuf is
UTF-8. We have lots of strings (filenames).
* On the bazel side, remote execution/caching requires some
cooperation with local resource management. In our case, the build
cluster is very far away, so each cache lookup takes typically 100ms.
This means that we have to run lots of them (IIRC, the default is 200)
in parallel to make it worth it. This again means that we have to
fiddle with resource management: by default we run only NUM-CPU tasks,
so we pretend remote tasks don't take resources. This lets us run 200
of them in parallel. However, when a task falls back to local
execution for some reason, you have to make it require resources
again, so the local machine won't get overwhelmed. We have code that
does it, but it's quite arcane, and needs some serious attention to
open source it. OTOH, if you are just caching, and can assume that the
cache replies instantly, you might just be able to get away with using
the existing resource management.
I hope this gives some perspective on the trade-offs. I think we could
have a better discussion about your design if you could specify what
environment you want to use it for:
* what languages (C++, Java, others?)
* what operating systems (OSX? Linux?)
* what deployment model/network characteristics (cloud? on premise?)
* just cache? or cache + execution?
I personally think it would be easiest to start with a linux-only
cache that runs on-premise and forego remote execution for now.
> --
> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to
bazel-discus...@googlegroups.com.
> To view this discussion on the web visit
https://groups.google.com/d/msgid/bazel-discuss/ffc51f11-e5dd-4125-84a6-a6bff704dee2%40googlegroups.com.
> For more options, visit
https://groups.google.com/d/optout.
--
Han-Wen Nienhuys
Google Munich
han...@google.com