Prototype for distributed caching and remote execution

1,117 views
Skip to first unread message

alpha....@gmail.com

unread,
Jan 11, 2016, 10:27:57 PM1/11/16
to bazel-discuss
Hi all!

I created a fork of Bazel in github that adds the features of distributed caching and remote execution here: https://github.com/hhclam/bazel.

I also drafted the design document for the implementation of above features here: https://docs.google.com/document/d/1CvEw3uu9mUszK-ukmSWp4Dmy43KDlHAjW75Gf17bUY8/edit#

The prototype is based on Hazelcast which provides a JCache API for a distributed cache. The remote execution is based on distributed caching plus a REST build worker built on top of Jetty. The build worker works both as a Hazelcast node as well as a Jetty server that listens to REST build requests. I successfully used it to build //src:bazel in a distributed way.

Please let me know what you think and I welcome your comments.

Alpha

lcid...@gmail.com

unread,
Jan 12, 2016, 8:53:55 AM1/12/16
to bazel-discuss
It would be especially interesting, which approach Google internally takes on distributed caching.

Han-Wen Nienhuys

unread,
Jan 12, 2016, 3:10:07 PM1/12/16
to lcid...@gmail.com, bazel-discuss
Here are some random thoughts, both on your design and Google's internal system.

* Inside google, we have a single system for caching and execution.
The implementation is heavily based on Linux and Google's production
environment.

* The system is heavily based on content-addressable storage (CAS):
files, stdout, stderr values, but also the action (command line,
environment, inputs, output names) are keyed by their checksums. For
structured data, the keying is done over the protobuf serialization.
The checksum used (like Bazel) is md5. Given md5's security record,
this is unfortunate, but changing it requires concerted effort between
the build system, version control system, and all storage systems
involved.

* For performance reasons, all of the data in the execution system is
kept in memory. The execution machines make very clever use of memory:
the inputs are stored on ram backed tmpfs, then the command runs
inside the same file system (but as a different user). The execution
machines also function as a distributed cache. This halves the RAM
requirements of the system. It does make it the complete system more
complex, because it needs a separate component to keep track on which
machine which file is stored.

* There is a frontend that Bazel connects to ask for a result. IIRC,
bazel checksums the whole request into a single md5, and sends that.
If the result is in cache, the (cached) reply is sent back. The reply
typically does not contain stderr/stdout inline, but they are sent
back as another output file (useful for test logs which can be very
large and not always interesting.).

* We may have had a time that compilers, jdk etc. were installed on
the execution machine, but this makes it difficult to manage compiler
upgrades. Nowadays, we check in the compiler into our version control.
This is something that Bazel supports too, and I recommend for a setup
that uses remote execution. Alternatively, if you run the executor in
a (docker) container or similar, you would have to add the checksum or
name of that container to the cache key.

* Inside google, binaries, especially C++ ones, tend to be very large.
This makes remote execution expensive in terms of bandwidth: sending
back the outputs from the compile farm is limited by network
bandwidth, and smaller offices with "limited" bandwidth, remote
compilation could easily swamp the network. This has been solved by
adopting a FUSE file system to lazily load build artifacts. See also
http://google-engtools.blogspot.de/2011/10/build-in-cloud-distributing-build.html.
I guess it can be done in open-source, but it adds considerable
complexity, since you have to write a FUSE daemon, store artifiacts
and then garbage collect them. If you don't do this, you might want to
consider a mechanism so you don't download output files that didn't
change relative to the last run.

* Our compile cluster runs in our data center. This makes sense
because 1) computation is cheaper in our production environment, 2)
the environment is fully managed, so tasks get restarted if machines
break or tasks are preempted. This has a downside, which is that there
is considerable ping time between the client machine running bazel and
the datacenter running the compilation. For batch compilation, that
may not be an issue, but for interactive edit/compile cycles, it adds
some overhead, and the structure of the protocol, makes this worse,
eg.:

bazel: execute action with checksum "abc"
remote: "abc" unknown; send the entire request
bazel: execute "gcc" input "f.c", md5: "123"
remote: "123" unknown; send me the file
bazel: uploads file with checksum "123"
remote: .. (waits for an execution machine to become available)
remote: oops, entry "123" was evicted from the cache, send it again.

etc. You can speculatively upload things to avoid roundtrips, but it
makes the code more complicated.

* In light of the last two items, it may be useful to consider if you
want to design for a service running locally (0ms ping time, infinite
bandwidth, but need to manage machines/tasks: what if a machine in the
build farm runs out of disk space?) or remotely (network limitations,
but Google Cloud / AWS / Kubernetes to manage deployments). I would
suggest the former, as I suspect it will be easier to implement.

* Is OSX a requirement? OSX is a bit of a pain in the behind. There
are no containers, namespaces, and filesystems are often case-folding.

* when you have a remote execution system, and it becomes popular,
people find all kinds of ways to abuse it: run tasks that take large
amounts of CPU, tasks that run out of RAM, produce too much output,
tests that fail sometimes, tests that connect to the internet, etc.

* A remote execution system should use our namespace sandbox (or a
moral equivalent), and it would be good if we could add resource
limits to our namespace sandbox, so it can impose limits on RAM/CPU
usage.

* there are bunch of places where Bazel caching must jibe with the
remote execution system. For example, if you use
--test_cache_results=no, then the remote execution must be suitably
instructed to skip the cache.

* One complaint I heard from the designers about our current protocol,
is that you can't tell file size from the checksum. This means that
it's hard for the execution service to predict how much memory an
action requires, making scheduling more difficult.

* Another complaint is that the request sends a flat list of files. A
compile action includes the compiler as an input, and for some
compilers (checked in!) these lists are large (say 5k files). For each
request, we have to build this list afresh. This causes a large amount
of memory churn (we have to build the request, in Java, serialize as
bytes, then run the raw bytes through md5). All of this means that it
still takes considerable time for bazel to build a binary, even if it
is completely cached remotely. This could be fixed by allowing entire
subtrees to be described by a single 'magical' file entry, or by
having first class support for directories. Or by supplying the
compiler on the worker machine.

* I notice you want to use JSON (?) for sending data back and forth?
Is that wise? How expensive is (de)serialization? Protobuf is already
a little problematic because Java strings are UTF16 and protobuf is
UTF-8. We have lots of strings (filenames).

* On the bazel side, remote execution/caching requires some
cooperation with local resource management. In our case, the build
cluster is very far away, so each cache lookup takes typically 100ms.
This means that we have to run lots of them (IIRC, the default is 200)
in parallel to make it worth it. This again means that we have to
fiddle with resource management: by default we run only NUM-CPU tasks,
so we pretend remote tasks don't take resources. This lets us run 200
of them in parallel. However, when a task falls back to local
execution for some reason, you have to make it require resources
again, so the local machine won't get overwhelmed. We have code that
does it, but it's quite arcane, and needs some serious attention to
open source it. OTOH, if you are just caching, and can assume that the
cache replies instantly, you might just be able to get away with using
the existing resource management.

I hope this gives some perspective on the trade-offs. I think we could
have a better discussion about your design if you could specify what
environment you want to use it for:

* what languages (C++, Java, others?)
* what operating systems (OSX? Linux?)
* what deployment model/network characteristics (cloud? on premise?)
* just cache? or cache + execution?

I personally think it would be easiest to start with a linux-only
cache that runs on-premise and forego remote execution for now.
> --
> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/ffc51f11-e5dd-4125-84a6-a6bff704dee2%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



--
Han-Wen Nienhuys
Google Munich
han...@google.com

Alpha Lam

unread,
Jan 12, 2016, 5:33:22 PM1/12/16
to Han-Wen Nienhuys, bazel-discuss
Thanks for your very thoughtful comments.

Let me start by answering your questions.

* what languages (C++, Java, others?)

Our organization has mostly Java code but also C++, python, clojure and other languages. We have lots of Java code and many sub-components. We would like to use Bazel to build everything we're building now and more efficient than our current system.

* what operating systems (OSX? Linux?)

Linux only.

* what deployment model/network characteristics (cloud? on premise?)

Mostly on premise (one site) although we would like to expand into AWS as well.

* just cache? or cache + execution?

Initially just cache. But eventually execution as well. We already have a distributed cache + execution system. I would like to replace it with Bazel that provides the same (and more) capabilities.

The prototype I posted has a simple implementation for remote execution but it's optional. For Bazel I would really like to see distributed caching and remote execution be the core features. Remote execution can be turned on optionally depending on the infrastructure the organization, but included in Bazel as a reference implementation. I'm very happy to contribute and optimize this component (for linux only). It will be very beneficial to other large organizations as well.

Some additional comments to your reply.

* For Bazel I think it's best if distributed cache and remote executable are separable.

* Is there a chance to open source the part where an action is serialized and checksummed? I also would like to replace the md5 digest with something stronger like sha1 or sha256. Would it be easier to implement in Bazel as a separate FileSystem?

* Regarding use of tmpfs in execution system. An alternative is one could implement a FUSE mount where both input files and output files are read and written to the FUSE mount. The FUSE mount is backed by a JCache instance. This way the JCache backend will handle cache lookup (i.e. knowing which machine has the file stored) and the execution worker is a standalone application that operates only on memory.

* Good suggestion about sending test logs separately. I think this means that the build worker doesn't have to upload everything before Bazel proceed, only output files that are needed by subsequently actions.

* In our environment we check in jdk in the repository but not the c++ toolchain. But it's an interesting idea to do for remote execution. We're kicking around the idea of use conda to provide the build environment for external libraries and toolchain.

* The idea of using FUSE is very interesting and is what we have implemented internally. It'll be very interesting if Bazel includes a reference implementation of FUSE mount as the execution root to unlock the performance gains you mentioned. It will also speed up the computation of checksum greatly as well (as it could be computed ahead of time). I think this is something we can contribute to Bazel as well. Regarding large object files we have been kicking the idea of using tools to compute binary diffs such as courgette.

* To avoid roundtrips I think it's best that Bazel can upload files speculatively since it's already a daemon. For example in the phase of parsing BUILD files and globbing.

* We're looking to implement the build farm on premise instead of using the public cloud.

* We have a Linux only environment and we have strong interest for optimizing the Linux use case, i.e. FUSE, third party libraries management, etc.

* Sandboxing the remote worker is a great idea and something we can help too.

* I can expand the protobuf definition for remote work request to include file size. Good point it's a good indicator for memory usage.

* I have not benchmarked the conversion of build request <-> JSON. The choice is because of JSON is more friendly than the binary format of serialized protobuf. Because we use system computer I didn't observe the problem of large list of files for C++ but for Java. The Java toolchain includes even image files. I think it can be trimmed down significantly.

* Good point about falling back to local execution. I am hoping that Bazel can have a --local_jobs options that specifies the maximum number of jobs to run locally. I'm interested in helping with this part as well.

Alpha

Lukács T. Berki

unread,
Jan 13, 2016, 3:54:05 AM1/13/16
to Alpha Lam, Han-Wen Nienhuys, bazel-discuss
@Han-Wen: good summary. There are only a few things I'd like to add:
  1. Checking in every compiler into the source tree just like at Google is probably not feasible for most of the people out there. I think it's easier to get something up and running if you just assume that certain binaries are there at certain paths on the remote executors (note that we try not to rely on this at Google, but we still depend on a few base libraries and binaries being present. The obvious one is /bin/bash).
  2. You'll eventually want to have a way to select what remote executors a particular action should be run on. If not for the issues mentioned in (1), then simply for hardware compatibility (Linux binaries should be run on a Linux machine)
  3. In theory, you can choose to ignore most of the lessons Han-Wen mentioned because SpawnStrategy is a pretty good abstraction and you can stick anything behind it. However, these were lessons hard learned and are thus valuable.
  4. I'm not sure if we want to integrate one particular kind of remote execution into Bazel. If not, we'll have to come up with an interface so that you can plug this in. BlazeModule is one possible idea, but it's kind of a wide and hairy interface and we don't really want to commit to not changing it ever again.


For more options, visit https://groups.google.com/d/optout.



--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Maximillianstr. 11-15 | 80539 München | Germany | Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle | Registergericht und -nummer: Hamburg, HRB 86891

Han-Wen Nienhuys

unread,
Jan 13, 2016, 4:23:30 AM1/13/16
to Lukács T. Berki, Alpha Lam, bazel-discuss
On a complete tangent: while writing all of this down, I realized that
it might be kind of neat if we can somehow mount tar files "directly"
into the namespace sandbox. (Maybe sandboxed strategy could untar the
files locally into a tmp dir and then give the root of the tmp dir to
our sandboxing binary). This would let us get 90% of the benefit of
docker without its complexity: people could get the tarfile for the
target system from a WORKSPACE http_file, and there would be no
overhead for mounting or checksumming the system in Bazel.

Lukács T. Berki

unread,
Jan 13, 2016, 4:28:40 AM1/13/16
to Han-Wen Nienhuys, Alpha Lam, bazel-discuss
On Wed, Jan 13, 2016 at 10:23 AM, Han-Wen Nienhuys <han...@google.com> wrote:
On a complete tangent: while writing all of this down, I realized that
it might be kind of neat if we can somehow mount tar files "directly"
into the namespace sandbox. (Maybe sandboxed strategy could untar the
files locally into a tmp dir and then give the root of the tmp dir to
our sandboxing binary). This would let us get 90% of the benefit of
docker without its complexity: people could get the tarfile for the
target system from a WORKSPACE http_file, and there would be no
overhead for mounting or checksumming the system in Bazel.
Why reinvent Docker if it already exists? You can just reference a Docker image instead of a tarfile and benefit from all the Docker images people have already built.

Han-Wen Nienhuys

unread,
Jan 13, 2016, 4:36:03 AM1/13/16
to Lukács T. Berki, Alpha Lam, bazel-discuss
On Wed, Jan 13, 2016 at 10:28 AM, Lukács T. Berki <lbe...@google.com> wrote:
>> On a complete tangent: while writing all of this down, I realized that
>> it might be kind of neat if we can somehow mount tar files "directly"
>> into the namespace sandbox. (Maybe sandboxed strategy could untar the
>> files locally into a tmp dir and then give the root of the tmp dir to
>> our sandboxing binary). This would let us get 90% of the benefit of
>> docker without its complexity: people could get the tarfile for the
>> target system from a WORKSPACE http_file, and there would be no
>> overhead for mounting or checksumming the system in Bazel.
>
> Why reinvent Docker if it already exists? You can just reference a Docker
> image instead of a tarfile and benefit from all the Docker images people
> have already built.

Sure, that sounds good too; you'd need to duplicate some logic to
understand the format, though. (Docker images are tar files, right?)

Han-Wen Nienhuys

unread,
Jan 19, 2016, 9:12:22 AM1/19/16
to Alpha Lam, bazel-discuss
On Tue, Jan 12, 2016 at 11:33 PM, Alpha Lam <alpha....@gmail.com> wrote:
> * just cache? or cache + execution?
>
> Initially just cache. But eventually execution as well. We already have a
> distributed cache + execution system. I would like to replace it with Bazel
> that provides the same (and more) capabilities.
>
> The prototype I posted has a simple implementation for remote execution but
> it's optional. For Bazel I would really like to see distributed caching and
> remote execution be the core features. Remote execution can be turned on
> optionally depending on the infrastructure the organization, but included in
> Bazel as a reference implementation. I'm very happy to contribute and
> optimize this component (for linux only). It will be very beneficial to
> other large organizations as well.

Right. The moment you do remote execution, tasks can take a long time
to complete, and the problem becomes similar to having a cache at
distance, in that you'll need to run lots of jobs in parallel for it
to be a win.

> * Is there a chance to open source the part where an action is serialized
> and checksummed? I also would like to replace the md5 digest with something
> stronger like sha1 or sha256.

Serialization is just protobuf serialization, and checksumming is md5.
The message structure is roughly

message exec {
bytes bin
repeated bytes argv
repeated bytes env
}
message input {
bytes path
bytes digest
}
message action {
exec exec
repeated input
repeated bytes output_path
}

message request {
optional action
optional action_digest

// metadata, cachable
// retry info etc.
}

You could replace md5 with something stronger, but you'd have to
change the checksumming throughout; in particular the file
checksumming is still hardcoded to md5.

If you want to venture in this direction (honestly, I would be more
worried about securing the RPC connection infrastructure), you should
make the digest function pluggable. It will be a while before we can
make Google version switch over to something else. If you're looking
for a good checksum, can I suggest

https://eprint.iacr.org/2014/170.pdf

It's a fast (parallelizable) scheme.

> Would it be easier to implement in Bazel as a
> separate FileSystem?

There is support for providing the checksum as a extended attribute.
See Path#getFastDigest()

> * Regarding use of tmpfs in execution system. An alternative is one could
> implement a FUSE mount where both input files and output files are read and
> written to the FUSE mount. The FUSE mount is backed by a JCache instance.
> This way the JCache backend will handle cache lookup (i.e. knowing which
> machine has the file stored) and the execution worker is a standalone
> application that operates only on memory.

I have extensive experience with FUSE, and I suggest to avoid it if
you can. It has performance problems, and making something that either
performant or has 100% fidelity to POSIX is possible, but tricky, let
alone both at the same time, and you need both performance (to beat
local execution) and fidelity (to appease tools that do weird tricks
with mmap, IPC over open file files, etc.)

> * Good suggestion about sending test logs separately. I think this means
> that the build worker doesn't have to upload everything before Bazel
> proceed, only output files that are needed by subsequently actions.
>
> * In our environment we check in jdk in the repository but not the c++
> toolchain. But it's an interesting idea to do for remote execution. We're
> kicking around the idea of use conda to provide the build environment for
> external libraries and toolchain.
>
> * The idea of using FUSE is very interesting and is what we have implemented
> internally.

You have implemented it? For source code or for objects?

> It'll be very interesting if Bazel includes a reference
> implementation of FUSE mount as the execution root to unlock the performance
> gains you mentioned.

It's not the execution root that's FUSE, but the output tree. The
execution root includes symlinks to source code too.

> It will also speed up the computation of checksum
> greatly as well (as it could be computed ahead of time). I think this is
> something we can contribute to Bazel as well. Regarding large object files
> we have been kicking the idea of using tools to compute binary diffs such as
> courgette.
>
> * To avoid roundtrips I think it's best that Bazel can upload files
> speculatively since it's already a daemon. For example in the phase of
> parsing BUILD files and globbing.

that's way too early: you'd likely thrash the (limited size) memory
cache for your fellow users.

> * We're looking to implement the build farm on premise instead of using the
> public cloud.
>
> * We have a Linux only environment and we have strong interest for
> optimizing the Linux use case, i.e. FUSE, third party libraries management,
> etc.
>
> * Sandboxing the remote worker is a great idea and something we can help
> too.
>
> * I can expand the protobuf definition for remote work request to include
> file size. Good point it's a good indicator for memory usage.

I think it's best appended to the checksum, so you can retain a single
identifier for a piece of content.

pa...@lucidchart.com

unread,
Jun 8, 2016, 9:52:21 AM6/8/16
to bazel-discuss
(1) I see that this is on master (https://github.com/bazelbuild/bazel/commit/79adf59e2973754c8c0415fcab45cd58c7c34697). Congrats!

Is this indeed the default implementation for distributed caching for the foreseeable future?

(2) "Distributed execution of actions" is on the roadmap (https://github.com/bazelbuild/bazel/blob/master/site/roadmap.md) for next month. Is it likely that this will be the Hazelcast implementation as well?

Damien Martin-guillerez

unread,
Jun 8, 2016, 2:16:59 PM6/8/16
to pa...@lucidchart.com, bazel-discuss
On Wed, Jun 8, 2016 at 3:52 PM <pa...@lucidchart.com> wrote:
On Monday, January 11, 2016 at 8:27:57 PM UTC-7, alpha....@gmail.com wrote:
> Hi all!
>
>
> I created a fork of Bazel in github that adds the features of distributed caching and remote execution here: https://github.com/hhclam/bazel.
>
>
> I also drafted the design document for the implementation of above features here: https://docs.google.com/document/d/1CvEw3uu9mUszK-ukmSWp4Dmy43KDlHAjW75Gf17bUY8/edit#
>
>
> The prototype is based on Hazelcast which provides a JCache API for a distributed cache. The remote execution is based on distributed caching plus a REST build worker built on top of Jetty. The build worker works both as a Hazelcast node as well as a Jetty server that listens to REST build requests. I successfully used it to build //src:bazel in a distributed way.
>
>
> Please let me know what you think and I welcome your comments.
>
>
> Alpha

(1) I see that this is on master (https://github.com/bazelbuild/bazel/commit/79adf59e2973754c8c0415fcab45cd58c7c34697). Congrats!

Is this indeed the default implementation for distributed caching for the foreseeable future?

Yes it is still experimental so the foreseeable future is not so far.
 

(2) "Distributed execution of actions" is on the roadmap (https://github.com/bazelbuild/bazel/blob/master/site/roadmap.md) for next month. Is it likely that this will be the Hazelcast implementation as well?

I don't know. The gRPC-based prototype that Alpha contributed (https://bazel-review.googlesource.com/#/c/3110/) is likely to have many more improvements.
 

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Alpha Lam

unread,
Jun 8, 2016, 3:18:56 PM6/8/16
to Damien Martin-guillerez, pa...@lucidchart.com, bazel-discuss, Eric Burnett
2016-06-08 14:16 GMT-04:00 'Damien Martin-guillerez' via bazel-discuss <bazel-...@googlegroups.com>:


On Wed, Jun 8, 2016 at 3:52 PM <pa...@lucidchart.com> wrote:
On Monday, January 11, 2016 at 8:27:57 PM UTC-7, alpha....@gmail.com wrote:
> Hi all!
>
>
> I created a fork of Bazel in github that adds the features of distributed caching and remote execution here: https://github.com/hhclam/bazel.
>
>
> I also drafted the design document for the implementation of above features here: https://docs.google.com/document/d/1CvEw3uu9mUszK-ukmSWp4Dmy43KDlHAjW75Gf17bUY8/edit#
>
>
> The prototype is based on Hazelcast which provides a JCache API for a distributed cache. The remote execution is based on distributed caching plus a REST build worker built on top of Jetty. The build worker works both as a Hazelcast node as well as a Jetty server that listens to REST build requests. I successfully used it to build //src:bazel in a distributed way.
>
>
> Please let me know what you think and I welcome your comments.
>
>
> Alpha

(1) I see that this is on master (https://github.com/bazelbuild/bazel/commit/79adf59e2973754c8c0415fcab45cd58c7c34697). Congrats!

Is this indeed the default implementation for distributed caching for the foreseeable future?

Yes it is still experimental so the foreseeable future is not so far. 
 

(2) "Distributed execution of actions" is on the roadmap (https://github.com/bazelbuild/bazel/blob/master/site/roadmap.md) for next month. Is it likely that this will be the Hazelcast implementation as well?

I don't know. The gRPC-based prototype that Alpha contributed (https://bazel-review.googlesource.com/#/c/3110/) is likely to have many more improvements.

Like Damien said, in the near future the code you're looking at is still the default implementation.

Eric Burnett and his team is working on defining a gRPC proto for caching. I would like to hold until that's done. Then after that I will update the spawn implementation to use gRPC for caching and I want to move the server implementation for caching and execution out to a separate project. The default will probably still be using Hazelcast but it should be easy to have other backends.

 

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/5f1c6beb-d877-4d18-a356-a6a6fbf3fccf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
Reply all
Reply to author
Forward
Message has been deleted
0 new messages