REAPI Open Source Servers

266 views
Skip to first unread message

Sadaf Matinkhoo

unread,
Sep 2, 2022, 10:55:36 AM9/2/22
to Remote Execution APIs Working Group
Hello REAPI Community,

I am looking for some information on the various open source server implementations for RE API. Some that are listed on bazel remote-apis are Buildfarm, Buildbarn, BuildGrid, and Scoot.

I'm looking for things like:

- How's the onboarding experience (effort, community responsiveness)?
- How's the debugging experience (metrics, tools)?
- How much have they been tested?
- What are their limitations (platforms, clients, scalability, etc.)?
- Which features do they support (autoscaling, heterogeneous workers, build without the bytes, etc.)?

If you have experience with any of these servers, I'd appreciate it if you could share some feedback here.

Thanks,
Sadaf

william...@gmail.com

unread,
Sep 2, 2022, 1:36:52 PM9/2/22
to Remote Execution APIs Working Group
Hi Sadaf,

There are a bunch of options in the space for both cache and remote execution.

I'm obviously biased because I work on it, but one you didn't name is BuildBuddy :)

- Open source (and we offer a cloud version if you don't want to host)
- Good support (via a slack community) and documentation, actively developed
- Debugging and metrics are easy with the UI, API, and exported metrics
- Tested extensively and used by hundreds of projects both open and closed source
- Supports linux and mac, autoscaling, heterogeneous worker pools, supports Build without the Bytes, firecracker isolation, etc

Happy to answer any questions you have, or help you get started,

Tyler

Sadaf Matinkhoo

unread,
Sep 2, 2022, 2:09:27 PM9/2/22
to william...@gmail.com, Remote Execution APIs Working Group
Hi Tyler,

Thank you for the information. I really appreciate it. :)
I didn't mention BuildBuddy only because I am somewhat familiar with it through its comprehensive documentation, and have heard the success stories. But I can't find some of the info I'm looking for on the other open source server implementations. If you have any insights on those, please let me know. I'll take your bias into account! :D 

Thanks,
Sadaf



--
You received this message because you are subscribed to a topic in the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/remote-execution-apis/KtcdLu0l3CU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to remote-execution...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/ad2837be-f272-4497-8047-3b6e11a019ben%40googlegroups.com.

Nathan Bruer

unread,
Sep 2, 2022, 2:49:04 PM9/2/22
to Sadaf Matinkhoo, Remote Execution APIs Working Group
Hi Sadaf,

I am the creator of Turbo Cache, I have not yet licensed it and not yet advertising it as "ready" because I'm not ready to tell people to use it when it has not been tested enough yet.

> How's the onboarding experience (effort, community responsiveness)?
This is one of the big reasons I ended up writing my own tool. In 2019 I tested out pretty much every open source project out there that implemented BRE API and they all either flat out did not work or was extremely difficult and required a lot of hacking to get it to work. We ended up going with Buildfarm, but this also required a ton of hacks. I always like a "just run this script" kind of deployment strategies to allow developers to learn and test the software, which is why I decided to include a terraform script for a reference implementation.

> How's the debugging experience (metrics, tools)?
As a developer on the BRE API, this is quite frustrating. Luckily I used to work for Google and I'm used to the "Google way" of self-documenting code, so when something was not clear how a part of the API works, I'd be able to use codesearch and easily traverse the bazel code to figure it out.

I will say it is extremely difficult to try and propagate errors efficiently, but I blame this not on BRE, instead rather the language or libraries the server implementation is built upon.

> How much have they been tested?
We found quite a few bugs with Buildbarn when we deployed it to production and ended up making a bunch of hacks to "retry on failure". For example, there was some race conditions that would cause an error stating a file already exists on downloading the artifacts even though there's only 1 job per instance, so we'd have the scheduler sniff out the error code and message buildbarn would return and if it pattern matches a criteria, the scheduler would terminate the instance and reschedule the job.

As for Turbo Cache, I try to extensively test every module and any time a bug is caught a regression test is required. I will say something that Google/Bazel-team could help with is an integration test suit that we can run across any BRE API implementers (if they exist, I don't know about them and should probably be documented with the BRE API repo). I ended up writing a terrible integration test here.

> What are their limitations (platforms, clients, scalability, etc.)?
I find that the biggest limitation comes down to the scheduler. It is quite difficult to write a non-zero-point-of-failure system with the API as it's written (or maybe as most clients are written). It is possible and have theory crafted a few ideas on how to allow multiple schedulers, but it's extremely complicated, so probably will never be done.

I have also been toying around with the idea of a fuse filesystem that can be mapped to a CAS, which in theory will significantly reduce the total-execution time, since it will only need to download the file/parts-of-file that are being used by the job. The other major thing I want to do this is because it will allow metrics on what files are not used for a job and I can store it in a database somewhere. (yes, I'm aware of the unused_files.zip file, but it's not very useful as is).

> Which features do they support (autoscaling, heterogeneous workers, build without the bytes, etc.)?
Buildbarn and Turbo Cache supports all of the above.

Last notes:
Another thing I've been wishing is the ability to get a job-label that is unique to the kind of job it's running. This is purely client specific and in bazel's case it could be something like the build target (eg: `//foo/bar:baz.o`) and pass it to the server somehow. The idea is that i'd allow the server to do optimizations like:
* Lookup key in multimap
* Each entry is an Action
* Find if any Action has a positive hit where a previously executed job has the same inputs but only on the digests it has read.

The idea spawns from something I noticed at my last company, where we frequently had simulation tests (most expensive tests) would not need to be executed because the file(s) that were modified were not used in those tests. Another more advanced example is if we used fusion we would only read the `wdo` file if a program crashed (since we only printed stack traces on failures), and thus, if a user modified the code because code review and the binary output was the same it would not need to run the test even though the Action hash technically was different.

-Blaise


--
You received this message because you are subscribed to the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to remote-execution...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/61782c4c-d692-457f-a6e5-6582461efa78n%40googlegroups.com.

Ed Schouten

unread,
Sep 2, 2022, 4:13:27 PM9/2/22
to Nathan Bruer, Sadaf Matinkhoo, Remote Execution APIs Working Group
Hi there!

Op vr 2 sep. 2022 om 20:49 schreef Nathan Bruer <natha...@gmail.com>:
> We found quite a few bugs with Buildbarn when we deployed it to production and ended up making a bunch of hacks to "retry on failure". For example, there was some race conditions that would cause an error stating a file already exists on downloading the artifacts even though there's only 1 job per instance, so we'd have the scheduler sniff out the error code and message buildbarn would return and if it pattern matches a criteria, the scheduler would terminate the instance and reschedule the job.

Did you ever end up reporting this issue? I would have loved to get
those issues fixed. Considering that you started the TurboCache
project, I guess it's unlikely you still have information on those
issues at hand.

Op vr 2 sep. 2022 om 16:55 schreef 'Sadaf Matinkhoo' via Remote
Execution APIs Working Group <remote-exe...@googlegroups.com>:
> - How's the onboarding experience (effort, community responsiveness)?

Buildbarn has a repository named bb-deployments that contains an
example deployment, which acts as a starting point for setting up a
cluster. That said, Buildbarn is a bit of a Swiss army knife. For
example, there is a storage layer that can be configured in many, many
different ways to do sharding, mirroring, failover, fallback, etc.. It
is expected that people setting up Buildbarn clusters do take some
time to read up on all of the configuration options and decide what is
best for their environment.

Buildbarn is thus not a good fit for people who want to set up a
cluster *quickly*, but in my opinion it is good for setting up a
cluster *well*. There is an initial phase that is steep, but as soon
as you've got the hang of it, you can set it up any way you like.
There is a large degree of uniformity in the way services are
configured.

In terms of community responsiveness, there is a Slack channel on
buildteamworld.slack.com named #buildbarn where many of its users hang
out. Some other implementations also have channels there. Buildbarn's
channel is most active in terms of postings. About half of the threads
have at least three people commenting, meaning that engagement rates
are pretty high.

> - How's the debugging experience (metrics, tools)?

I think the debugging experience is pretty good. There are a very
large amount of Prometheus metrics that basically give insight in all
of the services' subsystems. The bb-deployments repository also has
some Grafana dashboards that you can use.

We have pretty decent support for OpenTelemetry. There's middleware
included that can attach arbitrary gRPC request and response fields to
trace spans, so you can make the traces as simple/complex as you want.

Another nifty feature is that Buildbarn always attaches URLs to
bb-browser for any action it executes. bb-browser can be used to
inspect actions and their outcomes. Bazel prints these URLs when
builds fail, meaning that inspecting failures in more detail is just a
single mouse click away. This feature has been a life-saver for people
who design their own build rules, and have a hard time figuring out
why their actions fail.

I often get remarks that Buildbarn is bad at logging, because it logs
so little. This is intentional, as we're investing heavily in metrics,
tracing, and properly propagating errors up the stack.

> - How much have they been tested?

There is a large amount of unit testing going on. Integration/system
tests are not provided on GitHub, but you may assume that this is done
extensively by its main developers.

> - What are their limitations (platforms, clients, scalability, etc.)?

Platforms: Linux, macOS, FreeBSD. There is also some experimental
Windows support, but I only rarely hear stories of people actually
using it. I guess demand isn't that high.

Scalability: petabytes of storage, ~100k of worker threads.

> - Which features do they support (autoscaling, heterogeneous workers, build without the bytes, etc.)?

Commenting on the features you brought up yourself: yes, yes, yes.
There are also some other features worth mentioning.

Buildbarn's scheduler has support for multiple worker sizes. You can
create multiple pools of workers that all run the same OS/container,
but have different sizes (CPUs, memory). The scheduler is capable of
measuring execution times of actions on different worker sizes, and is
capable of making smarter decisions in the future on which worker size
to use (read: it's 'self-learning'). This prevents the need to
manually annotate actions with their required resources. This feature
has led to significant cost reductions for some of Buildbarn's users,
as manual annotations are more often wrong than right. Tech talk:
https://www.youtube.com/watch?v=3eKVBwlAHsk

Like Nathan mentioned for TurboCache, We have invested heavily in
writing a FUSE file system to provide lazy-loading of input roots.
This implementation was released back in 2020, and is rock-solid by
now. It has literally run billions of build actions/tests at this
point. Initially we only supported FUSE. Though this was fine for
Linux, OSXFUSE on macOS isn't that great. This is why we've recently
generalised this into a true virtual file system that also supports
NFSv4. So on macOS, the worker can on startup do a 'mount -t nfs
localhost:/' against an integrated NFSv4 server, and do builds inside
of that.

Buildbarn also provides a daemon named bb-clientd that people can
install on their personal systems. It allows you to do Bazel builds in
such a way that you have 'Builds without the Bytes', but are still
able to access the outputs afterwards. bazel-out/ is essentially
replaced by a similar lazy-loading file system. bb-clientd also
provides basic facilities for replaying arbitary actions stored in the
CAS, making it easy for people to collaborate on investigating complex
build failures.

Be sure to reach out if you have any further questions!

Best regards,
--
Ed Schouten <e...@nuxi.nl>

Fredrik Medley

unread,
Sep 2, 2022, 4:31:12 PM9/2/22
to Remote Execution APIs Working Group
Hi,

We at Meroton are supporting customers in setting up and maintaining Buildbarn clusters in combination with porting existing build systems to Bazel (I'm not the author, but still biased). In our experience, when Buildbarn is working well, a lot of CPU time is spent on checking out code and running Bazel to return nearly 100% cache hit.

As Ed happened to answer a few minutes before me, I've erased most of my answer...

> How much have they been tested?
I know several multinational companies using it. bb-deployments is now doing a basic integration test for the docker-compose and bare deployments.

> What are their limitations (platforms, clients, scalability, etc.)?
See also https://remote-apis-testing.gitlab.io/remote-apis-testing/ for server-client support matrix.

The modular design of the code base makes it easy to patch and adapt to your own internal needs, e.g. special authorization or your own custom hardware workers. If your support needs are more than the Slack channel, have a look at https://github.com/buildbarn/bb-storage#commercial-support

Best regards,
Fredrik Medley

Sadaf Matinkhoo

unread,
Sep 8, 2022, 10:55:09 AM9/8/22
to Fredrik Medley, Remote Execution APIs Working Group
Thanks everyone for your responses. This thread has been very helpful to me. I appreciate the time you took to share your experience and feedback.

--
You received this message because you are subscribed to a topic in the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/remote-execution-apis/KtcdLu0l3CU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to remote-execution...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/2f2ef003-d3ef-4bd1-9683-a991b08fd714n%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages