TurboCache - A new remote execution / cache server

Nathan Bruer

unread,

Apr 14, 2022, 11:04:02 PM4/14/22

to bazel-discuss

Hi team,

TL;DR

There's a new bazel remote execution / cache project: https://github.com/allada/turbo-cache

I thought I'd put this out there and see if there is any early feedback on a new project I've been working on on-and-off for the last couple years and now actually have lots of time to bring it to completion. Currently it is unlicensed because I have not yet chosen what to do with it, but it is open source and will likely apply at least a LGPL or even more permissive license in the near future.

I'd love any feedback any of you have.

My Background

A few years ago at my previous job I built out our remote execution farm (BuildBarn) that serviced about:

~1.2 million unit tests per day (seconds to run)
~20k integration tests per day (median test duration was ~8 mins)
~300k build jobs per day.
~1 petabyte of cache / month

We spent a huge amount of time trying to keep things stable and to keep infra related issues to a minimum. Over winter break of 2020 I had some free time, so I decided to start a new project to start from scratch and build the server remote execution / cache server myself with all the hindsight available to me.

I decided to write the entire thing in Rust to help with stability (and I wanted to try the brand new Async/Await [which is awesome btw]). I also wanted to implement some cool features that I thought this space was lacking.

Current State

There are 2 main parts of the project: Remote Cache and Remote Execution.

Remote Cache

Remote cache is in the alpha stage. It currently supports:

Memory store - Data will live in the same machine memory (with eviction policies)
S3 store - Services objects that live in a service that supports S3 calls
Compression store - Will compress data (lz4) then forward on to another store
Dedup (de-duplication) store - Uses a rolling hash algorithm to find parts of files that are the same and only process and store the parts that have changed (similar algorithm that rsync & bup uses). Very efficient when large files only a few changes in them.
FastSlow store - Tries one store first, then if not found tries the slow store (and then populates the fast store)
Filesystem store - Stores objects on disk
SizePartitioning store - Chooses a store to place objects based on the size field of the digest.
Retry logic - Some stores (like s3) you might be able to retry on an error. Retry & recovery is supported in these cases (without the client knowing).
Heavily tested - over 100 unit tests so far. Any bug detected always gets a regression test.
Extremely small memory footprint & no garbage collection
GRPC only endpoint

Remote Execution

Remote execution is still a work in progress. I estimate it will be in alpha stage sometime in May/June. Currently it has bazel properly talking to the scheduler & cas. The scheduler appears to properly schedule jobs w/ priorities, and the worker API to interact with the scheduler is all implemented. The next stage is to implement the workers.

Fredrik Medley

unread,

Apr 24, 2022, 4:58:04 PM4/24/22

to bazel-discuss

Can you mention, on high level, some specific pain points in the other remote systems, thinking specifically about Bazel Remote, Buildbarn and Buildfarm? I'm curious because I don't want to do the same mistakes myself.

Also, don't forget to add TurboCache to the remote-apis-testing repository: https://remote-apis-testing.gitlab.io/remote-apis-testing/

Best regards,

Fredrik

Steven Bergsieker

unread,

May 12, 2022, 2:27:50 PM5/12/22

to Fredrik Medley, Remote Execution APIs Working Group, bazel-discuss

Hi Nathan,

Thanks for posting this! You may want to add TurboCache to the list of known server implementations of the Remote Execution API: https://github.com/bazelbuild/remote-apis#servers

You might also be interested in joining remote-exe...@googlegroups.com and attending our monthly sync for all things remote execution-related. (Joining the group should invite you to the meeting automatically.)

Thanks,

Steven

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/fe9051ff-8ee1-496b-8732-7bc76dd65079n%40googlegroups.com.

Nathan Bruer

unread,

May 16, 2022, 9:23:35 PM5/16/22

to bazel-discuss

> Can you mention, on high level, some specific pain points in the other remote systems, thinking specifically about Bazel Remote, Buildbarn and Buildfarm? I'm curious because I don't want to do the same mistakes myself.

I am only going to give the point of view from the specific use case of my previous employer. We make a proof of concept and tried, Buildbarn, Scoot, Buildfarm and could only get Buildbarn to actually work for us (Late 2019). Then in spring 2020 we decided to start the project. Our requirements were to be able to run on Spot instances (AWS), be able to build-without-bytes, and be able to run all builds and tests (GPU and CPU tests).

In setting up Buidlbarn we wanted to run every job in a container, but buildbarn (at least at the time) did not support any launch wrappers, so we ended up hacking buildbarn to always launch the program under a custom shell script that we'd then create temporary AWS credentials then launch & configure docker with them injected. Since Buildbarn uses `-9` signal to kill timeout tasks, this made it incredibly difficult, because docker the jobs don't run under the same process parent, so we had to create lots of tricks to ensure that the next job that gets ran on the instance will terminate the previous docker container if it is running. We also found that we send a `Directory` that contains something like 6 million files. Buildbarn's code is setup so that it will not parallelize downloading a directory tree very efficiently (at least int he version we used). I submitted a patch to Buildbarn to fix this, but it got stalled and never committed. By not using this patch would cause almost every test to take an additional 4+ minutes to iterate and download the files (I think around 200Mb). Lastly we tried using a remote cache instance (pure CAS), but quickly found it could not handle the load, so we setup a very complicated cluster (which was hell to manage) and then eventually moved to S3 (since S3 is now guaranteed consistency). This simplified our problems greatly, since our downloading was averaging over 100GB/s 24/hrs/day (and we even had on-disk cache for every runner); s3 can handle this load easily. But, Buildbarn decided to remove S3 support from the code and we got stuck with an old version of the code.

Our use case had to handle an enormous amount of data and had to be extremely reliable and found that the only one we could get to work for our use cases could barely handle it and required a lot of hacks.

This is why I decided to build one from scratch with scaling and reliability as the primary focus.

> Thanks for posting this! You may want to add TurboCache to the list of known server implementations of the Remote Execution API: https://github.com/bazelbuild/remote-apis#servers

Thanks, I probably will once I get remote execution part stable. Right now it "barely" works and not willing to ask people to use it yet. The caching part is stable, but I think there's plenty of solutions out there for users that need caching. (ie: I don't want to advertise a product like this to non-bazel-experts when it is not stable yet).

> You might also be interested in joining remote-exe...@googlegroups.com and attending our monthly sync for all things remote execution-related. (Joining the group should invite you to the meeting automatically.)

I'll probably check it out. Thanks!

Reply all

Reply to author

Forward