Hi team,
TL;DRThere's a new bazel remote execution / cache project: https://github.com/allada/turbo-cache
I thought I'd put this out there and see if there is any early feedback on a new project I've been working on on-and-off for the last couple years and now actually have lots of time to bring it to completion. Currently it is unlicensed because I have not yet chosen what to do with it, but it is open source and will likely apply at least a LGPL or even more permissive license in the near future.
I'd love any feedback any of you have.
My BackgroundA few years ago at my previous job I built out our remote execution farm (BuildBarn) that serviced about:
~1.2 million unit tests per day (seconds to run)
~20k integration tests per day (median test duration was ~8 mins)
~300k build jobs per day.
~1 petabyte of cache / month
We spent a huge amount of time trying to keep things stable and to keep infra related issues to a minimum. Over winter break of 2020 I had some free time, so I decided to start a new project to start from scratch and build the server remote execution / cache server myself with all the hindsight available to me.
I decided to write the entire thing in Rust to help with stability (and I wanted to try the brand new Async/Await [which is awesome btw]). I also wanted to implement some cool features that I thought this space was lacking.
Current StateThere are 2 main parts of the project: Remote Cache and Remote Execution.
Remote CacheRemote cache is in the alpha stage. It currently supports:
Memory store - Data will live in the same machine memory (with eviction policies)
S3 store - Services objects that live in a service that supports S3 calls
Compression store - Will compress data (lz4) then forward on to another store
Dedup (de-duplication) store - Uses a rolling hash algorithm to find parts of files that are the same and only process and store the parts that have changed (similar algorithm that rsync & bup uses). Very efficient when large files only a few changes in them.
FastSlow store - Tries one store first, then if not found tries the slow store (and then populates the fast store)
Filesystem store - Stores objects on disk
SizePartitioning store - Chooses a store to place objects based on the size field of the digest.
Retry logic - Some stores (like s3) you might be able to retry on an error. Retry & recovery is supported in these cases (without the client knowing).
Heavily tested - over 100 unit tests so far. Any bug detected always gets a regression test.
GRPC only endpoint
Remote execution is still a work in progress. I estimate it will be in alpha stage sometime in May/June. Currently it has bazel properly talking to the scheduler & cas. The scheduler appears to properly schedule jobs w/ priorities, and the worker API to interact with the scheduler is all implemented. The next stage is to implement the workers.
--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/fe9051ff-8ee1-496b-8732-7bc76dd65079n%40googlegroups.com.