Hi there!
Op vr 2 sep. 2022 om 20:49 schreef Nathan Bruer <
natha...@gmail.com>:
> We found quite a few bugs with Buildbarn when we deployed it to production and ended up making a bunch of hacks to "retry on failure". For example, there was some race conditions that would cause an error stating a file already exists on downloading the artifacts even though there's only 1 job per instance, so we'd have the scheduler sniff out the error code and message buildbarn would return and if it pattern matches a criteria, the scheduler would terminate the instance and reschedule the job.
Did you ever end up reporting this issue? I would have loved to get
those issues fixed. Considering that you started the TurboCache
project, I guess it's unlikely you still have information on those
issues at hand.
Op vr 2 sep. 2022 om 16:55 schreef 'Sadaf Matinkhoo' via Remote
Execution APIs Working Group <
remote-exe...@googlegroups.com>:
> - How's the onboarding experience (effort, community responsiveness)?
Buildbarn has a repository named bb-deployments that contains an
example deployment, which acts as a starting point for setting up a
cluster. That said, Buildbarn is a bit of a Swiss army knife. For
example, there is a storage layer that can be configured in many, many
different ways to do sharding, mirroring, failover, fallback, etc.. It
is expected that people setting up Buildbarn clusters do take some
time to read up on all of the configuration options and decide what is
best for their environment.
Buildbarn is thus not a good fit for people who want to set up a
cluster *quickly*, but in my opinion it is good for setting up a
cluster *well*. There is an initial phase that is steep, but as soon
as you've got the hang of it, you can set it up any way you like.
There is a large degree of uniformity in the way services are
configured.
In terms of community responsiveness, there is a Slack channel on
buildteamworld.slack.com named #buildbarn where many of its users hang
out. Some other implementations also have channels there. Buildbarn's
channel is most active in terms of postings. About half of the threads
have at least three people commenting, meaning that engagement rates
are pretty high.
> - How's the debugging experience (metrics, tools)?
I think the debugging experience is pretty good. There are a very
large amount of Prometheus metrics that basically give insight in all
of the services' subsystems. The bb-deployments repository also has
some Grafana dashboards that you can use.
We have pretty decent support for OpenTelemetry. There's middleware
included that can attach arbitrary gRPC request and response fields to
trace spans, so you can make the traces as simple/complex as you want.
Another nifty feature is that Buildbarn always attaches URLs to
bb-browser for any action it executes. bb-browser can be used to
inspect actions and their outcomes. Bazel prints these URLs when
builds fail, meaning that inspecting failures in more detail is just a
single mouse click away. This feature has been a life-saver for people
who design their own build rules, and have a hard time figuring out
why their actions fail.
I often get remarks that Buildbarn is bad at logging, because it logs
so little. This is intentional, as we're investing heavily in metrics,
tracing, and properly propagating errors up the stack.
> - How much have they been tested?
There is a large amount of unit testing going on. Integration/system
tests are not provided on GitHub, but you may assume that this is done
extensively by its main developers.
> - What are their limitations (platforms, clients, scalability, etc.)?
Platforms: Linux, macOS, FreeBSD. There is also some experimental
Windows support, but I only rarely hear stories of people actually
using it. I guess demand isn't that high.
Scalability: petabytes of storage, ~100k of worker threads.
> - Which features do they support (autoscaling, heterogeneous workers, build without the bytes, etc.)?
Commenting on the features you brought up yourself: yes, yes, yes.
There are also some other features worth mentioning.
Buildbarn's scheduler has support for multiple worker sizes. You can
create multiple pools of workers that all run the same OS/container,
but have different sizes (CPUs, memory). The scheduler is capable of
measuring execution times of actions on different worker sizes, and is
capable of making smarter decisions in the future on which worker size
to use (read: it's 'self-learning'). This prevents the need to
manually annotate actions with their required resources. This feature
has led to significant cost reductions for some of Buildbarn's users,
as manual annotations are more often wrong than right. Tech talk:
https://www.youtube.com/watch?v=3eKVBwlAHsk
Like Nathan mentioned for TurboCache, We have invested heavily in
writing a FUSE file system to provide lazy-loading of input roots.
This implementation was released back in 2020, and is rock-solid by
now. It has literally run billions of build actions/tests at this
point. Initially we only supported FUSE. Though this was fine for
Linux, OSXFUSE on macOS isn't that great. This is why we've recently
generalised this into a true virtual file system that also supports
NFSv4. So on macOS, the worker can on startup do a 'mount -t nfs
localhost:/' against an integrated NFSv4 server, and do builds inside
of that.
Buildbarn also provides a daemon named bb-clientd that people can
install on their personal systems. It allows you to do Bazel builds in
such a way that you have 'Builds without the Bytes', but are still
able to access the outputs afterwards. bazel-out/ is essentially
replaced by a similar lazy-loading file system. bb-clientd also
provides basic facilities for replaying arbitary actions stored in the
CAS, making it easy for people to collaborate on investigating complex
build failures.
Be sure to reach out if you have any further questions!
Best regards,
--
Ed Schouten <
e...@nuxi.nl>