A remote output service for Bazel ("bazel-out/ on FUSE"): working towards a standard protocol that we can keep in remote-apis?

786 views
Skip to first unread message

Ed Schouten

unread,
Dec 28, 2020, 10:57:40 AM12/28/20
to Remote Execution APIs Working Group
Hello everyone,

TL;DR: I want to add a new protocol to the remote-apis repository that Bazel can use to talk to a FUSE helper daemon. Any feedback, objections, ...?

As some of us have experienced, Bazel remote execution may generate lots of network traffic. The "Builds without the Bytes" effort has addressed this, but comes with the downside that outputs of builds can no longer be accessed. In fact, I don't think you even get insight in which outputs the build would have yielded. For typical software development workflows, this solution is impractical. You often want to access build artifacts without really knowing which ones you're going to access up front.

Failed attempts

To see whether we can get the best of both worlds (fast builds, while still getting access to all outputs), I've been experimenting with letting bazel-out/ be backed by a virtual file system (FUSE). Below are three solutions I've worked on over the last couple of months that I think we should NOT pursue:

1. A FUSE file system that's directly integrated into Bazel. I eventually abandoned this approach for a couple of reasons. First and foremost, integrating FUSE into Bazel makes it a lot harder to run Bazel in unprivileged environments (Docker/Kubernetes containers). Lots of people tend to do this, e.g. for CI. Second, I remember reading a "Vision on Bazel storage" document about a year ago that stated that adding a direct dependency on FUSE was undesirable. Furthermore, FUSE is not cross platform. For systems like macOS it may be smarter to run a user space NFS/9p server, as that doesn't require kernel extensions.

2. A separate FUSE daemon that exposes the entire CAS under a single directory, where files can be accessed under the name "${hash}-${sizeBytes}". Bazel would then no longer download files from the CAS, but simply emit symbolic links pointing into the FUSE mount. PRs: #1 #2. This approach eventually worked okayish from the Bazel side, but tends to confuse build actions that call realpath() on input files. This effectively breaks dynamic linkage with rpath, Python module loading, etc.

3. A FUSE daemon like the one above, that in addition to a CAS directory also offers a tmpfs-like scratch space directory. By storing the bazel-out/ directory inside scratch space directory, Bazel can emit hardlinks. This keeps build actions happy. The downside of this approach is that it's relatively slow due to the high number of FUSE operations. Operations that need to modify many files, such as runfile links directory creation and 'bazel clean' take ages.

Successful attempt

After three unsuccessful attempts, I've now ended up with a solution that I think works well. Namely, I have created a daemon that runs both a FUSE file system and a gRPC server. Initially the FUSE mount is empty and immutable. Every time Bazel starts a new build, it notifies the FUSE daemon over gRPC to request a new build directory. Basically an mkdir(), except that all sorts of additional metadata is exchanged. Bazel then uses this FUSE-backed directory to store all of its outputs. Every time Bazel needs to do something that is inefficient to do over FUSE, it calls into the daemon over gRPC. Examples of such operations include:

- Batch create hundreds of symlinks when creating .runfiles directories,
- Creating lazy-loading CAS-backed files and directories, based on digests stored in ActionResult messages,
- Computing digests of files. The daemon can return these instantly for lazy-loading CAS-backed files.
- ...

Once the build is finished, Bazel performs a final gRPC call against the daemon to finalize the results. My implementation doesn't do anything fancy with that right now, but it could use that occasion to snapshot/archive the output directory. That would allow a user to time travel between bazel-out/ directories and compare their results.

The changes I made to Bazel can be found in a branch in my fork on GitHub. The most notable change is the addition of GrpcRemoteOutputService.

The gRPC protocol schema

The schema that my copy of Bazel uses to communicate with my FUSE daemon can for the time being be found in my fork of Bazel. As you can see, it's a relatively simple protocol. It depends on REv2, but not on anything specific to Bazel. My suspicion is that this protocol could also be used by other build clients, such as Pants. I'd love to hear from the maintainers of such tools whether they agree. Because of that, I would like to see if we could upstream this into the remote-apis repository, as opposed to keeping it in the Bazel tree. Thoughts on this?

During the last remote execution monthly meeting Ed Baunton and/or Sander Striker asked how this protocol is different from BuildGrid's local_cas protocol. The answer to that is simple: there doesn't seem to be any overlap whatsoever. The local_cas protocol is oblivious of builds; there is no scoping/context. Furthermore, it can only be used to stage trees based on Directory objects stored in the CAS, while we need to stage files, symlinks and Tree objects (contained in OutputFile, OutputSymlink and OutputDirectory messages). In summary. local_cas protocol seems to be designed for use on workers, while the remote output service protocol that I've designed is for use on clients.

My plan for now is to continue implementing this. Eventually I will send out PRs against Bazel to add support for this protocol. Furthermore, I will release the source code of my FUSE daemon (which is largely built on top of Buildbarn's frameworks) at github.com/buildbarn/bb-clientd.

Best regards,
--
Ed Schouten <e...@nuxi.nl>

Eric Burnett

unread,
Jan 5, 2021, 11:36:53 AM1/5/21
to Ed Schouten, Remote Execution APIs Working Group
Heya Ed (and I hope you had a happy new year!),

First, thanks for sharing all of this, and for prototyping so many things. I've added some specific comments below, but I don't have as much experience as I'd like with fuse filesystems, so I've also circulated this thread amongst some folk at Google who do. Hopefully they'll chime in with anything relevant :). But to my eyes, I really like the direction you're going here.

I haven't yet looked at your linked code, but one place I imagine there may be some tricky nuance is in the negotiation of where to place the files. The grpc service is presumably constrained in what paths it has control over and is capable of virtualizing, and bazel is presumably opinionated about where the output tree should live. Would you mind summarizing how this gets reconciled?

Thanks!
--Eric

Failed attempts

To see whether we can get the best of both worlds (fast builds, while still getting access to all outputs), I've been experimenting with letting bazel-out/ be backed by a virtual file system (FUSE). Below are three solutions I've worked on over the last couple of months that I think we should NOT pursue:

1. A FUSE file system that's directly integrated into Bazel. I eventually abandoned this approach for a couple of reasons. First and foremost, integrating FUSE into Bazel makes it a lot harder to run Bazel in unprivileged environments (Docker/Kubernetes containers). Lots of people tend to do this, e.g. for CI. Second, I remember reading a "Vision on Bazel storage" document about a year ago that stated that adding a direct dependency on FUSE was undesirable. Furthermore, FUSE is not cross platform. For systems like macOS it may be smarter to run a user space NFS/9p server, as that doesn't require kernel extensions.

The document you're referencing is probably this one. I didn't go quite as far as to say FUSE support shouldn't be linked into bazel, just that it needs to remain optional so that builds can always be done without it. But that said, I do like the separation of concerns of getting it out of bazel, especially when thinking about needing to link in os-specific implementations with potentially divergent logic. That'd require a standard abstraction like your gRPC server anyways, at which point, breaking them into two has lots of secondary benefits (like not having to implement it in java :)).
 

2. A separate FUSE daemon that exposes the entire CAS under a single directory, where files can be accessed under the name "${hash}-${sizeBytes}". Bazel would then no longer download files from the CAS, but simply emit symbolic links pointing into the FUSE mount. PRs: #1 #2. This approach eventually worked okayish from the Bazel side, but tends to confuse build actions that call realpath() on input files. This effectively breaks dynamic linkage with rpath, Python module loading, etc.

Makes sense. We avoided symlinks on the execution side for the same reason - too many tools mishandle them, and we had no expectation we could clean enough of them up to make a symlinking story viable. IIRC on the bazel side it does use symlinks in some cases (the runfiles tree), but the execution sandbox does something else (bind mounts?) to avoid it.

3. A FUSE daemon like the one above, that in addition to a CAS directory also offers a tmpfs-like scratch space directory. By storing the bazel-out/ directory inside scratch space directory, Bazel can emit hardlinks. This keeps build actions happy. The downside of this approach is that it's relatively slow due to the high number of FUSE operations. Operations that need to modify many files, such as runfile links directory creation and 'bazel clean' take ages.

I was recently evaluating hardlinks myself, and have one observation: performance is equivalent to symlinks for "small" numbers of files, but it starts to degrade nonlinearly above ~10k, and is additionally impacted by interleaving link creation with file creation in the source directory. (E.g. populating a directory of 500k files and then creating 500k hardlinks out of it takes <1m; creating mixing creating and deleting takes ~2m. On ext4, on GCE.)

I'd imagine that for most builds if you used hardlinks only for the "real" outputs and continued to have bazel build runfiles directories as symlinks to those (as it does today iirc), you'd probably find it viable even for fully cached builds until somewhere north of 100k total outputs. Though that's probably best left as an implementation detail for your daemon to capitalize on or not - I'm not sure if it's necessary once you've handed everything off to a daemon anyways.



Successful attempt

After three unsuccessful attempts, I've now ended up with a solution that I think works well. Namely, I have created a daemon that runs both a FUSE file system and a gRPC server. Initially the FUSE mount is empty and immutable. Every time Bazel starts a new build, it notifies the FUSE daemon over gRPC to request a new build directory. Basically an mkdir(), except that all sorts of additional metadata is exchanged. Bazel then uses this FUSE-backed directory to store all of its outputs. Every time Bazel needs to do something that is inefficient to do over FUSE, it calls into the daemon over gRPC. Examples of such operations include:

- Batch create hundreds of symlinks when creating .runfiles directories,
- Creating lazy-loading CAS-backed files and directories, based on digests stored in ActionResult messages,
- Computing digests of files. The daemon can return these instantly for lazy-loading CAS-backed files.
- ...

Once the build is finished, Bazel performs a final gRPC call against the daemon to finalize the results. My implementation doesn't do anything fancy with that right now, but it could use that occasion to snapshot/archive the output directory. That would allow a user to time travel between bazel-out/ directories and compare their results.

The changes I made to Bazel can be found in a branch in my fork on GitHub. The most notable change is the addition of GrpcRemoteOutputService.

The gRPC protocol schema

The schema that my copy of Bazel uses to communicate with my FUSE daemon can for the time being be found in my fork of Bazel. As you can see, it's a relatively simple protocol. It depends on REv2, but not on anything specific to Bazel. My suspicion is that this protocol could also be used by other build clients, such as Pants. I'd love to hear from the maintainers of such tools whether they agree. Because of that, I would like to see if we could upstream this into the remote-apis repository, as opposed to keeping it in the Bazel tree. Thoughts on this?

During the last remote execution monthly meeting Ed Baunton and/or Sander Striker asked how this protocol is different from BuildGrid's local_cas protocol. The answer to that is simple: there doesn't seem to be any overlap whatsoever. The local_cas protocol is oblivious of builds; there is no scoping/context. Furthermore, it can only be used to stage trees based on Directory objects stored in the CAS, while we need to stage files, symlinks and Tree objects (contained in OutputFile, OutputSymlink and OutputDirectory messages). In summary. local_cas protocol seems to be designed for use on workers, while the remote output service protocol that I've designed is for use on clients.

My plan for now is to continue implementing this. Eventually I will send out PRs against Bazel to add support for this protocol. Furthermore, I will release the source code of my FUSE daemon (which is largely built on top of Buildbarn's frameworks) at github.com/buildbarn/bb-clientd.


--
You received this message because you are subscribed to the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to remote-execution...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/CABh_MKkkFeHtHJ12%3DghqReXh1MXdB11ir4s2mNQgGBbRZx1a3A%40mail.gmail.com.

Ed Schouten

unread,
Jan 5, 2021, 4:34:19 PM1/5/21
to Eric Burnett, Remote Execution APIs Working Group
Hi Eric,

Op di 5 jan. 2021 om 17:36 schreef Eric Burnett <ericb...@google.com>:
> Heya Ed (and I hope you had a happy new year!),

Happy new year to you as well! \o/

> First, thanks for sharing all of this, and for prototyping so many things. I've added some specific comments below, but I don't have as much experience as I'd like with fuse filesystems, so I've also circulated this thread amongst some folk at Google who do. Hopefully they'll chime in with anything relevant :). But to my eyes, I really like the direction you're going here.

Thanks!

> I haven't yet looked at your linked code, but one place I imagine there may be some tricky nuance is in the negotiation of where to place the files. The grpc service is presumably constrained in what paths it has control over and is capable of virtualizing, and bazel is presumably opinionated about where the output tree should live. Would you mind summarizing how this gets reconciled?

That's a good question! The FUSE daemon I'm working on is built in
such a way that it doesn't need any access to the file system from
Bazel's perspective. It just manages the data stored in a single
virtual file system. That path is configured on both the FUSE daemon
side (through a config option) and Bazel (through a command line
flag). Those may be different in case of containerization. At the
start of every build, Bazel sends md5(workspace_dir) and the
invocation ID to the FUSE daemon. Based on that information, the FUSE
daemon creates a directory in its virtual file system and communicates
its name back to Bazel (output_path_suffix), so that Bazel knows where
its bazel-out/ symlink needs to point.

Except for BatchStat(), none of the RPCs care about the state of the
system outside the virtual file system. Those RPCs only deal with
relative paths underneath the output path. The one RPC that is harder
to get right is BatchStat(). This RPC can be used by Bazel to get the
digests corresponding to a list of files. This is tricky, because the
output path can contain symlinks with absolute paths, which may or may
not resolve to a location inside the output path. To solve that, Bazel
will send two pieces of information to the FUSE daemon at the start of
every build:

- output_path_prefix: The absolute path of the FUSE mount from Bazel's
perspective,
- output_path_aliases: A map<absolute path, relative path> of paths
that during the build will resolve to locations inside the output path
stored on FUSE.

Here's an example of what this could look like on my system:

- output_path_prefix: "/home/ed/bb_clientd/builds"
- output_path_aliases:
{"/home/ed/.cache/bazel/${md5sum}/execroot/bazel-out": "."}

So the FUSE daemon knows that during the build,
"/home/ed/bb_clientd/builds/${output_path_suffix}" is where build
outputs are stored, and that
"/home/ed/.cache/bazel/${md5sum}/execroot/bazel-out" will be a symlink
pointing to it. The FUSE daemon then has a UNIX-style pathname
resolver that can take these parameters into account. If the path
resolves to a location inside the output path, the FUSE daemon will
respond with a digest. If the FUSE daemon can't tell, it will send a
partially resolved path back to Bazel, which Bazel can use to compute
the digest manually.

> > I remember reading a "Vision on Bazel storage" document about a year ago that stated that adding a direct dependency on FUSE was undesirable.
>
> The document you're referencing is probably this one. I didn't go quite as far as to say FUSE support shouldn't be linked into bazel, just that it needs to remain optional so that builds can always be done without it. But that said, I do like the separation of concerns of getting it out of bazel, especially when thinking about needing to link in os-specific implementations with potentially divergent logic. That'd require a standard abstraction like your gRPC server anyways, at which point, breaking them into two has lots of secondary benefits (like not having to implement it in java :)).

Exactly! \o/

>> 3. A FUSE daemon like the one above, that in addition to a CAS directory also offers a tmpfs-like scratch space directory. By storing the bazel-out/ directory inside scratch space directory, Bazel can emit hardlinks. This keeps build actions happy. The downside of this approach is that it's relatively slow due to the high number of FUSE operations. Operations that need to modify many files, such as runfile links directory creation and 'bazel clean' take ages.
>
> I was recently evaluating hardlinks myself, and have one observation: performance is equivalent to symlinks for "small" numbers of files, but it starts to degrade nonlinearly above ~10k, and is additionally impacted by interleaving link creation with file creation in the source directory. (E.g. populating a directory of 500k files and then creating 500k hardlinks out of it takes <1m; creating mixing creating and deleting takes ~2m. On ext4, on GCE.)
>
> I'd imagine that for most builds if you used hardlinks only for the "real" outputs and continued to have bazel build runfiles directories as symlinks to those (as it does today iirc), you'd probably find it viable even for fully cached builds until somewhere north of 100k total outputs. Though that's probably best left as an implementation detail for your daemon to capitalize on or not - I'm not sure if it's necessary once you've handed everything off to a daemon anyways.

Interesting! For now my plan was indeed to stick to the same layout as
Bazel, even though hardlinks would be better in my case (more compact
to store). Still, it's worth investigating how runfiles in general may
be optimized going forward. Right now I observe that runfiles links
creation basically causes a quadratic explosion (number of files in a
common dependency * number of tests). Maybe we can eventually figure
out ways to bring this closer to linear...

Eric Burnett

unread,
Jan 5, 2021, 5:53:17 PM1/5/21
to Ed Schouten, Remote Execution APIs Working Group
Gotcha, I think I follow. Configuring both independently and then building a mapping from one ID to another makes sense. I could imagine cases where the absolute path linked under a workspace being unstable build over build might matter (e.g. incremental debug builds where absolute paths sneak in), but those generally already have trouble due to local/remote path discrepancies anyways.

More questions, as I come up with them :D:
1. How does this work for local execution, where bazel would normally write files into bazel-out directly (iirc)? I don't see anything in the proto relating to adding files from the local filesystem, but I'm not sure if that means hybrid execution isn't supported yet, or that the fuse mount is readwrite and so bazel can just write files directly.

2. Why do the symlink resolving under BatchStat (with all the complexity that entails), rather than having bazel resolve symlinks into absolute paths first via the filesystem and then ask questions about concrete files only? Or even read the hashes from the filesystem by way of extended attributes? (That's what I've imagined for bazel running on top of virtualized input filesystems, though I haven't gone particularly deep.) I'm guessing performance, but I'm not sure why.

3. Would the virtual directory get recreated for each incremental build, or is it possible to "reopen" it and add/remove only the deltas? The API makes me think it needs to be recreated anew each time, but that may also just be that you haven't gotten to worrying about incremental builds yet :). 

Interesting! For now my plan was indeed to stick to the same layout as
Bazel, even though hardlinks would be better in my case (more compact
to store). Still, it's worth investigating how runfiles in general may
be optimized going forward. Right now I observe that runfiles links
creation basically causes a quadratic explosion (number of files in a
common dependency * number of tests). Maybe we can eventually figure
out ways to bring this closer to linear...

Yeah, I wasn't suggesting you change how runfiles are layed out, just that iirc today they're already symlinks, and so the number of real files (potential hardlinks) is comparatively small.

Runfiles being quadratic IIRC is a known pain point. On Windows IIRC they're handled by having a "runfiles manifest" instead of symlinks, where the list of files is written out to a single file instead. Beyond that, the main optimizations I remember seeing were changes to create runfiles trees in fewer and fewer scenarios (--nobuild_runfile_links, --nobuild_runfiles_manifest, --nolegacy_external_runfiles, etc), with a desire to maybe turn them off by default? Since they're only actually needed for executables you want to run locally, which is presumably the exception rather than the rule and could maybe be handled similar to --experimental_remote_download_outputs=toplevel. But that's speculative on my part.

Rupert Shuttleworth

unread,
Jan 6, 2021, 1:10:40 AM1/6/21
to Remote Execution APIs Working Group
Hi there,

Firstly this is very exciting and cool to see

My 2c as a Googler who has done some Bazel + FUSE things recently:

- Agree that having arbitrary, lazy access to remote outputs is useful, especially for debugging. For CI purposes you probably don't need this, but it can be useful for interactive builds. One alternative to FUSE would be to just redo failed builds in "download all" mode, though. Not a great solution but it may be acceptable for most people, especially for smaller builds where the bandwidth and disk space needed to download everything is not prohibitive.

- FUSE support on Macs is currently problematic, both because of the unknown future of kernel extensions from the Apple side, and because of the recently closed-source nature of osxfuse/macfuse. Maybe you don't care about this, but a lot of Bazel users do use Macs, so something to keep in mind...

- It is useful for Bazel to be able to write to the output folder (e.g. for local-only actions or for dynamic execution) even if most of the other outputs are lazy references to remote files.

- Be careful that you don't slow down builds (especially incremental builds). The overhead of accessing files using FUSE may mean that smaller, incremental builds are slower, especially if they use local-only actions or dynamic execution. Maybe not a deal breaker, but something to keep in mind. 

- What happens if you do a clean build and then want to do an incremental build a few days later, or want to access some remote outputs a few days later? Will the remote outputs still be there to lazily download? Do you need an API to tell the remote cache to persist files / keep files alive while the local workspace exists?

- If any remote files are downloaded, can they be shared/re-used locally across different Bazel builds / workspaces? (do you have some local cache of downloaded outputs?). 

Well, I guess that's all I can think of for now. Sorry if you already addressed these, I haven't had time to read your code yet...

Cheers
Rupert

Ed Schouten

unread,
Jan 6, 2021, 4:20:39 AM1/6/21
to Eric Burnett, rup...@google.com, Remote Execution APIs Working Group
Hi Eric, Rupert,

Thank you both for your input. I really appreciate it! Let me respond
to both your messages at once.

Op di 5 jan. 2021 om 23:53 schreef Eric Burnett <ericb...@google.com>:
> 1. How does this work for local execution, where bazel would normally write files into bazel-out directly (iirc)? I don't see anything in the proto relating to adding files from the local filesystem, but I'm not sure if that means hybrid execution isn't supported yet, or that the fuse mount is readwrite and so bazel can just write files directly.

It's the latter. Locally executing actions just write into the FUSE
file system directly.

> 2. Why do the symlink resolving under BatchStat (with all the complexity that entails), rather than having bazel resolve symlinks into absolute paths first via the filesystem and then ask questions about concrete files only? Or even read the hashes from the filesystem by way of extended attributes? (That's what I've imagined for bazel running on top of virtualized input filesystems, though I haven't gone particularly deep.) I'm guessing performance, but I'm not sure why.

Yes, it's performance related. I've noticed that letting Bazel stat()
a large number of files through FUSE is quite expensive. Not
necessarily in raw CPU time, but it's the context switching between
Bazel, the kernel and the FUSE daemon that's pretty bad. Depending on
the OS, memory size, etc., there's also a lot of thrashing in the VFS,
where the kernel starts to exhaust its inodes relatively quickly. By
letting the FUSE daemon do that symlink expansion on its own data
structures, we bypass the kernel VFS entirely, which seemed to reduce
system time significantly.

> 3. Would the virtual directory get recreated for each incremental build, or is it possible to "reopen" it and add/remove only the deltas? The API makes me think it needs to be recreated anew each time, but that may also just be that you haven't gotten to worrying about incremental builds yet :).

Both are possible. When Bazel calls StartBuild() over gRPC, the FUSE
daemon returns the output path suffix (as discussed previously), but
also returns some information about the initial contents of that
directory. It can say: "The contents haven't changed since build
${previous_invocation_id}". If Bazel sees that this matches up with
its own previous invocation ID, it can skip scanning the file system.
On the Bazel side I'm creating a ModifiedFileSet based on that
information.

> Runfiles being quadratic IIRC is a known pain point. On Windows IIRC they're handled by having a "runfiles manifest" instead of symlinks, where the list of files is written out to a single file instead. Beyond that, the main optimizations I remember seeing were changes to create runfiles trees in fewer and fewer scenarios (--nobuild_runfile_links, --nobuild_runfiles_manifest, --nolegacy_external_runfiles, etc), with a desire to maybe turn them off by default? Since they're only actually needed for executables you want to run locally, which is presumably the exception rather than the rule and could maybe be handled similar to --experimental_remote_download_outputs=toplevel. But that's speculative on my part.

Yeah. My plan was to set --nobuild_runfile_links when this all gets rolled out.

Op wo 6 jan. 2021 om 07:10 schreef 'Rupert Shuttleworth' via Remote
Execution APIs Working Group <remote-exe...@googlegroups.com>:
> - FUSE support on Macs is currently problematic, both because of the unknown future of kernel extensions from the Apple side, and because of the recently closed-source nature of osxfuse/macfuse. Maybe you don't care about this, but a lot of Bazel users do use Macs, so something to keep in mind...

Yes. Both Linux and macOS are important to me. Right now the file
system code I have in Buildbarn is coupled against go-fuse, but my
hope is that I could at some point add a layer in between, so that I
could expose the file system through other protocols as well. For
macOS it would make sense to implement a userspace NFSv4 server. One
prerequisite for that is that I'll need to write an XDR -> Go
compiler/generator first:

https://tools.ietf.org/html/rfc7863

> - Be careful that you don't slow down builds (especially incremental builds). The overhead of accessing files using FUSE may mean that smaller, incremental builds are slower, especially if they use local-only actions or dynamic execution. Maybe not a deal breaker, but something to keep in mind.

Yes. My plan here was to make proper use of ModifiedFileSet. The
BatchStat() RPC that I have also helps a lot here.

> - What happens if you do a clean build and then want to do an incremental build a few days later, or want to access some remote outputs a few days later? Will the remote outputs still be there to lazily download? Do you need an API to tell the remote cache to persist files / keep files alive while the local workspace exists?

I haven't implemented this part yet, but my idea here is to let
BatchStat() do FindMissingBlobs() calls under the hood (and
temporarily memoize the results). This means that the file system will
automatically start to hide files that have gone absent remotely from
Bazel's point of view.

> - If any remote files are downloaded, can they be shared/re-used locally across different Bazel builds / workspaces? (do you have some local cache of downloaded outputs?).

Yes! The FUSE daemon I'm writing is built on top of the same storage &
transport layer as used by Buildbarn. This means that the local cache
is just another instance of the disk-based cache that you can also use
for centralized storage, worker-level caches, office caches, etc..
It's all content addressed and independent from the contents of the
FUSE filesystem, meaning that its contents are shared between
workspaces. It'll even be retained if you run 'bazel clean'.

Be sure to reach out if you have any further questions. In the
meantime, I'm going to continue working on this. My plan is to send
out a PR for the protocol definition before the next monthly meeting
(which is already next week).

--
Ed Schouten <e...@nuxi.nl>

Eric Burnett

unread,
Jan 6, 2021, 10:22:49 AM1/6/21
to Ed Schouten, Rupert Shuttleworth, Remote Execution APIs Working Group
> - What happens if you do a clean build and then want to do an incremental build a few days later, or want to access some remote outputs a few days later? Will the remote outputs still be there to lazily download? Do you need an API to tell the remote cache to persist files / keep files alive while the local workspace exists?

I haven't implemented this part yet, but my idea here is to let
BatchStat() do FindMissingBlobs() calls under the hood (and
temporarily memoize the results). This means that the file system will
automatically start to hide files that have gone absent remotely from
Bazel's point of view.

Might be better to couple this to StartBuild? For any outputs older than X, do FindMissingBlobs on them before responding to the StartBuild. That way clients can be fully informed upfront what (if anything) has changed vs its expectations. Otherwise they would only learn mid-build that expected files are missing, which I don't think they know how to handle gracefully (e.g. bazel doesn't have rewinding).

Assuming bazel doesn't itself call BatchStat on all files at the beginning of an incremental build, I guess. If it does, effectively equivalent?

Rupert: 
> Do you need an API to tell the remote cache to persist files / keep files alive while the local workspace exists?

The FindMissingBlobs call has "lifetime extension" semantics, no API change needed on the cache. So if the FUSE daemon wants to call it periodically on the live file set, it could keep idle workspaces alive. ("Periodically" depends on the implementation in question, but could presumably be part of the configuration of the daemon how many hours/days to use.)


Ed Schouten

unread,
Jan 6, 2021, 11:34:46 AM1/6/21
to Eric Burnett, Rupert Shuttleworth, Remote Execution APIs Working Group
Hi Eric,

Op wo 6 jan. 2021 om 16:22 schreef Eric Burnett <ericb...@google.com>:
> Might be better to couple this to StartBuild? For any outputs older than X, do FindMissingBlobs on them before responding to the StartBuild. That way clients can be fully informed upfront what (if anything) has changed vs its expectations. Otherwise they would only learn mid-build that expected files are missing, which I don't think they know how to handle gracefully (e.g. bazel doesn't have rewinding).

Sounds like a good idea! I'll adjust the .proto documentation to
require that. In the meantime I've sent a PR against remote-apis to
get the remote output service protocol added:

https://github.com/bazelbuild/remote-apis/pull/184

--
Ed Schouten <e...@nuxi.nl>

Eric Burnett

unread,
Jan 6, 2021, 2:46:26 PM1/6/21
to Ed Schouten, Rupert Shuttleworth, Remote Execution APIs Working Group
Sounds good! I'll add comments there as well. But just to set expectations, I wouldn't expect to land that PR right away - current best practice is to aim for two implementations of a given API before we ratify it, so I'd expect this PR to remain open through your prototyping at least.
 


--
Ed Schouten <e...@nuxi.nl>

Sander Striker

unread,
Jan 7, 2021, 7:12:00 AM1/7/21
to Ed Schouten, Remote Execution APIs Working Group
Hi Ed,

First of all, thanks for this write up and all the experimentation.

On Mon, Dec 28, 2020 at 4:57 PM Ed Schouten <e...@nuxi.nl> wrote:
Hello everyone,

TL;DR: I want to add a new protocol to the remote-apis repository that Bazel can use to talk to a FUSE helper daemon. Any feedback, objections, ...?

As some of us have experienced, Bazel remote execution may generate lots of network traffic. The "Builds without the Bytes" effort has addressed this, but comes with the downside that outputs of builds can no longer be accessed. In fact, I don't think you even get insight in which outputs the build would have yielded. For typical software development workflows, this solution is impractical. You often want to access build artifacts without really knowing which ones you're going to access up front.

Failed attempts

To see whether we can get the best of both worlds (fast builds, while still getting access to all outputs), I've been experimenting with letting bazel-out/ be backed by a virtual file system (FUSE). Below are three solutions I've worked on over the last couple of months that I think we should NOT pursue:

1. A FUSE file system that's directly integrated into Bazel. I eventually abandoned this approach for a couple of reasons. First and foremost, integrating FUSE into Bazel makes it a lot harder to run Bazel in unprivileged environments (Docker/Kubernetes containers). Lots of people tend to do this, e.g. for CI. Second, I remember reading a "Vision on Bazel storage" document about a year ago that stated that adding a direct dependency on FUSE was undesirable. Furthermore, FUSE is not cross platform. For systems like macOS it may be smarter to run a user space NFS/9p server, as that doesn't require kernel extensions.

2. A separate FUSE daemon that exposes the entire CAS under a single directory, where files can be accessed under the name "${hash}-${sizeBytes}". Bazel would then no longer download files from the CAS, but simply emit symbolic links pointing into the FUSE mount. PRs: #1 #2. This approach eventually worked okayish from the Bazel side, but tends to confuse build actions that call realpath() on input files. This effectively breaks dynamic linkage with rpath, Python module loading, etc.

3. A FUSE daemon like the one above, that in addition to a CAS directory also offers a tmpfs-like scratch space directory. By storing the bazel-out/ directory inside scratch space directory, Bazel can emit hardlinks. This keeps build actions happy. The downside of this approach is that it's relatively slow due to the high number of FUSE operations. Operations that need to modify many files, such as runfile links directory creation and 'bazel clean' take ages.

Successful attempt

After three unsuccessful attempts, I've now ended up with a solution that I think works well. Namely, I have created a daemon that runs both a FUSE file system and a gRPC server. Initially the FUSE mount is empty and immutable. Every time Bazel starts a new build, it notifies the FUSE daemon over gRPC to request a new build directory. Basically an mkdir(), except that all sorts of additional metadata is exchanged. Bazel then uses this FUSE-backed directory to store all of its outputs. Every time Bazel needs to do something that is inefficient to do over FUSE, it calls into the daemon over gRPC. Examples of such operations include:

- Batch create hundreds of symlinks when creating .runfiles directories,
- Creating lazy-loading CAS-backed files and directories, based on digests stored in ActionResult messages,
- Computing digests of files. The daemon can return these instantly for lazy-loading CAS-backed files.
- ...

Once the build is finished, Bazel performs a final gRPC call against the daemon to finalize the results. My implementation doesn't do anything fancy with that right now, but it could use that occasion to snapshot/archive the output directory. That would allow a user to time travel between bazel-out/ directories and compare their results.

The changes I made to Bazel can be found in a branch in my fork on GitHub. The most notable change is the addition of GrpcRemoteOutputService.

The gRPC protocol schema

The schema that my copy of Bazel uses to communicate with my FUSE daemon can for the time being be found in my fork of Bazel. As you can see, it's a relatively simple protocol. It depends on REv2, but not on anything specific to Bazel. My suspicion is that this protocol could also be used by other build clients, such as Pants. I'd love to hear from the maintainers of such tools whether they agree. Because of that, I would like to see if we could upstream this into the remote-apis repository, as opposed to keeping it in the Bazel tree. Thoughts on this?

During the last remote execution monthly meeting Ed Baunton and/or Sander Striker asked how this protocol is different from BuildGrid's local_cas protocol. The answer to that is simple: there doesn't seem to be any overlap whatsoever.

I don't think that our point has come across.  I mentioned the LocalCAS protocol, because in buildbox-casd we are are already doing a subset of what you are looking to achieve.  Using LocalCAS as a basis and expanding on it, seemed like a sensible approach to try and take.
 
The local_cas protocol is oblivious of builds; there is no scoping/context. Furthermore, it can only be used to stage trees based on Directory objects stored in the CAS, while we need to stage files, symlinks and Tree objects (contained in OutputFile, OutputSymlink and OutputDirectory messages). In summary. local_cas protocol seems to be designed for use on workers, while the remote output service protocol that I've designed is for use on clients.

FWIW, it wasn't designed to be used on workers exclusively (iirc it was actually dreamt up in the early days of recc).  The LocalCAS.Capture calls can be used by clients as well to get content into CAS and addressable as such.  The intent is to have recc use this, just like Buildstream is already doing this.  Buildstream is leveraging buildbox-casd to stage content for the build via FUSE as well as capturing the output into CAS.  The buildbox-fuse backend doesn't have lazy loading at this time, that was dropped to give buildbox-casd exclusive control over on-disk storage.  However, that functionality can be re-introduced.

In other words, LocalCAS could definitely be expanded to fit more needs, when the use cases are clear.  And while LocalCAS lived in the BuildGrid org, I think there would be support for moving it to remote-apis.  The only reason it isn't was because we were uncertain it fit there, being a protocol that doesn't actually have to travel over the network.
 
My plan for now is to continue implementing this. Eventually I will send out PRs against Bazel to add support for this protocol. Furthermore, I will release the source code of my FUSE daemon (which is largely built on top of Buildbarn's frameworks) at github.com/buildbarn/bb-clientd.

Much appreciated.  I'll read up on the rest of the thread as well.

Thanks!

Cheers,

Sander
 
Best regards,
--
Ed Schouten <e...@nuxi.nl>

--

Jürg Billeter

unread,
Jan 12, 2021, 9:35:41 AM1/12/21
to Ed Schouten, Remote Execution APIs Working Group, Sander Striker
Hi Ed,

On Mon, 2020-12-28 at 16:57 +0100, Ed Schouten wrote:
The gRPC protocol schema

The schema that my copy of Bazel uses to communicate with my FUSE daemon can for the time being be found in my fork of Bazel. As you can see, it's a relatively simple protocol. It depends on REv2, but not on anything specific to Bazel. My suspicion is that this protocol could also be used by other build clients, such as Pants. I'd love to hear from the maintainers of such tools whether they agree. Because of that, I would like to see if we could upstream this into the remote-apis repository, as opposed to keeping it in the Bazel tree. Thoughts on this?

During the last remote execution monthly meeting Ed Baunton and/or Sander Striker asked how this protocol is different from BuildGrid's local_cas protocol. The answer to that is simple: there doesn't seem to be any overlap whatsoever.

While there are clearly significant differences between your proposal and the LocalCAS protocol, there is an overlap. And, as Sander has mentioned, LocalCAS is not exclusive to workers, it's heavily used on the client side by BuildStream. It would be great if we could come up with a unified protocol that covers the use cases of Bazel output, BuildStream and workers.

The local_cas protocol is oblivious of builds; there is no scoping/context.

Correct, there is currently no scoping in LocalCAS except for the implicit scope of the `StageTree()` stream. I don't see an issue with the introduction of explicit scoping. My main concern would be around life time of these scopes. For BuildStream and workers we'd want temporary scopes where cleanup would be desired on client crash. And cleanup of stale directories may also be worth discussing for the Bazel use case. Nit: I'd prefer a more generic name such as `Session` instead of `Build` but that's not a real issue.

Furthermore, it can only be used to stage trees based on Directory objects stored in the CAS, while we need to stage files, symlinks and Tree objects (contained in OutputFile, OutputSymlink and OutputDirectory messages).

Ideally, we could eliminate the difference between `Directory` and `Tree` as suggested in a comment in issue #140, however, that probably won't happen until REAPI v3 (if at all).

Do you see an issue of simply supporting both in the protocol? I.e. we could add a `repeated DirectoryNode` field to the `BatchCreateRequest` message such that clients could use the most suitable variant. I'd like to avoid being forced to create a `Tree` structure as this could be relatively expensive for large hierarchies.

As far as I can tell, a major aspect that your proposal is missing compared to LocalCAS is recursive `CaptureTree`. `BatchStat` supports returning the digest of files, however, it doesn't support returning the digest of complete directories. Do you see an issue with such an extension? This allows optimizations for local daemons with a cache of CAS objects.

The LocalCAS protocol also includes Fetch and Upload methods. I think we can ignore them for this discussion as they are not tightly coupled with the Stage/Capture use case.

There are certainly further details to discuss but it may make sense to first see whether we can come to an agreement to consolidate these protocols (except Fetch/Upload).

Best regards,
Jürg

Jürg Billeter

unread,
Feb 9, 2021, 9:24:29 AM2/9/21
to remote-exe...@googlegroups.com
Hi everyone,

The LocalCAS API is a local gRPC API implemented by buildbox-casd to
provide CAS access via the local filesystem API. It is currently used
by buildbox-run (on clients and RE workers) and BuildStream (on
clients). Based on the recent discussion about standardizing a local
gRPC API for Bazel's output directory, I've been working on
improvements to the LocalCAS API in an attempt to cover Bazel's use
case as well.

The new/updated methods most relevant to this discussion are
OpenDirectory, CloseDirectory, CleanDirectory, Stage and Stat.

This is still a work in progress, however, I'd like to invite other
members of the REAPI community to join this effort to create a standard
API that covers a variety of use cases. An early draft:
https://docs.google.com/document/d/1SYialcjncU-hEWMoaxvEoNHZezUf5jG9DR-i67ZhWVw/edit#

Cheers,
Jürg

Reply all
Reply to author
Forward
0 new messages