Single output_base for multiple workspaces - is it really bad idea?

781 views
Skip to first unread message

Konstantin

unread,
Feb 2, 2021, 11:47:47 PM2/2/21
to bazel-discuss
With Git devs usually have a single folder designated for the local copy of remote repo and switch between branches in that folder as necessary.
In my company large number of devs use Perforce for source control and often have multiple versions of the sources in the different locations of their local file system. There are reasons for that behavior but it is out of scope. The fact of life is that there are many devs who may have 5 copies of the sources in the different folders.

I work on the conversion of the build to Bazel. Bazel has machine-wide repository cache for all the archives it downloads from the network and this is good. Unfortunately for the reasons I don't understand those archives are unpacked to the subfolders of output_base/external. Output_base is by default calculated as the hash of the path to the workspace folder, so when devs have many copies of sources and try to build from all those locations they end up with multiple copies of unarchived external packages. Now imagine that the external packages include toolchains (compilers, libs, tools) and some huge external libraries and you can easily see how people end up having many (like 5) copies of exactly identical external data (like 50 Gb each). 

Consequently I am requested to somehow share the toolchains and other big external packages in their unpacked form between multiple workspaces. 

One idea which comes to mind is that --output_base allows to set output_base folder location explicitly. I saw this feature used to run more than one build from the single workspace by giving different builds different output_bases. I wonder what happens if I do the opposite and make multiple workspaces point to the same one folder as their output_base. The motivation is to make all those workspaces to share what is expanded in output_base/external folder. 

I never saw anybody doing it and suspect it may have some nasty (or even catastrophic) consequences, but... maybe it is somehow ok?

If not that then what other ideas may I try to de-duplicate unpacked externals between several workspaces?

Thank you!
Konstantin

Philipp Schrader

unread,
Feb 4, 2021, 12:16:58 PM2/4/21
to bazel-discuss
There is the --experimental_repository_cache_hardlinks that may help a little bit. I've not done any experiments to see exactly what it does, but it sounds like it could address at least part of your concern? Apologies if I misunderstood.
Phil

Konstantin

unread,
Feb 4, 2021, 4:07:40 PM2/4/21
to bazel-discuss
Phil, thank you for the hint! Indeed the volume of Bazel command line switches is overwhelming and it is often hard to find the one you need even if you know it is there. 

I am looking at experimental_repository_cache_hardlinks and so far don't understand what it does. The description is "If set, the repository cache will hardlink the file in case of a cache hit, rather than copying. This is intended to save disk space." Unfortunately, it does not help much. From my experience, external stuff (say, http_archive) gets downloaded to the repository cache and then unarchived into the subfolder of output_base/external. I don't see repository cache entries copied anywhere and why would they? My attempts to read the source code in BazelRepositoryModule did not help much. Mystery! I will keep digging around though. 

Michael Hordijk

unread,
Feb 8, 2021, 4:59:25 PM2/8/21
to bazel-discuss
On Tuesday, February 2, 2021 at 9:47:47 PM UTC-7 kon...@ermank.com wrote:
With Git devs usually have a single folder designated for the local copy of remote repo and switch between branches in that folder as necessary.
In my company large number of devs use Perforce for source control and often have multiple versions of the sources in the different locations of their local file system. There are reasons for that behavior but it is out of scope. The fact of life is that there are many devs who may have 5 copies of the sources in the different folders.

Generally, Bazel puts everything it needs in output_base. It makes it easy to remove all the artifacts related to a given workspace instance (e.g. bazel clean --expunge).

However, the Perforce type workflow you describe (which I used quite a bit when I had to use Perforce) is not limited to Perforce. I use git worktree for a similar workflow. When I've got a long running development cycle on a branch, I may want to work on another branch simultaneously. You could do this without git worktree, but git worktree allows you to use a single git repository to support multiple workspaces simultaneously.

Bazel isn't particularly currently optimized for this workflow, as each workspace has a dedicated outputBase. A large portion of these outputBases (e.g. action_cache, external) could reasonably be shared between outputBases. I think the content based nature of the artifacts could lend itself to sharing those artifacts.

The big hurdle would be managing those shared caches. That is, how do you safely determine when to clean up stale action cache entries? How would a bazel clean know what to clean? However, I think for growing subset of users, I think I can deal with the fallout of manually trimming the action cache across multiple workspaces. The space savings would be significant for my workflow.

I'm guessing it slightly more complicated than creating a single action cache, similar to how --disk_cache doesn't work with remote execution (https://github.com/bazelbuild/bazel/pull/10233), but I for one would be able to leverage an action_cache shared across outputBases quite a bit.

-michael 

Konstantin

unread,
Feb 8, 2021, 5:16:44 PM2/8/21
to bazel-discuss
-- how do you safely determine when to clean up stale action cache entries?

At this time Bazel does not offer any help in cleaning the caches anyway. It just grows indefinitely until the user manually wipes the whole thing and let it be re-generated by the next build. So I don't consider it an issue - at least it is not a new issue.

And shared action cache is a kind of an advanced topic anyway. I would be happy to get rid of just "external" duplication. But can I just point different workspaces to the same single output_base and expect everything to keep working (with the exception of the ever-growing caches)?

Reply all
Reply to author
Forward
0 new messages