FR: alternative repository_cache drivers (--gcs_repository_cache)

174 views
Skip to first unread message

o...@wix.com

unread,
Aug 29, 2018, 3:43:55 AM8/29/18
to bazel-discuss
We love the flag --repository_cache as it saves a lot of time wasted otherwise to re-download third party maven jar that we have already fetched.
Problem is when running on stateless build servers (like GCB) we lose the content of that folder.

Currently that option only works with local path on the disk.

Do you think it can be easily implemented using a different driver? Say - Google Cloud Storage bucket?

Given Bazel already has the authorization protocol in it (--google_credentials) and GCS has the required read/write API - it should not be too hard to do, isn't it?

This can have a major impact for builds on GCB (or any other stateless build flow).

The bucket, just like the local folder, would need to be maintained and periodically cleaned.

WDYT?

markus....@ecosia.org

unread,
Aug 29, 2018, 5:07:07 AM8/29/18
to bazel-discuss
Yep, something like that would be great, similar to how we have disk and remote caching for actions it would make sense to have that also for external repositories. There is a way around this, it might cost more time than a direct support by bazel but it should still save time overall - you can just tar up the repository_cache directory and upload it to a GCS bucket - that is what circleci does under the hood if one uses their caching mechanisms (well they use s3). Further you could checksum your WORKSPACE file and all other files that specify external repositories, to do the upload only when any of these files changed.

o...@wix.com

unread,
Aug 29, 2018, 5:14:23 AM8/29/18
to bazel-discuss
Yes... that option also appears in GCB (https://cloud.google.com/cloud-build/docs/speeding-up-builds#caching_directories_with_google_cloud_storage)

But we must agree that copying in and out the entire folder every build has a lot of overhead. 
Also maintaining different folders per WORKSPACE file sha1 (we're using virtual mono-repo so our WORKSPACE file changes on the fly all the time).

But thanks for the comment - we might use that direction anyway

markus....@ecosia.org

unread,
Aug 29, 2018, 5:23:26 AM8/29/18
to bazel-discuss
Yes I agree, a native implementation would be much preferable to this workaround. Another way for a simple first implementation could be that bazel can just upload to a remote server without any specific authentication etc. and then one could set up a local proxy to handle the actual upload, similar to how https://github.com/Asana/bazels3cache and https://github.com/notnoopci/bazel-remote-proxy currently work with the remote action cache. That way the initial implementation would be backend agnostic.

Klaus Aehlig

unread,
Aug 29, 2018, 5:30:11 AM8/29/18
to o...@wix.com, bazel-discuss
On Wed, Aug 29, 2018 at 02:14:23AM -0700, ors via bazel-discuss wrote:
> Yes... that option also appears in GCB
> (https://cloud.google.com/cloud-build/docs/speeding-up-builds#caching_directories_with_google_cloud_storage)
>
> But we must agree that copying in and out the entire folder every build has
> a lot of overhead.
> Also maintaining different folders per WORKSPACE file sha1 (we're using
> virtual mono-repo so our WORKSPACE file changes on the fly all the time).

The default location of the repository_cache is independent of the WORKSPACE
(it is "cache/repos/v1" in the user_output_root, which defaults to
"~/.cache/bazel/_bazel_${USER}").

Also, you might have a look at the --distdir option for additional places
to look for files.

But, I do see the need for the general feature request of looking up distfiles
outside mounted file systems.

Thanks,
Klaus

--
Klaus Aehlig
Google Germany GmbH, Erika-Mann-Str. 33, 80636 Muenchen
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschaeftsfuehrer: Paul Terence Manicle, Halimah DeLaine Prado

Philipp Wollermann

unread,
Aug 29, 2018, 10:01:11 AM8/29/18
to Klaus Aehlig, Or Shachar, bazel-discuss
Klaus, do you think the current disk-based repository cache would work when accessed from multiple processes?

I wonder if we can use a network filesystem or something GCS FUSE and share it among all VMs.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/20180829093004.GA6462%40google.com.
For more options, visit https://groups.google.com/d/optout.


--
Philipp Wollermann
Software Engineer
phi...@google.com

Google Germany GmbH
Erika-Mann-Straße 33
80636 München

Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

Klaus Aehlig

unread,
Aug 29, 2018, 10:32:33 AM8/29/18
to Philipp Wollermann, Or Shachar, bazel-discuss
> Klaus, do you think the current disk-based repository cache would work when
> accessed from multiple processes?

The read-only distdir would of course work.

For the repository cache, I would say, at the moment it has a write-read
conflict that might end in readers erroring out, but not in wrong builds:
if a process is looking at the cache while another process, is adding
a file into the cache, that process might see a file in the cache, but
(as the file is not yet completely written) it has the wrong hash, so
error out.

This can easily be fixed (and I think we should do it) by replacing the
FileSystemUtils.copyFile() in lines 149 and 174 of RepositoryCache.java
by the usual copy-fsync-rename dance.

If we use other storage systems than a POSIX file system, we would need
again to take the appropriate measures to ensure that in the final
destination, the file appears atomically (and not through an intermediate
state with incomplete contents).

Tyler Rockwood

unread,
Jan 18, 2022, 11:39:11 AM1/18/22
to bazel-discuss
Sorry for reviving an old thread - I'm interesting in being able to use GCS for --repository_cache, is this still a valid approach to be able to add `http://` or `https://` prefixes to the --repository_cache flag? It's unclear to me if something like bzlmod is going to obsolete this flag.

Cheers,

-Tyler
Reply all
Reply to author
Forward
0 new messages