Does Bazel remote cache fetch files based the SHA-256 of the file content alone?

1,146 views
Skip to first unread message

duane....@gmail.com

unread,
Mar 10, 2018, 12:08:47 PM3/10/18
to bazel-discuss
I thought I read somewhere that during the build process, that Bazel hashes file contents but also includes other things timestamp and stuff (I forget exactly what).

Does that apply for the remote cache as well? I'd like to fetch files myself out of the cache, so I'd need to know how to create the same hash. Would that just be a SHA-256 of the file contents alone? Or do I need to also hash a timestamp, "Sundar Pichai", and other things to get the correct hash?

robin...@improbable.io

unread,
Mar 12, 2018, 1:34:37 PM3/12/18
to bazel-discuss
"The cache key consists of the env variables, the command line, and the (relative paths of the) input files".
See: https://github.com/bazelbuild/bazel/issues/2574

With the caveat that, from what my testing shows, the active_env is /not/ consumed by all rules which means that the env variables aren't always included.

Hope this helps!
Robin

Benjamin Elder

unread,
Mar 12, 2018, 1:49:25 PM3/12/18
to robin...@improbable.io, bazel-discuss
> With the caveat that, from what my testing shows, the active_env is /not/ consumed by all rules which means that the env variables aren't always included.

I've also noticed this, there is an issue at: https://github.com/bazelbuild/bazel/issues/3320



--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/37e4f5ee-e1a4-4267-b94a-80a406165bef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

duane....@gmail.com

unread,
Mar 12, 2018, 2:09:58 PM3/12/18
to bazel-discuss
Either I had a fundamental misunderstanding of how this works, or my question was worded poorly..

I thought that once Bazel builds something that it then hashes the resulting binary (or whatever) and then stores it the remote cache with that hash. Not the hashes of the inputs it used to create the binary. Is that wrong?

John Cater

unread,
Mar 12, 2018, 2:12:02 PM3/12/18
to duane....@gmail.com, bazel-discuss
That's incorrect. In fact, it can't work: if the hash is based on the output, Bazel would need to re-create the output to get the hash, to see if the output is in the cache.

By hashing the action inputs, we can efficiently check whether the result of an action is in the cache, without actually re-executing the action.

On Mon, Mar 12, 2018 at 2:10 PM <duane....@gmail.com> wrote:
Either I had a fundamental misunderstanding of how this works, or my question was worded poorly..

I thought that once Bazel builds something that it then hashes the resulting binary (or whatever) and then stores it the remote cache with that hash.  Not the hashes of the inputs it used to create the binary.  Is that wrong?

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

duane....@gmail.com

unread,
Mar 12, 2018, 2:20:10 PM3/12/18
to bazel-discuss
I figured you guys somehow also stored that a certain input hash produced a given output hash.


Maybe that was wishful thinking on my part. I was hoping to write a FUSE that retrieved files (both source and binaries) via hash. I figured I could hijack the Bazel remote cache for the source files right long side the binaries. For that I was assuming the hash was a hash of the content itself.

Is there a way for me to get the hash from a given binary? That way I can remember that hash and retrieve the content later with it?

Benjamin Elder

unread,
Mar 12, 2018, 2:26:21 PM3/12/18
to duane....@gmail.com, bazel-discuss
Sorry, I meant to also reply to this above: My understanding is that it's *both*..

There's a hash of the action to some action metadata (the "Action Cache") and then there's also at a different location (the "CAS") which is well, content addressed outputs [1].
I haven't tried to get anything out of the cache without using Bazel, but for the HTTP remote cache you should be able to fetch results via their hash.


--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
Message has been deleted

duane....@gmail.com

unread,
Mar 12, 2018, 2:44:26 PM3/12/18
to bazel-discuss
So I was thinking that the FUSE will store a list of hashes and retrieve files from the remote cache by hash on demand. If somebody modifies a file, it would store a local copy of it. However, if somebody wrote a file and the hash for that file is already in the remote cache, I wanted to delete the local copy and just store that as a hash again (for example if they checkout another branch). That means that I will need to be able to hash the content (without knowing how it was created) and use http HEAD to see if that hash already exists. But if the hashes are totally different, then I can't do it it hat way.

Benjamin Elder

unread,
Mar 12, 2018, 2:55:42 PM3/12/18
to duane....@gmail.com, bazel-discuss
That means that I will need to be able to hash the content (without knowing how it was created) and use http HEAD to see if that hash already exists. 

This should work by hitting <cache-host>/cas/<hash> but your caching backend would need to support HEAD.

I'm not sure about the viability of the overall plan however.


On Mon, Mar 12, 2018 at 11:44 AM <duane....@gmail.com> wrote:
So I was thinking that the FUSE will store a list of hashes and retrieve files from the remote cache by hash on demand.  If somebody modifies a file, it would store a local copy of it.  However, if somebody wrote a file and the hash for that file is already in the remote cache, I wanted to delete the local copy and just store that as a hash again (for example if they checkout another branch).  That means that I will need to be able to hash the content (without knowing how it was created) and use http HEAD to see if that hash already exists.  But if the hashes are totally different, then I can't do it it hat way.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Jakob Buchgraber

unread,
Mar 12, 2018, 3:01:59 PM3/12/18
to duane....@gmail.com, bazel-discuss
On Mon, Mar 12, 2018 at 7:44 PM <duane....@gmail.com> wrote:
So I was thinking that the FUSE will store a list of hashes and retrieve files from the remote cache by hash on demand.  If somebody modifies a file, it would store a local copy of it.  However, if somebody wrote a file and the hash for that file is already in the remote cache, I wanted to delete the local copy and just store that as a hash again (for example if they checkout another branch).  That means that I will need to be able to hash the content (without knowing how it was created) and use http HEAD to see if that hash already exists.  But if the hashes are totally different, then I can't do it it hat way.


This generally sounds like it works for output artifacts, but what is it that you trying to achieve? For example, files created locally are likely to be used again soon as inputs to other actions so I am not sure if deleting them after creation even if they already exist in the remote cache is a good idea.

Best,
Jakob
 
--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--

Jakob Buchgraber

Software Engineer

Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

    

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.

duane....@gmail.com

unread,
Mar 12, 2018, 3:03:51 PM3/12/18
to bazel-discuss
I'd appreciate any comments on the plan. What do you see as an issue? Speed? I was concerned about that myself.

duane....@gmail.com

unread,
Mar 12, 2018, 3:11:55 PM3/12/18
to bazel-discuss
I should have said that it would reserve the right to delete the local copy. If I had no way to hash the content and compare to the cache then I would have no way of doing it ever. You are right it would be faster to keep files local that get accessed often.

I was thinking of mostly source files (I was hoping to use the cache for both source and binaries). That if a user did a checkout of one branch and then switch to another branch and so on, I didn't want the number of local files to necessarily increase for perpetuity and never get "reclaimed".

Jakob Buchgraber

unread,
Mar 12, 2018, 3:13:53 PM3/12/18
to duane....@gmail.com, bazel-discuss
Hi Duane,

the remote cache doesn't contain source files at all. I am not sure I can help you without knowing what you are trying to build and what goals you are trying to achieve - a bit more context would be useful :-).

Best,
Jakob

On Mon, Mar 12, 2018 at 8:11 PM <duane....@gmail.com> wrote:
I should have said that it would reserve the right to delete the local copy.  If I had no way to hash the content and compare to the cache then I would have no way of doing it ever.  You are right it would be faster to keep files local that get accessed often.

I was thinking of mostly source files (I was hoping to use the cache for both source and binaries).  That if a user did a checkout of one branch and then switch to another branch and so on, I didn't want the number of local files to necessarily increase for perpetuity and never get "reclaimed".

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

duane....@gmail.com

unread,
Mar 13, 2018, 12:11:45 PM3/13/18
to bazel-discuss
I messed up and didn't hit reply-all to my last two responses. So nobody else saw them. I cut and pasted the exchange here for everybody too see:


I looked into GVFS and the problem with that is that it is currently only available for Windows, and my project does most of our development in Linux. Microsoft says that they will support Mac and then Linux at some point, but there is no timeframe for that.

I was thinking of using the remote cache to provide file content for our FUSE, but it sounds like you are encouraging me to not use remote caching at all and to FUSE instead. How would that work? Wouldn't Bazel, not knowing the files are on FUSE, access every file too see if what has been modified? If so, wouldn't that be slower then using a native FS?

Another problem I have is even if I were to get Bazel seamlessly integrated with my FUSE, there are other apps we use such as git. If it takes an hour for git to clone/checkout the repo (as it tries to write a gazillion files to my FUSE). That would defeat half the purpose of a mono-repository.

So I was thinking of writing the FUSE in a way where it tracks modified files and provides that info to anybody who wants it. And that any app, including Bazel and Git, could use that information to optimize their tasks (if they had some future --useFuse flag activated).

(BTW, I'm aware of inotify, but I'm not sure how resource intensive that is if events were registered for so many files. Maybe I should look into this more.)

The way I was thinking of doing it was exposing a virtual .fuse directory within my file system. Within that there would be a changeLog directory. Git would create an empty a "git.log" file in that directory after every commit and Bazel would create an empty a "bazel.log" file after every build (any app could do the same). Every time a file was modified, the FUSE would append the path to the modified file to every log file within that directory. Any app that wants can know what files changed since the time they created their file.

Ideally (maybe fantasyland) I would convince my company to allow me to open source the whole thing and a bunch of apps would adopt --useFuse flag and become "fuse-aware" on their own. But will probably happen is I would merely modify these apps myself and try to submit patches.

Obviously, before I start any if this, I need to make sure the idea is sound and not idiotic. I expected a FUSE implementation for this would exist already, but GVFS is the only one I can find, but we need Linux...
Hide quoted text

On Mar 13, 2018 2:06 AM, "Jakob Buchgraber" <buc...@google.com> wrote:
Hi Duane,

thanks for the details. Here's a link to how Google does that [1]. There's also [2] about which I have heard good things about. I know of Bazel users who implemented their own FUSE for remote caching. The big advantage there is when using a FUSE file system instead of remote caching via HTTP is that Bazel will only fetch what it needs lazily. We have been thinking about baking this lazy fetch behavior into Bazel's filesystem abstraction and I think this will happen in Bazel eventually as it can lead to huge performance gains for remote execution, but it's just not a priority at the moment.

I am happy to answer any specific questions about remote caching.

Best,
Jakob

[1] https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
[2] https://github.com/Microsoft/GVFS

On Mon, Mar 12, 2018 at 8:37 PM Duane Knesek <duane....@gmail.com> wrote:
Okay... I appreciate your time...

In short, I'm investigating what it would take to move my team to a mono-repository. I was thinking of prototyping a FUSE implementation that allows us to work in a large working directory without using a ton of space. That most "files" would be virtual and only the ones in common use would be stored locally.

I wanted to hijack the remote cache to store all files (including source) not just binaries produced by Bazel. I also wanted my FUSE to provide info on what files have been changed and possibly modify git & Bazel to use that information to speed up commits, builds, and so on rather than read every file and hash them every time.

Is that enough context? I could provide more if you like.

Reply all
Reply to author
Forward
0 new messages