How to debug remote cache misses

1,015 views
Skip to first unread message

Reuben D Netto

unread,
May 15, 2018, 3:38:58 AM5/15/18
to bazel-discuss
Hi,
I'm investigating using Bazel at work, as the remote caching holds a lot of promise for speeding up our dev loop. Unfortunately, when run on two different computers the cache key for the same target differs, and I can't figure out why. Both computers are running the same versions of Bazel, Java and OS X (10.13.4), have the same PATH variables, and are running under the same username. bazel info shows the same values on both for everything but PID and heap/GC info. When I do bazel dump  --action_graph :atlassian-core --action_graph:include_cmdline=true and view the output using protoc --decode_raw, I even see the action keys for the genrules match up.

The command I'm using to build it is:
bazel build --experimental_local_disk_cache --experimental_local_disk_cache_path=/tmp/bazel_cache  --experimental_strict_action_env --action_env=JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
 :atlassian-core
Following the build, the filenames in the cache directory differ.

(Note that I'm using a local disk cache instead of the HTTP cache for testing, since it's easier to flush.)

When I run the same command on two Linux boxes (using Nix to provide suitable values for PATH and JAVA_HOME), the hash of the entry in the caches is the same. (The output artefacts still differ, but that's a result of our rules including timestamps in them and wont affect our hit rate).

Bazel version is 0.13.0-homebrew on OSX (installed via brew), and 0.13.0 on Linux (installed via Nix).

I'd really appreciate it if anyone has any suggestions on how to figure out why the cache keys are different, as I'm completely at a loss.

Kind regards,
Reuben

Ian O'Connell

unread,
May 15, 2018, 7:53:04 AM5/15/18
to Reuben D Netto, bazel-discuss
One thing we've ran into is that the system JDK's can lead to things not being reproducible -- https://github.com/bazelbuild/bazel/issues/4769

if you use the embedded jdk inside bazel instead its been stable

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/4b022dcd-9909-4c90-a567-d6739412d159%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Greg Estren

unread,
May 15, 2018, 8:26:50 AM5/15/18
to Ian O'Connell, Reuben D Netto, bazel-discuss
You can also see some back and forth at https://github.com/bazelbuild/bazel/issues/4714 on investigating cache misses.

On Tue, May 15, 2018 at 1:53 PM 'Ian O'Connell' via bazel-discuss <bazel-...@googlegroups.com> wrote:
One thing we've ran into is that the system JDK's can lead to things not being reproducible -- https://github.com/bazelbuild/bazel/issues/4769

if you use the embedded jdk inside bazel instead its been stable
On Tue, May 15, 2018 at 3:38 AM, Reuben D Netto <rdn...@atlassian.com> wrote:
Hi,
I'm investigating using Bazel at work, as the remote caching holds a lot of promise for speeding up our dev loop. Unfortunately, when run on two different computers the cache key for the same target differs, and I can't figure out why. Both computers are running the same versions of Bazel, Java and OS X (10.13.4), have the same PATH variables, and are running under the same username. bazel info shows the same values on both for everything but PID and heap/GC info. When I do bazel dump  --action_graph :atlassian-core --action_graph:include_cmdline=true and view the output using protoc --decode_raw, I even see the action keys for the genrules match up.

The command I'm using to build it is:
bazel build --experimental_local_disk_cache --experimental_local_disk_cache_path=/tmp/bazel_cache  --experimental_strict_action_env --action_env=JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_172.jdk/Contents/Home
 :atlassian-core
Following the build, the filenames in the cache directory differ.

(Note that I'm using a local disk cache instead of the HTTP cache for testing, since it's easier to flush.)

When I run the same command on two Linux boxes (using Nix to provide suitable values for PATH and JAVA_HOME), the hash of the entry in the caches is the same. (The output artefacts still differ, but that's a result of our rules including timestamps in them and wont affect our hit rate).

Bazel version is 0.13.0-homebrew on OSX (installed via brew), and 0.13.0 on Linux (installed via Nix).

I'd really appreciate it if anyone has any suggestions on how to figure out why the cache keys are different, as I'm completely at a loss.

Kind regards,
Reuben

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/CAP_RQpjh5duMo%3Dnhh0JtYRKtD4FvTURyxyNAEiC%3DChXRZrHZHQ%40mail.gmail.com.

Reuben D'Netto

unread,
May 16, 2018, 12:02:21 AM5/16/18
to bazel-discuss
One thing we've ran into is that the system JDK's can lead to things not being reproducible -- https://github.com/bazelbuild/bazel/issues/4769

Thanks for the link. The JDKs appeared to be completely identical when hashed though, so that wasn't the problem.

You can also see some back and forth at https://github.com/bazelbuild/bazel/issues/4714 on investigating cache misses.

Thanks for that - although that didn't lead me to the answer, it's incredibly relevant to me and definitely something I'll be keeping an eye on.

For wisdom of the ancients, I was able to identify the differences using the custom version of Bazel linked from this thread. i.e. (https://github.com/werkt/bazel.git:remote-cache-logging). Turns out it was user error - one of my globs was including files under the target directory (we don't use completely out of tree builds for legacy reasons), and since one of my checkouts wasn't clean, the act of building it was causing its cache key to change. *facepalm*

Thanks again for the help.

George Gensure

unread,
May 16, 2018, 12:46:23 AM5/16/18
to Reuben D'Netto, bazel-discuss
Glad that helped. The bazel team has released a combination of tools that accomplish similar things and track with their releases a little better:

  --experimental_remote_grpc_log (a string; default: "")
    If specified, a path to a file to log gRPC call related details. This log
    consists of a sequence of serialized com.google.devtools.build.lib.remote.
    logging.RemoteExecutionLog.LogEntry protobufs with each message prefixed by
    a varint denoting the size of the following serialized protobuf message, as
    performed by the method LogEntry.writeDelimitedTo(OutputStream).

And these logs can be read with https://github.com/cdlee2/tools_remote-client/ - features are provided there for exploring the file tree

-George

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/13737388-2b63-4a94-865d-6942bde9a1bd%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages