Bazel remote caching performance with a large directory tree as an input

187 views
Skip to first unread message

Chris Kuehl

unread,
Feb 11, 2020, 9:00:56 PM2/11/20
to bazel-...@googlegroups.com
Hi bazel-discuss,

I'm migrating a large Python codebase to use Bazel, and am hoping for some advice on using a large directory tree as an input to a remotely-cached action.

Our project has a large virtualenv (directory of installed Python package sources) which is an input of many build targets; it contains about 600 installed Python packages (~50,000 files, ~1.5 GB). When using this directory as an input to an action, I'm observing a large slowdown when enabling remote caching. I think this is because a lot of time is spent hashing the inputs (constructing the Merkle tree?) to build the action's cache key.

An example of a build step we have is one which compiles an individual source file by calling a program inside the virtualenv; this gets run about 5,000 times in our codebase, taking about 0.5 second each. If I run this without any remote caching enabled with 32x concurrency, it takes about 100 seconds as expected. If I enable remote caching (e.g. --disk_cache /tmp/cache) it slows down to 210 seconds with a cold cache (0% hit rate) but improves to 90 seconds with a warm cache (100% hit rate). During the warm cache run I can see that Bazel itself is using ~100% of all 32 CPU cores on the box for the entire 90 seconds.

When I profile a warm cache build it appears that virtually all of the time is spent constructing Merkle trees (see attachment warm.png); on a cold cache, it looks like about half of the time is Merkle trees (attached cold.png).

It's understandable that hashing a large directory takes a lot of CPU time, but I wonder if anybody has any ideas on how to reduce this cost so that we can still benefit from remote caching? Or am I misunderstanding the problem entirely?

Some approaches we've considered already:
  1. Use a sentinel file which represents the state of the large directory as the only input. The idea is to generate a file which represents the contents of the directory (e.g. a directory listing with hashes), and use this as the input file, while the action ignores the sentinel file and reaches into the virtualenv as it otherwise would. This is very fast because only the sentinel file needs to be hashed by Bazel, but it means we have to opt-out of the Bazel sandbox and reach into the source tree directly (since the virtualenv is not one of the declared inputs). We're very interested in keeping the sandboxing benefit if possible since it is very useful to ensure build actions declare dependencies correctly.
  2. Somehow patch Bazel to only compute the input hash/tree once? This seems possible given that the input Merkle tree is identical for all of these actions, but it's unknown how difficult this is to do correctly.
  3. Batch all of the small compilations into a single action, so that we only pay the input hashing cost once. This works but means that a change to a single file requires a large re-compilation of all files, negating any real benefit.
  4. Don't use remote caching for small compilations, only for large targets. This is possible but not ideal, as the cumulative build time spent on small files is still significant.
These all have downsides which we're hoping to avoid. Has anyone faced a similar problem or have other suggestions?

In case it helps, I've appended a minimal reproduction below this email.

Thanks!

======


You'll need to create a large directory tree "venv". Anything large tree should work, but you can create one with "virtualenv venv && venv/bin/install plone" if desired. (No relation to plone, it's just a large Python package.) Reproduce with "bazel clean && bazel build compile-templates".
cold.png
warm.png

László Csomor

unread,
Feb 12, 2020, 4:38:04 AM2/12/20
to Chris Kuehl, bazel-discuss
Hi Chris,

Thanks for the details in your email!

> When using this directory as an input to an action [...]

This isn't supported well. For example, genrule(..., srcs=["subdir/"])
won't traverse "subdir/", nor will it stage the contents in the
sandbox. Bazel won't keep track of dependencies on the files under
"subdir/", so modifying the directory's contents won't trigger
rebuilds. Bazel warns about this:

WARNING: [redacted]/BUILD:4:5: input 'venv' to //:compile-46.compiled
is a directory; dependency checking of directories is unsound

Your example BUILD file does the same: genrule(..., srcs=["venv"]).
Are you sure it works correctly?

I'm wondering what are the inputs of the Merkle tree computation.
Could you attach a debugger here (assuming you run Bazel 2.1.0):
https://github.com/bazelbuild/bazel/blob/2.1.0/src/main/java/com/google/devtools/build/lib/remote/merkletree/MerkleTree.java#L122

Debugging Bazel itself: https://bazel.build/contributing.html#debugging-bazel


Cheers,
László

--
László Csomor | Software Engineer | laszlo...@google.com

Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Paul Manicle, Halimah DeLaine Prado
> --
> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/CAJi78qwy1vY9Lm826YRLynwYFqjR2FD%2B7WPBGDS_kvybrinepg%40mail.gmail.com.

Jakob Buchgraber

unread,
Feb 12, 2020, 4:47:21 AM2/12/20
to László Csomor, Chris Kuehl, bazel-discuss
Additionally to what Laszlo wrote consider taking a JSON profile of the build [1]. The merkle tree building will be in there as a separate node. It's quite possible that the merkle tree building is the problem. As Laszlo wrote it's strongly recommended to not specify directories but use a glob() instead. The way the MerkleTree building works with directory inputs is that it has to recursively traverse the directory by actually calling readdir(). Besides the correctness concerns this won't be cached between actions and so every action will run the readdir() and hashing. If you use a glob() you will have to pay this only once at the beginning.

Best,
Jakob

[1] https://docs.bazel.build/versions/master/skylark/performance.html#json-profile

Jakob Buchgraber

Software Engineer


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

    

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.




Jakob Buchgraber

unread,
Feb 12, 2020, 5:01:21 AM2/12/20
to Chris Kuehl, bazel-discuss, László Csomor
Chris, please excuse my ignorance. I read too quickly and missed that you had already taken the JSON profile. So change it to a glob and performance should improve!

Best,
Jakob

Jakob Buchgraber

Software Engineer


Google Germany GmbH

Erika-Mann-Straße 33

80636 München


Geschäftsführer: Paul Manicle, Halimah DeLaine Prado

Registergericht und -nummer: Hamburg, HRB 86891

Sitz der Gesellschaft: Hamburg


Diese E-Mail ist vertraulich. Falls sie diese fälschlicherweise erhalten haben sollten, leiten Sie diese bitte nicht an jemand anderes weiter, löschen Sie alle Kopien und Anhänge davon und lassen Sie mich bitte wissen, dass die E-Mail an die falsche Person gesendet wurde.

    

This e-mail is confidential. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person.



Jared Neil

unread,
Feb 13, 2020, 2:09:53 PM2/13/20
to bazel-discuss
For what it's worth, we ran into the same problem, but while using rules_nodejs, and those rules generate build files with every file listed explicitly to capture everything in each package in the node_modules directory, so I doubt using globs will solve the problem entirely. The only workaround we found was to limit the action inputs as much as possible, by depending only on specific subsets of the entire node_modules directory.

I feel like Bazel could save a lot of time be calculating a different Merkle tree for each tool input to the action, then combine those with the tree computed from the action's direct non-tool inputs.
For example, an action that has 1 input file and one output file, and uses a tool that depends on 50,000 files, shouldn't need to calculate a new 50,001 file Merkle tree for each action, it should instead re-use the 50,000 file tree for multiple actions, and combine that with the 1 new file to calculate the final action key.
Currently, it seems to be the case that the time spent in the Merkle tree code is a significant portion of the total time spent on every action, even when only one file changes between two actions.
I haven't looked at the Merkle tree code, so maybe it already works this way, but based on the speedup we saw in the profiles when trimming our node_modules, it doesn't appear to.
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-...@googlegroups.com.

Chris Kuehl

unread,
Feb 13, 2020, 2:28:41 PM2/13/20
to László Csomor, Jakob Buchgraber, bazel-...@googlegroups.com
Hi László and Jakob,

Thank you for the detailed responses!

On Wed, Feb 12, 2020 at 1:38 AM László Csomor <laszlo...@google.com> wrote:
This isn't supported well. For example, genrule(..., srcs=["subdir/"])
won't traverse "subdir/", nor will it stage the contents in the
sandbox. Bazel won't keep track of dependencies on the files under
"subdir/", so modifying the directory's contents won't trigger
rebuilds. Bazel warns about this:

  WARNING: [redacted]/BUILD:4:5: input 'venv' to //:compile-46.compiled
  is a directory; dependency checking of directories is unsound

Your example BUILD file does the same: genrule(..., srcs=["venv"]).
Are you sure it works correctly?

Ah, you're right, Bazel does not notice when I make changes inside the venv directory and as a result does not rebuild the files that depend on it. So it sounds like my current approach of just specifying the directory as an input won't really work even ignoring the perf concerns with remote caching, because it won't rebuild files when needed.

Interestingly, remote caching does seem to notice differences inside the directory. If I run (using the same BUILD file as my original post) this command:
bazel clean && bazel build 1.compiled --disk_cache /tmp/my-cache

When I run this command without changing the venv directory, I get remote cache hits ("INFO: 1 process: 1 remote cache hit."), but a change to any file inside of the venv directory causes a cache miss and rebuild.
 
I'm wondering what are the inputs of the Merkle tree computation.
Could you attach a debugger here (assuming you run Bazel 2.1.0):
https://github.com/bazelbuild/bazel/blob/2.1.0/src/main/java/com/google/devtools/build/lib/remote/merkletree/MerkleTree.java#L122

The inputs at that point contain just a reference to the top-level directory (has not been expanded to include every file in the tree):

skyframe-evaluator 31[1] print inputs
 inputs = "{external/bazel_tools/tools/genrule/genrule-setup.sh=File:[/nail/home/ckuehl/.cache/bazel/_bazel_ckuehl/16434d5facc8f66ae384214580769024[source]]external/bazel_tools/tools/genrule/genrule-setup.sh, venv=File:[/nail/home/ckuehl/pg/yelp-main[source]]venv}"

 
If I print out the tree a couple lines later (line 124) I can see that it does now include every single file under that venv tree:
(heavily abbreviated; the total count is 43617 which roughly matches the number of files in my venv).

On Wed, Feb 12, 2020 at 1:47 AM Jakob Buchgraber <buc...@google.com> wrote:
Additionally to what Laszlo wrote consider taking a JSON profile of the build [1]. The merkle tree building will be in there as a separate node. It's quite possible that the merkle tree building is the problem. As Laszlo wrote it's strongly recommended to not specify directories but use a glob() instead. The way the MerkleTree building works with directory inputs is that it has to recursively traverse the directory by actually calling readdir(). Besides the correctness concerns this won't be cached between actions and so every action will run the readdir() and hashing. If you use a glob() you will have to pay this only once at the beginning.

Thanks for this info. When switching to specifying the inputs as glob("venv/**") instead of just "venv" I'm seeing a couple new problems:
  • The (I think) loading phase now takes ~240 seconds instead of being nearly instant; looking at a profile taken with bazel build --nobuild shows a lot of time spent in evaluateTargetPatterns. I've attached a screenshot although not sure how useful it is. This might just be expected given the number of files?
  • Inside of the sandboxes Bazel creates to run each action in, each file inside the venv is symlinked individually (i.e. bazel creates 43,000 symlinks per action) instead of just the top-level directory being symlinked. This makes sense but causes a huge performance hit on top of the loading phase changes above due to how expensive it is to set up a sandbox for each action; it now takes upwards of an hour to run the build (I stopped it early, it was only done with about 10% of the actions after an hour).
I can't really think of an option that can preserve performance, sandboxing, and correctness for this large input. The options I can think of are:
  • Use a glob() as recommended above. This is the most correct and works well with remote caching but makes the loading phase very slow, and requires us to disable sandboxing for performance reasons (or find a way to prevent the sandboxing from symlinking each file individually).
  • Use some kind of sentinel files to represent the large directory (similar to the description in my first message). This seems like it can be correct (assuming the sentinel files correctly describe the state of the directory) but requires us to reach outside of the sandbox, sacrificing a big benefit of Bazel (sandboxing helps guarantee correctness).
  • Specify only the directory as an input. This doesn't really work because of incorrect rebuild semantics and causes poor performance with remote caching.
Sentinel files seem the most promising given our constraints, but I'm kind of sad that we would have to sacrifice sandboxing (since we see this as such a big benefit of using Bazel).

Thanks again for the help!
Screen Shot 2020-02-13 at 11.14.00 AM.png

Konstantin Erman

unread,
Feb 22, 2020, 9:22:09 AM2/22/20
to bazel-discuss
Chris, from your example it seems that you generate 5000 copies of the rule with each copy depending on 50000 source files. I wonder if you could introduce a single intermediate rule (py_library comes to mind) which would depend on your 50000 input files, but then 5000 other rules would only depend on a single intermediate rule. That intermediate rule may play the role of the sentinel you were considering. Checksuming and symlinking of 50000 files is going to stay this way, but I hope it would only be done once instead of 5000 times.

Konstantin

Chris Kuehl

unread,
Feb 25, 2020, 4:49:25 PM2/25/20
to Konstantin Erman, bazel-discuss
Hey Konstantin,

You're right -- if I group all of these 50k input files into one rule (e.g. via filegroup) and reference that instead, it is indeed significantly faster during the loading (?) phase. This is really helpful, thanks for the tip!

Unfortunately the symlinking required to set up a separate sandbox for each of the 5k rules still ends up being prohibitively slow for my use-case (as far as I can tell, each sandbox is only used once).

On Sat, Feb 22, 2020 at 6:22 AM Konstantin Erman <kon...@ermank.com> wrote:
Chris, from your example it seems that you generate 5000 copies of the rule with each copy depending on 50000 source files. I wonder if you could introduce a single intermediate rule (py_library comes to mind) which would depend on your 50000 input files, but then 5000 other rules would only depend on a single intermediate rule. That intermediate rule may play the role of the sentinel you were considering. Checksuming and symlinking of 50000 files is going to stay this way, but I hope it would only be done once instead of 5000 times.

Konstantin

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Austin Schuh

unread,
Feb 25, 2020, 7:11:13 PM2/25/20
to Chris Kuehl, Konstantin Erman, bazel-discuss
For the slow symlinking problem, try:

--experimental_sandbox_base=/dev/shm

This puts the symlinks in a tmpfs, which is significantly faster.

Austin
> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/CAJi78qx2amL_H1uS7ZoN7-i1XC%2BDgfhgGG%2B9Zfosp5X4CyCX1Q%40mail.gmail.com.

Fredrik Medley

unread,
Mar 26, 2020, 7:27:01 PM3/26/20
to bazel-discuss
I've written a comment in https://github.com/bazelbuild/bazel/issues/10875 about how to completely avoid building the merkle tree during an initial cache hit checking.

/Fredrik


Den onsdag 26 februari 2020 kl. 01:11:13 UTC+1 skrev Austin Schuh:
For the slow symlinking problem, try:

--experimental_sandbox_base=/dev/shm

This puts the symlinks in a tmpfs, which is significantly faster.

Austin

On Tue, Feb 25, 2020 at 1:49 PM 'Chris Kuehl' via bazel-discuss
<bazel-...@googlegroups.com> wrote:
>
> Hey Konstantin,
>
> You're right -- if I group all of these 50k input files into one rule (e.g. via filegroup) and reference that instead, it is indeed significantly faster during the loading (?) phase. This is really helpful, thanks for the tip!
>
> Unfortunately the symlinking required to set up a separate sandbox for each of the 5k rules still ends up being prohibitively slow for my use-case (as far as I can tell, each sandbox is only used once).
>
> On Sat, Feb 22, 2020 at 6:22 AM Konstantin Erman <kon...@ermank.com> wrote:
>>
>> Chris, from your example it seems that you generate 5000 copies of the rule with each copy depending on 50000 source files. I wonder if you could introduce a single intermediate rule (py_library comes to mind) which would depend on your 50000 input files, but then 5000 other rules would only depend on a single intermediate rule. That intermediate rule may play the role of the sentinel you were considering. Checksuming and symlinking of 50000 files is going to stay this way, but I hope it would only be done once instead of 5000 times.
>>
>> Konstantin
>>
>> --
>> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-...@googlegroups.com.
>> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/559a8ea7-4748-4ce5-9068-ed15f31e22c8%40googlegroups.com.
>
> --
> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages