Why Bazel consumes so much CPU during execution?

690 views
Skip to first unread message

Konstantin

unread,
Nov 26, 2022, 2:32:45 PM11/26/22
to bazel-discuss
For our huge C++ build from the "clean" state (no caches populated) Bazel is consistently and substantially slower than CMake. One thing I could not help noticing is that during Execution phase CMake consumes very little resources, leaving most of it to compilers and such, while Bazel continuously pegs down about half of available CPU cores for itself which means compilation goes twice slower! I wonder what is causing it and if something can be done to improve it. 

Thank you!
Konstantin

Moshe Pfeffer

unread,
Nov 27, 2022, 6:49:36 AM11/27/22
to moshe....@mobileye.com, bazel-discuss
--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/17497b25-ab34-40ac-8832-6cd5c8c7cb78n%40googlegroups.com.

Zhuo Chen

unread,
Dec 2, 2022, 7:43:09 AM12/2/22
to bazel-discuss
Can you paste your build metrics include the build time of both CMake and Bazel? Your observation on the CPU utilization is compliance with my understanding, however,  the result is not.
Benefiting from Bazel's  algorithm  of paralleling, Bazel is able to fully utilize your CPU cores. On the opposit, cmake's paralleling can only submit many build jobs "in batch" without considering their dependency. Hence although the CPU utilization is high when using Bazel, but I believe the build time should be shorter than cmake, in my experience, usually Bazel takes only one third of cmake.

Konstantin

unread,
Dec 2, 2022, 1:04:02 PM12/2/22
to bazel-discuss
Our current build system is not pure CMake - it is CMake + Ninja and I believe Ninja's primary purpose is exactly maximum utilization of the cores. It just consumes way less CPU for itself than Bazel does. My observations are on Windows and it could be that on Windows Bazel scheduler is not as resource efficient as Ninja.

David Turner

unread,
Dec 4, 2022, 12:03:47 PM12/4/22
to Konstantin, bazel-discuss
Bazel does a number of things like setting up a different sandbox for each build command, or hashing the content of build inputs and outputs that take a non-trivial amount of CPU and i/O.
None of this has to be performed with CMake + Ninja, so a "clean" Bazel build will always be significantly slower, in the absence of prebuilt artifacts in the cache.

Now half of your CPU cores seems really high, but that may depend on your project. For example, in our experience, when using a custom Python distribution (i.e. interpreter + module files, about 5000 files or 120 MiB), setting up the sandbox for a command that invokes a single py_binary() takes several hundred milliseconds, which is considerably slower than loading and running the script itself. None of that happens when using the system python, which we avoid for hermeticity / reproducibility reasons.


--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Konstantin

unread,
Dec 4, 2022, 4:16:10 PM12/4/22
to bazel-discuss
David, thank you for your response!

Although our build supports multiple platforms (Mac, Linux, Windows, Emscripten) my observations about Bazel high CPU consumption was on Windows where to my understanding sandbox is not supported (and therefore not created/destroyed) regardless of any settings. So I guess that high CPU is not related to sandboxing.

I wonder if there is a way to make an educated guess regarding what it is doing by looking at the build profile or some other kind of instrumentation. I'd love to find out that the cause is some kind of inefficiency in our Starlark code and make it better. 

We also use Python and care about its hermeticity, mostly because devs may have any random versions and configurations of Python on their machines and we don't want the build to be affected by those variations. So we also have our own Python package (for each platform) which we download with http_archive and then establish Python toolchain pointing at that downloaded package. Unfortunately we learned that at least on some platforms (like Linux) standard rules_python still use system Python to unpack the executable which IMHO goes against the goal of Python hermeticity. For that reason we use our own very limited Python ruleset which does not require creation of py_binary packages and just runs build scripts with the Python toolchain provided. Again we'd love to switch to the standard rules_python but the system Python requirement is a real bummer.

Konstantin

David Turner

unread,
Dec 5, 2022, 4:47:28 AM12/5/22
to Konstantin, bazel-discuss
On Sun, Dec 4, 2022 at 10:16 PM Konstantin <kon...@ermank.com> wrote:
David, thank you for your response!

Although our build supports multiple platforms (Mac, Linux, Windows, Emscripten) my observations about Bazel high CPU consumption was on Windows where to my understanding sandbox is not supported (and therefore not created/destroyed) regardless of any settings. So I guess that high CPU is not related to sandboxing.

It' s true that it doesn't set up a sandbox, but it will still scan output directories to remove files that are not listed in a build actions' manifest. So it is still probably taking some extra time on each build command, even on Windows :-/
My first guess would be that hashing all command inputs / outputs would take more time on Windows which is known for its distinctively poor I/O performance when querying metadata in the usual Posix specific way, which seems to be what Bazel does on all platforms.
I have looked into the Bazel sources a little, and it seems that Bazel does not use a WatchFs Java service on Windows, unless --experimental_windows_watchfs is used. Instead it will rescan all inputs by default, which is probably very costly (I didn't find anything that caches the digest of files in the sources, but my exploration was pretty light).
I would suggest trying this flag to see if this improves performance.

And there is also the presence of anti-virus programs, i.e. do not forget to add the hidden-by-default Bazel output user root / output base, which lies outside of the project's directory, to their exclusion lists.

I wonder if there is a way to make an educated guess regarding what it is doing by looking at the build profile or some other kind of instrumentation. I'd love to find out that the cause is some kind of inefficiency in our Starlark code and make it better. 

We also use Python and care about its hermeticity, mostly because devs may have any random versions and configurations of Python on their machines and we don't want the build to be affected by those variations. So we also have our own Python package (for each platform) which we download with http_archive and then establish Python toolchain pointing at that downloaded package. Unfortunately we learned that at least on some platforms (like Linux) standard rules_python still use system Python to unpack the executable which IMHO goes against the goal of Python hermeticity. For that reason we use our own very limited Python ruleset which does not require creation of py_binary packages and just runs build scripts with the Python toolchain provided. Again we'd love to switch to the standard rules_python but the system Python requirement is a real bummer.

Funny that you mention that because we had the same issue a few weeks ago (i.e. Bazel builds failing on remote builders with no system python3 installed). We found a really ugly workaround to force Bazel to use the prebuilt interpreter instead, but it only works on Unix (i.e. there is no equivalent solution for Windows, as far as I know). The associated bug has all the history if you are interested.
 
Konstantin

On Sunday, December 4, 2022 at 9:03:47 AM UTC-8 di...@google.com wrote:
Bazel does a number of things like setting up a different sandbox for each build command, or hashing the content of build inputs and outputs that take a non-trivial amount of CPU and i/O.
None of this has to be performed with CMake + Ninja, so a "clean" Bazel build will always be significantly slower, in the absence of prebuilt artifacts in the cache.

Now half of your CPU cores seems really high, but that may depend on your project. For example, in our experience, when using a custom Python distribution (i.e. interpreter + module files, about 5000 files or 120 MiB), setting up the sandbox for a command that invokes a single py_binary() takes several hundred milliseconds, which is considerably slower than loading and running the script itself. None of that happens when using the system python, which we avoid for hermeticity / reproducibility reasons.


On Fri, Dec 2, 2022 at 7:04 PM Konstantin <kon...@ermank.com> wrote:
Our current build system is not pure CMake - it is CMake + Ninja and I believe Ninja's primary purpose is exactly maximum utilization of the cores. It just consumes way less CPU for itself than Bazel does. My observations are on Windows and it could be that on Windows Bazel scheduler is not as resource efficient as Ninja.

On Friday, December 2, 2022 at 4:43:09 AM UTC-8 Zhuo Chen wrote:
Can you paste your build metrics include the build time of both CMake and Bazel? Your observation on the CPU utilization is compliance with my understanding, however,  the result is not.
Benefiting from Bazel's  algorithm  of paralleling, Bazel is able to fully utilize your CPU cores. On the opposit, cmake's paralleling can only submit many build jobs "in batch" without considering their dependency. Hence although the CPU utilization is high when using Bazel, but I believe the build time should be shorter than cmake, in my experience, usually Bazel takes only one third of cmake.

在2022年11月27日星期日 UTC+8 03:32:45<kon...@ermank.com> 写道:
For our huge C++ build from the "clean" state (no caches populated) Bazel is consistently and substantially slower than CMake. One thing I could not help noticing is that during Execution phase CMake consumes very little resources, leaving most of it to compilers and such, while Bazel continuously pegs down about half of available CPU cores for itself which means compilation goes twice slower! I wonder what is causing it and if something can be done to improve it. 

Thank you!
Konstantin

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

Konstantin

unread,
Dec 5, 2022, 6:13:40 PM12/5/22
to bazel-discuss
--  it will still scan output directories to remove files that are not listed in a build actions' manifest.

That part confused me a lot! To the best of my knowledge (and my experience confirms it) Bazel does not clean up package output folder from whatever garbage could be there since previous builds. Your link points at the code which seemingly does just that so-o... what actually happens? I will try to research it, but I'd appreciate some leads.
Package may have many targets, targets may have many actions... How Bazel would even know that everything we wanted to build inside the package is done and it is time to "prune the tree"?

Also the link to the "really ugly workaround" seems to be Google internal, but I very much want to know what it is.
I understand the workaround is not suitable for Windows, but on the other hand funny enough there seems to be no problem with the system Python on Windows, i.e. in my experiments py_binary on Windows unpacks and works without system Python!

Konstantin 

David Turner

unread,
Dec 6, 2022, 5:03:32 AM12/6/22
to Konstantin, bazel-discuss
On Tue, Dec 6, 2022 at 12:13 AM Konstantin <kon...@ermank.com> wrote:
--  it will still scan output directories to remove files that are not listed in a build actions' manifest.

That part confused me a lot! To the best of my knowledge (and my experience confirms it) Bazel does not clean up package output folder from whatever garbage could be there since previous builds. Your link points at the code which seemingly does just that so-o... what actually happens? I will try to research it, but I'd appreciate some leads.
Package may have many targets, targets may have many actions... How Bazel would even know that everything we wanted to build inside the package is done and it is time to "prune the tree"?

It looks like the code I pointed to is related to runfiles, so maybe this step is not performed during all build actions?

Also the link to the "really ugly workaround" seems to be Google internal, but I very much want to know what it is.

Sorry about that, my mistake, the public link is at https://fuchsia-review.googlesource.com/c/fuchsia/+/760704, but I'll explain the solution here for the sake of searchability. The trick is to use the stub_shebang attribute when calling py_runtime() for the prebuilt Python runtime with a carefully crafted value which is:

stub_shebang = '#!/usr/bin/env -S /bin/bash -c \'"$0".runfiles/main/%s "$0" "$@"\'' % _python3_interpreter_path,

This replaces the default value (which is "#!/usr/bin/env python3" and thus expects a python3 available in the system PATH). This ensures that the Python interpreter located in the target's runfiles directory is actually used instead.
This relies on the fact that "$0.runfiles/" is the directory containing the runtime files for the script when it is launched, i.e. totally a Bazel implementation detail that could break in a future release, but so far this works for us, but note that we do not use --build_python_zip, i.e. this solution might not work with this flag.

If you want more history, https://fxbug.dev/115164 contains more details.

I understand the workaround is not suitable for Windows, but on the other hand funny enough there seems to be no problem with the system Python on Windows, i.e. in my experiments py_binary on Windows unpacks and works without system Python!

Thanks, that's good to know. It looks like that in the case of Windows, Bazel will create a launcher Win32 executable for each py_binary() script it needs to launch. That's ... a little nuts, but it seems to work.

Konstantin

unread,
Dec 6, 2022, 2:58:53 PM12/6/22
to bazel-discuss
Wow! I got it! Thank you, David!

Yes, on Windows it is nuts, but... I take "nuts" for "not working" any day! :-) 

Jared Neil

unread,
Dec 12, 2022, 5:44:09 PM12/12/22
to bazel-discuss
>from the "clean" state (no caches populated)

I've noticed that builds with an empty disk or remote cache can be much slower than builds with caching completely disabled. I believe this is because bazel is calculating the ActionCache key for every action (lots of CPU) in order to do a cache lookup (remote latency), but all of that work doesn't help skip work when the cache is empty.
Are you seeing high CPU with caching completely disabled, or with an empty cache?

Konstantin

unread,
Dec 12, 2022, 8:06:11 PM12/12/22
to bazel-discuss
Jared, please clarify what do you mean under " caching completely disabled". Like what specific command line flags do you mean? 
We have "--disk_cache=" explicitly set to empty string, but beyond that I am not aware of the other ways to disavle Bazel caching.

lar...@google.com

unread,
Dec 14, 2022, 5:01:03 AM12/14/22
to bazel-discuss
On Sunday, 4 December 2022 at 22:16:10 UTC+1 kon...@ermank.com wrote:
David, thank you for your response!

Although our build supports multiple platforms (Mac, Linux, Windows, Emscripten) my observations about Bazel high CPU consumption was on Windows where to my understanding sandbox is not supported (and therefore not created/destroyed) regardless of any settings. So I guess that high CPU is not related to sandboxing.

That understanding is not correct, sandboxing is supported on Windows, see src/main/java/com/google/devtools/build/lib/sandbox/WindowsSandboxedSpawnRunner.java - even without that, the default ProcessWrapperSandbox should work. You should be able to see what it does sandbox-wise by passing `--sandbox_debug`. If you don't already, you should definitely pass `--experimental_reuse_sandbox_directories` (`--reuse_sandbox_directories` in 6.0). 
 
-Lars

David Turner

unread,
Dec 14, 2022, 6:39:02 AM12/14/22
to lar...@google.com, bazel-discuss
This is puzzling because the ProcessWrapperSandbox seems to require a Posix subsystem.
 
Also, the code you mention is never enabled by default, it requires --experimental_use_windows_sandbox, as well as passing the path to a host tool using --experimental_windows_sandbox_path.
Which tool to use is ambiguous, there are comments referencing sandboxfs (but then why not use --experimental_sandboxfs_path instead). Another comment talks about BuildXL Sandbox APIs, which are only deployed as DLLs, as far as I understand. So it looks like something is still missing,

Also there is actually no mention of this on the official Bazel documentation, nor in the open Github issue for this feature, but it *does* have a link to a document describing a (sadly failed or incomplete) GSoC attempt at implementing the feature in 2019, which seems to relate to the Bazel sources linked above.

So in the end, Windows sandboxing still doesn't seem to be usable today at all.
 
-Lars

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

lar...@google.com

unread,
Dec 14, 2022, 8:37:02 AM12/14/22
to bazel-discuss
Ok, that's not as useful as it looked at first glance. We don't have a Windows box at hand, so we haven't tried it out for a while.

-Lars

Reply all
Reply to author
Forward
0 new messages