Status on --experimental_spawn_scheduler

310 views
Skip to first unread message

Mark Sulkowski

unread,
Jun 23, 2020, 10:06:22 AM6/23/20
to bazel-discuss
I'm interested to know about any bugs or issues that may pertain to the use of --experimental_spawn_scheduler.

How mature is this feature, and what problems might I expect in using it?

Julio Merino

unread,
Jun 23, 2020, 10:55:57 AM6/23/20
to Mark Sulkowski, bazel-discuss
On 6/23/20 10:06 AM, 'Mark Sulkowski' via bazel-discuss wrote:
I'm interested to know about any bugs or issues that may pertain to the use of --experimental_spawn_scheduler.

How mature is this feature, and what problems might I expect in using it?

Hello Mark,

I'd say that the feature is pretty mature. We use it regularly at Google and don't experience random bugs. That isn't to say it's perfect. There are plenty of things to improve (you can query our open issues under team-Local-Exec), of course.

The big caveat here is that we use the dynamic scheduler against our internal client of remote execution... which is unfortunately very different to the external one. So it's possible that some issues remain in the latter that haven't been exposed due to reduced usage.

The things you have to be aware if you want to try this out are:

* Ensuring that the remote and local environments are equivalent. This is critical. Otherwise, mixing outputs between them could lead to strange build problems.

* Watching out for which workers you enable. Bazel doesn't have good resource tracking for workers, so if you end up enabling too many worker types during a build, you'll likely render your system unusable.

* Tuning --local_cpu_resources. We have found that 0.75*HOST_CPUS works best for us, but that will depend on your build behavior and whether you use workers or not.

I wrote a series of blog posts detailing what our work has been in this area, starting here: https://jmmv.dev/2019/12/bazel-dynamic-execution-introduction.html and a recent post on the upcoming version of the scheduler here: https://jmmv.dev/2020/06/shipping-bazel-new-dynamic-scheduler.html

Cheers

David Sanderson

unread,
Sep 29, 2020, 7:27:55 PM9/29/20
to bazel-discuss
On Tuesday, June 23, 2020 at 10:55:57 AM UTC-4 JMMV wrote:
I wrote a series of blog posts detailing what our work has been in this area, starting here: https://jmmv.dev/2019/12/bazel-dynamic-execution-introduction.html and a recent post on the upcoming version of the scheduler here: https://jmmv.dev/2020/06/shipping-bazel-new-dynamic-scheduler.html

Has the upcoming version of the scheduler described in https://jmmv.dev/2020/06/shipping-bazel-new-dynamic-scheduler.html made its way into a public bazel release?  If so, which one?

Julio Merino

unread,
Sep 29, 2020, 8:58:42 PM9/29/20
to David Sanderson, bazel-discuss
It has been out there for a while. On mobile now so I cannot check easily. But it’s still disabled by default.

Cheers

--


You received this message because you are subscribed to the Google Groups "bazel-discuss" group.


To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.


To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/b5fdf0ca-9ccd-41c0-85d8-132a921c3f4bn%40googlegroups.com.


--

David Sanderson

unread,
Sep 30, 2020, 12:31:25 AM9/30/20
to bazel-discuss
On Tuesday, September 29, 2020 at 8:58:42 PM UTC-4 JMMV wrote:
It has been out there for a while. On mobile now so I cannot check easily. But it’s still disabled by default.

Ah, thank you!  I delved into the code a bit.  If I'm understanding correctly, the basic way to try out the new
dynamic execution mechanism is to use:

    --internal_spawn_scheduler
    --spawn_strategy=dynamic
    --legacy_spawn_scheduler=false

with the possible additions of

    --experimental_local_lockfree_output
    --experimental_local_execution_delay=1000

I tried this out with bazel 3.5.0 with buildfarm as the remote executor. Sadly, every attempt so far has
failed due to missing directories -- apparently race conditions of some sort.

The old dynamic execution mechanism (plain --experimental_spawn_scheduler) also exhibits these
races in bazel 3.5.0.

For what it's worth, the most recent version of bazel where we've been able to use
--experimental_spawn_scheduler with buildfarm is bazel 3.1.0.

Lars Clausen

unread,
Sep 30, 2020, 4:09:11 AM9/30/20
to David Sanderson, bazel-discuss
Hi David,

I'm working on this area, and would like to hear more about how it fails. Could you share the failure messages and what your setup is like?

Thanks,
-Lars

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

David Sanderson

unread,
Sep 30, 2020, 3:55:39 PM9/30/20
to bazel-discuss
On Wednesday, September 30, 2020 at 4:09:11 AM UTC-4 Lars wrote:
Hi David,

I'm working on this area, and would like to hear more about how it fails. Could you share the failure messages and what your setup is like?

I'll be happy to!
We are using Ubuntu 18.04 on 64-bit Intel hardware.
We are currently using buildfarm 1.2.0 (https://github.com/bazelbuild/bazel-buildfarm/releases/tag/1.2.0)
When executing the following script with "bash -x":

    bazelrc=bazel.rc
    cat $bazelrc
    startup="--bazelrc=$bazelrc"
    ./bazel.3.5.0 $startup clean
    ./bazel.3.5.0 $startup build --verbose_failures --config=remote --config=dynamic-execution1 //atg/hello_world/cc:hello_world_test

we see output similar to the following:

    + bazelrc=bazel.rc
    + cat bazel.rc
    build --extra_toolchains=//tools/toolchain:cc-toolchain-linux_k8_clang
    build --compiler=clang
    build --host_compiler=clang
    build --crosstool_top=//tools/toolchain:cc-toolchain
    build --host_crosstool_top=//tools/toolchain:host-cc-toolchain
    build --platforms=//tools/toolchain:target_platform
    build --host_platform=//tools/toolchain:host_platform
    build --action_env=BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1
    build --features=-static_libgcc
    build --incompatible_strict_action_env
    
    build --curses=no
    build --color=no
    
    build:remote --remote_executor=grpc://buildfarm-prd-1.aws.uberatc.net
    build:remote --jobs=128
    
    # Dynamic execution
    #
    # --config=dynamic-execution0   --experimental_spawn_scheduler with --local_cpu_resources=HOST_CPUS*0.75 --local_ram_resources=HOST_RAM*0.75
    # --config=dynamic-execution1   Use the new dynamic scheduler
    # --config=dynamic-execution2   Use the new dynamic scheduler and --experimental_local_lockfree_output
    # --config=dynamic-execution3   Use the new dynamic scheduler and --experimental_local_lockfree_output and --experimental_local_execution_delay=1000
    #
    # Note that units of --experimental_local_execution_delay are milliseconds.
    
    build:dynamic-execution0 --local_cpu_resources=HOST_CPUS*0.75
    build:dynamic-execution0 --local_ram_resources=HOST_RAM*0.75
    build:dynamic-execution0 --internal_spawn_scheduler
    build:dynamic-execution0 --spawn_strategy=dynamic
    
    build:dynamic-execution1 --config=dynamic-execution0
    build:dynamic-execution1 --legacy_spawn_scheduler=false
    
    build:dynamic-execution2 --config=dynamic-execution1
    build:dynamic-execution2 --experimental_local_lockfree_output
    
    build:dynamic-execution3 --config=dynamic-execution2
    build:dynamic-execution3 --experimental_local_execution_delay=1000
    + startup=--bazelrc=bazel.rc
    + ./bazel.3.5.0 --bazelrc=bazel.rc clean
    INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
    + ./bazel.3.5.0 --bazelrc=bazel.rc build --verbose_failures --config=remote --config=dynamic-execution1 //atg/hello_world/cc:hello_world_test
    INFO: Invocation ID: 2b7f70c8-860d-4a12-9cc9-de9e35f371d7
    Loading: 
    Loading: 0 packages loaded
    Analyzing: target //atg/hello_world/cc:hello_world_test (1 packages loaded, 0 targets configured)
    INFO: Analyzed target //atg/hello_world/cc:hello_world_test (37 packages loaded, 4313 targets configured).
    INFO: Found 1 target...
    [0 / 1] [Prepa] BazelWorkspaceStatusAction stable-status.txt
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/malloc_extension.pic.d.tmp (No such file or directory)
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/thread_cache.pic.d.tmp (No such file or directory)
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/raw_printer.pic.d.tmp (No such file or directory)
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/symbolize.pic.d.tmp (No such file or directory)
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/central_freelist.pic.d.tmp (No such file or directory)
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/com_google_googletest/_objs/gtest/gtest-port.pic.d.tmp (No such file or directory)
    ERROR: /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/external/gperftools/BUILD.bazel:10:11: C++ compilation of rule '@gperftools//:tcmalloc' failed (Exit 34): java.io.FileNotFoundException: /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/sysinfo.pic.d.tmp (No such file or directory)
            at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
            at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:186)
            at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:176)
            at com.google.devtools.build.lib.vfs.Path.stat(Path.java:418)
            at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:454)
            at com.google.devtools.build.lib.remote.RemoteCache.moveOutputsToFinalLocation(RemoteCache.java:415)
            at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:374)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:440)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:266)
            at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
            at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runRemotely(DynamicSpawnStrategy.java:397)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$400(DynamicSpawnStrategy.java:64)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$2.callImpl(DynamicSpawnStrategy.java:298)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:471)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:408)
            at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
            at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
            at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            at java.base/java.lang.Thread.run(Unknown Source)
    . Note: Remote connection/protocol failed with: execution failed java.io.FileNotFoundException: /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/sysinfo.pic.d.tmp (No such file or directory)
            at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
            at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:186)
            at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:176)
            at com.google.devtools.build.lib.vfs.Path.stat(Path.java:418)
            at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:454)
            at com.google.devtools.build.lib.remote.RemoteCache.moveOutputsToFinalLocation(RemoteCache.java:415)
            at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:374)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:440)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:266)
            at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
            at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runRemotely(DynamicSpawnStrategy.java:397)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$400(DynamicSpawnStrategy.java:64)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$2.callImpl(DynamicSpawnStrategy.java:298)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:471)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:408)
            at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
            at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
            at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            at java.base/java.lang.Thread.run(Unknown Source)
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/fake_stacktrace_scope.pic.d.tmp (No such file or directory)
    Target //atg/hello_world/cc:hello_world_test failed to build
    INFO: Elapsed time: 2.172s, Critical Path: 0.85s
    INFO: 2 processes: 2 remote cache hit.
    FAILED: Build did NOT complete successfully
    WARNING: Failed to delete contents of sandbox /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/sandbox: java.io.FileNotFoundException: unlinkat(/home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/sandbox/linux-sandbox/570/execroot/__main__/external/sysroot/usr/include/linux/netfilter_arp) (No such file or directory)
    FAILED: Build did NOT complete successfully

I will be happy to work on creating a more complete reproduction example with
source code and buildfarm setup instructions, but that will take some time.
In the meantime, I'll be happy to run whatever experiments you might want to
suggest that would give you more information.  I'll also be happy to file an issue
on bazel's github if you'd like to move the discussion there.

Brian Silverman

unread,
Sep 30, 2020, 11:29:59 PM9/30/20
to David Sanderson, bazel-discuss
For my remote execution builds, I'm using --experimental_inmemory_dotd_files. I haven't actually tried without it, but given that your error is with .d files maybe that would fix it?

--experimental_inmemory_jdeps_files looks like more along the same lines too.

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.

David Sanderson

unread,
Oct 1, 2020, 12:06:23 PM10/1/20
to Brian Silverman, bazel-discuss
On Wed, Sep 30, 2020 at 11:29 PM Brian Silverman <bsilve...@gmail.com> wrote:
For my remote execution builds, I'm using --experimental_inmemory_dotd_files. I haven't actually tried without it, but given that your error is with .d files maybe that would fix it?

--experimental_inmemory_jdeps_files looks like more along the same lines too.

Thank you for the suggestion! That option was new to me.

I tried it out by adding it to the dynamic-execution1 config, but unfortunately it does not seem to help:

    build:dynamic-execution1 --experimental_inmemory_dotd_files

    build:dynamic-execution1 --legacy_spawn_scheduler=false
   
    build:dynamic-execution2 --config=dynamic-execution1
    build:dynamic-execution2 --experimental_local_lockfree_output
   
    build:dynamic-execution3 --config=dynamic-execution2
    build:dynamic-execution3 --experimental_local_execution_delay=1000
    + startup=--bazelrc=bazel.rc
    + ./bazel.3.5.0 --bazelrc=bazel.rc clean
    INFO: Starting clean (this may take a while). Consider using --async if the clean takes more than several minutes.
    + ./bazel.3.5.0 --bazelrc=bazel.rc build --verbose_failures --config=remote --config=dynamic-execution1 //atg/hello_world/cc:hello_world_test
    INFO: Invocation ID: 2866a98b-20af-4199-9937-2956d6d22ede

    Loading:
    Loading: 0 packages loaded
    Loading: 0 packages loaded
    Analyzing: target //atg/hello_world/cc:hello_world_test (1 packages loaded, 0 targets configured)
    INFO: Analyzed target //atg/hello_world/cc:hello_world_test (37 packages loaded, 4313 targets configured).
    INFO: Found 1 target...
    [0 / 2] [Prepa] BazelWorkspaceStatusAction stable-status.txt
    [9 / 65] [Prepa] BazelWorkspaceStatusAction stable-status.txt ... (87 actions, 86 running)

    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/logging.pic.d.tmp (No such file or directory)

    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/symbolize.pic.d.tmp (No such file or directory)
    WARNING: Reading from Remote Cache:
    /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/internal_logging.pic.d.tmp (No such file or directory)
    ERROR: /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/external/gperftools/BUILD.bazel:10:11: C++ compilation of rule '@gperftools//:tcmalloc' failed (Exit 34): java.io.FileNotFoundException: /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/raw_printer.pic.d.tmp (No such file or directory)

            at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
            at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:186)
            at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:176)
            at com.google.devtools.build.lib.vfs.Path.stat(Path.java:418)
            at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:454)
            at com.google.devtools.build.lib.remote.RemoteCache.moveOutputsToFinalLocation(RemoteCache.java:415)
            at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:374)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:440)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:266)
            at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
            at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runRemotely(DynamicSpawnStrategy.java:397)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$400(DynamicSpawnStrategy.java:64)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$2.callImpl(DynamicSpawnStrategy.java:298)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:471)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:408)
            at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
            at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
            at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            at java.base/java.lang.Thread.run(Unknown Source)
    . Note: Remote connection/protocol failed with: execution failed java.io.FileNotFoundException: /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/execroot/__main__/bazel-out/k8-fastbuild/bin/external/gperftools/_objs/tcmalloc/raw_printer.pic.d.tmp (No such file or directory)

            at com.google.devtools.build.lib.unix.NativePosixFiles.lstat(Native Method)
            at com.google.devtools.build.lib.unix.UnixFileSystem.statInternal(UnixFileSystem.java:186)
            at com.google.devtools.build.lib.unix.UnixFileSystem.stat(UnixFileSystem.java:176)
            at com.google.devtools.build.lib.vfs.Path.stat(Path.java:418)
            at com.google.devtools.build.lib.vfs.FileSystemUtils.moveFile(FileSystemUtils.java:454)
            at com.google.devtools.build.lib.remote.RemoteCache.moveOutputsToFinalLocation(RemoteCache.java:415)
            at com.google.devtools.build.lib.remote.RemoteCache.download(RemoteCache.java:374)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.downloadAndFinalizeSpawnResult(RemoteSpawnRunner.java:440)
            at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:266)
            at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:240)
            at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:132)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.runRemotely(DynamicSpawnStrategy.java:397)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy.access$400(DynamicSpawnStrategy.java:64)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$2.callImpl(DynamicSpawnStrategy.java:298)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:471)
            at com.google.devtools.build.lib.dynamic.DynamicSpawnStrategy$Branch.call(DynamicSpawnStrategy.java:408)
            at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
            at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
            at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
            at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
            at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
            at java.base/java.lang.Thread.run(Unknown Source)
    Target //atg/hello_world/cc:hello_world_test failed to build
    INFO: Elapsed time: 3.054s, Critical Path: 1.15s
    INFO: 10 processes: 10 remote cache hit.

    FAILED: Build did NOT complete successfully
    WARNING: Failed to delete contents of sandbox /home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/sandbox: java.io.IOException: unlinkat(/home/dws/.cache/bazel/_bazel_dws/71d2c1360d0922685900f5826e324a56/sandbox/linux-sandbox/187/execroot/__main__/external/gperftools/src/base) (Directory not empty)

Julio Merino

unread,
Oct 1, 2020, 1:19:36 PM10/1/20
to David Sanderson, Brian Silverman, bazel-discuss
Interesting. Does this only fail with dynamic execution? Do builds work with remote only?

The failure that you show seems to be in the "download artifacts to temporary location" step (see https://jmmv.dev/2019/12/bazel-dynamic-execution-download-times.html for details). I implemented that feature to make dynamic scheduling resilient to networking issues, but I only did so in the Google-internal variant of the remote execution path. I think Jakob was the one that replicated this same idea in the open-source version, but I doubt it has ever gone through extensive testing with dynamic execution yet...
> --
> You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discus...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/CAPONot_6vpYb33w6_ajP3b1B_Bc5pau2zVteEb4%3D1yem0OyJeQ%40mail.gmail.com.

David Sanderson

unread,
Oct 1, 2020, 2:16:30 PM10/1/20
to Julio Merino, Brian Silverman, bazel-discuss
On Thu, Oct 1, 2020 at 1:19 PM Julio Merino <jm...@google.com> wrote:
Interesting. Does this only fail with dynamic execution? Do builds work with remote only?

The failure that you show seems to be in the "download artifacts to temporary location" step (see https://jmmv.dev/2019/12/bazel-dynamic-execution-download-times.html for details). I implemented that feature to make dynamic scheduling resilient to networking issues, but I only did so in the Google-internal variant of the remote execution path. I think Jakob was the one that replicated this same idea in the open-source version, but I doubt it has ever gone through extensive testing with dynamic execution yet...

Correct. It only fails with dynamic execution. The builds work fine with remote only, and with local only.

In bazel 3.5.0, we've observed the failures with both the old and the new dynamic execution.

We do not see the failures with the old dynamic execution in bazel 3.1.0.
We first observed the failures with the old dynamic execution in bazel 3.2.0.
So it would appear that the regression (or at least a regression) was introduced during that interval.

Gregg Donovan

unread,
Oct 1, 2020, 8:08:54 PM10/1/20
to bazel-discuss
We (Etsy) are also staring to testing dynamic execution using 3.5.0 and have already had some promising test runs. We're hoping to get it into production.

The set of flags is a bit intimidating, though. What set of flags would you recommend we start testing with? We're happy to trade off RAM for speed and reliability and we want to be able to support our WFH and/or IntelliJ users who are often a few timezones from our remote build farm. 

Julio Merino

unread,
Oct 2, 2020, 9:31:09 AM10/2/20
to Gregg Donovan, bazel-discuss
On Oct 1, 2020, at 20:08, 'Gregg Donovan' via bazel-discuss <bazel-...@googlegroups.com> wrote:
>
> We (Etsy) are also staring to testing dynamic execution using 3.5.0 and have already had some promising test runs. We're hoping to get it into production.
>
> The set of flags is a bit intimidating, though. What set of flags would you recommend we start testing with? We're happy to trade off RAM for speed and reliability and we want to be able to support our WFH and/or IntelliJ users who are often a few timezones from our remote build farm.

I'd suggest to start with --internal_spawn_scheduler and then using --strategy=SomeMnemonic=dynamic to manually select which mnemonics to use the dynamic scheduler for. Going "all in" at once can be difficult because things might break.

Once that works, then the question is whether you are using either workers or sandboxing for any of those mnemonics.

If you are using workers, dynamic scheduling won't be able to achieve the best results because Bazel is currently unable to interrupt actions executed by a worker. So if we score a cache hit while the worker is running the action, we'll have to complete the worker action anyway.

If you are using sandboxing, then the old vs. the new dynamic scheduler shouldn't make much of a difference. But if you are _not_ using sandboxing, then you should definitely play with --nolegacy_spawn_scheduler and --experimental_local_lockfree_output to permit interrupting local standalone actions once there is a cache hit.

We've found that --local_cpu_resources=HOST_CPUS*0.75 works best when the dynamic scheduler runs for a significant part of the build, but you'll have to experiment with that.

I think someone also measured --experimental_local_execution_delay=N and found that 250 ms is the best value here. Ideally it'd be zero, but that causes too much local process churn and makes builds worse. Note that this is kind of a hack and we need a better solution than a blind delay, but haven't designed nor implemented one yet.

David Sanderson

unread,
Oct 27, 2020, 5:46:11 PM10/27/20
to bazel-discuss
I now have a small example that reproduces problems that we've been seeing with dynamic execution, and in particular the _new_ dynamic execution (--internal_spawn_scheduler --spawn_strategy=dynamic --legacy_spawn_scheduler=false), in conjunction with buildfarm. I verified that it still reproduces in bazel 3.7.0. I filed a ticket at https://github.com/bazelbuild/bazel/issues/12364 (bazel dynamic execution failures with (at least) buildfarm)

Since all the details are in the issue, I'll not repeat them here.
Reply all
Reply to author
Forward
0 new messages