Action rewinding and Bazel 7.0

201 views
Skip to first unread message

Lukács T. Berki

unread,
Nov 21, 2022, 6:29:29 AM11/21/22
to Daniel Wagner-Hall, Mark Schaller, Justin Horvitz, Chi Wang, bazel-dev
Hey folks,

I'm trying to figure out whether it's feasible to promise action rewinding for Bazel 7.0 .

It looks like Daniel (and @k1nkreet on GitHub but I don't know their e-mail) did some work on action rewinding (see #14126 and #16470) but it's quite a delicate change. In addition, I don't think it's a good idea to have a public feature that requires a large number of preconditions to work, which makes it even more difficult.

We at Google also don't have a lot of use for this extended version of action rewinding so it'll be difficult to find time for us to implement it, so if this is to happen, it would probably have to be implemented by Daniel with guidance from Justin and Mark. Would that work?

It looks like we collectively know what limitations need to be lifted, so there are no theoretical obstacles and the framework for testing it (RewindingTest) has already been open sourced, it "just" needs to be coded up and tested.

The limitations that I know of that need to be lifted are:
  1. It should work with --track_incremental_state (which means keeping track of Skyframe reverse deps)
  2. Outputs lost should be evicted from the local action cache (otherwise re-running an action would just be a local action cache hit)
  3. It should work in --nokeep_going mode
  4. It should work without ActionFS (but I assume that a simple integration test would take care of this)

--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

Ed Schouten

unread,
Nov 21, 2022, 7:31:53 AM11/21/22
to Lukács T. Berki, Daniel Wagner-Hall, Mark Schaller, Justin Horvitz, Chi Wang, bazel-dev
Hi Lukács,

Long time no see!

Op ma 21 nov. 2022 om 12:29 schreef 'Lukács T. Berki' via bazel-dev
<baze...@googlegroups.com>:
> I'm trying to figure out whether it's feasible to promise action rewinding for Bazel 7.0 .

I'm pretty sure I have already brought this up in a couple of
different places, but to me it's still not clear why we need the
action rewinding. As far as I know, there are two scenarios to
consider:

1. Your remote caching/execution service loses CAS objects that were
generated by an action that ran at an earlier point within the same
build.

This may happen if your remote caching setup is:

- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.

I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.

2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.

My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:

- https://github.com/bazelbuild/bazel/pull/12823 <- Bazel PR to add a
client for this
- https://github.com/buildbarn/bb-clientd/blob/master/pkg/filesystem/virtual/remote_output_service_directory.go#L342-L350
<- Code in bb_clientd that does the filtering.

I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar. As far as I know, this approach is at least
as efficient as doing full action rewinding. Action rewinding may, for
example, be triggered if the remote execution service's Execute() call
fails with FAILED_PRECONDITION. The REv2 spec does not guarantee that
these errors are returned *prior* to actually queueing/executing said
actions. This means that valuable amounts of worker time may be
wasted, just to come to the conclusion that rewinding needs to be
performed.

I would like to hear from the authors of both PRs why the approach I
laid out above is not sufficient for their use case.

Best regards,
--
Ed Schouten <e...@nuxi.nl>

Chi Wang

unread,
Nov 21, 2022, 7:47:56 AM11/21/22
to Ed Schouten, Lukács T. Berki, Daniel Wagner-Hall, Mark Schaller, Justin Horvitz, bazel-dev
I have a WIP PR to address #2. The title and description is not up-to-date though. It essentially adds a lease service which calls FindMissingBlobs() at the start of the build as well as during the build to renew the lease for blobs that are referenced by Bazel.

Lukács T. Berki

unread,
Nov 21, 2022, 8:24:54 AM11/21/22
to Ed Schouten, Daniel Wagner-Hall, Mark Schaller, Justin Horvitz, Chi Wang, bazel-dev
On Mon, Nov 21, 2022 at 1:31 PM Ed Schouten <e...@nuxi.nl> wrote:
Hi Lukács,

Long time no see!

Op ma 21 nov. 2022 om 12:29 schreef 'Lukács T. Berki' via bazel-dev
<baze...@googlegroups.com>:
> I'm trying to figure out whether it's feasible to promise action rewinding for Bazel 7.0 .

I'm pretty sure I have already brought this up in a couple of
different places, but to me it's still not clear why we need the
action rewinding. As far as I know, there are two scenarios to
consider:
I would be interested in Daniel's answer to these questions!

To be honest, I took the necessity of action rewinding at face value because Google needs it. My best understanding is that the reason why it exists at Google is that such an event doesn't happen with a very high chance (we try to size our caches such that (1) is not a problem, even though we can't give an upper limit to the time a single build takes), but at our scale, "not with a very high chance" still happens a lot and if it does, it costs resources.

 

1. Your remote caching/execution service loses CAS objects that were
generated by an action that ran at an earlier point within the same
build.

This may happen if your remote caching setup is:

- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.

I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.

2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.

My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:
We have a "remote data was lost, please clean your workspace" error message and so far, it has been enough so I personally haven't given a lot of thought about this particular scenario. Chi would probably be able to tell more since he thought a lot about the interaction of the local action cache and remote builds.
 

- https://github.com/bazelbuild/bazel/pull/12823 <- Bazel PR to add a
client for this
- https://github.com/buildbarn/bb-clientd/blob/master/pkg/filesystem/virtual/remote_output_service_directory.go#L342-L350
<- Code in bb_clientd that does the filtering.

I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar. As far as I know, this approach is at least
as efficient as doing full action rewinding. Action rewinding may, for
example, be triggered if the remote execution service's Execute() call
fails with FAILED_PRECONDITION. The REv2 spec does not guarantee that
these errors are returned *prior* to actually queueing/executing said
actions. This means that valuable amounts of worker time may be
wasted, just to come to the conclusion that rewinding needs to be
performed.

I would like to hear from the authors of both PRs why the approach I
laid out above is not sufficient for their use case.

Best regards,
--
Ed Schouten <e...@nuxi.nl>

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CABh_MKk5Wczd5m7LAN%2BANyQd%3DNxo9a6P_vQOnRaz4_Jgzo3j_A%40mail.gmail.com.

Daniel Wagner-Hall

unread,
Nov 21, 2022, 1:52:56 PM11/21/22
to Lukács T. Berki, Ed Schouten, Mark Schaller, Justin Horvitz, Chi Wang, bazel-dev
Thanks for kicking off this thread, Lukács!

On Mon, 21 Nov 2022 at 13:24, Lukács T. Berki <lbe...@google.com> wrote:
On Mon, Nov 21, 2022 at 1:31 PM Ed Schouten <e...@nuxi.nl> wrote:
1. Your remote caching/execution service loses CAS objects that were
generated by an action that ran at an earlier point within the same
build.

This may happen if your remote caching setup is:

- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.

I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.

I agree that within a build, a remote execution system should handle these lifetimes.

2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.

My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:
We have a "remote data was lost, please clean your workspace" error message and so far, it has been enough so I personally haven't given a lot of thought about this particular scenario. Chi would probably be able to tell more since he thought a lot about the interaction of the local action cache and remote builds.

This is the case I'm interested in - in particular e.g. if you have persistent Bazel daemons in a CI environment, having to manually inject a `bazel shutdown` when you hit this issue is a sub-par user experience.
 
I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar.

Interesting... I didn't know that skyframe actions would properly invalidate if you just remove stuff from RemoteActionFileSystem... How confident are you that works through the stack (including for incremental and disk-cache-using builds)? If so, that sounds like a great simpler substitute for full rewinding. If you're reasonably confident in it, I can try to put together a test-case showing it hopefully working next week!

Lukács T. Berki

unread,
Nov 23, 2022, 8:10:41 AM11/23/22
to Daniel Wagner-Hall, Ed Schouten, Mark Schaller, Justin Horvitz, Chi Wang, bazel-dev
On Mon, Nov 21, 2022 at 7:52 PM Daniel Wagner-Hall <dawa...@gmail.com> wrote:
Thanks for kicking off this thread, Lukács!

On Mon, 21 Nov 2022 at 13:24, Lukács T. Berki <lbe...@google.com> wrote:
On Mon, Nov 21, 2022 at 1:31 PM Ed Schouten <e...@nuxi.nl> wrote:
1. Your remote caching/execution service loses CAS objects that were
generated by an action that ran at an earlier point within the same
build.

This may happen if your remote caching setup is:

- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.

I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.

I agree that within a build, a remote execution system should handle these lifetimes.
Chi, mind reading this and telling me if this is correct?

IIRC the RBE folks thought about this and the limiting factor is that there is no maximum length of time a Bazel build can run. And it's possible that Bazel executes an action very early and another that depends on one of its outputs very late and in order to make that scenario work, Bazel would need to either set the lifetime of that output to some value that guarantees that it will be live by the end of the build (but there is no amount of time that would guarantee that) or to periodically refresh output artifacts during the build (but we don't do that; maybe we should?)


2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.

My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:
We have a "remote data was lost, please clean your workspace" error message and so far, it has been enough so I personally haven't given a lot of thought about this particular scenario. Chi would probably be able to tell more since he thought a lot about the interaction of the local action cache and remote builds.

This is the case I'm interested in - in particular e.g. if you have persistent Bazel daemons in a CI environment, having to manually inject a `bazel shutdown` when you hit this issue is a sub-par user experience.
Yeah, I can relate to that; I'd again be interested in what Chi thinks about the most desirable solution; there are all sorts of caches at play (Skyframe, the local action cache, the local disk cache) and there is also FUSEless mode which complicates things.
 
I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar.

Interesting... I didn't know that skyframe actions would properly invalidate if you just remove stuff from RemoteActionFileSystem... How confident are you that works through the stack (including for incremental and disk-cache-using builds)? If so, that sounds like a great simpler substitute for full rewinding. If you're reasonably confident in it, I can try to put together a test-case showing it hopefully working next week!
AFAICT this won't work since a given RemoteActionFileSystem instance is only used for the inputs/outputs of a given action and it is created right before the action is executed. Thus, detecting a missing artifact in RemoteActionFileSystem would still require action rewinding.

Chi Wang

unread,
Nov 23, 2022, 10:35:16 AM11/23/22
to Lukács T. Berki, Daniel Wagner-Hall, Ed Schouten, Mark Schaller, Justin Horvitz, bazel-dev
On Wed, Nov 23, 2022 at 2:10 PM Lukács T. Berki <lbe...@google.com> wrote:


On Mon, Nov 21, 2022 at 7:52 PM Daniel Wagner-Hall <dawa...@gmail.com> wrote:
Thanks for kicking off this thread, Lukács!

On Mon, 21 Nov 2022 at 13:24, Lukács T. Berki <lbe...@google.com> wrote:
On Mon, Nov 21, 2022 at 1:31 PM Ed Schouten <e...@nuxi.nl> wrote:
1. Your remote caching/execution service loses CAS objects that were
generated by an action that ran at an earlier point within the same
build.

This may happen if your remote caching setup is:

- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.

I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.

I agree that within a build, a remote execution system should handle these lifetimes.
Chi, mind reading this and telling me if this is correct?

IIRC the RBE folks thought about this and the limiting factor is that there is no maximum length of time a Bazel build can run. And it's possible that Bazel executes an action very early and another that depends on one of its outputs very late and in order to make that scenario work, Bazel would need to either set the lifetime of that output to some value that guarantees that it will be live by the end of the build (but there is no amount of time that would guarantee that) or to periodically refresh output artifacts during the build (but we don't do that; maybe we should?)


I agree with Ed on #1, i.e. the remote server should at least have enough storage for holding all outputs from one invocation and be reliable enough. Otherwise, action rewinding can't help.

The lease service I am working on at https://github.com/bazelbuild/bazel/pull/16660 will periodically renew remote metadata so it's not necessary to tell the remote server the maximum length of a Bazel build.
 

2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.

My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:
We have a "remote data was lost, please clean your workspace" error message and so far, it has been enough so I personally haven't given a lot of thought about this particular scenario. Chi would probably be able to tell more since he thought a lot about the interaction of the local action cache and remote builds.

This is the case I'm interested in - in particular e.g. if you have persistent Bazel daemons in a CI environment, having to manually inject a `bazel shutdown` when you hit this issue is a sub-par user experience.
Yeah, I can relate to that; I'd again be interested in what Chi thinks about the most desirable solution; there are all sorts of caches at play (Skyframe, the local action cache, the local disk cache) and there is also FUSEless mode which complicates things.
 

Again, with https://github.com/bazelbuild/bazel/pull/16660, I am able to sync skyframe cache, local action cache and local disk cache with the leases between invocations. i.e. in case of lease expiration, the generating actions will be correctly re-run. So if the remote server can promise the leases within one invocation, action rewinding is not required. Remote server is free to remove blobs between invocations, Bazel will detect it can invalide all the mentioned cache.

For the rare cases when blobs are removed within one invocation, Bazel will still complain "remote data was lost", but a following `bazel build` will clean the stale state and allow the build to move forward, no clean is required with the PR.

Based on that, we could let Bazel automatically re-run `bazel build` in case of lost inputs, or rewinding on a larger scale.
 
I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar.

Interesting... I didn't know that skyframe actions would properly invalidate if you just remove stuff from RemoteActionFileSystem... How confident are you that works through the stack (including for incremental and disk-cache-using builds)? If so, that sounds like a great simpler substitute for full rewinding. If you're reasonably confident in it, I can try to put together a test-case showing it hopefully working next week!
AFAICT this won't work since a given RemoteActionFileSystem instance is only used for the inputs/outputs of a given action and it is created right before the action is executed. Thus, detecting a missing artifact in RemoteActionFileSystem would still require action rewinding.

Yes. The lifetime for the lease service is bound to the Bazel server.

Ulrik Falklöf

unread,
Nov 24, 2022, 3:41:54 AM11/24/22
to bazel-dev

Thank you for working on this! 

I imagine several scenarios where rewinding would be valuable:

  • Reconfigured sharding for remote caches.
  • Remote cache down and replaced with other non-fully synchronized instance.
  • Temporary network issue resulting in caches being unreachable.
  • Remote RAM cache restarted.
  • Local --disk_cache storage cleared.

Handling these rare scenarios by some kind of rewinding/retrying seems as the most reliable and versatile solution to me.

Rewinding arbitrary far back might not be necessary. Retrying the whole build could be good enough, as long as it happens automatically, so that the user can still trust bazel to take care of it.

I try to teach our users to trust bazel and avoid their old habits of ‘make clean’. Having situations requiring manual 'bazel shutdown' or ‘bazel clean’ would ruin that trust and is therefore blocking my organization from using BwtB.

Regarding how many times to rewind/retry and risk of getting stuck indefinitely: it could be considered to automatically disable BwtB during the rewind/retry.

BR,
Ulrik Falklöf, Ericsson

Lukács T. Berki

unread,
Nov 24, 2022, 9:10:05 AM11/24/22
to Ulrik Falklöf, bazel-dev, Chi Wang, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Thu, Nov 24, 2022 at 9:41 AM Ulrik Falklöf <ulrik.fal...@gmail.com> wrote:

Thank you for working on this! 

I imagine several scenarios where rewinding would be valuable:

  • Reconfigured sharding for remote caches.
  • Remote cache down and replaced with other non-fully synchronized instance.
  • Temporary network issue resulting in caches being unreachable.
  • Remote RAM cache restarted.
  • Local --disk_cache storage cleared.
Do I understand correctly that Chi's lease service would solve all these use cases? If so, the argument for action rewinding becomes much weaker. In fact, I can't think of any case where it would still be necessary, except maybe the "remote cache loses inputs within a single build" use case, but it looks like the lease service could be extended to also cover that case?

I don't think that in the presence of remote execution, Bazel can guarantee that a build succeeds (e.g. when someone pulls out the network cable), so as long as we can make sure that Bazel doesn't get stuck in a bad state, we are fine. 

Handling these rare scenarios by some kind of rewinding/retrying seems as the most reliable and versatile solution to me.

Rewinding arbitrary far back might not be necessary. Retrying the whole build could be good enough, as long as it happens automatically, so that the user can still trust bazel to take care of it.

I try to teach our users to trust bazel and avoid their old habits of ‘make clean’. Having situations requiring manual 'bazel shutdown' or ‘bazel clean’ would ruin that trust and is therefore blocking my organization from using BwtB.

Regarding how many times to rewind/retry and risk of getting stuck indefinitely: it could be considered to automatically disable BwtB during the rewind/retry.

BR,
Ulrik Falklöf, Ericsson


On Monday, November 21, 2022 at 12:29:29 PM UTC+1 lbe...@google.com wrote:
Hey folks,

I'm trying to figure out whether it's feasible to promise action rewinding for Bazel 7.0 .

It looks like Daniel (and @k1nkreet on GitHub but I don't know their e-mail) did some work on action rewinding (see #14126 and #16470) but it's quite a delicate change. In addition, I don't think it's a good idea to have a public feature that requires a large number of preconditions to work, which makes it even more difficult.

We at Google also don't have a lot of use for this extended version of action rewinding so it'll be difficult to find time for us to implement it, so if this is to happen, it would probably have to be implemented by Daniel with guidance from Justin and Mark. Would that work?

It looks like we collectively know what limitations need to be lifted, so there are no theoretical obstacles and the framework for testing it (RewindingTest) has already been open sourced, it "just" needs to be coded up and tested.

The limitations that I know of that need to be lifted are:
  1. It should work with --track_incremental_state (which means keeping track of Skyframe reverse deps)
  2. Outputs lost should be evicted from the local action cache (otherwise re-running an action would just be a local action cache hit)
  3. It should work in --nokeep_going mode
  4. It should work without ActionFS (but I assume that a simple integration test would take care of this)

--
Lukács T. Berki | Software Engineer | lbe...@google.com | 

Google Germany GmbH | Erika-Mann-Str. 33  | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.

Chi Wang

unread,
Nov 29, 2022, 9:59:30 AM11/29/22
to Ulrik Falklöf, Lukács T. Berki, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.



On Tue, Nov 29, 2022 at 12:16 PM Ulrik Falklöf <ulrik.fal...@gmail.com> wrote:
On Thu, Nov 24, 2022 at 3:10 PM Lukács T. Berki <lbe...@google.com> wrote:
On Thu, Nov 24, 2022 at 9:41 AM Ulrik Falklöf <ulrik.fal...@gmail.com> wrote:

I imagine several scenarios where rewinding would be valuable:

  • Reconfigured sharding for remote caches.
  • Remote cache down and replaced with other non-fully synchronized instance.
  • Temporary network issue resulting in caches being unreachable.
  • Remote RAM cache restarted.
  • Local --disk_cache storage cleared.
Do I understand correctly that Chi's lease service would solve all these use cases? If so, the argument for action rewinding becomes much weaker. In fact, I can't think of any case where it would still be necessary, except maybe the "remote cache loses inputs within a single build" use case, but it looks like the lease service could be extended to also cover that case?

I got the same impression based on Chi's message. And I assume the extension would be needed also for HTTP caches not supporting leases via findMissingBlobs.

I assume #10880 would be fixed (for lost inputs within and between builds) by any of:

 

A:  Lease service. + Extension to automatically re-run the whole build in case of lost inputs within single build.

B:  Lease service. + Action Rewinding.

C:  Action Rewinding

D:  Automatically re-run the whole build in case of lost inputs. + Automatic clearing of some stale states before re-run.

 

I would be happy with any of A, B, C, D! However, if it the lease service also become complex or increase memory usage, then D appears most attractive to me. 
  
I don't think that in the presence of remote execution, Bazel can guarantee that a build succeeds (e.g. when someone pulls out the network cable), so as long as we can make sure that Bazel doesn't get stuck in a bad state, we are fine. 

But even in presence of remote execution, BwtB with lost input should not prevent using --remote_retries and --remote_local_fallback to try making the build succeed.

Lukács T. Berki

unread,
Nov 29, 2022, 10:22:53 AM11/29/22
to Chi Wang, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Tue, Nov 29, 2022 at 3:59 PM Chi Wang <chi...@google.com> wrote:
I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.
Excellent. Then we don't need to make action rewinding work, which would be quite a delicate change. There are some decisions to be made (what if one changes the RBE server during the lifetime of the Bazel server, what happens if a cache entry that Bazel thinks exists doesn't exist on the remote server anymore, etc.), but even with all those, it will be less work and will result in a simpler system than if we went with action rewinding.

Daniel, Ulrik, Ed: do you have a use case which Chi's proposal doesn't cover?

Ulrik Falklöf

unread,
Nov 29, 2022, 10:35:26 AM11/29/22
to Lukács T. Berki, bazel-dev, Chi Wang, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Thu, Nov 24, 2022 at 3:10 PM Lukács T. Berki <lbe...@google.com> wrote:
On Thu, Nov 24, 2022 at 9:41 AM Ulrik Falklöf <ulrik.fal...@gmail.com> wrote:

I imagine several scenarios where rewinding would be valuable:

  • Reconfigured sharding for remote caches.
  • Remote cache down and replaced with other non-fully synchronized instance.
  • Temporary network issue resulting in caches being unreachable.
  • Remote RAM cache restarted.
  • Local --disk_cache storage cleared.
Do I understand correctly that Chi's lease service would solve all these use cases? If so, the argument for action rewinding becomes much weaker. In fact, I can't think of any case where it would still be necessary, except maybe the "remote cache loses inputs within a single build" use case, but it looks like the lease service could be extended to also cover that case?

I got the same impression based on Chi's message. And I assume the extension would be needed also for HTTP caches not supporting leases via findMissingBlobs.

I assume #10880 would be fixed (for lost inputs within and between builds) by any of:

 

A:  Lease service. + Extension to automatically re-run the whole build in case of lost inputs within single build.

B:  Lease service. + Action Rewinding.

C:  Action Rewinding

D:  Automatically re-run the whole build in case of lost inputs. + Automatic clearing of some stale states before re-run.

 

I would be happy with any of A, B, C, D! However, if it the lease service also become complex or increase memory usage, then D appears most attractive to me. 
 
I don't think that in the presence of remote execution, Bazel can guarantee that a build succeeds (e.g. when someone pulls out the network cable), so as long as we can make sure that Bazel doesn't get stuck in a bad state, we are fine. 

Yannic Bonenberger

unread,
Nov 29, 2022, 11:37:43 AM11/29/22
to Ulrik Falklöf, Lukács T. Berki, bazel-dev, Chi Wang, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.

I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).

Overall, I’m not convinced there’s any good solution short of action rewinding.

Thanks,
Yannic

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.

Lukács T. Berki

unread,
Nov 30, 2022, 5:24:23 AM11/30/22
to Yannic Bonenberger, Ulrik Falklöf, bazel-dev, Chi Wang, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.

I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).

Overall, I’m not convinced there’s any good solution short of action rewinding.
I'd say, there isn't any perfect solution short of action rewinding. But that is quite expensive: it would take a lot of dev time, it would add substantial complexity to the code Bazel and it would be an excellent place for bugs to lurk (complex, arcane code that's not exercised frequently). So the question is, is the difference between what can be achieved with the lease renewal idea and the perfect solution worth all these costs?

On a more theoretical side, if your server promises to keep leased blobs available for X time, maybe it should do that instead of working around its inability to do so in Bazel? At least with a probability that is high enough to be able to paper over it by Bazel returning an error in the case that doesn't work out...

Chi Wang

unread,
Nov 30, 2022, 5:48:55 AM11/30/22
to Lukács T. Berki, Yannic Bonenberger, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:


On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.


Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:

1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.
2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.

  
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).


I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,

Based on that, If Bazel can automatically rerun the whole build, I don't see why action rewinding is better than lease service.

Lukács T. Berki

unread,
Nov 30, 2022, 6:33:22 AM11/30/22
to Chi Wang, Yannic Bonenberger, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Wed, Nov 30, 2022 at 11:48 AM Chi Wang <chi...@google.com> wrote:


On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:


On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.


Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:

1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.
+1
 
2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.
+1
 

  
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).


I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,
+1

I think there are two main arguments for action rewinding:
  1. The build might fail over to an entirely different remote execution cluster while it's running and that causes loss of state unless some clever (and maybe costly) replication is done.
  2. No service can guarantee that it keeps 100.0000% of the blobs for which it has leases outstanding because it ultimately runs on physical and thus fallible computers
And these have to be weighed against its development and maintenance costs. I am not absolutely against action rewinding in Bazel, it's just that my perspective is that these costs would be way too high for not failing a negligible percentage of all builds. Also, making things more resilient on the remote execution side seems to be easier because it deals with flat data structures (the blob + action caches) instead of graphs. 

If someone says that proper action rewinding is easier than I think it is or that the "negligible percentage" is not that negligible and can only be made so with disproportionate investment on the remote execution side, I'd be happy to reconsider.

Ulrik Falklöf

unread,
Nov 30, 2022, 7:45:01 AM11/30/22
to Lukács T. Berki, Chi Wang, Yannic Bonenberger, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall

I agree with Lukács about high cost for action rewinding, but also with Yannic about complexity and cost for making server side resilient enough. To me it seems both those alternatives are too complex and costly.


And as Chi, I also believe load for the remote server is the same between rerun the whole build and action rewinding. Therefore, I think automatically rerunning the whole build makes most sense.

 

Alexjski raised concerns about complexities for lease service in the #16660 review. For my use cases, not even a lease service would be required. What use cases do you have that requires a lease service in addition to also automatically rerunning the while build?

Yannic Bonenberger

unread,
Nov 30, 2022, 9:33:08 AM11/30/22
to Lukács T. Berki, Chi Wang, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall

Am 30.11.2022 um 12:33 schrieb Lukács T. Berki <lbe...@google.com>:



On Wed, Nov 30, 2022 at 11:48 AM Chi Wang <chi...@google.com> wrote:


On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:


On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.


Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:

1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.
+1

Fair enough. But the smaller the value, the more RPCs need to happen, which will increase the load.

2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.
+1
 

  
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).


I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,
+1

Ok, maybe we have different perspectives of what re-running means then. From my understanding, it means that build A fails (as in, Bazel returns exit_code != 0) and is re-tried at a higher level. I agree that this still doesn’t imply that Bazel cannot use the local cache. The problem is that, in practice, we see that lots of people aren’t re-using Bazel server on CI at all and always spin up a new CI runner (e.g., container), and then Bazel would re-do all these RPCs (FWIW, some people already do that for any failures anyways). If your definition of re-running means that Bazel internally catches the error and restarts the build from the beginning, things would definitely look differently.

I think there are two main arguments for action rewinding:
  1. The build might fail over to an entirely different remote execution cluster while it's running and that causes loss of state unless some clever (and maybe costly) replication is done.
  2. No service can guarantee that it keeps 100.0000% of the blobs for which it has leases outstanding because it ultimately runs on physical and thus fallible computers
And these have to be weighed against its development and maintenance costs. I am not absolutely against action rewinding in Bazel, it's just that my perspective is that these costs would be way too high for not failing a negligible percentage of all builds. Also, making things more resilient on the remote execution side seems to be easier because it deals with flat data structures (the blob + action caches) instead of graphs. 

What about tree artifacts? Would the lease service only check the root node of the merkle tree or download everything and extend the lifetime of all blobs and tree nodes? If it’s the former, that wouldn’t help, and if it’s the latter, that’d either mean lots of downloads for every time we extend the lifetime or higher memory consumption because Bazel needs to hold all referenced digests in memory. Both options sound problematic to me.

Also, the shorter the lease time, the more RPCs are happening to extend the lifetime of blobs. That will significantly increase the overall number of RPCs during a build, increase the load on the remote system as keeping track of when it’s ok to delete a blob isn’t free either, and probably adds to the overall latency of a build.

If someone says that proper action rewinding is easier than I think it is or that the "negligible percentage" is not that negligible and can only be made so with disproportionate investment on the remote execution side, I'd be happy to reconsider.

What’s the negligible percentage? 1 in 100,000? 1 in 1,000,000? That’s not that many builds, so we might still see the failure every day in larger organizations.

I absolutely see that action rewinding is a complex thing to do. If it was easy, we would have had it for a long time already. But I’m not convinced the other proposals are a good enough improvement over what we have today to justify their cost.

Implementation wise, the cheapest is probably to only invalidate the blobs that no return errors when we try to download them in the file system so that Bazel will re-run the producing action. If that isn’t good enough to get to the „negligible percentage“, it probably means action rewinding. The lease service produces many RPCs and doesn’t guarantee anything. AFAIK, all remote cache implementations already either extend lifetime of the blob or move it to the front of the LRU cache on access.

Chi Wang

unread,
Nov 30, 2022, 10:03:00 AM11/30/22
to Yannic Bonenberger, Lukács T. Berki, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Wed, Nov 30, 2022 at 3:33 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:


Am 30.11.2022 um 12:33 schrieb Lukács T. Berki <lbe...@google.com>:



On Wed, Nov 30, 2022 at 11:48 AM Chi Wang <chi...@google.com> wrote:


On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:


On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.


Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:

1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.
+1

Fair enough. But the smaller the value, the more RPCs need to happen, which will increase the load.

It uses FindMissingBlobs (one RPC) to periodically renew the lease. I believe "one RPC per hundred seconds" is small compared to what Bazel has to do for remote cache/execution.
 
2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.
+1
 

  
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).


I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,
+1

Ok, maybe we have different perspectives of what re-running means then. From my understanding, it means that build A fails (as in, Bazel returns exit_code != 0) and is re-tried at a higher level. I agree that this still doesn’t imply that Bazel cannot use the local cache. The problem is that, in practice, we see that lots of people aren’t re-using Bazel server on CI at all and always spin up a new CI runner (e.g., container), and then Bazel would re-do all these RPCs (FWIW, some people already do that for any failures anyways). If your definition of re-running means that Bazel internally catches the error and restarts the build from the beginning, things would definitely look differently.


Yes, my definition of re-running is reusing the Bazel server. Either Bazel can catch the error and rerun the equivalent `bazel build`, or a wrapper could catch the exit code and rerun.
I think there are two main arguments for action rewinding:
  1. The build might fail over to an entirely different remote execution cluster while it's running and that causes loss of state unless some clever (and maybe costly) replication is done.
  2. No service can guarantee that it keeps 100.0000% of the blobs for which it has leases outstanding because it ultimately runs on physical and thus fallible computers
And these have to be weighed against its development and maintenance costs. I am not absolutely against action rewinding in Bazel, it's just that my perspective is that these costs would be way too high for not failing a negligible percentage of all builds. Also, making things more resilient on the remote execution side seems to be easier because it deals with flat data structures (the blob + action caches) instead of graphs. 

What about tree artifacts? Would the lease service only check the root node of the merkle tree or download everything and extend the lifetime of all blobs and tree nodes?

Bazel has to parse the Tree to know all the files inside the tree regardless of the output mode immediately after action execution. So it is already tracking files inside tree artifacts.
 
If it’s the former, that wouldn’t help, and if it’s the latter, that’d either mean lots of downloads for every time we extend the lifetime or higher memory consumption because Bazel needs to hold all referenced digests in memory. Both options sound problematic to me.


It only downloads the Tree digests once after the action is completed. To renew, it only renews the files inside the tree.
 
Also, the shorter the lease time, the more RPCs are happening to extend the lifetime of blobs. That will significantly increase the overall number of RPCs during a build, increase the load on the remote system as keeping track of when it’s ok to delete a blob isn’t free either, and probably adds to the overall latency of a build.


We use FindMissingBlobs to extend the lifetime (one RPC per hundred seconds). I believe the cost should be relatively low. 

Lukács T. Berki

unread,
Dec 1, 2022, 3:23:52 AM12/1/22
to Yannic Bonenberger, Chi Wang, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten, Daniel Wagner-Hall
On Wed, Nov 30, 2022 at 3:33 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:


Am 30.11.2022 um 12:33 schrieb Lukács T. Berki <lbe...@google.com>:



On Wed, Nov 30, 2022 at 11:48 AM Chi Wang <chi...@google.com> wrote:


On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:


On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.


Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:

1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.
+1

Fair enough. But the smaller the value, the more RPCs need to happen, which will increase the load.

2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.
+1
 

  
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).


I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,
+1

Ok, maybe we have different perspectives of what re-running means then. From my understanding, it means that build A fails (as in, Bazel returns exit_code != 0) and is re-tried at a higher level. I agree that this still doesn’t imply that Bazel cannot use the local cache. The problem is that, in practice, we see that lots of people aren’t re-using Bazel server on CI at all and always spin up a new CI runner (e.g., container), and then Bazel would re-do all these RPCs (FWIW, some people already do that for any failures anyways). If your definition of re-running means that Bazel internally catches the error and restarts the build from the beginning, things would definitely look differently.

I think there are two main arguments for action rewinding:
  1. The build might fail over to an entirely different remote execution cluster while it's running and that causes loss of state unless some clever (and maybe costly) replication is done.
  2. No service can guarantee that it keeps 100.0000% of the blobs for which it has leases outstanding because it ultimately runs on physical and thus fallible computers
And these have to be weighed against its development and maintenance costs. I am not absolutely against action rewinding in Bazel, it's just that my perspective is that these costs would be way too high for not failing a negligible percentage of all builds. Also, making things more resilient on the remote execution side seems to be easier because it deals with flat data structures (the blob + action caches) instead of graphs. 

What about tree artifacts? Would the lease service only check the root node of the merkle tree or download everything and extend the lifetime of all blobs and tree nodes? If it’s the former, that wouldn’t help, and if it’s the latter, that’d either mean lots of downloads for every time we extend the lifetime or higher memory consumption because Bazel needs to hold all referenced digests in memory. Both options sound problematic to me.

Also, the shorter the lease time, the more RPCs are happening to extend the lifetime of blobs. That will significantly increase the overall number of RPCs during a build, increase the load on the remote system as keeping track of when it’s ok to delete a blob isn’t free either, and probably adds to the overall latency of a build.

If someone says that proper action rewinding is easier than I think it is or that the "negligible percentage" is not that negligible and can only be made so with disproportionate investment on the remote execution side, I'd be happy to reconsider.

What’s the negligible percentage? 1 in 100,000? 1 in 1,000,000? That’s not that many builds, so we might still see the failure every day in larger organizations.
Data point: Google definitely counts as a "larger organization" in this sense and we saw the need to implement action rewinding, but not to make it incrementally correct. One could conclude two things from this: either that incrementally correct action rewinding is so difficult that not even Google found it economical to implement it or that action rewinding is in fact necessary in a large enough org.

My understanding is that we need the lease renewal functionality anyway to deal with cases where a single build is too long (and thus a long time can pass between the creation and the use of an artifact) and where a supposedly remotely cached artifact lingers in the local cache for a long time. I hope that it will be good enough to obviate most of the need for action rewinding. If not, we can revisit action rewinding again.

To reiterate: I am not against action rewinding per se; it's just that I don't think it's a good idea for Google in particular to implement it and I find it not impossible that the maintenance cost of any implementation is high enough that it's not feasible for us to maintain it. I'd be elated to be proven wrong on the latter! AFAIU there is promising work going on at https://github.com/bazelbuild/bazel/pull/16470 and https://github.com/bazelbuild/bazel/pull/14126 .

Daniel Wagner-Hall

unread,
Dec 1, 2022, 9:26:09 AM12/1/22
to Lukács T. Berki, Chi Wang, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten
On Tue, 29 Nov 2022 at 15:22, Lukács T. Berki <lbe...@google.com> wrote:
On Tue, Nov 29, 2022 at 3:59 PM Chi Wang <chi...@google.com> wrote:
I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.
Excellent. Then we don't need to make action rewinding work, which would be quite a delicate change. There are some decisions to be made (what if one changes the RBE server during the lifetime of the Bazel server, what happens if a cache entry that Bazel thinks exists doesn't exist on the remote server anymore, etc.), but even with all those, it will be less work and will result in a simpler system than if we went with action rewinding.

Daniel, Ulrik, Ed: do you have a use case which Chi's proposal doesn't cover?

I had a look at the code, and experimented with using it in a few situations. AFAICT Chi's proposal covers everything I care about - thanks for driving this discussion!

Chi: Do you have an expected timeline to getting your PR merged (and back-ported to Bazel 6)? Are there any blockers you need support with?

Chi Wang

unread,
Dec 5, 2022, 5:45:25 AM12/5/22
to Daniel Wagner-Hall, Lukács T. Berki, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten
On Thu, Dec 1, 2022 at 3:26 PM Daniel Wagner-Hall <dawa...@gmail.com> wrote:
On Tue, 29 Nov 2022 at 15:22, Lukács T. Berki <lbe...@google.com> wrote:
On Tue, Nov 29, 2022 at 3:59 PM Chi Wang <chi...@google.com> wrote:
I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.
Excellent. Then we don't need to make action rewinding work, which would be quite a delicate change. There are some decisions to be made (what if one changes the RBE server during the lifetime of the Bazel server, what happens if a cache entry that Bazel thinks exists doesn't exist on the remote server anymore, etc.), but even with all those, it will be less work and will result in a simpler system than if we went with action rewinding.

Daniel, Ulrik, Ed: do you have a use case which Chi's proposal doesn't cover?

I had a look at the code, and experimented with using it in a few situations. AFAICT Chi's proposal covers everything I care about - thanks for driving this discussion!


Thanks for confirming lease service works for your cases.
 
Chi: Do you have an expected timeline to getting your PR merged (and back-ported to Bazel 6)? Are there any blockers you need support with?


No blockers. Just need to clean up the code and fix the tests before sending it for review. Probably one or two weeks. We probably wouldn't be able to catch 6, but I can back port it after it is merged.

--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.

Lukács T. Berki

unread,
Dec 5, 2022, 3:18:03 PM12/5/22
to Daniel Wagner-Hall, Chi Wang, Ulrik Falklöf, bazel-dev, Justin Horvitz, Ed Schouten
On Thu, Dec 1, 2022 at 3:26 PM Daniel Wagner-Hall <dawa...@gmail.com> wrote:
On Tue, 29 Nov 2022 at 15:22, Lukács T. Berki <lbe...@google.com> wrote:
On Tue, Nov 29, 2022 at 3:59 PM Chi Wang <chi...@google.com> wrote:
I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.
Excellent. Then we don't need to make action rewinding work, which would be quite a delicate change. There are some decisions to be made (what if one changes the RBE server during the lifetime of the Bazel server, what happens if a cache entry that Bazel thinks exists doesn't exist on the remote server anymore, etc.), but even with all those, it will be less work and will result in a simpler system than if we went with action rewinding.

Daniel, Ulrik, Ed: do you have a use case which Chi's proposal doesn't cover?

I had a look at the code, and experimented with using it in a few situations. AFAICT Chi's proposal covers everything I care about - thanks for driving this discussion!
Let's shelve action rewinding for the time being. 

To be clear: I'd love this to be a feature of Bazel, eventually, but given its marginal utility, high development cost and probably large complexity cost, I don't think it's a good investment of the time of the Bazel team to work on this. I'd consider accepting a pull request if it is reasonably simple, both in terms of complexity of code and in terms of new behaviors added to Bazel.


Chi: Do you have an expected timeline to getting your PR merged (and back-ported to Bazel 6)? Are there any blockers you need support with?

Reply all
Reply to author
Forward
0 new messages