Hi Lukács,
Long time no see!
Op ma 21 nov. 2022 om 12:29 schreef 'Lukács T. Berki' via bazel-dev
<baze...@googlegroups.com>:
> I'm trying to figure out whether it's feasible to promise action rewinding for Bazel 7.0 .
I'm pretty sure I have already brought this up in a couple of
different places, but to me it's still not clear why we need the
action rewinding. As far as I know, there are two scenarios to
consider:
1. Your remote caching/execution service loses CAS objects that were
generated by an action that ran at an earlier point within the same
build.
This may happen if your remote caching setup is:
- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.
I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.
2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.
My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:
- https://github.com/bazelbuild/bazel/pull/12823 <- Bazel PR to add a
client for this
- https://github.com/buildbarn/bb-clientd/blob/master/pkg/filesystem/virtual/remote_output_service_directory.go#L342-L350
<- Code in bb_clientd that does the filtering.
I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar. As far as I know, this approach is at least
as efficient as doing full action rewinding. Action rewinding may, for
example, be triggered if the remote execution service's Execute() call
fails with FAILED_PRECONDITION. The REv2 spec does not guarantee that
these errors are returned *prior* to actually queueing/executing said
actions. This means that valuable amounts of worker time may be
wasted, just to come to the conclusion that rewinding needs to be
performed.
I would like to hear from the authors of both PRs why the approach I
laid out above is not sufficient for their use case.
Best regards,
--
Ed Schouten <e...@nuxi.nl>
--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CABh_MKk5Wczd5m7LAN%2BANyQd%3DNxo9a6P_vQOnRaz4_Jgzo3j_A%40mail.gmail.com.
On Mon, Nov 21, 2022 at 1:31 PM Ed Schouten <e...@nuxi.nl> wrote:1. Your remote caching/execution service loses CAS objects that weregenerated by an action that ran at an earlier point within the same
build.
This may happen if your remote caching setup is:
- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.
I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.
2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.
My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:We have a "remote data was lost, please clean your workspace" error message and so far, it has been enough so I personally haven't given a lot of thought about this particular scenario. Chi would probably be able to tell more since he thought a lot about the interaction of the local action cache and remote builds.
I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar.
Thanks for kicking off this thread, Lukács!On Mon, 21 Nov 2022 at 13:24, Lukács T. Berki <lbe...@google.com> wrote:On Mon, Nov 21, 2022 at 1:31 PM Ed Schouten <e...@nuxi.nl> wrote:1. Your remote caching/execution service loses CAS objects that weregenerated by an action that ran at an earlier point within the same
build.
This may happen if your remote caching setup is:
- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.
I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.I agree that within a build, a remote execution system should handle these lifetimes.
2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.
My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:We have a "remote data was lost, please clean your workspace" error message and so far, it has been enough so I personally haven't given a lot of thought about this particular scenario. Chi would probably be able to tell more since he thought a lot about the interaction of the local action cache and remote builds.This is the case I'm interested in - in particular e.g. if you have persistent Bazel daemons in a CI environment, having to manually inject a `bazel shutdown` when you hit this issue is a sub-par user experience.
I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar.Interesting... I didn't know that skyframe actions would properly invalidate if you just remove stuff from RemoteActionFileSystem... How confident are you that works through the stack (including for incremental and disk-cache-using builds)? If so, that sounds like a great simpler substitute for full rewinding. If you're reasonably confident in it, I can try to put together a test-case showing it hopefully working next week!
On Mon, Nov 21, 2022 at 7:52 PM Daniel Wagner-Hall <dawa...@gmail.com> wrote:Thanks for kicking off this thread, Lukács!On Mon, 21 Nov 2022 at 13:24, Lukács T. Berki <lbe...@google.com> wrote:On Mon, Nov 21, 2022 at 1:31 PM Ed Schouten <e...@nuxi.nl> wrote:1. Your remote caching/execution service loses CAS objects that weregenerated by an action that ran at an earlier point within the same
build.
This may happen if your remote caching setup is:
- Too small, which can easily be fixed by increasing its storage
capacity. Rewinding is not a good solution here, as it may cause your
build to get stuck indefinitely.
- Not reliable. In that case the right fix is to improve the
reliability of your remote caching setup.
I think it's not an unreasonable requirement for remote caching setups
to guarantee preservation of CAS objects for the duration of a single
build.I agree that within a build, a remote execution system should handle these lifetimes.Chi, mind reading this and telling me if this is correct?IIRC the RBE folks thought about this and the limiting factor is that there is no maximum length of time a Bazel build can run. And it's possible that Bazel executes an action very early and another that depends on one of its outputs very late and in order to make that scenario work, Bazel would need to either set the lifetime of that output to some value that guarantees that it will be live by the end of the build (but there is no amount of time that would guarantee that) or to periodically refresh output artifacts during the build (but we don't do that; maybe we should?)
2. Your Bazel server process holds on to state of which blobs exist
between invocations, causing it to reference objects that may have
been garbage collected in the meantime.
My question here is why Bazel can't just scan this state at the start
of a build, call FindMissingBlobs() and selectively purge the state
corresponding to objects that no longer exist remotely? Because
FindMissingBlobs() is capable of doing this for large batches, the
number of RPCs needed is fairly small, even for large repositories.
This is exactly what bb_clientd does if you make use of its Remote
Output Service feature:We have a "remote data was lost, please clean your workspace" error message and so far, it has been enough so I personally haven't given a lot of thought about this particular scenario. Chi would probably be able to tell more since he thought a lot about the interaction of the local action cache and remote builds.This is the case I'm interested in - in particular e.g. if you have persistent Bazel daemons in a CI environment, having to manually inject a `bazel shutdown` when you hit this issue is a sub-par user experience.Yeah, I can relate to that; I'd again be interested in what Chi thinks about the most desirable solution; there are all sorts of caches at play (Skyframe, the local action cache, the local disk cache) and there is also FUSEless mode which complicates things.
I'm pretty sure that we could add similar logic to Bazel's
src/main/java/com/google/devtools/build/lib/remote/RemoteActionFileSystem.java
to do something similar.Interesting... I didn't know that skyframe actions would properly invalidate if you just remove stuff from RemoteActionFileSystem... How confident are you that works through the stack (including for incremental and disk-cache-using builds)? If so, that sounds like a great simpler substitute for full rewinding. If you're reasonably confident in it, I can try to put together a test-case showing it hopefully working next week!AFAICT this won't work since a given RemoteActionFileSystem instance is only used for the inputs/outputs of a given action and it is created right before the action is executed. Thus, detecting a missing artifact in RemoteActionFileSystem would still require action rewinding.
Thank you for working on this!
I imagine several scenarios where rewinding would be valuable:
Handling these rare scenarios by some kind of rewinding/retrying seems as the most reliable and versatile solution to me.
Rewinding arbitrary far back might not be necessary. Retrying the whole build could be good enough, as long as it happens automatically, so that the user can still trust bazel to take care of it.
I try to teach our users to trust bazel and avoid their old habits of ‘make clean’. Having situations requiring manual 'bazel shutdown' or ‘bazel clean’ would ruin that trust and is therefore blocking my organization from using BwtB.
Regarding how many times to rewind/retry and risk of getting stuck indefinitely: it could be considered to automatically disable BwtB during the rewind/retry.
BR,
Ulrik Falklöf, Ericsson
Thank you for working on this!
I imagine several scenarios where rewinding would be valuable:
- Reconfigured sharding for remote caches.
- Remote cache down and replaced with other non-fully synchronized instance.
- Temporary network issue resulting in caches being unreachable.
- Remote RAM cache restarted.
- Local --disk_cache storage cleared.
Handling these rare scenarios by some kind of rewinding/retrying seems as the most reliable and versatile solution to me.
Rewinding arbitrary far back might not be necessary. Retrying the whole build could be good enough, as long as it happens automatically, so that the user can still trust bazel to take care of it.
I try to teach our users to trust bazel and avoid their old habits of ‘make clean’. Having situations requiring manual 'bazel shutdown' or ‘bazel clean’ would ruin that trust and is therefore blocking my organization from using BwtB.
Regarding how many times to rewind/retry and risk of getting stuck indefinitely: it could be considered to automatically disable BwtB during the rewind/retry.
BR,
Ulrik Falklöf, Ericsson
On Monday, November 21, 2022 at 12:29:29 PM UTC+1 lbe...@google.com wrote:Hey folks,I'm trying to figure out whether it's feasible to promise action rewinding for Bazel 7.0 .It looks like Daniel (and @k1nkreet on GitHub but I don't know their e-mail) did some work on action rewinding (see #14126 and #16470) but it's quite a delicate change. In addition, I don't think it's a good idea to have a public feature that requires a large number of preconditions to work, which makes it even more difficult.We at Google also don't have a lot of use for this extended version of action rewinding so it'll be difficult to find time for us to implement it, so if this is to happen, it would probably have to be implemented by Daniel with guidance from Justin and Mark. Would that work?It looks like we collectively know what limitations need to be lifted, so there are no theoretical obstacles and the framework for testing it (RewindingTest) has already been open sourced, it "just" needs to be coded up and tested.The limitations that I know of that need to be lifted are:--
- It should work with --track_incremental_state (which means keeping track of Skyframe reverse deps)
- Outputs lost should be evicted from the local action cache (otherwise re-running an action would just be a local action cache hit)
- It should work in --nokeep_going mode
- It should work without ActionFS (but I assume that a simple integration test would take care of this)
Lukács T. Berki | Software Engineer | lbe...@google.com |Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany | Geschäftsführer: Paul Manicle, Halimah DeLaine Prado | Registergericht und -nummer: Hamburg, HRB 86891
--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/3f155d2e-5ad1-4cd4-b14b-09e9b3dcde90n%40googlegroups.com.
On Thu, Nov 24, 2022 at 3:10 PM Lukács T. Berki <lbe...@google.com> wrote:
On Thu, Nov 24, 2022 at 9:41 AM Ulrik Falklöf <ulrik.fal...@gmail.com> wrote:I imagine several scenarios where rewinding would be valuable:
- Reconfigured sharding for remote caches.
- Remote cache down and replaced with other non-fully synchronized instance.
- Temporary network issue resulting in caches being unreachable.
- Remote RAM cache restarted.
- Local --disk_cache storage cleared.
Do I understand correctly that Chi's lease service would solve all these use cases? If so, the argument for action rewinding becomes much weaker. In fact, I can't think of any case where it would still be necessary, except maybe the "remote cache loses inputs within a single build" use case, but it looks like the lease service could be extended to also cover that case?
I got the same impression based on Chi's message. And I assume the extension would be needed also for HTTP caches not supporting leases via findMissingBlobs.I assume #10880 would be fixed (for lost inputs within and between builds) by any of:
A: Lease service. + Extension to automatically re-run the whole build in case of lost inputs within single build.
B: Lease service. + Action Rewinding.
C: Action Rewinding.
D: Automatically re-run the whole build in case of lost inputs. + Automatic clearing of some stale states before re-run.
I would be happy with any of A, B, C, D! However, if it the lease service also become complex or increase memory usage, then D appears most attractive to me.
I don't think that in the presence of remote execution, Bazel can guarantee that a build succeeds (e.g. when someone pulls out the network cable), so as long as we can make sure that Bazel doesn't get stuck in a bad state, we are fine.
But even in presence of remote execution, BwtB with lost input should not prevent using --remote_retries and --remote_local_fallback to try making the build succeed.
I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.
On Thu, Nov 24, 2022 at 9:41 AM Ulrik Falklöf <ulrik.fal...@gmail.com> wrote:I imagine several scenarios where rewinding would be valuable:
- Reconfigured sharding for remote caches.
- Remote cache down and replaced with other non-fully synchronized instance.
- Temporary network issue resulting in caches being unreachable.
- Remote RAM cache restarted.
- Local --disk_cache storage cleared.
Do I understand correctly that Chi's lease service would solve all these use cases? If so, the argument for action rewinding becomes much weaker. In fact, I can't think of any case where it would still be necessary, except maybe the "remote cache loses inputs within a single build" use case, but it looks like the lease service could be extended to also cover that case?
I assume #10880 would be fixed (for lost inputs within and between builds) by any of:
A: Lease service. + Extension to automatically re-run the whole build in case of lost inputs within single build.
B: Lease service. + Action Rewinding.
C: Action Rewinding.
D: Automatically re-run the whole build in case of lost inputs. + Automatic clearing of some stale states before re-run.
I don't think that in the presence of remote execution, Bazel can guarantee that a build succeeds (e.g. when someone pulls out the network cable), so as long as we can make sure that Bazel doesn't get stuck in a bad state, we are fine.
--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CAM6KHreUiRUQPEuwCrRn0NB5ZDu6V8VGFBpN5LVUUZtcABhmvQ%40mail.gmail.com.
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).
Overall, I’m not convinced there’s any good solution short of action rewinding.
On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).
On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.
Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.
2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,
I agree with Lukács about high cost for action rewinding, but also with Yannic about complexity and cost for making server side resilient enough. To me it seems both those alternatives are too complex and costly.
And as Chi, I also believe load for the remote server is the same between rerun the whole build and action rewinding. Therefore, I think automatically rerunning the whole build makes most sense.
Alexjski raised concerns about complexities for lease service in the #16660 review. For my use cases, not even a lease service would be required. What use cases do you have that requires a lease service in addition to also automatically rerunning the while build?
Am 30.11.2022 um 12:33 schrieb Lukács T. Berki <lbe...@google.com>:
On Wed, Nov 30, 2022 at 11:48 AM Chi Wang <chi...@google.com> wrote:
On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:
On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.
Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:
1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.+1
2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.+1
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).
I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,+1
I think there are two main arguments for action rewinding:
- The build might fail over to an entirely different remote execution cluster while it's running and that causes loss of state unless some clever (and maybe costly) replication is done.
- No service can guarantee that it keeps 100.0000% of the blobs for which it has leases outstanding because it ultimately runs on physical and thus fallible computers
And these have to be weighed against its development and maintenance costs. I am not absolutely against action rewinding in Bazel, it's just that my perspective is that these costs would be way too high for not failing a negligible percentage of all builds. Also, making things more resilient on the remote execution side seems to be easier because it deals with flat data structures (the blob + action caches) instead of graphs.
If someone says that proper action rewinding is easier than I think it is or that the "negligible percentage" is not that negligible and can only be made so with disproportionate investment on the remote execution side, I'd be happy to reconsider.
Am 30.11.2022 um 12:33 schrieb Lukács T. Berki <lbe...@google.com>:
On Wed, Nov 30, 2022 at 11:48 AM Chi Wang <chi...@google.com> wrote:
On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:
On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.
Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:
1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.+1
Fair enough. But the smaller the value, the more RPCs need to happen, which will increase the load.
2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.+1
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).
I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,+1
Ok, maybe we have different perspectives of what re-running means then. From my understanding, it means that build A fails (as in, Bazel returns exit_code != 0) and is re-tried at a higher level. I agree that this still doesn’t imply that Bazel cannot use the local cache. The problem is that, in practice, we see that lots of people aren’t re-using Bazel server on CI at all and always spin up a new CI runner (e.g., container), and then Bazel would re-do all these RPCs (FWIW, some people already do that for any failures anyways). If your definition of re-running means that Bazel internally catches the error and restarts the build from the beginning, things would definitely look differently.
I think there are two main arguments for action rewinding:
- The build might fail over to an entirely different remote execution cluster while it's running and that causes loss of state unless some clever (and maybe costly) replication is done.
- No service can guarantee that it keeps 100.0000% of the blobs for which it has leases outstanding because it ultimately runs on physical and thus fallible computers
And these have to be weighed against its development and maintenance costs. I am not absolutely against action rewinding in Bazel, it's just that my perspective is that these costs would be way too high for not failing a negligible percentage of all builds. Also, making things more resilient on the remote execution side seems to be easier because it deals with flat data structures (the blob + action caches) instead of graphs.
What about tree artifacts? Would the lease service only check the root node of the merkle tree or download everything and extend the lifetime of all blobs and tree nodes?
If it’s the former, that wouldn’t help, and if it’s the latter, that’d either mean lots of downloads for every time we extend the lifetime or higher memory consumption because Bazel needs to hold all referenced digests in memory. Both options sound problematic to me.
Also, the shorter the lease time, the more RPCs are happening to extend the lifetime of blobs. That will significantly increase the overall number of RPCs during a build, increase the load on the remote system as keeping track of when it’s ok to delete a blob isn’t free either, and probably adds to the overall latency of a build.
Am 30.11.2022 um 12:33 schrieb Lukács T. Berki <lbe...@google.com>:
On Wed, Nov 30, 2022 at 11:48 AM Chi Wang <chi...@google.com> wrote:
On Wed, Nov 30, 2022 at 11:24 AM Lukács T. Berki <lbe...@google.com> wrote:
On Tue, Nov 29, 2022 at 5:37 PM Yannic Bonenberger <yan...@yannic-bonenberger.com> wrote:
I’m not convinced the lease service will be sufficient: our remote execution has a feature where we retain blobs for at least x time after access. Access is either blob upload, blob download, find missing blobs, or action cache lookup for all referenced blobs (at least outputs, not sure about inputs there actually). IIUC, that's equivalent behavior to the lease service, and works pretty well in the happy case actually. The problem with this approach in general is that (a) it’s only a pinky swear by the remote execution/caching system to not loose the blob, which does happen in practice regardless of this promise (e.g., because the CAS is on disk and auto-scaled and the machine that had the blob was removed from the cluster due to low load or something and the cluster runs without a second layer of caching in some more permanent storage like S3 or GCS), and (b) that storing many blobs even just for a couple of hours can mean a lot of data, a lot of which is basically never accessed again because a dependency of that action changed in the next build, which lets cost for storage explode.
Server can and should retain blobs for at least x time after access. Your definition for "access" also sounds good to me. But there are some important difference with lease service:
1. The "at least x time" can be set to a smaller value. With today's code, the server has to guess the maximum invocation time for a project and use that for the lifetime of just accessed blobs. With lease service, Bazel and the server can agree on a small value. I believe this will reduce the load. With action rewinding, I believe the server still has to store many blobs for a couple of hours because you probably don't want Bazel rewinds too much.+1
Fair enough. But the smaller the value, the more RPCs need to happen, which will increase the load.
2. Bazel is able to clean the stale state with the help of lease service. So even if the server cannot promise the leases, Bazel can move forward in the following build.+1
I also don’t think re-running the whole build is generally a good idea as it increases the load on the remote execution or caching system (esp. for very large Bazel builds that take longer and are more likely to suffer from failures because of lost blobs).
I believe the load for the remote server is the same between rerun the whole build and action rewinding. Rerun the whole build doesn't mean Bazel will check remote cache for all actions, it still has local action cache. Only actions whose outputs are missing are rerun. In theory, the num of actions need to be checked against remote cache/remote execution is the same with action rewinding,+1
Ok, maybe we have different perspectives of what re-running means then. From my understanding, it means that build A fails (as in, Bazel returns exit_code != 0) and is re-tried at a higher level. I agree that this still doesn’t imply that Bazel cannot use the local cache. The problem is that, in practice, we see that lots of people aren’t re-using Bazel server on CI at all and always spin up a new CI runner (e.g., container), and then Bazel would re-do all these RPCs (FWIW, some people already do that for any failures anyways). If your definition of re-running means that Bazel internally catches the error and restarts the build from the beginning, things would definitely look differently.
I think there are two main arguments for action rewinding:
- The build might fail over to an entirely different remote execution cluster while it's running and that causes loss of state unless some clever (and maybe costly) replication is done.
- No service can guarantee that it keeps 100.0000% of the blobs for which it has leases outstanding because it ultimately runs on physical and thus fallible computers
And these have to be weighed against its development and maintenance costs. I am not absolutely against action rewinding in Bazel, it's just that my perspective is that these costs would be way too high for not failing a negligible percentage of all builds. Also, making things more resilient on the remote execution side seems to be easier because it deals with flat data structures (the blob + action caches) instead of graphs.
What about tree artifacts? Would the lease service only check the root node of the merkle tree or download everything and extend the lifetime of all blobs and tree nodes? If it’s the former, that wouldn’t help, and if it’s the latter, that’d either mean lots of downloads for every time we extend the lifetime or higher memory consumption because Bazel needs to hold all referenced digests in memory. Both options sound problematic to me.
Also, the shorter the lease time, the more RPCs are happening to extend the lifetime of blobs. That will significantly increase the overall number of RPCs during a build, increase the load on the remote system as keeping track of when it’s ok to delete a blob isn’t free either, and probably adds to the overall latency of a build.
If someone says that proper action rewinding is easier than I think it is or that the "negligible percentage" is not that negligible and can only be made so with disproportionate investment on the remote execution side, I'd be happy to reconsider.What’s the negligible percentage? 1 in 100,000? 1 in 1,000,000? That’s not that many builds, so we might still see the failure every day in larger organizations.
On Tue, Nov 29, 2022 at 3:59 PM Chi Wang <chi...@google.com> wrote:I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.Excellent. Then we don't need to make action rewinding work, which would be quite a delicate change. There are some decisions to be made (what if one changes the RBE server during the lifetime of the Bazel server, what happens if a cache entry that Bazel thinks exists doesn't exist on the remote server anymore, etc.), but even with all those, it will be less work and will result in a simpler system than if we went with action rewinding.Daniel, Ulrik, Ed: do you have a use case which Chi's proposal doesn't cover?
On Tue, 29 Nov 2022 at 15:22, Lukács T. Berki <lbe...@google.com> wrote:On Tue, Nov 29, 2022 at 3:59 PM Chi Wang <chi...@google.com> wrote:I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.Excellent. Then we don't need to make action rewinding work, which would be quite a delicate change. There are some decisions to be made (what if one changes the RBE server during the lifetime of the Bazel server, what happens if a cache entry that Bazel thinks exists doesn't exist on the remote server anymore, etc.), but even with all those, it will be less work and will result in a simpler system than if we went with action rewinding.Daniel, Ulrik, Ed: do you have a use case which Chi's proposal doesn't cover?I had a look at the code, and experimented with using it in a few situations. AFAICT Chi's proposal covers everything I care about - thanks for driving this discussion!
Chi: Do you have an expected timeline to getting your PR merged (and back-ported to Bazel 6)? Are there any blockers you need support with?
--
You received this message because you are subscribed to the Google Groups "bazel-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-dev/CA%2Bc80BJeEpxtjSRyYsVQz8ORwHW3wFCWBv1TwBSTvQdqwoR0xg%40mail.gmail.com.
On Tue, 29 Nov 2022 at 15:22, Lukács T. Berki <lbe...@google.com> wrote:On Tue, Nov 29, 2022 at 3:59 PM Chi Wang <chi...@google.com> wrote:I believe option A, i.e. lease service plus automatically re-run the whole build in case of lost inputs can solve all the mentioned issues.Excellent. Then we don't need to make action rewinding work, which would be quite a delicate change. There are some decisions to be made (what if one changes the RBE server during the lifetime of the Bazel server, what happens if a cache entry that Bazel thinks exists doesn't exist on the remote server anymore, etc.), but even with all those, it will be less work and will result in a simpler system than if we went with action rewinding.Daniel, Ulrik, Ed: do you have a use case which Chi's proposal doesn't cover?I had a look at the code, and experimented with using it in a few situations. AFAICT Chi's proposal covers everything I care about - thanks for driving this discussion!
Chi: Do you have an expected timeline to getting your PR merged (and back-ported to Bazel 6)? Are there any blockers you need support with?