Retrying RESOURCE_EXHAUSTED errors

1,347 views
Skip to first unread message

Eric Burnett

unread,
May 6, 2020, 7:12:21 PM5/6/20
to Remote Execution APIs Working Group, Yinfu Chen
Heya API folk,

I want to solicit your opinion: what should clients do upon receiving RESOURCE_EXHAUSTED errors from REAPI?

(I should note that whatever we decide here will probably be applied to bazel and goma at minimum, so this is not simply an academic exercise).

The specific scenario of interest to me is an API that returns RESOURCE_EXHAUSTED when some global rate limit is exceeded. In our case, we want to be able to return this error when a large client sends us too many requests in total, as that risks overwhelming our service entirely if left unchecked. But we want this to induce graceful backoff, not a state change from "fine" to "exploding".

The problem here is that global overload is rarely "slight" enough that we can rely on the usual retry semantics to mask it. Bazel, for example, retries RPCs 5 times by default. But if the overload spiked to e.g. "15% over quota", every RPC attempt would have a baseline 13% chance of failing, and at 5 attempts, you'd expect a RPC to be exhausted (aka a build to totally fail) every 23k RPCs or so. Given large enough clients, now you're measuring the user-visible build failures in QPS, which is nowhere close to "graceful degradation" in my book. (Users should ~never see unnecessary build failures, period.)

And that's assuming it's both statistically random, and mild - if there's any skew based on location, client, server you've happened to connect to, whatever, the odds of user-visible build failures go way up even at 15% overload, and turning up a new workload without first increasing quota could easily put you at 200% of quota instead of 115%, at which point every single build is failing across both new and old workloads. Ouch.

I'd like to say "clients should retry RESOURCE_EXHAUSTED indefinitely". I.e. if you manage to put yourself into overload conditions your builds will struggle along until they hit some higher-level timeout, hopefully giving you enough time to page someone before "slow" turns into "failing". Note that in this case, failing builds and having them retried at a higher level will make things explicitly worse.

This has two complications:
  • Discoverability: it's not entirely obvious from clients why builds are then slow. (Though clients could presumably log something to stdout to make it obvious).
  • gRPC: since this couldn't be too simple, gRPC also injects RESOURCE_EXHAUSTED errors for client memory exhaustion and for RPCs exhausting size limits.
    • I've never personally seen the former happen, but want to call it out;
    • In the latter case this presumably impacts debugging, but hopefully wouldn't be first manifesting on real-world workloads in the wild?
Even so, I'd like to proceed as above - right now the status quo is that servers cannot safely return RESOURCE_EXHAUSTED without expecting to induce some level of client outage, which is a tough tradeoff when you want to gracefully protect your service from (probably short) traffic spikes. Conversely, the above scenarios at worst turn a build that was going to fail anyways into one that seems hung until you check logs, and even there, seems pretty unlikely in practice.

Any opinions? Absent objection, I'll add it as a clarification to the API and attempt to update bazel to retry indefinitely (w/ warning) in practice. If you don't care either way, that's fine by me too :).

Thanks,
Eric

John Millikin

unread,
May 6, 2020, 7:22:27 PM5/6/20
to Eric Burnett, Remote Execution APIs Working Group, Yinfu Chen
Clients retrying RESOURCE_EXHAUSTED indefinitely would be surprising, because that code covers both transient and non-transient failure conditions. For example, an action that exceeds the maximum working tree size would fail no matter how many times it's re-attempted.

The way I've seen this handled in the past is for the gRPC status to contain additional metadata that clients can inspect -- if this metadata is present, and indicates a transient failure, the client might adjust its retry thresholds.

Stripe's build executor handles temporary resource over-limit by keeping actions in QUEUED state until either capacity is available, or the action times out. This seems to work fairly well.

--
You received this message because you are subscribed to the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to remote-execution...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/CAJZTgz2iMCq0T-edutQCkU_NVf8-GsXxFyjqQTG8q-TQ1jQQGQ%40mail.gmail.com.

Eric Burnett

unread,
May 6, 2020, 8:04:31 PM5/6/20
to John Millikin, Remote Execution APIs Working Group, Yinfu Chen
Thanks for the quick feedback John!

On Wed, May 6, 2020 at 7:22 PM John Millikin <jmil...@stripe.com> wrote:
Clients retrying RESOURCE_EXHAUSTED indefinitely would be surprising, because that code covers both transient and non-transient failure conditions. For example, an action that exceeds the maximum working tree size would fail no matter how many times it's re-attempted.

Shouldn't an overly large working tree be returned as INVALID_ARGUMENT? It sounds like that condition is going to be deterministic, and so should not be mapped to a retriable error.
 

The way I've seen this handled in the past is for the gRPC status to contain additional metadata that clients can inspect -- if this metadata is present, and indicates a transient failure, the client might adjust its retry thresholds.

I looked at the metadata options available - should have called that out. One option is to simply search the error string for "quota", which is hacky but probably effective, and might actually be the best choice. Otherwise the standard metadata options available appear to be google.rpc.QuotaFailure, which is...TWO strings to grep...or google.rpc.RetryInfo, which is simply a backoff delay with no additional semantics. (I also technically don't have an easy way to inject these errors, but I think that's my problem :) ).
 

Stripe's build executor handles temporary resource over-limit by keeping actions in QUEUED state until either capacity is available, or the action times out. This seems to work fairly well.

The queue is a natural place to hold execution requests, yep! I'm less concerned about them though - we only get a few thousand qps of execution requests, which are unlikely to destabilize the system alone, whereas we get a couple orders of magnitude more download requests (bytestream.Read especially), and at much higher CPU cost each. This is especially notable for cache-only builds - holding Execute calls would slow everything for a remote execution build, but for cache-only the only RPCs available for influencing build behaviour are GetActionResult and ByteStream.Read.

That said, an alternative to returning explicit RESOURCE_EXHAUSTED errors when backoff needs to be induced is to artificially delay RPCs before processing them. (I think this tactic has some name - 'honey' something? - don't exactly remember). But that seems to be less discoverable since clients get no explicit signal in that case, so I'd consider it only if we decide there's no viable way to indicate the same through explicit errors.

John Millikin

unread,
May 6, 2020, 8:33:39 PM5/6/20
to Eric Burnett, Remote Execution APIs Working Group, Yinfu Chen
On Thu, May 7, 2020 at 9:04 AM Eric Burnett <ericb...@google.com> wrote:
Thanks for the quick feedback John!

On Wed, May 6, 2020 at 7:22 PM John Millikin <jmil...@stripe.com> wrote:
Clients retrying RESOURCE_EXHAUSTED indefinitely would be surprising, because that code covers both transient and non-transient failure conditions. For example, an action that exceeds the maximum working tree size would fail no matter how many times it's re-attempted.

Shouldn't an overly large working tree be returned as INVALID_ARGUMENT? It sounds like that condition is going to be deterministic, and so should not be mapped to a retriable error.

INVALID_ARGUMENT is for request values that are inherently invalid -- like a negative timeout, or an empty filename, or something like that. If it depends on system state (such as a configured disk quota) then the correct error code would be RESOURCE_EXHAUSTED, which is logically a sort of subset of FAILED_PRECONDITION.
 
 

The way I've seen this handled in the past is for the gRPC status to contain additional metadata that clients can inspect -- if this metadata is present, and indicates a transient failure, the client might adjust its retry thresholds.

I looked at the metadata options available - should have called that out. One option is to simply search the error string for "quota", which is hacky but probably effective, and might actually be the best choice. Otherwise the standard metadata options available appear to be google.rpc.QuotaFailure, which is...TWO strings to grep...or google.rpc.RetryInfo, which is simply a backoff delay with no additional semantics. (I also technically don't have an easy way to inject these errors, but I think that's my problem :) ).

You would want to define a new message type, with structured data containing details about the out-of-quota condition. This type should live in the Bazel Remote APIs repository.

The gRPC error status can contain any protobuf message, it's just a `repeated Any`. You don't need to restrict yourself to those in the `google.rpc.*` package namespace. The client library can handle attaching metadata to the response.
 
 

Stripe's build executor handles temporary resource over-limit by keeping actions in QUEUED state until either capacity is available, or the action times out. This seems to work fairly well.

The queue is a natural place to hold execution requests, yep! I'm less concerned about them though - we only get a few thousand qps of execution requests, which are unlikely to destabilize the system alone, whereas we get a couple orders of magnitude more download requests (bytestream.Read especially), and at much higher CPU cost each. This is especially notable for cache-only builds - holding Execute calls would slow everything for a remote execution build, but for cache-only the only RPCs available for influencing build behaviour are GetActionResult and ByteStream.Read.

That said, an alternative to returning explicit RESOURCE_EXHAUSTED errors when backoff needs to be induced is to artificially delay RPCs before processing them. (I think this tactic has some name - 'honey' something? - don't exactly remember). But that seems to be less discoverable since clients get no explicit signal in that case, so I'd consider it only if we decide there's no viable way to indicate the same through explicit errors.

Rate limiting is different from quota because it doesn't only depend on the behavior of the rejected client, but on the behavior of all clients.

For rate-limit rejections, the server should return RESOURCE_EXHAUSTED with (optionally) a `RetryInfo`. Clients should obey the RetryInfo. This is all pretty standardized within gRPC; IMO it would be best for Bazel to avoid defining its own semantics on how rateacls work.

Eric Burnett

unread,
May 7, 2020, 11:17:41 AM5/7/20
to John Millikin, Remote Execution APIs Working Group, Yinfu Chen
On Wed, May 6, 2020 at 8:33 PM John Millikin <jmil...@stripe.com> wrote:
INVALID_ARGUMENT is for request values that are inherently invalid -- like a negative timeout, or an empty filename, or something like that. If it depends on system state (such as a configured disk quota) then the correct error code would be RESOURCE_EXHAUSTED, which is logically a sort of subset of FAILED_PRECONDITION.

I realize we're conflating two different types of errors together in our discussion, which I think is confusing the issue. I'm only interested in RESOURCE_EXHAUSTED on RPC-level errors, and how they may arise; the scenario you're describing would I think be an action result error (communicated via ExecuteResponse.status). 

So, can you confirm if you use RESOURCE_EXHAUSTED errors for RPC-level errors, and would be negatively impacted if they were retried indefinitely?

W.r.t. error semantics, I'd say:
  • Request containing a tree that'll (deterministically) expand too large is either
    • RPC-level INVALID_ARGUMENT (if exceeding some service-wide threshold the client cannot change) or
    • RPC-level FAILED_PRECONDITION (if client needs to update their configuration to have larger disks before retrying similar requests).
  • Service trying to run the action on a worker that happens to be out of disk is either
    • action-level INTERNAL (if service should have prevented this case from happening but failed for whatever reason) or
    • RESOURCE_EXHAUSTED. Though I don't really like RESOURCE_EXHAUSTED in that case, as it's sort of equivalent to the server saying "not my problem; retry I guess". If the server gave a deterministic amount of space and the action ran out of disk, then there is no error - OK with a non-zero exit code would be expected, since the action ran to completion by hitting a legitimate failure state. If the server didn't give a deterministic amount of space and the action flakily ran out through no fault of the caller, INTERNAL seems a better fit, as it's on the server to fix this corner case. RESOURCE_EXHAUSTED carries the implication that the caller did something wrong (either on that call or in aggregate); not the case in this scenario.
  • If there's a fixed pool of resources that have been used up and so there's no room to add in the request, that would be RESOURCE_EXHAUSTED, but I don't think that applies to most RPCs within the REAPI. (Returning it on Execute if the number of queued actions hits a ceiling and no more are currently being accepted would fit, but that would be another good example where indefinite retries would be beneficial - otherwise overload again becomes catastrophic failure).

The gRPC error status can contain any protobuf message, it's just a `repeated Any`. You don't need to restrict yourself to those in the `google.rpc.*` package namespace. The client library can handle attaching metadata to the response.

In a perfect world, yes, I'd define some sort of ServiceOverloaded message and return that. Unfortunately I don't have a clean way to inject that - rate limit errors are enforced upstream of my actual server. Perhaps the conclusion here is "too bad, that's the only viable option", but for now I'll continue exploring options that don't require defining and injecting custom metadata to control this :).
 
Rate limiting is different from quota because it doesn't only depend on the behavior of the rejected client, but on the behavior of all clients.

For rate-limit rejections, the server should return RESOURCE_EXHAUSTED with (optionally) a `RetryInfo`. Clients should obey the RetryInfo. This is all pretty standardized within gRPC; IMO it would be best for Bazel to avoid defining its own semantics on how rateacls work.

I think we're disagreeing on terminology - I'm talking about rate limits here. (Internally we call them "Rate Quotas" in this context, as opposed to "Allocation Quotas". But equivalent to rate limits).

RetryInfo does not suffice here, specifically because it has no way to indicate that retries should not be exhausted - all it contains is retry_delay, which says how long to wait for the next attempt. But the server can't predict the future and doesn't know how long the client should wait, so if it returned e.g. "10s" flat, within a minute of hitting overload conditions you'd still start to have some amount of RPCs exhausting their allowed retries and in turn causing their builds to fail.

(In practice, with bazel's default exponential-backoff logic, I can tell you that hitting "queue-length upper bound" causes an instantaneous failure of ~all concurrent builds sending actions to that queue, and overload causing "RPC rate limit" errors causes somewhere between a few and all currently-running builds to fail depending on the degree of overload. Hence not wanting a limit on retries - I want to have ways to inform customers/clients of these conditions that are better than them learning of it through a major and indiscriminate outage.)

John Millikin

unread,
May 7, 2020, 8:18:06 PM5/7/20
to Eric Burnett, Remote Execution APIs Working Group, Yinfu Chen
Trying to respond inline made this message too long + hard to process, so I'll reduce the first part to general points:
  • We wouldn't consider out-of-quota to be an action execution error, because the action was never scheduled and execution never started.
  • We don't consider RateACL denials to be always retriable, because they indicate a bug in the client (lack of request throttling). It seems correct that clients which send abusive amounts of requests should be given hard errors.
  • Client applications that make it difficult to throttle outgoing requests should be changed to implement that functionality.
  • While FAILED_PRECONDITION could be used here -- all resource exhaustions are failed preconditions -- using the more specific status code is better. It simplifies defining monitoring (dashboards, alerts, etc) and clients can provide better error reporting to the user.
  • We would be negatively impacted if the Bazel remote APIs tried to define different semantics for that status code vs the standard gRPC definitions in https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto.
 
The gRPC error status can contain any protobuf message, it's just a `repeated Any`. You don't need to restrict yourself to those in the `google.rpc.*` package namespace. The client library can handle attaching metadata to the response.

In a perfect world, yes, I'd define some sort of ServiceOverloaded message and return that. Unfortunately I don't have a clean way to inject that - rate limit errors are enforced upstream of my actual server. Perhaps the conclusion here is "too bad, that's the only viable option", but for now I'll continue exploring options that don't require defining and injecting custom metadata to control this :).

To clarify, when you say "upstream", do you mean that there's an intermediate proxy service of some sort that's rejecting the requests before it even reaches your process? Or does your gRPC implementation use a rate-limiter that's rejecting requests before your handlers are called? If your executor must be behind a proxy layer, and has no control over that proxy's handling of load shedding, then that sounds like it's the core of the problem.

It feels like what you want is an intermediate service that knows about specific categories of errors coming out of your backend, and can decide which are presumed transient (= indefinitely retriable) and which are not. Since this logic is specific to the executor, and not to the client, I don't think it makes sense to put it into Bazel.

If I'm running Bazel and it's talking to a service that sends it several RateACL errors in a row, I would expect the build to fail. If the server doesn't want the client to go away, it should be more gentle -- e.g. by queuing requests, or having finer-grained rate limits on specific expensive RPCs.
 
I think we're disagreeing on terminology - I'm talking about rate limits here. (Internally we call them "Rate Quotas" in this context, as opposed to "Allocation Quotas". But equivalent to rate limits).

RetryInfo does not suffice here, specifically because it has no way to indicate that retries should not be exhausted - all it contains is retry_delay, which says how long to wait for the next attempt. But the server can't predict the future and doesn't know how long the client should wait, so if it returned e.g. "10s" flat, within a minute of hitting overload conditions you'd still start to have some amount of RPCs exhausting their allowed retries and in turn causing their builds to fail.

(In practice, with bazel's default exponential-backoff logic, I can tell you that hitting "queue-length upper bound" causes an instantaneous failure of ~all concurrent builds sending actions to that queue, and overload causing "RPC rate limit" errors causes somewhere between a few and all currently-running builds to fail depending on the degree of overload. Hence not wanting a limit on retries - I want to have ways to inform customers/clients of these conditions that are better than them learning of it through a major and indiscriminate outage.)
 
This definitely sounds more like a problem on the executor side (lack of separate rate-limits between clients, lack of graceful queuing on incoming requests).

IMO it's not up to the server to decide if retries should be attempted. The client (Bazel) is in charge of deciding how long it should wait for capacity to become available, and if the service is shedding load, then it seems reasonable for the build to fail. If there are cases where this is not desired (you want a build that just keeps trying to keep going despite errors), that seems like something akin to Bazel's `--keep_going` flag.

I believe there's already a flag to control the retry count on rexec RPCs, maybe it would make sense to make that capability more flexible? Such as by allowing RESOURCE_EXHAUSTED to have a different number (including unlimited) of retries.
Reply all
Reply to author
Forward
0 new messages