Heya API folk,
I want to solicit your opinion: what should clients do upon receiving RESOURCE_EXHAUSTED errors from REAPI?
(I should note that whatever we decide here will probably be applied to bazel and goma at minimum, so this is not simply an academic exercise).
The specific scenario of interest to me is an API that returns RESOURCE_EXHAUSTED when some global rate limit is exceeded. In our case, we want to be able to return this error when a large client sends us too many requests in total, as that risks overwhelming our service entirely if left unchecked. But we want this to induce graceful backoff, not a state change from "fine" to "exploding".
The problem here is that global overload is rarely "slight" enough that we can rely on the usual retry semantics to mask it. Bazel, for example, retries RPCs 5 times by default. But if the overload spiked to e.g. "15% over quota", every RPC attempt would have a baseline 13% chance of failing, and at 5 attempts, you'd expect a RPC to be exhausted (aka a build to totally fail) every 23k RPCs or so. Given large enough clients, now you're measuring the user-visible build failures in QPS, which is nowhere close to "graceful degradation" in my book. (Users should ~never see unnecessary build failures, period.)
And that's assuming it's both statistically random, and mild - if there's any skew based on location, client, server you've happened to connect to, whatever, the odds of user-visible build failures go way up even at 15% overload, and turning up a new workload without first increasing quota could easily put you at 200% of quota instead of 115%, at which point every single build is failing across both new and old workloads. Ouch.
I'd like to say "clients should retry RESOURCE_EXHAUSTED indefinitely". I.e. if you manage to put yourself into overload conditions your builds will struggle along until they hit some higher-level timeout, hopefully giving you enough time to page someone before "slow" turns into "failing". Note that in this case, failing builds and having them retried at a higher level will make things explicitly worse.
This has two complications:
- Discoverability: it's not entirely obvious from clients why builds are then slow. (Though clients could presumably log something to stdout to make it obvious).
- gRPC: since this couldn't be too simple, gRPC also injects RESOURCE_EXHAUSTED errors for client memory exhaustion and for RPCs exhausting size limits.
- I've never personally seen the former happen, but want to call it out;
- In the latter case this presumably impacts debugging, but hopefully wouldn't be first manifesting on real-world workloads in the wild?
Even so, I'd like to proceed as above - right now the status quo is that servers cannot safely return RESOURCE_EXHAUSTED without expecting to induce some level of client outage, which is a tough tradeoff when you want to gracefully protect your service from (probably short) traffic spikes. Conversely, the above scenarios at worst turn a build that was going to fail anyways into one that seems hung until you check logs, and even there, seems pretty unlikely in practice.
Any opinions? Absent objection, I'll add it as a clarification to the API and attempt to update bazel to retry indefinitely (w/ warning) in practice. If you don't care either way, that's fine by me too :).
Thanks,
Eric