In mid September 2020 we completed a migration to the currently recommended Google RPC client for all RPCs between our frontends and backends. This enabled a set of best practice configuration options including client-side request throttling.Â
This was rolled out in a phased manner and no problems were observed, though it now seems likely that there was an occasional small impact that went unnoticed at the time.
On October 15 (according to reports) and then definitely on October 19th / 20th we saw a higher than usual number of 429 responses returned in response to read requests. This is an entirely normal occurrence, as we often get large numbers of requests, usually for a short duration as in this case.
However, the client-side throttling reacted by considering backends reporting RESOURCE_EXHAUSTED as being globally overloaded and unable to handle more traffic.Â
This meant that the frontend RPC client returned a status code that was mapped to 429, without the request reaching the backend server, and without regard for whether the quota bucket it would have accessed was really out of tokens. This created an incident where it was unpredictable which requests would be rejected incorrectly.
The mitigation was to disable client-side throttling while a longer term solution is prepared.
2020-10-19 (All times in US/Pacific)
02:35 PT (based on external report from Lets Encrypt) <OUTAGE BEGINS>
02:35 PT requests intermittently receive 429 errors <IMPACT BEGINS>
2020-10-20 (All times in US/Pacific)
00:41 PT overnight email (in UK time) is read<DETECTION TIME>
00:41 PT investigation begins <ESCALATION TIME>
04:58 Code change submitted disabling client-side throttling via flags change
05:10 Second external report of impact received
05:11 Review of timelines + options to deploy mitigation + team discusses
08:39 Deployment of mitigation to CI environment begins
09:30 Deployment of mitigation to Staging environment begins
10:12 Deployment of mitigation to EU production region begins
10:13 Deployment scheduled for US production region (for 4 hours later)
11:03 EU push completed
15:05 US push completed <IMPACT ENDS>
15:05 PT Impact is over, issue is resolved <OUTAGE ENDS>
Read requests have been excluded from analysis as it was not easily possible to distinguish between requests that were rejected because of this issue and those that would have been denied for quota reasons.
The figures below will not be completely accurate but should be a reasonable guideline as to the impact of the issue. For example, some of the affected requests might have been rejected for other reasons if they’d reached the server e.g. failure to chain to an accepted root.
We will be implementing a longer-term solution to prevent this from happening again, while being able to use client-side throttling. This requires code changes and testing / QA before it can be deployed.
While we do not believe that general alerts on client errors (4XX status) is worthwhile, a specific alert has been added to our monitoring for if the RPC client reports issues sending RPCs to our backends.