Incident report for excess 429 responses from the Google Logs in October

198 views
Skip to first unread message

Kat Joyce

unread,
Nov 5, 2020, 7:51:01 AM11/5/20
to Certificate Transparency Policy, certificate-...@googlegroups.com
Hi everyone,

Please find below our incident report for the higher-than-usual number of 429 responses that were seen from the Google Logs at specific points in October.

We welcome any questions and comments you may have.

Kind regards,
Kat and the CT Team at Google


CT Log Servers 429 Issue - Incident Report

Background and Root Cause

In mid September 2020 we completed a migration to the currently recommended Google RPC client for all RPCs between our frontends and backends. This enabled a set of best practice configuration options including client-side request throttling. 

This was rolled out in a phased manner and no problems were observed, though it now seems likely that there was an occasional small impact that went unnoticed at the time.

On October 15 (according to reports) and then definitely on October 19th / 20th we saw a higher than usual number of 429 responses returned in response to read requests. This is an entirely normal occurrence, as we often get large numbers of requests, usually for a short duration as in this case.

However, the client-side throttling reacted by considering backends reporting RESOURCE_EXHAUSTED as being globally overloaded and unable to handle more traffic. 

This meant that the frontend RPC client returned a status code that was mapped to 429, without the request reaching the backend server, and without regard for whether the quota bucket it would have accessed was really out of tokens. This created an incident where it was unpredictable which requests would be rejected incorrectly.

The mitigation was to disable client-side throttling while a longer term solution is prepared.

Summary Timeline


2020-10-19 (All times in US/Pacific)

02:35 PT (based on external report from Lets Encrypt) <OUTAGE BEGINS>

02:35 PT requests intermittently receive 429 errors <IMPACT BEGINS>


2020-10-20 (All times in US/Pacific)

00:41 PT overnight email (in UK time) is read<DETECTION TIME>

00:41 PT investigation begins <ESCALATION TIME>

04:58 Code change submitted disabling client-side throttling via flags change

05:10 Second external report of impact received

05:11 Review of timelines + options to deploy mitigation + team discusses

08:39 Deployment of mitigation to CI environment begins

09:30 Deployment of mitigation to Staging environment begins

10:12 Deployment of mitigation to EU production region begins

10:13 Deployment scheduled for US production region (for 4 hours later)

11:03 EU push completed

15:05 US push completed <IMPACT ENDS>

15:05 PT Impact is over, issue is resolved <OUTAGE ENDS>

Impact

October 19th Write Requests

Read requests have been excluded from analysis as it was not easily possible to distinguish between requests that were rejected because of this issue and those that would have been denied for quota reasons.

The figures below will not be completely accurate but should be a reasonable guideline as to the impact of the issue. For example, some of the affected requests might have been rejected for other reasons if they’d reached the server e.g. failure to chain to an accepted root.


Endpoint

Total Requests

Affected Requests


/HTTP.argon2020.AddChain

843

0


/HTTP.argon2020.AddPreChain

38407

12


/HTTP.argon2021.AddChain

1587419

1531


/HTTP.argon2021.AddPreChain

1130623

2065


/HTTP.argon2022.AddChain

167

0


/HTTP.argon2022.AddPreChain

1132

0


/HTTP.argon2023.AddChain

5

0


/HTTP.argon2023.AddPreChain

254

0


/HTTP.icarus.AddChain

223

0


/HTTP.icarus.AddPreChain

142

1


/HTTP.pilot.AddChain

1963

1


/HTTP.pilot.AddPreChain

337

0


/HTTP.rocketeer.AddChain

2289

1


/HTTP.rocketeer.AddPreChain

9045

175


/HTTP.skydiver.AddChain

101

1


/HTTP.skydiver.AddPreChain

9045

0


/HTTP.xenon2020.AddChain

293628

9


/HTTP.xenon2020.AddPreChain

36490

600


/HTTP.xenon2021.AddChain

2206649

25404


/HTTP.xenon2021.AddPreChain

1658507

20958


/HTTP.xenon2022.AddChain

238

2


/HTTP.xenon2022.AddPreChain

1158

0


/HTTP.xenon2023.AddChain

4

0


/HTTP.xenon2023.AddPreChain

306

0


Total

6978975

50760


Percent Affected



0.7273274371


October 20th Write Requests



Endpoint

Total Requests

Affected Requests


/HTTP.argon2018.AddChain

2

0


/HTTP.argon2019.AddChain

1

0


/HTTP.argon2020.AddChain

830

0


/HTTP.argon2020.AddPreChain

34183

1


/HTTP.argon2021.AddChain

1565035

0


/HTTP.argon2021.AddPreChain

1317147

1


/HTTP.argon2022.AddChain

167

0


/HTTP.argon2022.AddPreChain

1359

0


/HTTP.argon2023.AddChain

11

0


/HTTP.argon2023.AddPreChain

191

0


/HTTP.icarus.AddChain

247

0


/HTTP.icarus.AddPreChain

138

0


/HTTP.pilot.AddChain

1216

0


/HTTP.pilot.AddPreChain

332

0


/HTTP.rocketeer.AddChain

128

0


/HTTP.rocketeer.AddPreChain

10034

52


/HTTP.skydiver.AddChain

101

0


/HTTP.skydiver.AddPreChain

9695

0


/HTTP.xenon2020.AddChain

295488

2


/HTTP.xenon2020.AddPreChain

35638

164


/HTTP.xenon2021.AddChain

1930465

8209


/HTTP.xenon2021.AddPreChain

1574050

6160


/HTTP.xenon2022.AddChain

237

0


/HTTP.xenon2022.AddPreChain

1425

0


/HTTP.xenon2023.AddChain

4

0


/HTTP.xenon2023.AddPreChain

234

0


Total

6778358

14589


Percent Affected



0.215229116


Followup Actions

We will be implementing a longer-term solution to prevent this from happening again, while being able to use client-side throttling. This requires code changes and testing / QA before it can be deployed.

While we do not believe that general alerts on client errors (4XX status) is worthwhile, a specific alert has been added to our monitoring for if the RPC client reports issues sending RPCs to our backends.


Reply all
Reply to author
Forward
0 new messages