## Problem Summary
Duration of problem: 61 minutes
% of product affected: Blocked 182 IPs (27%) / 3,288,342 requests (85%)
Impact: Five organizations, including Google, experienced failures in calls to addprechain for some of their requests for a period of 1 hour.
SLO violation duration: 0:00:00 (Log uptime > 99%)
## Background
Traffic intended for Google's services is terminated at the Google Front End (GFE), the GFE performs a number of operations before forwarding traffic to the corresponding service backends. One of the operations the GFEs perform is Denial of Service (DoS) protection.
Like many other large organizations, Google uses automated systems to protect against traffic overload and Denial of Service attacks. The protective action that these systems take are designed to reduce the impact of such attacks on legitimate users and the service backends themselves. These systems are generally heuristically activated based on characteristics of observed traffic. See https://cloud.google.com/security/infrastructure/design/ for more details on Google's GFE and DoS protection infratstructure.
Until recently, Google maintained Certificate Transparency logs using two different backend technologies: one C++ based (cert-log) which held the full data structure in memory, and one golang based (Trillian) which holds the data structure in [Spanner](https://cloud.google.com/spanner/).
As the rate of certificate issuance in the TLS ecosystem accelerated, the cert-log based logs became less viable. As a result, the team decided to migrate these legacy C++ based logs onto our new, more scalable, Trillian infrastructure to mitigate the chance of failure. Once the associated data had been migrated, we began to migrate the traffic, and the associated traffic labels from the old services to the new Trillian service in preparation for a full turn down of the C++ infrastructure.
As a result, this unified traffic label carried the corresponding increase in traffic which exceeded configured thresholds and which the automated DoS detection system interpreted as an attack. The DoS system subsequently blocked some users of the Trillian based logs until a manual intervention overrode the DoS system.
## Impact
Over the course of one hour, we served 3,474,280 `503 Service Unavailable` errors to 182 IPs. This accounted for 85% of requests and 27% of IPs respectively. 211,528 of the `503'`s were for `add-pre-chain` requests and 478,222 were for `add-chain` requests.
It should be noted that it is not possible to measure the exact number of distinctly affected pre-certificate logging requests since CT clients will attempt to log the same pre-certificate to multiple logs with multiple retries.
The `503`'s reduced the rate at which affected organizations could issue TLS certificates with embedded SCTs because of the Chrome requirement for TLS certificates to be included in at least one Google log.
Clients which did not implement backoff-on-failure retried rapidly, both multiplying failure counts and increasing their DoS attack blockage rate.
Due to the compliance monitor's steady traffic characteristics, its requests were not blocked by the DoS service and therefore the observed availability metric remained unaffected.
## Root Causes and Trigger
The traffic routing labels for Skydiver, Pilot, Rocketer, Icarus, and Daedalus were redirected to a common Trillian label without an accompanying change in the allowed DoS threshold.
At 2pm GMT on Friday, Nov 30, the GFE deployment containing all these changes started rolling out. Since changes to the GFE can impact many services, such changes are staged and rolled out over several weeks to various geographic locations.
At 5pm, three daily cron jobs launched, each of which can generate 1000qps of `add-chain` requests populating the logs with the certificates that the Google crawler has encountered on the open internet. These jobs had been moved to end-of-day months earlier in order to leave room for production maintenance during work-hours.
At 6pm, the traffic from these jobs, which was previously spread across several separate traffic labels, together produced a combined 2.8k qps under the single Trillian traffic label, exceeding the configured DoS threshold, and consequently put the service into DoS prevention mode.
Much of the `add-[pre-]chain` traffic had characteristics which are considered suspicious by DoS server when in DoS prevention mode. We observed that steady, higher volume clients that implemented backoff fared better than others who were low-volume and/or didn't implement client backoff.
## Lessons Learned
* Don't launch large jobs at 5pm on a Friday, even "safe" cron jobs.
* Load test services whenever significant service structure changes are made, to ensure DoS thresholds remain realistic.
* Certificate Transparency clients should implement exponential backoff and retry to reduce chances of accidental DDoS traffic limitations in their use of CT Logs.
### Things that went well:
* Traffic generators were disabled within 30-40mins after detection time.
* DoS defense mode was manually deactivated 40 min early.
## Timeline:
2018-11-27
13:00 PST GFE configuration starts rollout redirecting traffic from the Skydiver, Pilot, Rocketeer, Icarus, and Daedalus traffic labels to the Trillian one.
2018-11-30
06:40 PST GFE configuration rollout completed.
09:00 PST Borgcron starts the preloaders for Pilot, Rocketeer, and Submariner.
10:00 PST Preloader traffic ramps up past the DoS threshold
10:03 PST DoS Server enters attack-mode <OUTAGE BEGINS>
10:03 PST Oncall paged for cert-log-trillian-us in DoS attack mode <DETECTION TIME>
10:07 PST Oncall backup advises oncall that the DOS attack is "normal", due to the GFE rollout having completed and most of the traffic now going to the single Trillian traffic label, and that the DoS configuration probably needs to be tweaked accordingly (4 preloaders were running, for Argon, Pilot, Rocketeer, and Submariner, and they're set for 1000 qps each).
10:31 PST Submariner preloader killed
10:41 PST Org 1 sends email and reaches out via Slack <ESCALATION TIME>
10:43 PST Pilot preloader killed
10:44 PST Digicert sends email
10:45 PST Rocketeer preload killed
10:47 PST Argon preloader killed
10:54 PST Oncall asks for help ending attack mode.
11:02 PST DoS Server exits attack mode <OUTAGE ENDS>
11:04 PST DoS capacity change is submitted
11:05 PST Preloader jobs for Pilot, Rocketeer & Submariner are disabled
11:53 PST Email to ct-policy explaining the outage <PUBLIC COMMUNICATION>
## Blocked Requests
The following organizations received `503`'s for `add-pre-chain:`
| org | endpoint | 503s | % blocked |
|-------|-------------|--------|-----------|
| Org 1 | addprechain | 231154 | 84% |
| Org 2 | addprechain | 1091 | 18% |
| Org 3 | addprechain | 525 | 8% |
| Org 4 | addprechain | 74 | 61% |
| Org 5 | addprechain | 2 | 0% |
The following organizations received `503`'s for `add-pre-chain`, broken down by log:
| org | log | endpoint | 503s |
|-------|-----------|-------------|-------|
| Org 1 | argon2019 | addprechain | 47949 |
| Org 1 | icarus | addprechain | 45624 |
| Org 1 | skydiver | addprechain | 45034 |
| Org 1 | pilot | addprechain | 45025 |
| Org 1 | rocketeer | addprechain | 44962 |
| Org 1 | testtube | addprechain | 2303 |
| Org 1 | argon2020 | addprechain | 1205 |
| Org 2 | argon2018 | addprechain | 551 |
| Org 2 | skydiver | addprechain | 548 |
| Org 3 | skydiver | addprechain | 525 |
| Org 1 | argon2021 | addprechain | 122 |
| Org 1 | argon2018 | addprechain | 65 |
| Org 4 | argon2019 | addprechain | 8 |
| Org 4 | skydiver | addprechain | 8 |
| Org 4 | rocketeer | addprechain | 8 |
| Org 4 | icarus | addprechain | 8 |
| Org 4 | pilot | addprechain | 7 |
| Org 4 | argon2020 | addprechain | 7 |
| Org 4 | argon2018 | addprechain | 7 |
| Org 4 | argon2021 | addprechain | 5 |
| Org 4 | xenon2019 | addprechain | 5 |
| Org 4 | xenon2021 | addprechain | 5 |
| Org 4 | xenon2020 | addprechain | 5 |
| Org 4 | xenon2018 | addprechain | 4 |
| Org 2 | testtube | addprechain | 4 |
| Org 4 | xenon2022 | addprechain | 1 |
| Org 5 | rocketeer | addprechain | 1 |
| Org 5 | pilot | addprechain | 1 |
The following logs served `503`'s:
| log | endpoint | 503s | % blocked |
|------------|----------------|--------|-----------|
| skydiver | addchain | 233838 | 71% |
| argon2019 | addchain | 206221 | 39% |
| argon2018 | addchain | 20722 | 32% |
| testtube | addchain | 11313 | 92% |
| argon2020 | addchain | 6458 | 26% |
| argon2021 | addchain | 631 | 32% |
| rocketeer | addchain | 8 | 0% |
| pilot | addchain | 7 | 0% |
| xenon2019 | addchain | 6 | 67% |
| xenon2020 | addchain | 6 | 67% |
| xenon2021 | addchain | 5 | 56% |
| xenon2018 | addchain | 4 | 67% |
| icarus | addchain | 3 | 30% |
| argon2019 | addprechain | 46893 | 73% |
| skydiver | addprechain | 45379 | 79% |
| icarus | addprechain | 44498 | 73% |
| pilot | addprechain | 44318 | 90% |
| rocketeer | addprechain | 44290 | 92% |
| testtube | addprechain | 2744 | 83% |
| argon2020 | addprechain | 1263 | 72% |
| argon2018 | addprechain | 595 | 19% |
| argon2021 | addprechain | 121 | 72% |
| xenon2019 | addprechain | 5 | 45% |
| xenon2020 | addprechain | 5 | 56% |
| xenon2021 | addprechain | 4 | 57% |
| xenon2018 | addprechain | 3 | 50% |
| xenon2022 | addprechain | 1 | 50% |
| argon2019 | getentries | 556713 | 89% |
| rocketeer | getentries | 485584 | 92% |
| pilot | getentries | 421160 | 87% |
| argon2018 | getentries | 280597 | 92% |
| icarus | getentries | 261129 | 91% |
| skydiver | getentries | 254990 | 93% |
| argon2021 | getentries | 4766 | 64% |
| aviator | getentries | 4712 | 62% |
| argon2020 | getentries | 82 | 3% |
| xenon2021 | getentries | 53 | 34% |
| xenon2018 | getentries | 43 | 3% |
| xenon2020 | getentries | 43 | 27% |
| testtube | getentries | 33 | 2% |
| xenon2019 | getentries | 30 | 7% |
| xenon2022 | getentries | 2 | 5% |
| skydiver | getproofbyhash | 9107 | 86% |
| pilot | getproofbyhash | 1477 | 84% |
| rocketeer | getproofbyhash | 1454 | 87% |
| argon2019 | getproofbyhash | 96 | 89% |
| icarus | getproofbyhash | 3 | 30% |
| argon2018 | getroots | 36 | 48% |
| argon2019 | getroots | 33 | 48% |
| xenon2021 | getroots | 30 | 48% |
| pilot | getroots | 21 | 36% |
| skydiver | getroots | 21 | 37% |
| xenon2019 | getroots | 21 | 39% |
| argon2017 | getroots | 20 | 38% |
| aviator | getroots | 20 | 36% |
| icarus | getroots | 20 | 36% |
| xenon2020 | getroots | 20 | 38% |
| xenon2022 | getroots | 20 | 38% |
| argon2020 | getroots | 19 | 34% |
| argon2021 | getroots | 19 | 35% |
| rocketeer | getroots | 19 | 35% |
| xenon2018 | getroots | 19 | 37% |
| argon2022 | getroots | 3 | 38% |
| daedalus | getroots | 3 | 30% |
| testtube | getroots | 3 | 5% |
| argon2019 | getsth | 1088 | 13% |
| icarus | getsth | 1059 | 12% |
| pilot | getsth | 1032 | 8% |
| argon2018 | getsth | 1016 | 11% |
| argon2020 | getsth | 1001 | 15% |
| rocketeer | getsth | 992 | 10% |
| skydiver | getsth | 991 | 11% |
| argon2021 | getsth | 967 | 13% |
| aviator | getsth | 923 | 13% |
| argon2017 | getsth | 675 | 7% |
| xenon2019 | getsth | 673 | 11% |
| xenon2020 | getsth | 671 | 16% |
| xenon2018 | getsth | 669 | 7% |
| xenon2022 | getsth | 661 | 14% |
| daedalus | getsth | 658 | 22% |
| xenon2021 | getsth | 629 | 17% |
| argon2022 | getsth | 583 | 13% |
| testtube | getsth | 557 | 11% |
| solera2022 | getsth | 547 | 13% |
| solera2021 | getsth | 545 | 13% |
| solera2019 | getsth | 544 | 13% |
| solera2020 | getsth | 535 | 13% |
| crucible | getsth | 534 | 12% |
| solera2018 | getsth | 529 | 12% |
| icarus2019 | getsth | 2 | 100% |