Temporary outage of Google CT Logs

553 views
Skip to first unread message

Gary Belvin

unread,
Nov 30, 2018, 2:53:56 PM11/30/18
to ct-p...@chromium.org, certificate-...@googlegroups.com
At 10:03 PST some services went into a DoS protection mode, resulting in 503's being served to some clients until 11:02 PST when the issue was resolved. 

We will be following up with a more detailed post-mortem next week. 

Best, 
Gary Belvin

Affected logs:
argon*
icarus
pilot
rocketeer
skydiver
submariner
solera*
xenon*
crucible
testtube


--
Gary Belvin | Software Engineer | g...@google.com | Security Team

Ryan Sleevi

unread,
Dec 3, 2018, 7:19:03 AM12/3/18
to g...@google.com, ct-p...@chromium.org
Gary,

Do you have further updates on this? On the scope of the issue, the time to resolution, the scale of impact? Separating out post-mortem from incident response seems useful here, and it's unclear if the incident has been resolved.

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CADgN-wqhQA6zku2a2_8gh%2BM%2BdfsJ-9LNe6H3tNFanA1HY%3Dj7rg%40mail.gmail.com.

Al Cutter

unread,
Dec 3, 2018, 7:49:14 AM12/3/18
to rsl...@chromium.org, Gary Belvin, ct-p...@chromium.org
Hi Ryan,

Gary's original message called out that the incident had been resolved:
... until 11:02 PST when the issue was resolved.
 I'm not sure if you're saying that some folks are still experiencing issues or if you just missed that?

Anyway, the team continues to work on gathering information for the internal post-mortem, and we're hoping to publish the external version by the end of the week.

Cheers,
Al.

Gary Belvin

unread,
Dec 10, 2018, 10:08:49 AM12/10/18
to Al Cutter, rsl...@chromium.org, ct-p...@chromium.org
Dear ct-policy readers,  

Please find the post-mortem for the Nov 30 incident below. 

Sincerely,
Gary Belvin
Google CT Team 


# Postmortem: Preloader Induced DoS Defense Mode

## Problem Summary
Duration of problem:  61 minutes 
% of product affected:  Blocked 182 IPs (27%) / 3,288,342 requests (85%)
Impact: Five organizations, including Google, experienced failures in calls to addprechain for some of their requests for a period of 1 hour.
SLO violation duration: 0:00:00 (Log uptime > 99%)

## Background
Traffic intended for Google's services is terminated at the Google Front End (GFE), the GFE performs a number of operations before forwarding traffic to the corresponding service backends.  One of the operations the GFEs perform is Denial of Service (DoS) protection.

Like many other large organizations, Google uses automated systems to protect against traffic overload and Denial of Service attacks.  The protective action that these systems take are designed to reduce the impact of such attacks on legitimate users and the service backends themselves. These systems are generally heuristically activated based on characteristics of observed traffic. See https://cloud.google.com/security/infrastructure/design/ for more details on Google's GFE and DoS protection infratstructure.

Until recently, Google maintained Certificate Transparency logs using two different backend technologies: one C++ based (cert-log) which held the full data structure in memory, and one golang based (Trillian) which holds the data structure in [Spanner](https://cloud.google.com/spanner/).

As the rate of certificate issuance in the TLS ecosystem accelerated, the cert-log based logs became less viable. As a result, the team decided to migrate these legacy C++ based logs onto our new, more scalable, Trillian infrastructure to mitigate the chance of failure.  Once the associated data had been migrated, we began to migrate the traffic, and the associated traffic labels from the old services to the new Trillian service in preparation for a full turn down of the C++ infrastructure. 

As a result, this unified traffic label carried the corresponding increase in traffic which exceeded configured thresholds and which the automated DoS detection system interpreted as an attack. The DoS system subsequently blocked some users of the Trillian based logs until a manual intervention overrode the DoS system.

## Impact
Over the course of one hour, we served 3,474,280  `503 Service Unavailable` errors to 182 IPs. This accounted for 85% of requests and 27% of IPs respectively.  211,528 of the `503'`s were for `add-pre-chain` requests and 478,222 were for `add-chain` requests.

It should be noted that it is not possible to measure the exact number of distinctly affected pre-certificate logging requests since CT clients will attempt to log the same pre-certificate to multiple logs with multiple retries. 

The `503`'s reduced the rate at which affected organizations could issue TLS certificates with embedded SCTs because of the Chrome requirement for TLS certificates to be included in at least one Google log. 

Clients which did not implement backoff-on-failure retried rapidly, both multiplying failure counts and increasing their DoS attack blockage rate. 

Due to the compliance monitor's steady traffic characteristics, its requests were not blocked by the DoS service and therefore the observed availability metric remained unaffected. 


## Root Causes and Trigger
The traffic routing labels for Skydiver, Pilot, Rocketer, Icarus, and Daedalus were redirected to a common Trillian label without an accompanying change in the allowed DoS threshold. 

At 2pm GMT on Friday, Nov 30, the GFE deployment containing all these changes started rolling out. Since changes to the GFE can impact many services, such changes are staged and rolled out over several weeks to various geographic locations.

At 5pm, three daily cron jobs launched, each of which can generate 1000qps of `add-chain` requests populating the logs with the certificates that the Google crawler has encountered on the open internet. These jobs had been moved to end-of-day months earlier in order to leave room for production maintenance during work-hours. 

At 6pm, the traffic from these jobs, which was previously spread across several separate traffic labels, together produced a combined 2.8k qps under the single Trillian traffic label, exceeding the configured DoS threshold, and consequently put the service into DoS prevention mode. 

Much of the `add-[pre-]chain` traffic had characteristics which are considered suspicious by DoS server when in DoS prevention mode.  We observed that steady, higher volume clients that implemented backoff fared better than others who were low-volume and/or didn't implement client backoff.


## Lessons Learned
*   Don't launch large jobs at 5pm on a Friday, even "safe" cron jobs.
*   Load test services whenever significant service structure changes are made, to ensure DoS thresholds remain realistic. 
*   Certificate Transparency clients should implement exponential backoff and retry to reduce chances of accidental DDoS traffic limitations in their use of CT Logs.

### Things that went well:
*   Traffic generators were disabled within 30-40mins after detection time.
*   DoS defense mode was manually deactivated 40 min early.


## Timeline:

2018-11-27
    13:00 PST GFE configuration starts rollout redirecting traffic from the Skydiver, Pilot, Rocketeer, Icarus, and Daedalus traffic labels to the Trillian one.

2018-11-30
    06:40 PST GFE configuration rollout completed.
    09:00 PST Borgcron starts the preloaders for Pilot, Rocketeer, and Submariner.
    10:00 PST Preloader traffic ramps up past the DoS threshold
    10:03 PST DoS Server enters attack-mode <OUTAGE BEGINS>
    10:03 PST Oncall paged for cert-log-trillian-us in DoS attack mode <DETECTION TIME>
    10:07 PST Oncall backup advises oncall that the DOS attack is "normal", due to the GFE rollout having completed and most of the traffic now going to the single Trillian traffic label, and that the DoS configuration probably needs to be tweaked accordingly (4 preloaders were running, for Argon, Pilot, Rocketeer, and Submariner, and they're set for 1000 qps each).
    10:31 PST Submariner preloader killed
    10:41 PST Org 1 sends email and reaches out via Slack <ESCALATION TIME>
    10:43 PST Pilot preloader killed
    10:44 PST Digicert sends email
    10:45 PST Rocketeer preload killed
    10:47 PST Argon preloader killed
    10:54 PST Oncall asks for help ending attack mode.
    11:02 PST DoS Server exits attack mode <OUTAGE ENDS>
    11:04 PST DoS capacity change is submitted
    11:05 PST Preloader jobs for Pilot, Rocketeer & Submariner are disabled
    11:53 PST Email to ct-policy explaining the outage <PUBLIC COMMUNICATION>

## Blocked Requests

The following organizations received `503`'s for `add-pre-chain:`

|  org  |  endpoint   |  503s  | % blocked |
|-------|-------------|--------|-----------|
| Org 1 | addprechain | 231154 | 84%       |
| Org 2 | addprechain |   1091 | 18%       |
| Org 3 | addprechain |    525 | 8%        |
| Org 4 | addprechain |     74 | 61%       |
| Org 5 | addprechain |      2 | 0%        |

The following organizations received `503`'s for `add-pre-chain`, broken down by log:

|  org  |    log    |  endpoint   | 503s  |
|-------|-----------|-------------|-------|
| Org 1 | argon2019 | addprechain | 47949 |
| Org 1 | icarus    | addprechain | 45624 |
| Org 1 | skydiver  | addprechain | 45034 |
| Org 1 | pilot     | addprechain | 45025 |
| Org 1 | rocketeer | addprechain | 44962 |
| Org 1 | testtube  | addprechain |  2303 |
| Org 1 | argon2020 | addprechain |  1205 |
| Org 2 | argon2018 | addprechain |   551 |
| Org 2 | skydiver  | addprechain |   548 |
| Org 3 | skydiver  | addprechain |   525 |
| Org 1 | argon2021 | addprechain |   122 |
| Org 1 | argon2018 | addprechain |    65 |
| Org 4 | argon2019 | addprechain |     8 |
| Org 4 | skydiver  | addprechain |     8 |
| Org 4 | rocketeer | addprechain |     8 |
| Org 4 | icarus    | addprechain |     8 |
| Org 4 | pilot     | addprechain |     7 |
| Org 4 | argon2020 | addprechain |     7 |
| Org 4 | argon2018 | addprechain |     7 |
| Org 4 | argon2021 | addprechain |     5 |
| Org 4 | xenon2019 | addprechain |     5 |
| Org 4 | xenon2021 | addprechain |     5 |
| Org 4 | xenon2020 | addprechain |     5 |
| Org 4 | xenon2018 | addprechain |     4 |
| Org 2 | testtube  | addprechain |     4 |
| Org 4 | xenon2022 | addprechain |     1 |
| Org 5 | rocketeer | addprechain |     1 |
| Org 5 | pilot     | addprechain |     1 |

The following logs served `503`'s:

|    log     |    endpoint    |  503s  | % blocked |
|------------|----------------|--------|-----------|
| skydiver   | addchain       | 233838 | 71%       |
| argon2019  | addchain       | 206221 | 39%       |
| argon2018  | addchain       |  20722 | 32%       |
| testtube   | addchain       |  11313 | 92%       |
| argon2020  | addchain       |   6458 | 26%       |
| argon2021  | addchain       |    631 | 32%       |
| rocketeer  | addchain       |      8 | 0%        |
| pilot      | addchain       |      7 | 0%        |
| xenon2019  | addchain       |      6 | 67%       |
| xenon2020  | addchain       |      6 | 67%       |
| xenon2021  | addchain       |      5 | 56%       |
| xenon2018  | addchain       |      4 | 67%       |
| icarus     | addchain       |      3 | 30%       |
| argon2019  | addprechain    |  46893 | 73%       |
| skydiver   | addprechain    |  45379 | 79%       |
| icarus     | addprechain    |  44498 | 73%       |
| pilot      | addprechain    |  44318 | 90%       |
| rocketeer  | addprechain    |  44290 | 92%       |
| testtube   | addprechain    |   2744 | 83%       |
| argon2020  | addprechain    |   1263 | 72%       |
| argon2018  | addprechain    |    595 | 19%       |
| argon2021  | addprechain    |    121 | 72%       |
| xenon2019  | addprechain    |      5 | 45%       |
| xenon2020  | addprechain    |      5 | 56%       |
| xenon2021  | addprechain    |      4 | 57%       |
| xenon2018  | addprechain    |      3 | 50%       |
| xenon2022  | addprechain    |      1 | 50%       |
| argon2019  | getentries     | 556713 | 89%       |
| rocketeer  | getentries     | 485584 | 92%       |
| pilot      | getentries     | 421160 | 87%       |
| argon2018  | getentries     | 280597 | 92%       |
| icarus     | getentries     | 261129 | 91%       |
| skydiver   | getentries     | 254990 | 93%       |
| argon2021  | getentries     |   4766 | 64%       |
| aviator    | getentries     |   4712 | 62%       |
| argon2020  | getentries     |     82 | 3%        |
| xenon2021  | getentries     |     53 | 34%       |
| xenon2018  | getentries     |     43 | 3%        |
| xenon2020  | getentries     |     43 | 27%       |
| testtube   | getentries     |     33 | 2%        |
| xenon2019  | getentries     |     30 | 7%        |
| xenon2022  | getentries     |      2 | 5%        |
| skydiver   | getproofbyhash |   9107 | 86%       |
| pilot      | getproofbyhash |   1477 | 84%       |
| rocketeer  | getproofbyhash |   1454 | 87%       |
| argon2019  | getproofbyhash |     96 | 89%       |
| icarus     | getproofbyhash |      3 | 30%       |
| argon2018  | getroots       |     36 | 48%       |
| argon2019  | getroots       |     33 | 48%       |
| xenon2021  | getroots       |     30 | 48%       |
| pilot      | getroots       |     21 | 36%       |
| skydiver   | getroots       |     21 | 37%       |
| xenon2019  | getroots       |     21 | 39%       |
| argon2017  | getroots       |     20 | 38%       |
| aviator    | getroots       |     20 | 36%       |
| icarus     | getroots       |     20 | 36%       |
| xenon2020  | getroots       |     20 | 38%       |
| xenon2022  | getroots       |     20 | 38%       |
| argon2020  | getroots       |     19 | 34%       |
| argon2021  | getroots       |     19 | 35%       |
| rocketeer  | getroots       |     19 | 35%       |
| xenon2018  | getroots       |     19 | 37%       |
| argon2022  | getroots       |      3 | 38%       |
| daedalus   | getroots       |      3 | 30%       |
| testtube   | getroots       |      3 | 5%        |
| argon2019  | getsth         |   1088 | 13%       |
| icarus     | getsth         |   1059 | 12%       |
| pilot      | getsth         |   1032 | 8%        |
| argon2018  | getsth         |   1016 | 11%       |
| argon2020  | getsth         |   1001 | 15%       |
| rocketeer  | getsth         |    992 | 10%       |
| skydiver   | getsth         |    991 | 11%       |
| argon2021  | getsth         |    967 | 13%       |
| aviator    | getsth         |    923 | 13%       |
| argon2017  | getsth         |    675 | 7%        |
| xenon2019  | getsth         |    673 | 11%       |
| xenon2020  | getsth         |    671 | 16%       |
| xenon2018  | getsth         |    669 | 7%        |
| xenon2022  | getsth         |    661 | 14%       |
| daedalus   | getsth         |    658 | 22%       |
| xenon2021  | getsth         |    629 | 17%       |
| argon2022  | getsth         |    583 | 13%       |
| testtube   | getsth         |    557 | 11%       |
| solera2022 | getsth         |    547 | 13%       |
| solera2021 | getsth         |    545 | 13%       |
| solera2019 | getsth         |    544 | 13%       |
| solera2020 | getsth         |    535 | 13%       |
| crucible   | getsth         |    534 | 12%       |
| solera2018 | getsth         |    529 | 12%       |
| icarus2019 | getsth         |      2 | 100%      |

Gary Belvin

unread,
Dec 10, 2018, 10:15:32 AM12/10/18
to Al Cutter, rsl...@chromium.org, ct-p...@chromium.org
Correction:
Add % blocked column for table 2. 

|  org  |    log    |  endpoint   | 503s  | % blocked |
|-------|-----------|-------------|-------|-----------|
| Org 1 | argon2019 | addprechain | 47949 | 73%       |
| Org 1 | icarus    | addprechain | 45624 | 73%       |
| Org 1 | skydiver  | addprechain | 45034 | 94%       |
| Org 1 | pilot     | addprechain | 45025 | 94%       |
| Org 1 | rocketeer | addprechain | 44962 | 94%       |
| Org 1 | testtube  | addprechain |  2303 | 81%       |
| Org 1 | argon2020 | addprechain |  1205 | 71%       |
| Org 2 | argon2018 | addprechain |   551 | 18%       |
| Org 2 | skydiver  | addprechain |   548 | 18%       |
| Org 3 | skydiver  | addprechain |   525 | 8%        |
| Org 1 | argon2021 | addprechain |   122 | 74%       |
| Org 1 | argon2018 | addprechain |    65 | 81%       |
| Org 4 | argon2019 | addprechain |     8 | 67%       |
| Org 4 | skydiver  | addprechain |     8 | 67%       |
| Org 4 | rocketeer | addprechain |     8 | 57%       |
| Org 4 | icarus    | addprechain |     8 | 57%       |
| Org 4 | pilot     | addprechain |     7 | 54%       |
| Org 4 | argon2020 | addprechain |     7 | 64%       |
| Org 4 | argon2018 | addprechain |     7 | 64%       |
| Org 4 | argon2021 | addprechain |     5 | 71%       |
| Org 4 | xenon2019 | addprechain |     5 | 50%       |
| Org 4 | xenon2021 | addprechain |     5 | 71%       |
| Org 4 | xenon2020 | addprechain |     5 | 63%       |
| Org 4 | xenon2018 | addprechain |     4 | 67%       |
| Org 2 | testtube  | addprechain |     4 | 17%       |
| Org 4 | xenon2022 | addprechain |     1 | 100%      |
| Org 5 | rocketeer | addprechain |     1 | 0%        |
| Org 5 | pilot     | addprechain |     1 | 0%        |

Jacob Hoffman-Andrews

unread,
Dec 18, 2018, 3:55:34 PM12/18/18
to Certificate Transparency Policy, a...@google.com, rsl...@chromium.org
I've posted a postmortem for Let's Encrypt's side of this outage at https://community.letsencrypt.org/t/2018-11-30-production-google-ct-log-submission-failures/80461. I believe we are probably the organization listed as the "Org 1" that sent email. Though the "Org 1" in the charts does not match up with Let's Encrypt, since we only submit to Icarus and Argon2019.

One of the main things that leaps out at me as a cross-organization improvement: Both the Google CT log team and the Let's Encrypt team received alerts at approximately the same time, and both were working on the problem at the same time. It took about half an hour for Let's Encrypt to be confident that this was a Google outage and reach out. It sounds like it also took about half an hour for the Google CT team to realize this was a real public-facing outage and that additional work was needed besides disabling the preloaders.

I think we could have improved time to remediation in two ways:
 - If Google had a public status page for its logs and was very aggressive about updating it shortly after an alert is confirmed, Let's Encrypt would have known to reach out much sooner (though, conversely, we might not have reached out if we thought the problem was being handled).
 - Let's Encrypt should be more aggressive about reaching out early even if we're not sure the outage is on Google's end.

There's a parallel discussion about this outage going on at https://groups.google.com/a/chromium.org/forum/#!topic/ct-policy/_csiMYrwsxc (thread is originally related to a different outage). I'm going to reply to part of it here:


On Tuesday, December 11, 2018 at 7:06:41 AM UTC-8, Ryan Sleevi wrote:
On Tue, Dec 11, 2018 at 9:42 AM Doug Beattie <douglas...@gmail.com> wrote:
I ask because Google has plans [1]  to take their older non-sharded CT logs down in the May-August timeframe next year.  Having all logs based on Trillian and being managed within the same infrastructure, release process, DNS management, DoS protections, etc. can result in a higher probability for an outage across all Google CT logs.  While any other CT log operator can go down with little ecosystem impact, this is not the case for Google CT logs (CAs are obligated to include at least one Google SCT).  Has this risk been adequately addresses?

I think one common thread of these post-mortems is that CAs can be taking steps to reduce any impact, and that CAs that have taken such steps have seen minimal impact. This has also been a recurring theme of CT Policy Days. 
 
The more recent outage [2] due to "Preloader Induced DoS Defense Mode" makes me even more concerned about successful DoS which results in disabling global SSL issuance.  Perhaps it's time to consider changing the Google CT policy to permit issuance of certificates without a Google SCT?

From that post-mortem, it also appears that CAs which took steps to diversify their logging saw limited impact (perhaps none, in some cases), while others exhibited certain pessimistic behaviours that have been called out as problematic in past discussions.

Could you share what data - whether from the post-mortem or from CA operations - that you think leads to the conclusion? It seems like having actionable and concrete data, which these post-mortems ensure, allow a bit more discussion and evaluation. 

Ryan, can you clarify what steps you think CAs can be taking to reduce impact or diversify logging? Let's Encrypt diversifies logging across two Google logs, but they are both Trillian-based, and both were affected by this outage simultaneously. As a result, we were heavily impacted. As Doug says, it seems likely that Google logs will continue to have correlated outages, since they will all be behind the same frontend infrastructure. 
Reply all
Reply to author
Forward
0 new messages