Google CT Log Outage Postmortem For Oct 24 2018

192 views
Skip to first unread message

Martin Smith

unread,
Nov 13, 2018, 8:41:49 AM11/13/18
to google-...@googlegroups.com, certificate-...@googlegroups.com

Firstly, apologies for the delay in sending this out. We decided to do a more thorough impact analysis / breakdown and that took up some extra time.


On October 24th 2018 all Trillian based Google CT logs experienced an approx 40 minute impact on availability. This occurred from roughly 04:22 to 05:03 Pacific Time (11:22 to 12:03 GMT), and affected the argon20XX and xenon20XX logs, as well as the solera20XX and crucible test logs.


Within this time there was a shorter 18 minute window when 3653 requests in total (from 157 unique IPv4/v6 addresses) received 502 HTTP response codes. Successful requests, totalling 35153 during this 18 minute window, possibly experienced much higher and variable latency than normal. For the overall ~40 minute impact 131240 requests (from 224 unique IP addresses) were subject to potentially higher latencies.


The root cause was an unexpected behavioural change in a network library that we depend on for routing external requests to our servers. The result was that the servers began rejecting all inbound traffic. Automatic checks on the new release binary gave different results on several runs. We believe this was due to differing traffic patterns at the time and because internal traffic bypassed the failure. The result was that the new release was briefly set live before being manually rolled back to the previous version.


Summary timeline of events:


04:22 PT OUTAGE BEGINS - Rollout of new release begins

04:28 PT DETECTION TIME - First warning bug received for raised error rates

04:33 PT ESCALATION TIME - First page for raised error rates

04:34 PT Rollout of new release reaches approx. 90% complete

04:34 PT Rollout aborted

04:36 PT (approx.) On-call requests immediate rollback

04:39 PT Rollback process is initiated to restore previous release

04:45 PT Rollback begins to take effect

05:03 PT OUTAGE ENDS - Rollback complete


As of November 5th 2018 the lowest argon20xx availability 90 day uptime is 99.9907% for argon2018.


The following list shows the number of 502s we served for each of the affected log endpoints. A few malformed requests have been omitted.


Endpoint

502s Returned

/logs/argon2017/ct/v1

1

/logs/argon2018/ct/v1/get-roots

1

/logs/argon2019/ct/v1/add-chain

1

/logs/argon2021/ct/v1/add-chain

1

/logs/argon2021/ct/v1/get-roots

1

/logs/solera2018/ct/v1/get-entries

1

/logs/solera2019/ct/v1/get-entries

1

/logs/solera2021/ct/v1/get-entries

1

/logs/xenon2019/ct/v1/add-pre-chain

1

/logs/xenon2020/ct/v1/get-roots

1

/logs/xenon2021/ct/v1/add-pre-chain

1

/logs/xenon2022/ct/v1/get-entries

1

/logs/xenon2018/ct/v1/get-entries

2

/logs/xenon2019/ct/v1/get-entries

2

/logs/xenon2021/ct/v1/get-entries

3

/logs/xenon2020/ct/v1/get-entries

4

/logs/argon2018/ct/v1/add-pre-chain

5

/logs/argon2021/ct/v1/get-entries

8

/logs/argon2019/ct/v1/add-pre-chain

22

/logs/argon2021/ct/v1/add-pre-chain

25

/logs/argon2020/ct/v1/get-entries

36

/logs/solera2021/ct/v1/get-sth

68

/logs/solera2018/ct/v1/get-sth

69

/logs/solera2019/ct/v1/get-sth

69

/logs/solera2020/ct/v1/get-sth

70

/logs/solera2022/ct/v1/get-sth

73

/logs/xenon2021/ct/v1/get-sth

87

/logs/xenon2020/ct/v1/get-sth

88

/logs/argon2022/ct/v1/get-sth

89

/logs/crucible/ct/v1/get-sth

91

/logs/xenon2019/ct/v1/get-sth

98

/logs/xenon2018/ct/v1/get-sth

113

/logs/argon2017/ct/v1/get-sth

115

/logs/xenon2022/ct/v1/get-sth

115

/logs/argon2019/ct/v1/get-entries

137

/logs/argon2019/ct/v1/get-sth

207

/logs/argon2018/ct/v1/get-sth

212

/logs/argon2020/ct/v1/get-sth

216

/logs/argon2021/ct/v1/get-sth

231

/logs/argon2020/ct/v1/add-pre-chain

394

/logs/argon2018/ct/v1/get-entries

523

Total

3184



We apologize for this interruption to serving and will be introducing additional deployment checks and monitoring to guard against a future recurrence.


Martin
Google CT Team

Al Cutter

unread,
Nov 13, 2018, 8:54:59 AM11/13/18
to certificate-...@googlegroups.com, ct-p...@chromium.org, google-...@googlegroups.com
[+ct-policy]

--
You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/certificate-transparency/CAK76_KVJO4U6y2ax1_F4tzkPGOD8nN%2B1oqZkKUsYRd1RX7u%2B%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Doug Beattie

unread,
Nov 14, 2018, 5:48:05 PM11/14/18
to rsl...@chromium.org, a...@google.com, certificate-...@googlegroups.com, ct-p...@chromium.org, google-...@googlegroups.com
I would also be interested in a more detailed post mortem.  Having all of the Google  Trillian based logs down at the same time could have resulted in a DoS for issuance because all CAs are required to get SCTs from at least one Google log.  Currently GlobalSign still uses some of the older Google logs so we were not adversely impacted this time.  This is the second time Google logs all went down (last time was a DNS issue).  

Will there be procedures put in place that prevent updates to all of the Google logs at the same time in the future?

Doug


On Tue, Nov 13, 2018 at 11:38 AM Ryan Sleevi <rsl...@chromium.org> wrote:
This provides a useful breakdown and impact analysis, but does not appear to raise to the level of a postmortem that helps the community understand root causes and mitigations.

During the recent Apple-hosted CT Policy Days, Martin gave an excellent presentation that arguably could serve as a model post-mortem in the level of details in analyzing the root cause, the explanation of the architectural considerations, the steps being taken to mitigate those issues, and sufficient context to provide insight into other potential issues. I'm curious if there are plans to share that more broadly, either as a result of the minutes from CT Policy Days, or as a further follow-up to this incident report.

You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACM%3D_Ocyry-Z4mw6fVnX55QxYTvU9fm9TWF7UXfsvqB2zsmWtg%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvYsS8ad94xpw-8EuKZw5etu9OaFQFzcWmMwgbSL-mtYkA%40mail.gmail.com.

Martin Smith

unread,
Nov 22, 2018, 8:24:22 AM11/22/18
to rsl...@chromium.org, Al Cutter, certificate-...@googlegroups.com, ct-p...@chromium.org, google-...@googlegroups.com
OK, in addition below is a summary of what I presented at the Policy Days. We can't go into much more detail as the problems occurred in code that's not open source.

Martin

More Details

We share common networking infrastructure with most Google services. This is managed for us and contains a lot of moving parts. We normally don’t worry much about it. Requests arrive from the Internet and are routed through this via internal networks to our servers.

Our release process is fully automated and consists of multiple stages. Continuously generated release candidates must progress through the stages to become live releases. At each stage a combination of tests are run combined with evaluations of the behaviour of the servers including comparisons to previous versions. If any test or evaluation fails the candidate is blocked and not released.

Servers are typically built in layers. In our case requests pass through an Application Framework layer (shared code), then our interceptor that performs rate limiting and other common functionality before making it through to our actual HTTP request handler.

A bug was introduced into the Application Framework library that made all external requests arriving at our server incorrectly fail internal ACL checks. These requests never reached our handler or interceptor code so did not appear in our error metrics. This caused no unit test failures as it was outside their scope. Integration tests also did not see the problem as the traffic involved was internal and did not trigger the ACL failure, which required via interactions with other networking components.

The nature of this failure prevented the errors from being visible to the release evaluation because they were not recorded in metrics. Additionally, other requests were being submitted directly to the servers from our internal systems, which all succeeded. This meant that if the release evaluation occurred at a time when large number of internal requests were happening everything seemed good and the evaluation passed. 

Consequently, the release briefly made it live in production. Once it was deployed probers began accumulating errors and the edge network -> server error ratio began to increase as the faulty binary rolled out in more locations. This led to a number of alerts being triggered. The oncall was able to rapidly correlate the error onset to the beginning of the rollout and requested an immediate rollback.

Once the rollback was complete, and the previous version was redeployed everywhere, the error metrics returned to normal and the observed impact reduced to zero.

Our primary follow-up actions will be to ensure that our canary environment is tested via the external request processing path, and improve the release evaluation process at the canary stage so it assesses live traffic.

On Tue, 13 Nov 2018 at 16:38, Ryan Sleevi <rsl...@chromium.org> wrote:
This provides a useful breakdown and impact analysis, but does not appear to raise to the level of a postmortem that helps the community understand root causes and mitigations.

During the recent Apple-hosted CT Policy Days, Martin gave an excellent presentation that arguably could serve as a model post-mortem in the level of details in analyzing the root cause, the explanation of the architectural considerations, the steps being taken to mitigate those issues, and sufficient context to provide insight into other potential issues. I'm curious if there are plans to share that more broadly, either as a result of the minutes from CT Policy Days, or as a further follow-up to this incident report.

On Tue, Nov 13, 2018 at 8:55 AM 'Al Cutter' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.

--
You received this message because you are subscribed to the Google Groups "Google CT Logs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-ct-log...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/google-ct-logs/CACvaWvYsS8ad94xpw-8EuKZw5etu9OaFQFzcWmMwgbSL-mtYkA%40mail.gmail.com.

Doug Beattie

unread,
Dec 11, 2018, 9:42:16 AM12/11/18
to rsl...@chromium.org, Al Cutter, certificate-...@googlegroups.com, Certificate Transparency Policy, google-...@googlegroups.com

Maybe I missed it, but was there a more detailed Postmortem so the community can understand the root causes and mitigation?  I ask because Google has plans [1]  to take their older non-sharded CT logs down in the May-August timeframe next year.  Having all logs based on Trillian and being managed within the same infrastructure, release process, DNS management, DoS protections, etc. can result in a higher probability for an outage across all Google CT logs.  While any other CT log operator can go down with little ecosystem impact, this is not the case for Google CT logs (CAs are obligated to include at least one Google SCT).  Has this risk been adequately addresses?

The more recent outage [2] due to "Preloader Induced DoS Defense Mode" makes me even more concerned about successful DoS which results in disabling global SSL issuance.  Perhaps it's time to consider changing the Google CT policy to permit issuance of certificates without a Google SCT?

by li

unread,
Jan 14, 2019, 3:06:28 AM1/14/19
to certificate-transparency
Hello,      
     I would like to ask a question about the transparent reporting website (https://transparencyreport.google.com/https/certificates/)given by Google. 
      Its function is similar to a CT Monitor, which provides users with  certificate query service. 
      I have two questions:
     1: What is the list of Logs it monitors? There is no specific disclosure about this point.  On September, 2018, we found that no record  in Argon2019 was returned by       Google, compared with other third-party monitors. But in the results on October, 2018, some records in Argon2019 were returned by it.  I want to confirm when Google added Argon2019 to the monitor list.
     2: A large number of precertifcates are missing. We find that, if a certificate is recorded in logs only in the format of precertificate, it is probably not returned by Google.
     Has Google made a special deal?
     Thank you very much and look forward to your prompt reply.

在 2018年11月13日星期二 UTC+8下午9:41:49,Martin Smith写道:
Reply all
Reply to author
Forward
0 new messages