Re: Google CT Log Outage Postmortem For Oct 24 2018

410 views
Skip to first unread message

Al Cutter

unread,
Nov 13, 2018, 8:54:59 AM11/13/18
to certificate-...@googlegroups.com, ct-p...@chromium.org, google-...@googlegroups.com
[+ct-policy]

On Tue, Nov 13, 2018 at 1:41 PM 'Martin Smith' via certificate-transparency <certificate-...@googlegroups.com> wrote:

Firstly, apologies for the delay in sending this out. We decided to do a more thorough impact analysis / breakdown and that took up some extra time.


On October 24th 2018 all Trillian based Google CT logs experienced an approx 40 minute impact on availability. This occurred from roughly 04:22 to 05:03 Pacific Time (11:22 to 12:03 GMT), and affected the argon20XX and xenon20XX logs, as well as the solera20XX and crucible test logs.


Within this time there was a shorter 18 minute window when 3653 requests in total (from 157 unique IPv4/v6 addresses) received 502 HTTP response codes. Successful requests, totalling 35153 during this 18 minute window, possibly experienced much higher and variable latency than normal. For the overall ~40 minute impact 131240 requests (from 224 unique IP addresses) were subject to potentially higher latencies.


The root cause was an unexpected behavioural change in a network library that we depend on for routing external requests to our servers. The result was that the servers began rejecting all inbound traffic. Automatic checks on the new release binary gave different results on several runs. We believe this was due to differing traffic patterns at the time and because internal traffic bypassed the failure. The result was that the new release was briefly set live before being manually rolled back to the previous version.


Summary timeline of events:


04:22 PT OUTAGE BEGINS - Rollout of new release begins

04:28 PT DETECTION TIME - First warning bug received for raised error rates

04:33 PT ESCALATION TIME - First page for raised error rates

04:34 PT Rollout of new release reaches approx. 90% complete

04:34 PT Rollout aborted

04:36 PT (approx.) On-call requests immediate rollback

04:39 PT Rollback process is initiated to restore previous release

04:45 PT Rollback begins to take effect

05:03 PT OUTAGE ENDS - Rollback complete


As of November 5th 2018 the lowest argon20xx availability 90 day uptime is 99.9907% for argon2018.


The following list shows the number of 502s we served for each of the affected log endpoints. A few malformed requests have been omitted.


Endpoint

502s Returned

/logs/argon2017/ct/v1

1

/logs/argon2018/ct/v1/get-roots

1

/logs/argon2019/ct/v1/add-chain

1

/logs/argon2021/ct/v1/add-chain

1

/logs/argon2021/ct/v1/get-roots

1

/logs/solera2018/ct/v1/get-entries

1

/logs/solera2019/ct/v1/get-entries

1

/logs/solera2021/ct/v1/get-entries

1

/logs/xenon2019/ct/v1/add-pre-chain

1

/logs/xenon2020/ct/v1/get-roots

1

/logs/xenon2021/ct/v1/add-pre-chain

1

/logs/xenon2022/ct/v1/get-entries

1

/logs/xenon2018/ct/v1/get-entries

2

/logs/xenon2019/ct/v1/get-entries

2

/logs/xenon2021/ct/v1/get-entries

3

/logs/xenon2020/ct/v1/get-entries

4

/logs/argon2018/ct/v1/add-pre-chain

5

/logs/argon2021/ct/v1/get-entries

8

/logs/argon2019/ct/v1/add-pre-chain

22

/logs/argon2021/ct/v1/add-pre-chain

25

/logs/argon2020/ct/v1/get-entries

36

/logs/solera2021/ct/v1/get-sth

68

/logs/solera2018/ct/v1/get-sth

69

/logs/solera2019/ct/v1/get-sth

69

/logs/solera2020/ct/v1/get-sth

70

/logs/solera2022/ct/v1/get-sth

73

/logs/xenon2021/ct/v1/get-sth

87

/logs/xenon2020/ct/v1/get-sth

88

/logs/argon2022/ct/v1/get-sth

89

/logs/crucible/ct/v1/get-sth

91

/logs/xenon2019/ct/v1/get-sth

98

/logs/xenon2018/ct/v1/get-sth

113

/logs/argon2017/ct/v1/get-sth

115

/logs/xenon2022/ct/v1/get-sth

115

/logs/argon2019/ct/v1/get-entries

137

/logs/argon2019/ct/v1/get-sth

207

/logs/argon2018/ct/v1/get-sth

212

/logs/argon2020/ct/v1/get-sth

216

/logs/argon2021/ct/v1/get-sth

231

/logs/argon2020/ct/v1/add-pre-chain

394

/logs/argon2018/ct/v1/get-entries

523

Total

3184



We apologize for this interruption to serving and will be introducing additional deployment checks and monitoring to guard against a future recurrence.


Martin
Google CT Team

--
You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/certificate-transparency/CAK76_KVJO4U6y2ax1_F4tzkPGOD8nN%2B1oqZkKUsYRd1RX7u%2B%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Sleevi

unread,
Nov 13, 2018, 11:38:01 AM11/13/18
to Al Cutter, certificate-...@googlegroups.com, ct-p...@chromium.org, google-...@googlegroups.com
This provides a useful breakdown and impact analysis, but does not appear to raise to the level of a postmortem that helps the community understand root causes and mitigations.

During the recent Apple-hosted CT Policy Days, Martin gave an excellent presentation that arguably could serve as a model post-mortem in the level of details in analyzing the root cause, the explanation of the architectural considerations, the steps being taken to mitigate those issues, and sufficient context to provide insight into other potential issues. I'm curious if there are plans to share that more broadly, either as a result of the minutes from CT Policy Days, or as a further follow-up to this incident report.

You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACM%3D_Ocyry-Z4mw6fVnX55QxYTvU9fm9TWF7UXfsvqB2zsmWtg%40mail.gmail.com.

Doug Beattie

unread,
Nov 14, 2018, 5:48:05 PM11/14/18
to rsl...@chromium.org, a...@google.com, certificate-...@googlegroups.com, ct-p...@chromium.org, google-...@googlegroups.com
I would also be interested in a more detailed post mortem.  Having all of the Google  Trillian based logs down at the same time could have resulted in a DoS for issuance because all CAs are required to get SCTs from at least one Google log.  Currently GlobalSign still uses some of the older Google logs so we were not adversely impacted this time.  This is the second time Google logs all went down (last time was a DNS issue).  

Will there be procedures put in place that prevent updates to all of the Google logs at the same time in the future?

Doug


Doug Beattie

unread,
Dec 11, 2018, 9:42:16 AM12/11/18
to rsl...@chromium.org, Al Cutter, certificate-...@googlegroups.com, Certificate Transparency Policy, google-...@googlegroups.com

Maybe I missed it, but was there a more detailed Postmortem so the community can understand the root causes and mitigation?  I ask because Google has plans [1]  to take their older non-sharded CT logs down in the May-August timeframe next year.  Having all logs based on Trillian and being managed within the same infrastructure, release process, DNS management, DoS protections, etc. can result in a higher probability for an outage across all Google CT logs.  While any other CT log operator can go down with little ecosystem impact, this is not the case for Google CT logs (CAs are obligated to include at least one Google SCT).  Has this risk been adequately addresses?

The more recent outage [2] due to "Preloader Induced DoS Defense Mode" makes me even more concerned about successful DoS which results in disabling global SSL issuance.  Perhaps it's time to consider changing the Google CT policy to permit issuance of certificates without a Google SCT?

Ryan Sleevi

unread,
Dec 11, 2018, 10:06:41 AM12/11/18
to Doug Beattie, Ryan Sleevi, Al Cutter, Certificate Transparency Policy, Martin Smith
On Tue, Dec 11, 2018 at 9:42 AM Doug Beattie <douglas...@gmail.com> wrote:

Maybe I missed it, but was there a more detailed Postmortem so the community can understand the root causes and mitigation? 

There was, actually, but it looks like Martin wasn't a member of ct-policy@ and thus it wasn't archived. This is, in general, an indictment against cross-posting.
 
I ask because Google has plans [1]  to take their older non-sharded CT logs down in the May-August timeframe next year.  Having all logs based on Trillian and being managed within the same infrastructure, release process, DNS management, DoS protections, etc. can result in a higher probability for an outage across all Google CT logs.  While any other CT log operator can go down with little ecosystem impact, this is not the case for Google CT logs (CAs are obligated to include at least one Google SCT).  Has this risk been adequately addresses?

I think one common thread of these post-mortems is that CAs can be taking steps to reduce any impact, and that CAs that have taken such steps have seen minimal impact. This has also been a recurring theme of CT Policy Days. 
 
The more recent outage [2] due to "Preloader Induced DoS Defense Mode" makes me even more concerned about successful DoS which results in disabling global SSL issuance.  Perhaps it's time to consider changing the Google CT policy to permit issuance of certificates without a Google SCT?

From that post-mortem, it also appears that CAs which took steps to diversify their logging saw limited impact (perhaps none, in some cases), while others exhibited certain pessimistic behaviours that have been called out as problematic in past discussions.

Could you share what data - whether from the post-mortem or from CA operations - that you think leads to the conclusion? It seems like having actionable and concrete data, which these post-mortems ensure, allow a bit more discussion and evaluation. 

Ryan Sleevi

unread,
Dec 11, 2018, 10:07:06 AM12/11/18
to Martin Smith, Ryan Sleevi, Al Cutter, Certificate Transparency Policy
Reposting this on Martin's behalf.

On Thu, Nov 22, 2018 at 8:24 AM Martin Smith <m...@google.com> wrote:
OK, in addition below is a summary of what I presented at the Policy Days. We can't go into much more detail as the problems occurred in code that's not open source.

Martin

More Details

We share common networking infrastructure with most Google services. This is managed for us and contains a lot of moving parts. We normally don’t worry much about it. Requests arrive from the Internet and are routed through this via internal networks to our servers.

Our release process is fully automated and consists of multiple stages. Continuously generated release candidates must progress through the stages to become live releases. At each stage a combination of tests are run combined with evaluations of the behaviour of the servers including comparisons to previous versions. If any test or evaluation fails the candidate is blocked and not released.

Servers are typically built in layers. In our case requests pass through an Application Framework layer (shared code), then our interceptor that performs rate limiting and other common functionality before making it through to our actual HTTP request handler.

A bug was introduced into the Application Framework library that made all external requests arriving at our server incorrectly fail internal ACL checks. These requests never reached our handler or interceptor code so did not appear in our error metrics. This caused no unit test failures as it was outside their scope. Integration tests also did not see the problem as the traffic involved was internal and did not trigger the ACL failure, which required via interactions with other networking components.

The nature of this failure prevented the errors from being visible to the release evaluation because they were not recorded in metrics. Additionally, other requests were being submitted directly to the servers from our internal systems, which all succeeded. This meant that if the release evaluation occurred at a time when large number of internal requests were happening everything seemed good and the evaluation passed. 

Consequently, the release briefly made it live in production. Once it was deployed probers began accumulating errors and the edge network -> server error ratio began to increase as the faulty binary rolled out in more locations. This led to a number of alerts being triggered. The oncall was able to rapidly correlate the error onset to the beginning of the rollout and requested an immediate rollback.

Once the rollback was complete, and the previous version was redeployed everywhere, the error metrics returned to normal and the observed impact reduced to zero.

Our primary follow-up actions will be to ensure that our canary environment is tested via the external request processing path, and improve the release evaluation process at the canary stage so it assesses live traffic.

You received this message because you are subscribed to the Google Groups "Google CT Logs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-ct-log...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/google-ct-logs/CACvaWvYsS8ad94xpw-8EuKZw5etu9OaFQFzcWmMwgbSL-mtYkA%40mail.gmail.com.

Doug Beattie

unread,
Dec 14, 2018, 8:24:24 AM12/14/18
to Certificate Transparency Policy, Ryan Sleevi
This time copying the list.

---------- Forwarded message ---------
From: Doug Beattie <douglas...@gmail.com>
Date: Fri, Dec 14, 2018 at 8:21 AM
Subject: Re: [ct-policy] Re: Google CT Log Outage Postmortem For Oct 24 2018
To: <rsl...@chromium.org>


Ryan,

I think you missed the point.  Mandating that all SSL certificates use SCTs from at least one Google log is an unacceptable risk, imo, especially when they are managed within the same infrastructure.  I'd prefer to see a requirement for more SCTs without any from Google over the current Google CT policy that requires at least one Google SCT.  For example:
< 15 month
Current: 2 (one Google and one non-Google)
Proposed:   2 (one Google and one non-Google), or 3 (from non-Google CT logs from at least 2 different operators)

>= 15, <= 27 months
Current: 3 (at least one Google and one Non-Google
Proposed: 3 (at least one Google and one Non-Google, or 4 (from non-Google logs from at least 2 different operators)

Is anyone else concerned bout this single point of failure?

See more responses below.

On Tue, Dec 11, 2018 at 10:06 AM Ryan Sleevi <rsl...@chromium.org> wrote:


On Tue, Dec 11, 2018 at 9:42 AM Doug Beattie <douglas...@gmail.com> wrote:

Maybe I missed it, but was there a more detailed Postmortem so the community can understand the root causes and mitigation? 

There was, actually, but it looks like Martin wasn't a member of ct-policy@ and thus it wasn't archived. This is, in general, an indictment against cross-posting.
 
I ask because Google has plans [1]  to take their older non-sharded CT logs down in the May-August timeframe next year.  Having all logs based on Trillian and being managed within the same infrastructure, release process, DNS management, DoS protections, etc. can result in a higher probability for an outage across all Google CT logs.  While any other CT log operator can go down with little ecosystem impact, this is not the case for Google CT logs (CAs are obligated to include at least one Google SCT).  Has this risk been adequately addresses?

I think one common thread of these post-mortems is that CAs can be taking steps to reduce any impact, and that CAs that have taken such steps have seen minimal impact. This has also been a recurring theme of CT Policy Days. 
 
The issue is that CAs can't take any steps to reduce impact when all the Google Logs go down.  Yes, we've spread out SCT fetching to cover many CT logs and have no issues when multiple logs are down or slow, but we can't get around this if we're required to use at least one Google log.

There are 2 major things that concern me:

1) Please correct me f I'm wrong, but here will be 2 Google CT logs (Argon and Xenon), down from the current 5 (Pilot, Rocketeer, Skydiver, Argon, Icarus) by the middle of next year.  Maybe I missed an announcement on a new log
2) The Google logs share the same Infrastructure, management/release process and same code base (once all transitioned to Trillian).  The postmortum confirmed this.  
* A single mistake anywhere in the process takes down both logs and global SSL certificate issuance.  
* A single attack on the Google logs takes down issuance.  
* Internet security depends on the availability of 2 Google CT logs

Even Google has outages from time to time (3 so far that impacted some or all of their logs), so isn't requiring Google SCTs and having only 2 CT logs an unacceptably high risk, especially with the growing number of certificates?

Argon2019 has a backlog of more than 1000 when I checked just now, https://crt.sh/monitored-logs  Will the 2 Google logs have the capacity and bandwidth to support the growing certificate issuance requirements by providing SCT in a timely manner?
image.png
 
The more recent outage [2] due to "Preloader Induced DoS Defense Mode" makes me even more concerned about successful DoS which results in disabling global SSL issuance.  Perhaps it's time to consider changing the Google CT policy to permit issuance of certificates without a Google SCT?

From that post-mortem, it also appears that CAs which took steps to diversify their logging saw limited impact (perhaps none, in some cases), while others exhibited certain pessimistic behaviours that have been called out as problematic in past discussions.

GlobalSign didn't have any direct impact because we're still using some of the older Google logs, but if Google is planning to EOL those logs and have just 2 remaining, then diversifying log use would not have helped.
 
Could you share what data - whether from the post-mortem or from CA operations - that you think leads to the conclusion? It seems like having actionable and concrete data, which these post-mortems ensure, allow a bit more discussion and evaluation. 

What are Google's plans for avoiding outages like the ones we've seen recently?  
How can we justify the continued requirement for least one Google SCT given the recent outages and planned move to just 2 Google CT logs managed within the same infrastructure?

Wayne Thayer

unread,
Dec 14, 2018, 11:59:54 AM12/14/18
to Doug Beattie (Globalsign), Certificate Transparency Policy, Ryan Sleevi
On Fri, Dec 14, 2018 at 6:24 AM Doug Beattie <douglas...@gmail.com> wrote:
This time copying the list.

---------- Forwarded message ---------
From: Doug Beattie <douglas...@gmail.com>
Date: Fri, Dec 14, 2018 at 8:21 AM
Subject: Re: [ct-policy] Re: Google CT Log Outage Postmortem For Oct 24 2018
To: <rsl...@chromium.org>


Ryan,

I think you missed the point.  Mandating that all SSL certificates use SCTs from at least one Google log is an unacceptable risk, imo, especially when they are managed within the same infrastructure.  I'd prefer to see a requirement for more SCTs without any from Google over the current Google CT policy that requires at least one Google SCT.  For example:
< 15 month
Current: 2 (one Google and one non-Google)
Proposed:   2 (one Google and one non-Google), or 3 (from non-Google CT logs from at least 2 different operators)

>= 15, <= 27 months
Current: 3 (at least one Google and one Non-Google
Proposed: 3 (at least one Google and one Non-Google, or 4 (from non-Google logs from at least 2 different operators)

Is anyone else concerned bout this single point of failure?

Yes, and I think we should be concerned about more than accidental outages: in a world where CAs won't issue publicly-trusted certificates unless they can log to Google logs, and HTTPS is required to operate a website, Google is not only a SPOF but becomes a gatekeeper of who can run a website.


Alex Cohn

unread,
Dec 17, 2018, 11:56:58 AM12/17/18
to Certificate Transparency Policy, sle...@google.com
On Friday, December 14, 2018 at 7:24:24 AM UTC-6, Doug Beattie (Globalsign) wrote:

Argon2019 has a backlog of more than 1000 when I checked just now, https://crt.sh/monitored-logs  Will the 2 Google logs have the capacity and bandwidth to support the growing certificate issuance requirements by providing SCT in a timely manner?
image.png
 

I think this is a misunderstanding of what crt.sh's backlog represents - Rob Stradling can certainly provide more detail than I can, but I believe this is simply the difference between the tree_size of the latest SCT crt.sh has retrieved from a log and the latest log entry ID it has processed and added to its database. In other words, a high backlog is not indicative of a problem with a log, but rather that the crt.sh monitor has not been able to keep up with the log's recent growth. 

I believe the only external indicator of a log's inability to keep up with its load would be an increasing average merge delay, but don't know of any monitor that publishes that. I seem to recall that Google and Cloudflare are each tracking this internally, though?

Alex

Rob Stradling

unread,
Dec 18, 2018, 6:14:40 AM12/18/18
to Alex Cohn, Certificate Transparency Policy, sle...@google.com
On 17/12/2018 16:56, 'Alex Cohn' via Certificate Transparency Policy wrote:
> On Friday, December 14, 2018 at 7:24:24 AM UTC-6, Doug Beattie
> (Globalsign) wrote:
>
>
> Argon2019 has a backlog of more than 1000 when I checked just now,
> https://crt.sh/monitored-logs <https://crt.sh/monitored-logs>  Will
> the 2 Google logs have the capacity and bandwidth to support the
> growing certificate issuance requirements by providing SCT in a
> timely manner?
>
>
> I think this is a misunderstanding of what crt.sh's backlog represents -
> Rob Stradling can certainly provide more detail than I can, but I
> believe this is simply the difference between the tree_size of the
> latest SCT crt.sh has retrieved from a log and the latest log entry ID
> it has processed and added to its database. In other words, a high > backlog is not indicative of a problem with a log, but rather that the
> crt.sh monitor has not been able to keep up with the log's recent growth.

That's correct.

(Nit: tree_size of the latest STH, not SCT)

--
Rob Stradling
Senior Research & Development Scientist
Sectigo Limited

Pierre Phaneuf

unread,
Dec 18, 2018, 6:38:38 AM12/18/18
to Rob Stradling, Alex Cohn, Certificate Transparency Policy, sle...@google.com
A heavily loaded log might be slow in giving replies to get-entries
requests, or turn them down entirely in the worst case, so a backlog
on crt.sh doesn't necessarily mean that the log is fine either.
> --
> You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
> To post to this group, send email to ct-p...@chromium.org.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/b77f950a-1609-4b64-e8cc-50063df9f41b%40sectigo.com.

Ryan Sleevi

unread,
Dec 19, 2018, 3:59:09 PM12/19/18
to Doug Beattie, Certificate Transparency Policy
Splitting this off from https://groups.google.com/a/chromium.org/d/msg/ct-policy/_csiMYrwsxc/9wTwCwtiBQAJ because even though the questions are short, the answers are long and complicated :) Bottom-posting, so that folks who didn't follow that thread/message have sufficient context.

Doug,


There’s a lot going on in your message, so I’m going to try to reframe it a little, so we don’t get lost in the replies. I’ll try to work from the ‘easiest’ stuff and then onto the more nuanced stuff.


I’m glad to see other members of the community have already highlighted that the misunderstanding with what crt.sh was reflecting. To recap:

  1. Those metrics have nothing to do with the delivery of SCTs, but of monitored entries. Only the entities logging can provide data about their experiences obtaining SCTs, using either the /ct/v1/add-chain API or the /ct/v1/add-pre-chain API. In my previous message, I tried to capture this, by explicitly asking GlobalSign to share any data that it can, relevant to this discussion.

  2. Crt.sh doesn’t allow inferring bandwidth or capacity from those measurements; if anything, the greater bandwidth and capacity of those logs causes monitors, such as crt.sh, to fall behind quicker, because of how rapidly they are able to integrate entries.


The offer still stands: if there is data that GlobalSign has regarding challenges with Logs, Google or otherwise, then we really would love to have that data made available, and that’s exactly the kind of discussion we’d love to see more of on ct-policy@. As we’ve shared before, so much of the policy and approach rests on the sharing of information to help inform and craft better policies. We do our best with the data we have, holistically, but GlobalSign can make a real impact by sharing such data. For example, discussions about the logs you use, the latency you see, the error rates, etc, all help both the community and Log Operators better set expectations.


Next, on the topic of log diversity, I’m having a bit of trouble. From looking at https://ct.cloudflare.com/ , it looks like GlobalSign is perhaps leading the pack of CAs in terms of Log diversity and distribution. From your reply, it sounds like neither issue caused any discernible or meaningful disruption for GlobalSign, is that correct? If so, that highlights the point I was trying to capture - that some CAs were impacted by these issues is no doubt tied to decisions that those CAs made with regards to where and how they log, and as a consequence, faced greater disruption. While it’s certainly a goal to make sure no Log has disruption, I don’t think we can lay the blame for any and all CA disruptions at the feet of the Logs or Log Operators given the current state here. At the same time, it’s somewhat telling that it appears there’s no use of Argon in the mix, or any Trillian CA. It may be an artifact of Cloudflare’s dashboard - I admit, I haven’t run the hard numbers myself for GlobalSign - but it seems odd to be concerned about Google Log outages or performance when it appears, on cursory glance, that steps aren’t being taken to mitigate those risks through diversity.


This is key to understanding what, I think, is the most nuanced and complicated of your points: the assessment of risk. It seems your focus on risk is, quite understandably, the risk on the issuance side of the pipeline, rather than on the relying party side. Broadly speaking, it seems the risk you’re most concerned about is a global, simultaneous, persistent Google Log outage. Without wanting to put words in your mouth, the impression I got is that you see that risk manifesting in several ways:

  1. Infrastructure homogeneity, such as using common Google infrastructure (front-ends, DoS capabilities) or networking

  2. Codebase homogeneity, such as the alignment on Trillian

  3. The raw number of Google logs, independent of those first two concerns

  4. Operational issues, such as the release management process and deployment


I hope that’s a fair presentation/restatement of the concerns, and that it’s not unreasonable to suggest that your primary concern is regarding the issuance of certificates.


You both propose and ask about a possible solution, such as substituting the Google Log requirement for something else. Given the set of concerns, it’s not unreasonable to see that as a possible solution. Unfortunately, I think it overlooks a number of practical limitations that have been previously discussed, while also overlooking some of the other risk factors that are a part of the calculus.


To be clear and up-front: Our goal is not to keep the Google Log requirement indefinitely. Since the beginning - quite literally in those first public versions of policy - we’ve been wanting to move to an ecosystem that is wholly independent of Google. But as I hope to show, there are real and practical challenges with that, which are still in the process of being addressed, and that the value being provided far, far exceeds the risk, even in consideration of these sorts of incidents.


Since it’s the oldest issue, the one I’ll tackle first is the question about independence. You’ve posed a question about “operators”, but if you recall, early versions of the policy included similar language, focused on “infrastructure or administrative access”. You may recall, from that thread, our disagreement about whether or not diverse logs was a “security” requirement, and I tried to explain and document why it was a fundamentally critical requirement, and Ben explained the security risk introduced by SCTs in the first place, which was a design concession to CA concerns.


Over time, this evolved into the current requirement for One Google Log, as captured in this thread from May 2015. Hopefully, that thread captures some of the reasons. There’s also this thread, from February 2017 following CT Policy Days, that captures more of the challenges and risks in quantifying some of the diversity requirements. We’re not alone in facing these challenges - you can see Gerv struggling to pin down a good solution for Mozilla, knowing these challenges.


These risks aren’t purely theory and armchair quarterbacking. We’ve seen them play out in the ecosystem already. The question of infrastructure independence has come up with Log Operators, both in the context of deploying to cloud providers as well as outages caused by both cloud providers and local infrastructure. We’ve seen a greater coalescing around implementations onto Trillian - which is positive for the ecosystem in some ways, particularly scalability, but understandably has the negatives of single-system risk, whether Google-operated or otherwise. As we’ve moved to require CT for all certificates, we know that there are real benefits to CAs otherwise ‘hiding’ certificates by colluding with Logs. We’ve seen multiple Log operators combine - sometimes publicly, as was the case with DigiCert and Symantec, sometimes privately, as was the case with StartCom and WoTrus/WoSign. We’ve seen Log operators issue SCTs and fail to incorporate them. The point being is that all of these are meant to capture that the risks you are seemingly concerned about, with the One Google requirement, are not meaningfully addressed by sprinkling in more SCTs or trying to pin down diversity.


This opens up the bigger issue, though, which is the question about “Why One Google in the first place?”. It’s not just that defining diversity is hard, and it’s not about purely best practice either. As I alluded to earlier, the risks being mitigated here are not only those risks to CA issuance.


One aspect of this policy is about risk management for Chrome users and certificate subscribers. If Chrome is going to require CT for certificates, as it does, then it’s important to take reasonable steps to mitigate the risk that such a requirement would cause certificates to stop working for site operators and users. If all of the SCTs within a certificate are from Logs that are disqualified, that once-working certificate will no longer work. Some of that risk is mitigated by the number of SCTs required, but the long lifetime of certificates, and the unfortunate and wholly avoidable challenges with replacing certificates, means that risk very much has to be considered. By not only operating a Log, but requiring the Log, we are better able to ensure that certificate holders will not find their certificate rendered unusable for Chrome users. This assumption rests on the belief that Google Logs are more resilient and scalable, and less likely to experience critical, DQ-worthy failure. To be clear, it’s not that we would not disqualify a Google Log if necessary, but it’s a variable that we can control and invest in - and have, rather significantly - in the furtherance of greater transparency.


I mentioned this in the very first thread we had on the matter, when similarly talking about the “Too Big to Fail” scenario. While counter-intuitive to suggest that the critical requirement on a Google Log mitigates a single-point-of-failure, when you think about the threat model that users and site operators face, rather than that of CAs, the requirement for a Google Log prevents a third-party from being critical to Chrome users’ security and the reliability of sites for Chrome users. For example, imagine a Log providing a known-hostile, split-view set of certificates. The clear action to take is to disqualify the Log. In the world you proposed, this would run the risk that such a Log would be a “load bearing” Log, and the consequence is that disqualifying such a Log could render millions of sites inoperable. As I called out those several years ago, that’s a very similar story to where we see ourselves with CAs today, and as recent challenges have shown, doing the right thing isn’t always the easy thing, and we should avoid introducing such issues in new systems.


However, the single largest reason for the ongoing one-Google requirement is the lack of a deployed SCT checking mechanism. As I mentioned, and Ben captured in some of those threads, SCTs were introduced as a concession for CAs concerned about the performance and time-to-issuance. As reasonable as those decisions may have been, given the facts that were available, it introduced a new challenge: A need for clients to check SCTs as part of ensuring that a Log is behaving correctly. If SCTs are not checked, then a Log can provide split views or hide certificates, and as a consequence, become highly-trusted and critical to Internet security. To be clear: In a world of SCTs, the choices are either between Logs being Trusted (much in the way that CAs are) or to deploy consistent verification of SCTs, either during or post-validation, to ensure meaningful detection of Log malfeasance.


CT’s key advance has been that it IS possible to cryptographically verify, detect, and prove Log shenanigans, and that’s a significant advancement from the Web PKI’s hierarchical model of trust. Unfortunately, one of the key challenges has been balancing the privacy needs of users and the operational deployment challenges at Internet scale, and that takes time and is critical to “get right”. You can see we’ve been exploring this in the context of the DNS-based proof delivery mechanism, you can see explorations of this in the IETF TRANS WG’s work on gossip, and you can see this consideration factoring in heavily into some of the changes of RFC 6962-bis.


Absent that, however, the choices are either to introduce a Trusted Log or to trust all logs equally. Trusting all logs equally is not an easy decision - beyond all of the above concerns elaborated on, it also introduces a host of new questions, such as “Why do I trust this Log”, and the perennial favorite question of my WebTrust colleagues, “Should Logs be audited?” All of the cryptographic verifiability would be unnecessary in such a system, while simultaneously, real concerns about “Should Logs be operated by CAs” would be introduced. Two Logs, for example, could collude or be compromised in such a way as to introduce significant risk to the ecosystem, and that’s both a real concern and one that has, unfortunately, historically been validated as legitimate within the CA ecosystem.


The alternative is what we’ve pursued, for this interim period, which is that of a Trusted Log. In this regard, the Logs that Google operates serve a critical security purpose for Chrome users and site operators: They ensure that any certificate that will be accepted by Chrome will be shared, by Google, through its Logs. Users and relying parties can inspect Google Logs to see what Chrome trusts. If you do not trust Google Logs to be honest, then you similarly should not trust Chrome, to verify SCTs or certificates or select which CAs to trust or to run native code, since this is all rooted in Google. This is certainly an imperfect solution, but it’s one that is intended to be temporary, as both client implementations and Log operators improve and grow.


While we’ve continued to make progress on developing the necessary tools and solutions to address this SCT-checking challenge, I can understand if there’s some frustration that it’s not here now. You’re not alone, if that is the case. Our focus and priority has been on improving the Log compliance and monitoring side, because we totally understand that it is an area that can very much impact CAs in the issuance side, and those are areas where CAs are feeling the most pain right now.


This is part of the holistic calculus to risk we’re taking - considering users and site operators. We understand and acknowledge that there is a risk that if all the Google Logs encountered a simultaneous, global, consistent outage, CAs would face challenges in issuing certs. On the whole, however, Google can take steps to mitigate those risks, and similarly encourage CAs to take appropriate steps to do the same. Sharing data along with the concerns can help improve the dialog and highlight if our calculus is off, and we’re always open to better understand. That said, the risks to users is of paramount importance to us, and to a lesser extent but still greater than that of CAs, the risks to site operators. We want to make sure that we’re meaningfully mitigating those risks as best as possible, while the ecosystem continues to grow and improve.


It may be that the Log risk mitigation takes more time than expected, and we’ve discussed what that may mean in past CT Policy Days events. As we’ve shared in the past, other solutions in this space might mean formalizing the notion of Trusted Logs. As mentioned, the notion of a Trusted Log is about ensuring that relying parties and site operators have the risks of CA collusion mitigated, so there would understandably be significant challenges to identify those criteria.


I realize this is a very long email, which is a product of how long this discussion having gone on, and us not really having holistically put it out there in written form. I realize this doesn’t enumerate a specific and concrete set of timeline and steps to reducing the Google Log requirement, but rather principles - getting those discrete steps is something Devon and I continue to work on. As you can see from the concerns, though, it’s very much a fluid thing - pulling in one area has an impact in another, so we want to make sure we’re thoughtfully balancing things and developing concrete, actionable, and meaningful solutions. Getting rid of the One Google requirement is not merely aspirational, it is a concrete goal of ours, but we’re balancing the steps to get there with those steps necessary to make sure the ecosystem is robust and growing.


That said, I wanted to acknowledge one more part of your message, which is what is the risk of both Google Trillian Logs going down at the same time. It sounds like you’re unhappy with the postmortems they’ve provided, and may see additional architectural risks not being addressed or acknowledged. On that front, I want to encourage you to push them for more details and more questions that can help you build that confidence. Despite the above “Trusted Log” discussion, we on the Chrome side intentionally and deliberately try to keep a wall between us and the CT team, to make sure that we’re holding all Logs to the same set of standards and expectations. As long as the “Trusted Status” remains, it’s not unreasonable to hold the Google Logs to an even higher standard.


 

Kurt Roeckx

unread,
Dec 19, 2018, 5:21:46 PM12/19/18
to Ryan Sleevi, Doug Beattie, Certificate Transparency Policy
On Wed, Dec 19, 2018 at 03:58:30PM -0500, Ryan Sleevi wrote:
>
> Next, on the topic of log diversity, I’m having a bit of trouble. From
> looking at https://ct.cloudflare.com/ , it looks like GlobalSign is perhaps
> leading the pack of CAs in terms of Log diversity and distribution. From
> your reply, it sounds like neither issue caused any discernible or
> meaningful disruption for GlobalSign, is that correct? If so, that
> highlights the point I was trying to capture - that some CAs were impacted
> by these issues is no doubt tied to decisions that those CAs made with
> regards to where and how they log, and as a consequence, faced greater
> disruption.

As long as a Google log is required, no amount of diversity helps.
If all Google logs are down, no certificates can be issued.

What the cloudflare page shows, is that all CAs mentioned there
spread their precertificates over 2 at least Google logs. Sentigo
seems to be the only one that clearly seems to favour one Google
log over the others.

But what that doesn't show is that this spread happens all the
time, or that they just switched from one log to an other. It's
about all the certificates they issued, not from the past month or
something.

Note that spreading the load over multiple Google logs can also be
a problem if one of the logs has a problem. They might switch to
only using the other log, and the shift in load might trigger the
DoS protection.


I think many of the other points you're talking about is really
about trusting the logs. Logs are not supposed to be trusted. We
should have technical measures to make sure they are working
properly. But this all still needs to be implemented. I think if
we ever come to a situation where trust in a log is no longer a
requirement, the requirement of the Google log can also go away.


Kurt

Ryan Sleevi

unread,
Dec 19, 2018, 6:59:04 PM12/19/18
to Kurt Roeckx, Ryan Sleevi, Doug Beattie, Certificate Transparency Policy
On Wed, Dec 19, 2018 at 5:21 PM Kurt Roeckx <ku...@roeckx.be> wrote:
On Wed, Dec 19, 2018 at 03:58:30PM -0500, Ryan Sleevi wrote:
>
> Next, on the topic of log diversity, I’m having a bit of trouble. From
> looking at https://ct.cloudflare.com/ , it looks like GlobalSign is perhaps
> leading the pack of CAs in terms of Log diversity and distribution. From
> your reply, it sounds like neither issue caused any discernible or
> meaningful disruption for GlobalSign, is that correct? If so, that
> highlights the point I was trying to capture - that some CAs were impacted
> by these issues is no doubt tied to decisions that those CAs made with
> regards to where and how they log, and as a consequence, faced greater
> disruption.

As long as a Google log is required, no amount of diversity helps.
If all Google logs are down, no certificates can be issued.

I agree; but I think that's overlooking a very important point in these postmortems: If you're using only one Google Log, for example, it doesn't matter whether it's all Logs or one Log, you're just as exposed. Understanding that, a big mitigation step for CAs to take is to actually use multiple, diverse Logs - that's literally why they exist.

What the cloudflare page shows, is that all CAs mentioned there
spread their precertificates over 2 at least Google logs. Sentigo
seems to be the only one that clearly seems to favour one Google
log over the others.

But what that doesn't show is that this spread happens all the
time, or that they just switched from one log to an other. It's
about all the certificates they issued, not from the past month or
something.

Note that spreading the load over multiple Google logs can also be
a problem if one of the logs has a problem. They might switch to
only using the other log, and the shift in load might trigger the
DoS protection.

Yes, exactly. And this is something that the CAs can take steps, and something we've been encouraging for some time. While it is important and concerning the possibility of all Google Logs going down simultaneously, that's not as concerning as if a single Log having an issue can cause issues for CAs - and that's where we want to make sure that best practices are being developed, discussed, and followed.
 
I think many of the other points you're talking about is really
about trusting the logs. Logs are not supposed to be trusted. We
should have technical measures to make sure they are working
properly. But this all still needs to be implemented. I think if
we ever come to a situation where trust in a log is no longer a
requirement, the requirement of the Google log can also go away.

Yes. That is exactly the goal. These are challenges to solve. At the same time, we want to make sure the pursuit of solving the trust challenge, which is important, doesn't cause us to ignore other challenges. And the reason as to "Why do we still have the One Google requirement" is because some of those other challenges have been more pressing, and more risky, for CAs.

The biggest challenge is in third-party Logs reliably and scalably operating. I'm not trying to gloss over the issues with Google Logs, but given that third-party Logs also play a significant part in the ecosystem, we've been wanting to make sure they've got the policies, tools, and technologies to succeed, and that the community has the tools, processes, and techniques to be successful when that One Google log requirement is removed.
Reply all
Reply to author
Forward
0 new messages