Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Google Trust Services - Minor SCT issue disclosure

321 views
Skip to first unread message

Andy Warner

unread,
Aug 23, 2018, 8:50:18 AM8/23/18
to mozilla-dev-s...@lists.mozilla.org
Please note, Google wrote this report for internal use immediately after the issue. We intended to post it to m.d.s.p at that time, but securing internal approvals took a while and the posting ended-up on the back burner for a bit. It was a minor issue, but we want the community to be aware of it.

Summary:

May 21st 2018, a new tool for issuing certificates within Google was made available to internal customers. Within hours we started to receive reports that Chrome Canary (v67) with Certificate Transparency checks enabled was showing warnings. A coding error led to the new tool providing Signed Certificate Timestamps (SCTs) from 2 Google CT logs instead of one Google and one non-Google log.

* NOTE: Affected certs were logged at issuance to at least 2 Google CT logs and 2 non-Google CT logs. The embedded SCTs for affected certs only provided proofs from Google logs instead of Google and non-Google logs as required by Chrome.

* NOTE: The bug was due to an 'if/else' chain fall through. The code in question has been refactored to be simpler and more readable.

The issue was fully resolved ~14 hours after initial notification. The issue was mitigated within 4 hours. Triage and code fixes happened within 11 hours and it took ~3 hours to deploy the fixed code and confirm the fixed behavior in production. The new code was running in relatively few locations, so deployment was quick compared to some changes in our infrastructure.

Most affected customers responded quickly to communications that they should replace their certificates and revoke the old ones before a given deadline. All certificates that were issued with an SCT set that was not fully compliant were revoked on 2018-06-19 if they had not already been revoked by the customer previously. Most users replaced certificates shortly after notification.

Timeline:

2018-03-22 Bug introduced to codebase.
2018-05-21 Push including bug became available to clients.
2018-05-22 08:05 UTC First user reports that Chrome Canary presents a CT warning for a certificate.
2018-05-22 09:25 UTC Bug filed with initial assessment.
2018-05-22 12:01 UTC Frontend jobs with the bug are taken offline following standard CA procedures.
2018-05-22 15:59 UTC Issue conclusively identified.
2018-05-22 19:07 UTC Fix is submitted.
2018-05-22 21:48 UTC Fix starts to be rolled out.
2018-05-22 22:16 UTC Fix fully deployed and tested on test instances followed by deployment to production. Access to frontends restored.
2018-05-24 Customer communication sent to affected users to ask them to renew their certificates and revoke the old ones.
2018-06-19 The final handful of certificates that had not already been revoked and replaced by users were revoked by the CA.

Findings:

* The operational plan to halt issuance worked as expected and was implemented quickly.
* The problem was quickly found, fully understood and easy to remedy.
* Tests existed, but did not cover this failure case.

Remediation Plan
* Completed
** Message of the Day (MOTD) functionality was added or improved for all issuance systems to make it easier to communicate issues to users when issuance is intentionally paused.
** Test coverage was expanded to ensure that both the quantity and type of SCTs are checked.

Alex Gaynor

unread,
Aug 23, 2018, 8:58:00 AM8/23/18
to awa...@google.com, mozilla-dev-s...@lists.mozilla.org
Hi Andy,

Just so I follow, this is something you're proactively sharing, right? As
far as I can tell, there's no violation of any Mozilla Root Program rules
here, just an issue that caused interstitials in Chrome.

Either way, I appreciate your sharing.

You mentioned the issue was do to some overly complex control flow. In
order to help other CAs out, do you think there are testing methodologies
that could have helped catch this earlier?

Alex
> _______________________________________________
> dev-security-policy mailing list
> dev-secur...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-security-policy
>

Andy Warner

unread,
Aug 23, 2018, 9:06:38 AM8/23/18
to aga...@mozilla.com, mozilla-dev-s...@lists.mozilla.org
Correct, we do not believe there was a policy violation, we're proactively
sharing in the interest of transparency and knowledge sharing.

I believe there is additional information we could share about how we've
modified testing to ensure compliance with Chrome and Safari's SCT
inclusion rules and have more flexible tests. I want to discuss this with
the engineer who implemented the changes to ensure they agree with how I
would summarize the changes. Update to follow.

Nick Lamb

unread,
Aug 23, 2018, 10:55:18 AM8/23/18
to dev-secur...@lists.mozilla.org, Andy Warner
On Thu, 23 Aug 2018 05:50:05 -0700 (PDT)
Andy Warner via dev-security-policy
<dev-secur...@lists.mozilla.org> wrote:

> May 21st 2018, a new tool for issuing certificates within Google was
> made available to internal customers. Within hours we started to
> receive reports that Chrome Canary (v67) with Certificate
> Transparency checks enabled was showing warnings. A coding error led
> to the new tool providing Signed Certificate Timestamps (SCTs) from 2
> Google CT logs instead of one Google and one non-Google log.

Feel free to jump in anywhere I've made a mistake, this might totally
invalidate some of my questions.

Presumably, since you eventually "fixed" this by asking Subscribers to
re-issue, the SCTs are baked into a signed certificate, rather than
provided separately so that the Subscriber can use them with e.g.
Stapling technologies ?

Which means that this "new tool" also involved a Google controlled
subCA signing these certificates with, as it turns out, the wrong SCTs
in them. It's not clear to me if the tool and CA are operationally one
and the same.

Q1: Could a more significant "coding error" in this tool have resulted
in certificates being mis-issued (for example with SANs that don't
belong to Google, or lacking mandatory X.509 fields, or without being
CT logged)? If not please explain why the tool couldn't cause this.

Q2: If this error hadn't caused a negative end-user experience, what
mechanisms if any do you believe would have brought it to your
attention and how soon? e.g. does a team sample resulting certificates
from this tool at some interval? If it samples pre-certificates that
would not have detected this error, but is worth mentioning.

Q3: Such mistakes are of course inevitable in software development. But
they could also be introduced maliciously. Were you able to confidently
identify which specific individual(s) made the relevant change? (I don't
want names). Are you confident you'd be able to do this even if somehow
the production tool turned out not to match your revision control
systems?

Thanks as always for satisfying my curiosity

Nick.

Andy Warner

unread,
Aug 23, 2018, 11:19:24 AM8/23/18
to n...@tlrmx.org, dev-secur...@lists.mozilla.org
Google provides SCTs via embedding and during SSL handshaking depending on
the certificate and how it is served. In this case, all of the affected
certs used embedded SCTs and the issue was the selection of which SCTs to
include because we submit to more CT logs than required, but only embed the
required number of SCTs to keep cert sizes as small as possible. These
certs were submitted to 4 CT logs, 2 Google, 2 non-Google, but the embedded
certs were only from the 2 Google logs, not one Google and one non-Google.
The CA signed 4 correct SCTs and all 4 were submitted to CT logs, the
problem was the embedding logic for the SCTs.

In response to Q1, the logic involved was specific to selection and
embedding of SCTs, not part of validation logic, so a related error would
not affect validation. An unrelated error in validation logic could of
course affect validation, but all CAs have that risk and like other CAs we
have multiple layers of safeguards on validation logic.

For Q2, we sample certs regularly and make use of proven external linting
libraries and our own linting and audit logic. In this case because the
issue was not something normally checked by external tools and the behavior
was perfectly fine until the Chrome deadline in April, I can only posit
that we would have discovered it fairly quickly via other means. We now
have specific checks for this issue and other similar problems we could
foresee.

For Q3, we could account for the initial submission fully and understand
exactly what happened. Google has rigorous version control and enforcement
systems to ensure only properly reviewed and built code can enter
production and to reconcile running code against the cut point for an
approved release. Our CA systems have additional safeguards on top of those
standard tools to further ensure that we have strong knowledge of the
pedigree of all code and how it was built and deployed.

Ryan Sleevi

unread,
Aug 23, 2018, 12:40:20 PM8/23/18
to Andy Warner, mozilla-dev-security-policy
On Thu, Aug 23, 2018 at 8:50 AM, Andy Warner via dev-security-policy <
dev-secur...@lists.mozilla.org> wrote:
>
> * NOTE: The bug was due to an 'if/else' chain fall through. The code in
> question has been refactored to be simpler and more readable.
>

Andy,

It might be good for the community if you could describe the processes
before and after the change, so that other CAs can help prevent similar
issues with their own embedding systems.

Andy Warner

unread,
Aug 24, 2018, 4:13:53 AM8/24/18
to ry...@sleevi.com, mozilla-dev-s...@lists.mozilla.org
The code at issue evolved as CT requirements changed. What started off as a
very simple conditional grew into a more complex if / else if block with
somewhat complicated logic and inline checks. As part of the fix, we
simplified the conditionals and refactored the inline checks to make use of
nice clear IsExternallyOperated() and IsGoogleOperated() functions. The end
result is a much more readable and clear set of logic that is easier to
test and we expanded test coverage. I think the big lesson for the
community is that it would have been better to have refactored earlier
rather the evolving the code to the point it became more complicated than
it needed to be.
0 new messages