Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Google Aviator incident under investigation

929 views
Skip to first unread message

Paul Hadfield

unread,
Oct 17, 2016, 7:43:15 AM10/17/16
to Certificate Transparency Policy
On Sunday October 16th, 2016 our automated monitoring of the Google Aviator Log revealed some operational issues.

We are investigating and will circulate our findings in a report as soon as they are ready.


The Certificate Transparency Team

Paul Hadfield

unread,
Oct 21, 2016, 7:19:52 AM10/21/16
to Certificate Transparency Policy
Hello Chromium ct-policy,

As promised on Monday October 17, here are the findings of our investigation of the recent Google Aviator incident.

regards,
Google Certificate Transparency Team.

--

SUMMARY:

On 16 October 2016 the Google 'Aviator' log exceeded its stated Maximum Merge Delay (MMD). The apparent merge delay rose over the course of three days from its usual level of about 1.5 hours to 26.2 hours.  This serious operational issue arose because of an unusually large backlog of added certificates that had been submitted to the log over the preceding day. Aviator was not able to incorporate those submissions into new Signed Tree Heads quickly enough.


Exceeding the stated Maximum Merge Delay is a violation of RFC 6962, section 3:


The log MUST incorporate a certificate in its Merkle Tree
within the Maximum Merge Delay period after the issuance of the SCT.


It is also a violation of the Chromium Certificate Transparency policy:


Log Operators must ... incorporate a certificate for which an SCT has been issued by the Log within the MMD.


IMPACT:

Five consecutive runs by Aviator's signer failed to incorporate recently submitted chains within MMD.  These are as follows:


Submitted chains in index range [35910127, 35919332) were not incorporated within MMD for STH signed at timestamp 1476653510423 for tree size 35936627 (34.7% of entries sequenced for that STH).


Submitted chains in index range [35936627, 35962877) were not incorporated within MMD for STH signed at timestamp 1476657100609 for tree size 35962877 (100% of entries sequenced for that STH).


Submitted chains in index range [35962877, 35982377) were not incorporated within MMD for STH signed at timestamp 1476663459670 for tree size 35982377 (100% of entries sequenced for that STH).


Submitted chains in index range [35982377, 36081668) were not incorporated within MMD for STH signed at timestamp 1476671114710 for tree size 36084877 (96.9% of entries sequenced for that STH).


Submitted chains in index range [36084877, 36113179) were not incorporated within MMD for STH signed at timestamp 1476675980176 for tree size 36255877 (16.6% of entries sequenced for that STH).



ROOT CAUSE:

A large backlog of added certificates was generated over a 6 hour period during the early hours (PDT) of Sunday 16 October, caused by the (non-malicious) actions of some high-volume clients. Aviator's signer could not sequence the submitted certificate chains quickly enough to clear the backlog, exacerbated by the fact our protection against flooding did not activate when expected.


REMEDIATION AND PREVENTION:

During the impact period, Google's engineers worked to bring down the size of Aviator's backlog of submitted certs, in an attempt to avoid a policy violation.  As part of that effort the /ct/v1/add-chain and /ct/v1/add-pre-chain end-points were made temporarily unavailable; this was not announced.


/add-chain was unavailable from 2016-10-16 16:42 PDT to 2016-10-16 21:30 PDT (4.75hrs);

/add-pre-chain was unavailable from 2016-10-16 16:42 PDT to 1026-10-16 17:42 PDT (1hr)


Google are using the lessons learned from this incident to improve operational practices for the Pilot, Rocketeer, Submariner, Icarus and Skydiver logs; in particular the sequencing operation has been tuned, as have protections for the logs against flooding.  Monitoring has been revised to provide earlier warning of similar events in the future.


Tom Ritter

unread,
Oct 21, 2016, 10:37:48 AM10/21/16
to Paul Hadfield, Certificate Transparency Policy
On 21 October 2016 at 06:19, 'Paul Hadfield' via Certificate
Transparency Policy <ct-p...@chromium.org> wrote:
> It is also a violation of the Chromium Certificate Transparency policy:
>
>
> Log Operators must ... incorporate a certificate for which an SCT has been
> issued by the Log within the MMD.

I don't see any mention of Aviator's status with Chrome. Is this still
being decided, or is no news to be taken as no change will be made?

-tom

Paul Hadfield

unread,
Oct 21, 2016, 10:39:19 AM10/21/16
to Tom Ritter, Certificate Transparency Policy
I think Chrome are yet to state their position w.r.t. Aviator.

Paul

Ryan Hurst

unread,
Oct 21, 2016, 10:40:51 AM10/21/16
to Tom Ritter, Paul Hadfield, Certificate Transparency Policy
Tom,

Something that is not obvious to those on the outside (and we need to do a better job making sure it is), is that the "CT Team" is not part of Chrome.

This is why the log inclusion policy is called the Chrome CT Policy.

Now that the CT Team has published the incident report, Chrome now has enough information to decide what they think the right response would be. I think this will take them a few days to work out.

Ryan


--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CA%2BcU71%3Dxdj0vu20hTv1nYR1UruUV3ir_fOrxfnMqR83ZpGWm1w%40mail.gmail.com.

Ryan Sleevi

unread,
Oct 21, 2016, 4:16:53 PM10/21/16
to Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 7:37 AM, Tom Ritter <t...@ritter.vg> wrote:
I don't see any mention of Aviator's status with Chrome. Is this still
being decided, or is no news to be taken as no change will be made?

In general, we (Chrome) try to work with log operators to notify and understand the issues, and discuss the incident.

As far as remediation steps to take, I'm mixed, and would greatly appreciate feedback from the broader community of CAs, Log Operators, Auditors, and interested parties.

On the one hand, I can see a very compelling argument to remove the Aviator log (or more aptly, treat it as no longer trusted within the next few weeks). This is consistent with a strict zero tolerance policy, which, despite all the negatives that zero tolerance involves, has the benefit that it hopefully rises beyond even the slightest suggestion of treating Google logs different.

On the other hand, there is some material difference here between some of the other actions we've seen, with respect to uptime (Certly) or with respect to violating the append-only property (Izenpe). I had been planning a more thorough write-up of these concerns, but then I remembered I provided much of this back in June - https://groups.google.com/a/chromium.org/d/msg/ct-policy/AH9JHYDljpU/f4I9vQLACwAJ

When we examine the nature of this specific failure - the failure to integrate the SCTs within the MMD - we have to think about the impact. Unlike an uptime issue, so long as the STH is eventually consistent, there's no ability to hide misissued certificates longer than the window of non-compliance (which, as I understand, is 2.2 hours). As the STH eventually integrated all of these SCTs, the first and foremost concern - the ability for the log to mask misissuance - seems to have been limited, and that window is significantly less than the window that a stapled OCSP response could be used to assert 'goodness' (roughly 3.5 days)

That said, I wouldn't want to see a scenario in which logs routinely blow MMDs - that creates a system in which the reliability of the system becomes suspect, and increases the window of detection of misissuance.

I'm curious how the broader community feels, and, as I indicated above, I can see arguments for both. I think the ecosystem can support either action as well, so this is largely about understanding the threat model, community norms, and whether enforcing a zero tolerance policy provides the greatest benefit to the ecosystem, which it may very well do.

Thoughts?

Tom Ritter

unread,
Oct 21, 2016, 4:44:42 PM10/21/16
to Ryan Sleevi, Paul Hadfield, Certificate Transparency Policy
I've never been a big fan of zero-tolerance policies. I mean if your
private key gets compromised, or is significantly mishandled (Izenpen)
then yea, but poor uptime or missing an MMD seems like the type of
thing you give someone two, maybe three strikes at before you're out.

(Depending, of course, on the type of MMD miss. Missing an MMD because
you're behind and then you catch up a half-day later - fix your error
and let's move on. Do it again, and you're probably out. Maybe
there'd be a second exception if someone went out of their way to try
and DOS you and it took you an MMD to recover. But miss it because of
a bug that does not include SCTs into the log at all until someone
noticed (somehow!) - that would be a one-strike-you're-out situation.)

So while it puts you in an odd position of not wanting to appear
'soft' on your own company - I think it's reasonable to consider this
a warning shot and to not distrust the log.

In general, I think it is advantageous to the community to allow some
wiggle room for mistakes or circumstances that can be proven (or
reasonably inferred) to have little or no security impact. And we
expect the operators to actively address their shortcomings. But
zero-tolerance policies would likely discourage people from operating
logs. And I would like to see the general Transparency initiative
flourish (in both the certificate direction, and others.)

-tom

Linus Nordberg

unread,
Oct 21, 2016, 5:05:33 PM10/21/16
to Tom Ritter, Certificate Transparency Policy
Tom Ritter <t...@ritter.vg> wrote
Fri, 21 Oct 2016 15:44:21 -0500:

> So while it puts you in an odd position of not wanting to appear
> 'soft' on your own company - I think it's reasonable to consider this
> a warning shot and to not distrust the log.

+1.

Walter Goulet

unread,
Oct 21, 2016, 5:14:12 PM10/21/16
to Tom Ritter, Ryan Sleevi, Paul Hadfield, Certificate Transparency Policy
Hi Ryan,

As a log operator (Venafi), we are working actively to ensure that we can
properly scale up to handle higher volumes of certificate submissions from
high volume CAs in the future.

For situations like this where a log operator is overwhelmed by a backlog of
certificate submissions, I would hope that the log operator is afforded an
opportunity to learn from the incident so they can properly re-engineer
their solution to scale properly. I agree with Tom's point that if the issue
happens repeatedly which demonstrates that the log operator is not learning
from their mistakes, then it is fair to consider distrusting them.

Thanks,
Walter
--
You received this message because you are subscribed to the Google Groups
"Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit
https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CA%2BcU71mZB-pFb6XXH6aziMqWndVNKKWj_fzWewAUzHu5CWemdg%40mail.gmail.com.

Ryan Sleevi

unread,
Oct 21, 2016, 5:17:14 PM10/21/16
to Linus Nordberg, Tom Ritter, Certificate Transparency Policy
Note that I linked to a previous post in which Pilot and Aviator had 'incidents'.


If we accept this view - that it's within the realm of 'minor' - then how best should we (Chrome) keep track of incidents in a way that the community can reasonably evaluate the 'overall' performance of the log? How long should that performance be evaluated - against the lifetime of the log? Against the past N months?

I don't have good answers, and I'm not presenting them to disagree, but to moreso highlight the challenges :)

Ryan Sleevi

unread,
Oct 21, 2016, 5:18:21 PM10/21/16
to Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
Oh, and I do want to stress that I don't think "Log distrusting" should be a significant/serious event, particularly with respect to reputation. That is, the system was designed to accomodate logs coming and going, it's mostly a question of how frequently they're coming and going that becomes an issue :)

Peter Bowen

unread,
Oct 21, 2016, 5:26:30 PM10/21/16
to Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
Would Chrome reconsider the portion of the CT policy that says "SCT
from a log qualified at the time of check is presented"? If there is
a known good STH up to a certain point, what is the risk of simply
accepting SCTs included in the log as of that STH? The advantage of
this is a cert with, say, Aviator and Izenpe embedded SCTs would still
be trusted without having to find a server that does SCT delivery via
TLS extension or OCSP stapling.

Thanks,
Peter

Ryan Sleevi

unread,
Oct 21, 2016, 5:30:37 PM10/21/16
to Peter Bowen, Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 2:26 PM, Peter Bowen <pzb...@gmail.com> wrote:
Would Chrome reconsider the portion of the CT policy that says "SCT
from a log qualified at the time of check is presented"?  If there is
a known good STH up to a certain point, what is the risk of simply
accepting SCTs included in the log as of that STH?  The advantage of
this is a cert with, say, Aviator and Izenpe embedded SCTs would still
be trusted without having to find a server that does SCT delivery via
TLS extension or OCSP stapling.

I would think the risk here would be that, if a log was disqualified, should the community be watching the logs compliance? For example, could it backdate?

The requirement that at least one current SCT be included is that it provides a verifiable timestamp (among other things) for which to evaluate the rest of the SCTs against, and that so long as the log is qualified, we can presume that it is not forged.

Adopting the policy you suggest seems like it would make things fundamentally insecure, but it's also entirely possible (and likely, given I haven't had my coffee yet) that I'm missing something obvious. 

Rob Stradling

unread,
Oct 21, 2016, 5:47:06 PM10/21/16
to sle...@chromium.org, Certificate Transparency Policy
I think Aviator should remain trusted in Chrome. This was a relatively
minor transgression that occurred in unusual circumstances. The log
operator has been transparent about what happened and has acted fast to
take corrective measures.

We want to encourage other organizations to run logs (especially after
Google's announcement this week at CABForum!) I fear that killing
Aviator would make other organizations think twice about standing up
logs for inclusion in Chrome.

If Google can't comply with the CT Policy, who can?
> --
> You received this message because you are subscribed to the Google
> Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to ct-policy+...@chromium.org
> <mailto:ct-policy+...@chromium.org>.
> To post to this group, send email to ct-p...@chromium.org
> <mailto:ct-p...@chromium.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvbLOw4T7Y8Tgn_HKg%2BMpGoUVGksQ_vj45_n9uueHRq3tQ%40mail.gmail.com
> <https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvbLOw4T7Y8Tgn_HKg%2BMpGoUVGksQ_vj45_n9uueHRq3tQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
Rob Stradling
Senior Research & Development Scientist
COMODO - Creating Trust Online
Office Tel: +44.(0)1274.730505
Office Fax: +44.(0)1274.730909
www.comodo.com

COMODO CA Limited, Registered in England No. 04058690
Registered Office:
3rd Floor, 26 Office Village, Exchange Quay,
Trafford Road, Salford, Manchester M5 3EQ

This e-mail and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error please notify the
sender by replying to the e-mail containing this attachment. Replies to
this email may be monitored by COMODO for operational or business
reasons. Whilst every endeavour is taken to ensure that e-mails are free
from viruses, no liability can be accepted and the recipient is
requested to use their own virus checking software.

Ben Laurie

unread,
Oct 21, 2016, 5:51:47 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).

Both of these actions are policy violations.

I don't see any other choice.

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvb13H-gF8etEG-5Wk%3Dogm6wnxSFT062X32uf1smW6J3VA%40mail.gmail.com.

Peter Bowen

unread,
Oct 21, 2016, 6:02:12 PM10/21/16
to Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
I think we may be approaching this from the wrong end. Chrome
published a policy but has never published the risk (e.g. threat
model) that the policy is trying to mitigate. Given the current log
ecosystem has multiple operators with more than one log, we really
should be talking about both the operator and the log in these
discussions. Is the non-compliance with CT policy a log issue or an
log operator issue? What is the purpose of disqualifying a log (even
for downtime/availability) if the operator is allowed to immediately
resubmit? Assuming the problem was a fluke outside the operator's
control (e.g. DDOS attack on their DNS provider), is the answer "don't
change anything, submit again, and good to go in 90 days?" What does
this accomplish?

Thanks,
Peter

Ryan Sleevi

unread,
Oct 21, 2016, 6:14:37 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 2:51 PM, 'Ben Laurie' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).

Both of these actions are policy violations.

I don't see any other choice.


I don't think that's a fair statement, and I'm surprised to hear you state it.

You can throttle input, which is effectively an 'outage', as provided for in https://www.chromium.org/Home/chromium-security/certificate-transparency/log-policy . So long as the MMD is maintained, and the overall outage does not regress past the 99% uptime, this is still compliant.

I suspect you're more specifically thinking of "What happens when a single certificate is presented, and the option is either blow MMD or blow 99% uptime", which is a possible situation, but one would have hoped that the Log Operator took appropriate steps to avoid that situation, since a variety of options exist - up to and including no longer including CAs as accepted by the Log until the Log Operator is able to scale.

But certainly, I want to find solutions, but I'm not sure I agree with statements that it's "impossible" to comply with.

Ryan Sleevi

unread,
Oct 21, 2016, 6:16:12 PM10/21/16
to Peter Bowen, Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 3:02 PM, Peter Bowen <pzb...@gmail.com> wrote:
I think we may be approaching this from the wrong end.  Chrome
published a policy but has never published the risk (e.g. threat
model) that the policy is trying to mitigate. 

We've posted several times to this list various threats, including one set of discussion I linked to already in this thread.
 
Given the current log
ecosystem has multiple operators with more than one log, we really
should be talking about both the operator and the log in these
discussions.  Is the non-compliance with CT policy a log issue or an
log operator issue?  What is the purpose of disqualifying a log (even
for downtime/availability) if the operator is allowed to immediately
resubmit?

That presumes submissions that are technically compliant are automatically accepted. That's never been stated.
 
  Assuming the problem was a fluke outside the operator's
control (e.g. DDOS attack on their DNS provider), is the answer "don't
change anything,"

That's also not the policy. At a minimum, the key must change - so that any doubts about the previous log are avoided.

Tom Ritter

unread,
Oct 21, 2016, 6:43:36 PM10/21/16
to Ryan Sleevi, Peter Bowen, Linus Nordberg, Certificate Transparency Policy
On 21 October 2016 at 16:16, Ryan Sleevi <rsl...@chromium.org> wrote:
> Note that I linked to a previous post in which Pilot and Aviator had
> 'incidents'.
>
> Consider both
> https://groups.google.com/a/chromium.org/d/msg/ct-policy/Itoq0YUZTlA/24hkszkVBAAJ
> and
> https://groups.google.com/a/chromium.org/d/msg/ct-policy/dqoW-QMdKr8/8kt_ghhmCAAJ
>
> If we accept this view - that it's within the realm of 'minor' - then how
> best should we (Chrome) keep track of incidents in a way that the community
> can reasonably evaluate the 'overall' performance of the log? How long
> should that performance be evaluated - against the lifetime of the log?
> Against the past N months?
>
> I don't have good answers, and I'm not presenting them to disagree, but to
> moreso highlight the challenges :)


Make a wiki page next to the 'known-logs' page that links to incident
reports posted in this forum.

Log performance/operation will be measured based on the totality of
the situation. Blowing an MMD because of low capacity or a log having
a big piece of downtime twice in 2 months would be considered
differently from twice in two years. In other words, handwave about
the problem and try not to overspecify what actions you might take
about future unspecified events with unspecified details. =) Not so
different from CA inclusion!



On 21 October 2016 at 16:17, Ryan Sleevi <rsl...@chromium.org> wrote:
> Oh, and I do want to stress that I don't think "Log distrusting" should be a
> significant/serious event, particularly with respect to reputation. That is,
> the system was designed to accomodate logs coming and going, it's mostly a
> question of how frequently they're coming and going that becomes an issue :)

I don't want it to be a serious event from the point of view of
clients trusting things, no. The system shouldn't break because of it
=)

But it is a big deal for an organization running the log. An
investment they made was deemed 'not good enough' by the community.
What incentive do they have to try again? And how many similar
organizations will be discouraged from starting a log in the first
place. So I want log distrusting to be done judiciously, with an eye
towards non-security failures being treated with understanding.



On 21 October 2016 at 16:29, Ryan Sleevi <rsl...@chromium.org> wrote:
>
>
> On Fri, Oct 21, 2016 at 2:26 PM, Peter Bowen <pzb...@gmail.com> wrote:
>>
>> Would Chrome reconsider the portion of the CT policy that says "SCT
>> from a log qualified at the time of check is presented"? If there is
>> a known good STH up to a certain point, what is the risk of simply
>> accepting SCTs included in the log as of that STH? The advantage of
>> this is a cert with, say, Aviator and Izenpe embedded SCTs would still
>> be trusted without having to find a server that does SCT delivery via
>> TLS extension or OCSP stapling.
>
>
> I would think the risk here would be that, if a log was disqualified, should
> the community be watching the logs compliance? For example, could it
> backdate?
>
> The requirement that at least one current SCT be included is that it
> provides a verifiable timestamp (among other things) for which to evaluate
> the rest of the SCTs against, and that so long as the log is qualified, we
> can presume that it is not forged.
>
> Adopting the policy you suggest seems like it would make things
> fundamentally insecure, but it's also entirely possible (and likely, given I
> haven't had my coffee yet) that I'm missing something obvious.

Without giving it too much thought, it seems such a policy would need
to resolve SCTs to that known-good STH via an inclusion proof before
trusting them. That would prevent backdating. And there's several
reasons that's a difficult (or impossible) path to go down.



On 21 October 2016 at 16:47, Rob Stradling <rob.st...@comodo.com> wrote:
> (especially after
> Google's announcement this week at CABForum!)

Can't wait to hear about it ;)

-tom

Ben Laurie

unread,
Oct 21, 2016, 7:17:55 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).

Is that what you would recommend, or do you have other ideas?


But certainly, I want to find solutions, but I'm not sure I agree with statements that it's "impossible" to comply with.

My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

Require an incident report if the frequency gets too high.

Richard Salz

unread,
Oct 21, 2016, 7:22:18 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
At fist I was strongly in favor of zero tolerance, and Caesar's wife and all that, but after reading everything so far I changed my mind.  Tweak the policy to allow "infrequent" MMD misses, record them, and move on.

Ben Laurie

unread,
Oct 21, 2016, 7:30:07 PM10/21/16
to Richard Salz, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On 22 October 2016 at 00:22, Richard Salz <rich...@gmail.com> wrote:
At fist I was strongly in favor of zero tolerance, and Caesar's wife and all that, but after reading everything so far I changed my mind.  Tweak the policy to allow "infrequent" MMD misses, record them, and move on.

It is not as simple as that. The log has to take action to make the MMD misses infrequent when under such load. The question is: what action?
 

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.

Richard Salz

unread,
Oct 21, 2016, 7:31:32 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
From Chrome's viewpoint, it is as simple as that.  From the log developer's viewpoint, guidance would be helpful but "throw hardware at it" is one possibiliy.

Ryan Sleevi

unread,
Oct 21, 2016, 7:33:32 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 4:17 PM, Ben Laurie <be...@google.com> wrote:
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).

Is that what you would recommend, or do you have other ideas?

Right, I see there being two sources of load for a Log Operator, at least w/r/t MMD-impacting events:
- New submissions
- "Backfill" submissions

The issue is that if you decide to accept a given root, then you are, in effect, agreeing to log everything that root ever issued (unless, of course, your policy makes some statement otherwise, such as examining the validity dates, or only N # of certs, etc). We haven't had any logs do so, but it's not impossible to think of them wanting to do so.

This also comes up with things like cross-signs. For example, if you agree to log Identrust or Symantec certs, then you also potentially agree to log the entirity of the US FPKI - which is quite a few certs! Now, if your log implementation checks revocation status, it could decide to reject such certs (not complying with policy), but now we get into that gray area of how much or how willing should a log be to log everything.

For "new" submissions - that is, new certificates being issued - it seems unlikely in general that a CA will cause serious pressure; even a CA like Let's Encrypt. If it does/is, then that's something that should be discussed, and is precisely the thing that is meaningful to the community to solve. But my gut and sense from discussions with log operators (including Google) is that even CAs such as Let's Encrypt do not place unmanagable load on reasonably developed logs, nor would they be anticipated to.

From what I understand from Ryan's post, it's most likely that this was a 'backfill' sort of operation. At that point, the submitter has the full ability to affect the QPS of which they log, and the upper scale of how many QPS may come in is, presuming a distributed enough submitter, equivalent to the totality of the WebPKI that the log operator accepts. That'd be huge!

My suggestion of 'removing a CA' was moreso with respect to thinking about the 'new' submissions case, and an example of how you could mitigate some of the challenge, if no other solution existed. For addressing the 'backfill' case, the answer would have to be some form of D(DoS) mitigation, which seems to fit within the reasonable bounds of mitigation, and is distinct from a 'total' outage. So even if submitted tried to log N million certificates in 1 second, you could reject once you exceeded your QPS budget that ensured you hit your MMD budget.

A log operator could also seek to mitigate this issue with acceptance policies (as mentioned above), or by 'pre' backfilling the log contents, such that it started from a known state. Of course, as the PKI grows, I expect that the former will be more popular than the latter, but I suspect both fit within a spectrum of option and degree, such that it's not either/or.

 
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

Given that we've seen CAs backdating certificates, how do you define 'new' certs? :)
 
Require an incident report if the frequency gets too high.

Who reports? :) With respect to adding precerts, only CAs trusted by the log can do that. With respect to MMD measurements, only those who successfully obtain an SCT can quantify that, and it may be that multiple parties are logging, and they aren't aware that they're all getting "come back laters" (e.g. 10 parties are seeing an 'outage'). Should they report unconditionally every failure? 

Brian Smith

unread,
Oct 21, 2016, 7:34:45 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
Ben Laurie wrote:


On 21 October 2016 at 23:13, Ryan Sleevi <rsl...@chromium.org> wrote:


On Fri, Oct 21, 2016 at 2:51 PM, 'Ben Laurie' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).

Both of these actions are policy violations.

I don't see any other choice.


I don't think that's a fair statement, and I'm surprised to hear you state it.

You can throttle input, which is effectively an 'outage', as provided for in https://www.chromium.org/Home/chromium-security/certificate-transparency/log-policy . So long as the MMD is maintained, and the overall outage does not regress past the 99% uptime, this is still compliant.

I suspect you're more specifically thinking of "What happens when a single certificate is presented, and the option is either blow MMD or blow 99% uptime", which is a possible situation, but one would have hoped that the Log Operator took appropriate steps to avoid that situation, since a variety of options exist - up to and including no longer including CAs as accepted by the Log until the Log Operator is able to scale.

I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT,

Hi Ben,

Please forgive me for brainstorming in public. What do you think of this logic?

I think the primary intent of CT is to make sure the certs get publicly logged. The secondary goal is to ensure the logging process doesn't reduce availability too much. Therefore, a lot can (temporarily) refuse to give out a SCT even if it will eventually log the cert chain. That is, a log could estimate how likely it is that it will meet the MMD for a cert (based on its load and other factors) and decide whether or not to return the SCT to the submitter. This will risk reducing the availability for the website (assuming SCTs are required), but that's a secondary, and also that's unlikely as long as there are a plurality of trusted logs available for the website to use instead.

The window between when somebody receives the SCT for a cert chain and the time that the cert becomes publicly available in the log is the most critical window of vulnerability in CT, right? Therefore, once the log hands out a SCT it is really important that it meet the MMD. In fact a 24 hour MMD is already much larger of a window of vulnerability than we ultimately want, right?
 
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

I agree that this seems like a reasonable thing to do.

Cheers,
Brian

Ben Laurie

unread,
Oct 21, 2016, 7:40:21 PM10/21/16
to Richard Salz, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On 22 October 2016 at 00:31, Richard Salz <rich...@gmail.com> wrote:
From Chrome's viewpoint, it is as simple as that.  From the log developer's viewpoint, guidance would be helpful but "throw hardware at it" is one possibiliy.

Sadly, it isn't. The core problem is that to provide a reliable, available, consistent service has limits on performance - most services fudge around the edges of that, but CT is not permitted to.

I can see ways around it: for example, allow logs to shard - permit a "log" to be a series of sublogs, each with their own entry policy. The requirement is that at least one sublog should accept your valid cert. Then you can scale practically indefinitely.

I see Ryan's next post hints at that. More later.

Ryan Sleevi

unread,
Oct 21, 2016, 7:46:57 PM10/21/16
to Ben Laurie, Richard Salz, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
I should add that one area of concern is the role that name-constrained subordinate CAs will play. Setting aside the 6962-bis notion of "All these certificates are yours, except name constrained sub-CAs. Attempt no logging there" - if someone with an NC SubCA _wanted_ to log all their certs they issued, they could potentially introduce the 'spam' problem that is trying to be mitigated with respect to rejecting self-signed certs.

So it's another potential source of load, potentially malicious - a holder of an NC SubCA minting a ton of certs beforehand, then attempting to DoS a log into submission by throwing them all at a log at once. We need the flexibility for the log to handle that, since infinite scaling is hard, but at the same time, we need the reliability to assure that new certificates issued by CAs (potentially misissued) can be accepted.

You could imagine a log policy that they would reject certs from a name-constrained subCA (except, from the Chrome policy side, I think that's the opposite of what we want, and I'll be posting on TRANS to this effect), or you could imagine they decide to not log certs for a domain name (which potentially means misissuance wouldn't be detected), or they could decide to reject from that particular NC SubCA, but all of these are with tradeoffs and impact.

So to the general matter of "How should a log operator balance load", I'm not 100% sure, but perhaps this is an area where log operators could, within the IETF TRANS WG, work to write up some threat models and operational guidance suggestions, based on their experiences of the past several years, so that when evaluating questions of policy failures, we can look at them through that lens.

Ben Laurie

unread,
Oct 21, 2016, 7:59:00 PM10/21/16
to Brian Smith, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
I agree. In practice, the merges are usually much faster than that, so really the MMD is about the available repair window, not the expected merge time.

Perhaps it would be better expressed as a distribution?

There are also, btw, privacy concerns around excessively fast merges. :-)

Ben Laurie

unread,
Oct 21, 2016, 8:08:00 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On 22 October 2016 at 00:32, Ryan Sleevi <rsl...@chromium.org> wrote:


On Fri, Oct 21, 2016 at 4:17 PM, Ben Laurie <be...@google.com> wrote:
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).

Is that what you would recommend, or do you have other ideas?

Right, I see there being two sources of load for a Log Operator, at least w/r/t MMD-impacting events:
- New submissions
- "Backfill" submissions

The issue is that if you decide to accept a given root, then you are, in effect, agreeing to log everything that root ever issued (unless, of course, your policy makes some statement otherwise, such as examining the validity dates, or only N # of certs, etc). We haven't had any logs do so, but it's not impossible to think of them wanting to do so.

This also comes up with things like cross-signs. For example, if you agree to log Identrust or Symantec certs, then you also potentially agree to log the entirity of the US FPKI - which is quite a few certs! Now, if your log implementation checks revocation status, it could decide to reject such certs (not complying with policy), but now we get into that gray area of how much or how willing should a log be to log everything.

For "new" submissions - that is, new certificates being issued - it seems unlikely in general that a CA will cause serious pressure; even a CA like Let's Encrypt. If it does/is, then that's something that should be discussed, and is precisely the thing that is meaningful to the community to solve. But my gut and sense from discussions with log operators (including Google) is that even CAs such as Let's Encrypt do not place unmanagable load on reasonably developed logs, nor would they be anticipated to.

From what I understand from Ryan's post, it's most likely that this was a 'backfill' sort of operation. At that point, the submitter has the full ability to affect the QPS of which they log, and the upper scale of how many QPS may come in is, presuming a distributed enough submitter, equivalent to the totality of the WebPKI that the log operator accepts. That'd be huge!

This is exactly what happened.
 

My suggestion of 'removing a CA' was moreso with respect to thinking about the 'new' submissions case, and an example of how you could mitigate some of the challenge, if no other solution existed. For addressing the 'backfill' case, the answer would have to be some form of D(DoS) mitigation, which seems to fit within the reasonable bounds of mitigation, and is distinct from a 'total' outage. So even if submitted tried to log N million certificates in 1 second, you could reject once you exceeded your QPS budget that ensured you hit your MMD budget.

A log operator could also seek to mitigate this issue with acceptance policies (as mentioned above), or by 'pre' backfilling the log contents, such that it started from a known state. Of course, as the PKI grows, I expect that the former will be more popular than the latter, but I suspect both fit within a spectrum of option and degree, such that it's not either/or.

So could we fix this right now by setting an acceptance policy that says "unless load is too high"?
 

 
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

Given that we've seen CAs backdating certificates, how do you define 'new' certs? :)

I mean new to the log.
 
 
Require an incident report if the frequency gets too high.

Who reports? :) With respect to adding precerts, only CAs trusted by the log can do that. With respect to MMD measurements, only those who successfully obtain an SCT can quantify that, and it may be that multiple parties are logging, and they aren't aware that they're all getting "come back laters" (e.g. 10 parties are seeing an 'outage'). Should they report unconditionally every failure? 

Ultimately you can judge the log by whether it has actually included certs: i.e. if it has failed to respond to you (for a new cert inclusion), can you see that it has, in fact, responded to others by including new certs?

Whether it responds to queries about its contents is judged by the various analysis services.
 

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.

Brian Smith

unread,
Oct 21, 2016, 9:43:26 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
Ryan Sleevi <rsl...@chromium.org> wrote:
On the one hand, I can see a very compelling argument to remove the Aviator log (or more aptly, treat it as no longer trusted within the next few weeks). This is consistent with a strict zero tolerance policy, which, despite all the negatives that zero tolerance involves, has the benefit that it hopefully rises beyond even the slightest suggestion of treating Google logs different.

I agree. I think there's another benefit to distrusting the Aviator log now. Currently, CT is relatively unimportant. However, especially in 2017 it looks like it will be Really Important. If Chrome were to distrust the Aviator log now, the ecosystem would get operational experience with what happens when a Google log is removed. In particular, think about all the people that have certs with only one Aviator SCT and one non-Google SCT embedded; they would (depending on how the distrust of Aviator would be done) not on their own be sufficiently logged according to Chrome's CT policy. It would be very interesting to see how the ecosystem would cope with these certificates suddenly becoming non-compliant with a browser's CT log inclusion policy. And, it would be better to learn this before CT becomes Really Important.

I had been planning a more thorough write-up of these concerns, but then I remembered I provided much of this back in June - https://groups.google.com/a/chromium.org/d/msg/ct-policy/AH9JHYDljpU/f4I9vQLACwAJ

That previous decision couldn't have helped people's perception of whether Google would give its own logs special treatment or not. The problem with these perception issues is that it will make enforcement for non-Google logs harder when they do similar, but perhaps worse, things. 

So, I think that we can tolerate the (temporary?) loss of the Aviator log, it is better to err on the side of a literal interpretation of the policy. If we can't tolerate the loss of the Aviator log then it kind of means that it has become "too big to fail," which would be an indication of a serious problem too. In the abstract, this would be an excellent test of CT's and the Chrome CT Policy's resilience against "too big to fail". Better sooner than later.

When we examine the nature of this specific failure - the failure to integrate the SCTs within the MMD - we have to think about the impact.

AFAICT, the only thing worse than handing out a SCT and not logging the cert within the MMD is handing out a SCT and never logging it at all. An MMD of 24 hours is already *very* generous and is supposed to be sized to already accommodate major operational problems. That is, when choosing the MMD it was already decided that 24 hours was the most that could be tolerated. Being over 9% over the maximum tolerance is non-trivial when we consider that.
 
Unlike an uptime issue,

Some uptime issues are more critical than others, AFAICT. For example, uptime of sharing the log contents of certificates for which SCTs have been shared seems more important than the uptime of certificate acceptance. 
 
so long as the STH is eventually consistent, there's no ability to hide misissued certificates longer than the window of non-compliance (which, as I understand, is 2.2 hours). As the STH eventually integrated all of these SCTs, the first and foremost concern - the ability for the log to mask misissuance - seems to have been limited, and that window is significantly less than the window that a stapled OCSP response could be used to assert 'goodness' (roughly 3.5 days)

Agreed.
 
That said, I wouldn't want to see a scenario in which logs routinely blow MMDs - that creates a system in which the reliability of the system becomes suspect, and increases the window of detection of misissuance.

Right! The MMD is already a huge security-for-availability trade-off. Again, it seems like it was already decided that 24 hours was the most that could reasonably be tolerated. If that's not the case, then let's raise the maximum MMDs allowed. 

Interestingly, the Chrome CT policy doesn't state a maximum MMD. So, couldn't a log simply define their MMD to be 10 years to ensure they are always in compliance? There is a part of the policy that says outages may not exceed "an MMD of more than 24 hours", but there isn't anything that limits what the log's actually MMD should be, IIUC.
 
I'm curious how the broader community feels, and, as I indicated above, I can see arguments for both. I think the ecosystem can support either action as well, so this is largely about understanding the threat model, community norms, and whether enforcing a zero tolerance policy provides the greatest benefit to the ecosystem, which it may very well do.

As other messages in the thread indicated, especially those by Ben Laurie, we should also consider whether the current policy is reasonable w.r.t. availability requirements. If it is determined that the current policy isn't reasonable then it should be changed. In that case, I think it makes sense to re-evaluate all log distrust decisions, including this one, based on the new criteria. All logs that were distrusted under the old rules but which would be trusted under the new rules should remain trusted (or be re-trusted, if they were already distrusted).

OTOH, if the policy doesn't change, then it seems reasonable to interpret the policy exactly as written.

Regardless, at a minimum, the policy should be changed to clearly state a maximum allowed MMD.

Cheers,
Brian

Eric Mill

unread,
Oct 21, 2016, 9:58:27 PM10/21/16
to Brian Smith, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On that note, I haven't seen anyone discuss on-thread whether a 24 hour MMD is perhaps just unreasonable at scale. While there are clearly security ramifications to extending the allowable window before a certificate is logged, maybe a 36-hour MMD is better for security than weakening the other guarantees of a log, and/or better than allowing the MMD requirement to get more "blurry" as small incidents are allowed?

-- Eric

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.

Todd Johnson

unread,
Oct 21, 2016, 10:15:51 PM10/21/16
to Eric Mill, Brian Smith, Certificate Transparency Policy, Paul Hadfield, Ryan Sleevi, Tom Ritter
Awesome conversation!  

Unfortunate circumstances do happen, but, that should not keep the operator(s) from being able to redeem themselves.  So long as there is *transparency*, the operator(s) should have some methodology to redeem themselves.  

Otherwise, would partitioning logs with fewer trust anchors help?

From an enterprise perspective, it is much easier to accept risk of longer grace periods for receiving bad news... Such as a revoked certificate on a CRL, with a 30 day publication schedule.  Or, discovering a miss-issued (or of malice) certificate.  What is "appropriate" for *public trust*?

To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

To post to this group, send email to ct-p...@chromium.org.
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

To post to this group, send email to ct-p...@chromium.org.

Ryan Sleevi

unread,
Oct 22, 2016, 9:03:34 AM10/22/16