Google Aviator incident under investigation

889 views
Skip to first unread message

Paul Hadfield

unread,
Oct 17, 2016, 7:43:15 AM10/17/16
to Certificate Transparency Policy
On Sunday October 16th, 2016 our automated monitoring of the Google Aviator Log revealed some operational issues.

We are investigating and will circulate our findings in a report as soon as they are ready.


The Certificate Transparency Team

Paul Hadfield

unread,
Oct 21, 2016, 7:19:52 AM10/21/16
to Certificate Transparency Policy
Hello Chromium ct-policy,

As promised on Monday October 17, here are the findings of our investigation of the recent Google Aviator incident.

regards,
Google Certificate Transparency Team.

--

SUMMARY:

On 16 October 2016 the Google 'Aviator' log exceeded its stated Maximum Merge Delay (MMD). The apparent merge delay rose over the course of three days from its usual level of about 1.5 hours to 26.2 hours.  This serious operational issue arose because of an unusually large backlog of added certificates that had been submitted to the log over the preceding day. Aviator was not able to incorporate those submissions into new Signed Tree Heads quickly enough.


Exceeding the stated Maximum Merge Delay is a violation of RFC 6962, section 3:


The log MUST incorporate a certificate in its Merkle Tree
within the Maximum Merge Delay period after the issuance of the SCT.


It is also a violation of the Chromium Certificate Transparency policy:


Log Operators must ... incorporate a certificate for which an SCT has been issued by the Log within the MMD.


IMPACT:

Five consecutive runs by Aviator's signer failed to incorporate recently submitted chains within MMD.  These are as follows:


Submitted chains in index range [35910127, 35919332) were not incorporated within MMD for STH signed at timestamp 1476653510423 for tree size 35936627 (34.7% of entries sequenced for that STH).


Submitted chains in index range [35936627, 35962877) were not incorporated within MMD for STH signed at timestamp 1476657100609 for tree size 35962877 (100% of entries sequenced for that STH).


Submitted chains in index range [35962877, 35982377) were not incorporated within MMD for STH signed at timestamp 1476663459670 for tree size 35982377 (100% of entries sequenced for that STH).


Submitted chains in index range [35982377, 36081668) were not incorporated within MMD for STH signed at timestamp 1476671114710 for tree size 36084877 (96.9% of entries sequenced for that STH).


Submitted chains in index range [36084877, 36113179) were not incorporated within MMD for STH signed at timestamp 1476675980176 for tree size 36255877 (16.6% of entries sequenced for that STH).



ROOT CAUSE:

A large backlog of added certificates was generated over a 6 hour period during the early hours (PDT) of Sunday 16 October, caused by the (non-malicious) actions of some high-volume clients. Aviator's signer could not sequence the submitted certificate chains quickly enough to clear the backlog, exacerbated by the fact our protection against flooding did not activate when expected.


REMEDIATION AND PREVENTION:

During the impact period, Google's engineers worked to bring down the size of Aviator's backlog of submitted certs, in an attempt to avoid a policy violation.  As part of that effort the /ct/v1/add-chain and /ct/v1/add-pre-chain end-points were made temporarily unavailable; this was not announced.


/add-chain was unavailable from 2016-10-16 16:42 PDT to 2016-10-16 21:30 PDT (4.75hrs);

/add-pre-chain was unavailable from 2016-10-16 16:42 PDT to 1026-10-16 17:42 PDT (1hr)


Google are using the lessons learned from this incident to improve operational practices for the Pilot, Rocketeer, Submariner, Icarus and Skydiver logs; in particular the sequencing operation has been tuned, as have protections for the logs against flooding.  Monitoring has been revised to provide earlier warning of similar events in the future.


Tom Ritter

unread,
Oct 21, 2016, 10:37:48 AM10/21/16
to Paul Hadfield, Certificate Transparency Policy
On 21 October 2016 at 06:19, 'Paul Hadfield' via Certificate
Transparency Policy <ct-p...@chromium.org> wrote:
> It is also a violation of the Chromium Certificate Transparency policy:
>
>
> Log Operators must ... incorporate a certificate for which an SCT has been
> issued by the Log within the MMD.

I don't see any mention of Aviator's status with Chrome. Is this still
being decided, or is no news to be taken as no change will be made?

-tom

Paul Hadfield

unread,
Oct 21, 2016, 10:39:19 AM10/21/16
to Tom Ritter, Certificate Transparency Policy
I think Chrome are yet to state their position w.r.t. Aviator.

Paul

Ryan Hurst

unread,
Oct 21, 2016, 10:40:51 AM10/21/16
to Tom Ritter, Paul Hadfield, Certificate Transparency Policy
Tom,

Something that is not obvious to those on the outside (and we need to do a better job making sure it is), is that the "CT Team" is not part of Chrome.

This is why the log inclusion policy is called the Chrome CT Policy.

Now that the CT Team has published the incident report, Chrome now has enough information to decide what they think the right response would be. I think this will take them a few days to work out.

Ryan


--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CA%2BcU71%3Dxdj0vu20hTv1nYR1UruUV3ir_fOrxfnMqR83ZpGWm1w%40mail.gmail.com.

Ryan Sleevi

unread,
Oct 21, 2016, 4:16:53 PM10/21/16
to Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 7:37 AM, Tom Ritter <t...@ritter.vg> wrote:
I don't see any mention of Aviator's status with Chrome. Is this still
being decided, or is no news to be taken as no change will be made?

In general, we (Chrome) try to work with log operators to notify and understand the issues, and discuss the incident.

As far as remediation steps to take, I'm mixed, and would greatly appreciate feedback from the broader community of CAs, Log Operators, Auditors, and interested parties.

On the one hand, I can see a very compelling argument to remove the Aviator log (or more aptly, treat it as no longer trusted within the next few weeks). This is consistent with a strict zero tolerance policy, which, despite all the negatives that zero tolerance involves, has the benefit that it hopefully rises beyond even the slightest suggestion of treating Google logs different.

On the other hand, there is some material difference here between some of the other actions we've seen, with respect to uptime (Certly) or with respect to violating the append-only property (Izenpe). I had been planning a more thorough write-up of these concerns, but then I remembered I provided much of this back in June - https://groups.google.com/a/chromium.org/d/msg/ct-policy/AH9JHYDljpU/f4I9vQLACwAJ

When we examine the nature of this specific failure - the failure to integrate the SCTs within the MMD - we have to think about the impact. Unlike an uptime issue, so long as the STH is eventually consistent, there's no ability to hide misissued certificates longer than the window of non-compliance (which, as I understand, is 2.2 hours). As the STH eventually integrated all of these SCTs, the first and foremost concern - the ability for the log to mask misissuance - seems to have been limited, and that window is significantly less than the window that a stapled OCSP response could be used to assert 'goodness' (roughly 3.5 days)

That said, I wouldn't want to see a scenario in which logs routinely blow MMDs - that creates a system in which the reliability of the system becomes suspect, and increases the window of detection of misissuance.

I'm curious how the broader community feels, and, as I indicated above, I can see arguments for both. I think the ecosystem can support either action as well, so this is largely about understanding the threat model, community norms, and whether enforcing a zero tolerance policy provides the greatest benefit to the ecosystem, which it may very well do.

Thoughts?

Tom Ritter

unread,
Oct 21, 2016, 4:44:42 PM10/21/16
to Ryan Sleevi, Paul Hadfield, Certificate Transparency Policy
I've never been a big fan of zero-tolerance policies. I mean if your
private key gets compromised, or is significantly mishandled (Izenpen)
then yea, but poor uptime or missing an MMD seems like the type of
thing you give someone two, maybe three strikes at before you're out.

(Depending, of course, on the type of MMD miss. Missing an MMD because
you're behind and then you catch up a half-day later - fix your error
and let's move on. Do it again, and you're probably out. Maybe
there'd be a second exception if someone went out of their way to try
and DOS you and it took you an MMD to recover. But miss it because of
a bug that does not include SCTs into the log at all until someone
noticed (somehow!) - that would be a one-strike-you're-out situation.)

So while it puts you in an odd position of not wanting to appear
'soft' on your own company - I think it's reasonable to consider this
a warning shot and to not distrust the log.

In general, I think it is advantageous to the community to allow some
wiggle room for mistakes or circumstances that can be proven (or
reasonably inferred) to have little or no security impact. And we
expect the operators to actively address their shortcomings. But
zero-tolerance policies would likely discourage people from operating
logs. And I would like to see the general Transparency initiative
flourish (in both the certificate direction, and others.)

-tom

Linus Nordberg

unread,
Oct 21, 2016, 5:05:33 PM10/21/16
to Tom Ritter, Certificate Transparency Policy
Tom Ritter <t...@ritter.vg> wrote
Fri, 21 Oct 2016 15:44:21 -0500:

> So while it puts you in an odd position of not wanting to appear
> 'soft' on your own company - I think it's reasonable to consider this
> a warning shot and to not distrust the log.

+1.

Walter Goulet

unread,
Oct 21, 2016, 5:14:12 PM10/21/16
to Tom Ritter, Ryan Sleevi, Paul Hadfield, Certificate Transparency Policy
Hi Ryan,

As a log operator (Venafi), we are working actively to ensure that we can
properly scale up to handle higher volumes of certificate submissions from
high volume CAs in the future.

For situations like this where a log operator is overwhelmed by a backlog of
certificate submissions, I would hope that the log operator is afforded an
opportunity to learn from the incident so they can properly re-engineer
their solution to scale properly. I agree with Tom's point that if the issue
happens repeatedly which demonstrates that the log operator is not learning
from their mistakes, then it is fair to consider distrusting them.

Thanks,
Walter
--
You received this message because you are subscribed to the Google Groups
"Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit
https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CA%2BcU71mZB-pFb6XXH6aziMqWndVNKKWj_fzWewAUzHu5CWemdg%40mail.gmail.com.

Ryan Sleevi

unread,
Oct 21, 2016, 5:17:14 PM10/21/16
to Linus Nordberg, Tom Ritter, Certificate Transparency Policy
Note that I linked to a previous post in which Pilot and Aviator had 'incidents'.


If we accept this view - that it's within the realm of 'minor' - then how best should we (Chrome) keep track of incidents in a way that the community can reasonably evaluate the 'overall' performance of the log? How long should that performance be evaluated - against the lifetime of the log? Against the past N months?

I don't have good answers, and I'm not presenting them to disagree, but to moreso highlight the challenges :)

Ryan Sleevi

unread,
Oct 21, 2016, 5:18:21 PM10/21/16
to Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
Oh, and I do want to stress that I don't think "Log distrusting" should be a significant/serious event, particularly with respect to reputation. That is, the system was designed to accomodate logs coming and going, it's mostly a question of how frequently they're coming and going that becomes an issue :)

Peter Bowen

unread,
Oct 21, 2016, 5:26:30 PM10/21/16
to Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
Would Chrome reconsider the portion of the CT policy that says "SCT
from a log qualified at the time of check is presented"? If there is
a known good STH up to a certain point, what is the risk of simply
accepting SCTs included in the log as of that STH? The advantage of
this is a cert with, say, Aviator and Izenpe embedded SCTs would still
be trusted without having to find a server that does SCT delivery via
TLS extension or OCSP stapling.

Thanks,
Peter

Ryan Sleevi

unread,
Oct 21, 2016, 5:30:37 PM10/21/16
to Peter Bowen, Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 2:26 PM, Peter Bowen <pzb...@gmail.com> wrote:
Would Chrome reconsider the portion of the CT policy that says "SCT
from a log qualified at the time of check is presented"?  If there is
a known good STH up to a certain point, what is the risk of simply
accepting SCTs included in the log as of that STH?  The advantage of
this is a cert with, say, Aviator and Izenpe embedded SCTs would still
be trusted without having to find a server that does SCT delivery via
TLS extension or OCSP stapling.

I would think the risk here would be that, if a log was disqualified, should the community be watching the logs compliance? For example, could it backdate?

The requirement that at least one current SCT be included is that it provides a verifiable timestamp (among other things) for which to evaluate the rest of the SCTs against, and that so long as the log is qualified, we can presume that it is not forged.

Adopting the policy you suggest seems like it would make things fundamentally insecure, but it's also entirely possible (and likely, given I haven't had my coffee yet) that I'm missing something obvious. 

Rob Stradling

unread,
Oct 21, 2016, 5:47:06 PM10/21/16
to sle...@chromium.org, Certificate Transparency Policy
I think Aviator should remain trusted in Chrome. This was a relatively
minor transgression that occurred in unusual circumstances. The log
operator has been transparent about what happened and has acted fast to
take corrective measures.

We want to encourage other organizations to run logs (especially after
Google's announcement this week at CABForum!) I fear that killing
Aviator would make other organizations think twice about standing up
logs for inclusion in Chrome.

If Google can't comply with the CT Policy, who can?
> --
> You received this message because you are subscribed to the Google
> Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to ct-policy+...@chromium.org
> <mailto:ct-policy+...@chromium.org>.
> To post to this group, send email to ct-p...@chromium.org
> <mailto:ct-p...@chromium.org>.
> To view this discussion on the web visit
> https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvbLOw4T7Y8Tgn_HKg%2BMpGoUVGksQ_vj45_n9uueHRq3tQ%40mail.gmail.com
> <https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvbLOw4T7Y8Tgn_HKg%2BMpGoUVGksQ_vj45_n9uueHRq3tQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

--
Rob Stradling
Senior Research & Development Scientist
COMODO - Creating Trust Online
Office Tel: +44.(0)1274.730505
Office Fax: +44.(0)1274.730909
www.comodo.com

COMODO CA Limited, Registered in England No. 04058690
Registered Office:
3rd Floor, 26 Office Village, Exchange Quay,
Trafford Road, Salford, Manchester M5 3EQ

This e-mail and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they are
addressed. If you have received this email in error please notify the
sender by replying to the e-mail containing this attachment. Replies to
this email may be monitored by COMODO for operational or business
reasons. Whilst every endeavour is taken to ensure that e-mails are free
from viruses, no liability can be accepted and the recipient is
requested to use their own virus checking software.

Ben Laurie

unread,
Oct 21, 2016, 5:51:47 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).

Both of these actions are policy violations.

I don't see any other choice.

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvb13H-gF8etEG-5Wk%3Dogm6wnxSFT062X32uf1smW6J3VA%40mail.gmail.com.

Peter Bowen

unread,
Oct 21, 2016, 6:02:12 PM10/21/16
to Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
I think we may be approaching this from the wrong end. Chrome
published a policy but has never published the risk (e.g. threat
model) that the policy is trying to mitigate. Given the current log
ecosystem has multiple operators with more than one log, we really
should be talking about both the operator and the log in these
discussions. Is the non-compliance with CT policy a log issue or an
log operator issue? What is the purpose of disqualifying a log (even
for downtime/availability) if the operator is allowed to immediately
resubmit? Assuming the problem was a fluke outside the operator's
control (e.g. DDOS attack on their DNS provider), is the answer "don't
change anything, submit again, and good to go in 90 days?" What does
this accomplish?

Thanks,
Peter

Ryan Sleevi

unread,
Oct 21, 2016, 6:14:37 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 2:51 PM, 'Ben Laurie' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).

Both of these actions are policy violations.

I don't see any other choice.


I don't think that's a fair statement, and I'm surprised to hear you state it.

You can throttle input, which is effectively an 'outage', as provided for in https://www.chromium.org/Home/chromium-security/certificate-transparency/log-policy . So long as the MMD is maintained, and the overall outage does not regress past the 99% uptime, this is still compliant.

I suspect you're more specifically thinking of "What happens when a single certificate is presented, and the option is either blow MMD or blow 99% uptime", which is a possible situation, but one would have hoped that the Log Operator took appropriate steps to avoid that situation, since a variety of options exist - up to and including no longer including CAs as accepted by the Log until the Log Operator is able to scale.

But certainly, I want to find solutions, but I'm not sure I agree with statements that it's "impossible" to comply with.

Ryan Sleevi

unread,
Oct 21, 2016, 6:16:12 PM10/21/16
to Peter Bowen, Ryan Sleevi, Linus Nordberg, Tom Ritter, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 3:02 PM, Peter Bowen <pzb...@gmail.com> wrote:
I think we may be approaching this from the wrong end.  Chrome
published a policy but has never published the risk (e.g. threat
model) that the policy is trying to mitigate. 

We've posted several times to this list various threats, including one set of discussion I linked to already in this thread.
 
Given the current log
ecosystem has multiple operators with more than one log, we really
should be talking about both the operator and the log in these
discussions.  Is the non-compliance with CT policy a log issue or an
log operator issue?  What is the purpose of disqualifying a log (even
for downtime/availability) if the operator is allowed to immediately
resubmit?

That presumes submissions that are technically compliant are automatically accepted. That's never been stated.
 
  Assuming the problem was a fluke outside the operator's
control (e.g. DDOS attack on their DNS provider), is the answer "don't
change anything,"

That's also not the policy. At a minimum, the key must change - so that any doubts about the previous log are avoided.

Tom Ritter

unread,
Oct 21, 2016, 6:43:36 PM10/21/16
to Ryan Sleevi, Peter Bowen, Linus Nordberg, Certificate Transparency Policy
On 21 October 2016 at 16:16, Ryan Sleevi <rsl...@chromium.org> wrote:
> Note that I linked to a previous post in which Pilot and Aviator had
> 'incidents'.
>
> Consider both
> https://groups.google.com/a/chromium.org/d/msg/ct-policy/Itoq0YUZTlA/24hkszkVBAAJ
> and
> https://groups.google.com/a/chromium.org/d/msg/ct-policy/dqoW-QMdKr8/8kt_ghhmCAAJ
>
> If we accept this view - that it's within the realm of 'minor' - then how
> best should we (Chrome) keep track of incidents in a way that the community
> can reasonably evaluate the 'overall' performance of the log? How long
> should that performance be evaluated - against the lifetime of the log?
> Against the past N months?
>
> I don't have good answers, and I'm not presenting them to disagree, but to
> moreso highlight the challenges :)


Make a wiki page next to the 'known-logs' page that links to incident
reports posted in this forum.

Log performance/operation will be measured based on the totality of
the situation. Blowing an MMD because of low capacity or a log having
a big piece of downtime twice in 2 months would be considered
differently from twice in two years. In other words, handwave about
the problem and try not to overspecify what actions you might take
about future unspecified events with unspecified details. =) Not so
different from CA inclusion!



On 21 October 2016 at 16:17, Ryan Sleevi <rsl...@chromium.org> wrote:
> Oh, and I do want to stress that I don't think "Log distrusting" should be a
> significant/serious event, particularly with respect to reputation. That is,
> the system was designed to accomodate logs coming and going, it's mostly a
> question of how frequently they're coming and going that becomes an issue :)

I don't want it to be a serious event from the point of view of
clients trusting things, no. The system shouldn't break because of it
=)

But it is a big deal for an organization running the log. An
investment they made was deemed 'not good enough' by the community.
What incentive do they have to try again? And how many similar
organizations will be discouraged from starting a log in the first
place. So I want log distrusting to be done judiciously, with an eye
towards non-security failures being treated with understanding.



On 21 October 2016 at 16:29, Ryan Sleevi <rsl...@chromium.org> wrote:
>
>
> On Fri, Oct 21, 2016 at 2:26 PM, Peter Bowen <pzb...@gmail.com> wrote:
>>
>> Would Chrome reconsider the portion of the CT policy that says "SCT
>> from a log qualified at the time of check is presented"? If there is
>> a known good STH up to a certain point, what is the risk of simply
>> accepting SCTs included in the log as of that STH? The advantage of
>> this is a cert with, say, Aviator and Izenpe embedded SCTs would still
>> be trusted without having to find a server that does SCT delivery via
>> TLS extension or OCSP stapling.
>
>
> I would think the risk here would be that, if a log was disqualified, should
> the community be watching the logs compliance? For example, could it
> backdate?
>
> The requirement that at least one current SCT be included is that it
> provides a verifiable timestamp (among other things) for which to evaluate
> the rest of the SCTs against, and that so long as the log is qualified, we
> can presume that it is not forged.
>
> Adopting the policy you suggest seems like it would make things
> fundamentally insecure, but it's also entirely possible (and likely, given I
> haven't had my coffee yet) that I'm missing something obvious.

Without giving it too much thought, it seems such a policy would need
to resolve SCTs to that known-good STH via an inclusion proof before
trusting them. That would prevent backdating. And there's several
reasons that's a difficult (or impossible) path to go down.



On 21 October 2016 at 16:47, Rob Stradling <rob.st...@comodo.com> wrote:
> (especially after
> Google's announcement this week at CABForum!)

Can't wait to hear about it ;)

-tom

Ben Laurie

unread,
Oct 21, 2016, 7:17:55 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).

Is that what you would recommend, or do you have other ideas?


But certainly, I want to find solutions, but I'm not sure I agree with statements that it's "impossible" to comply with.

My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

Require an incident report if the frequency gets too high.

Richard Salz

unread,
Oct 21, 2016, 7:22:18 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
At fist I was strongly in favor of zero tolerance, and Caesar's wife and all that, but after reading everything so far I changed my mind.  Tweak the policy to allow "infrequent" MMD misses, record them, and move on.

Ben Laurie

unread,
Oct 21, 2016, 7:30:07 PM10/21/16
to Richard Salz, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On 22 October 2016 at 00:22, Richard Salz <rich...@gmail.com> wrote:
At fist I was strongly in favor of zero tolerance, and Caesar's wife and all that, but after reading everything so far I changed my mind.  Tweak the policy to allow "infrequent" MMD misses, record them, and move on.

It is not as simple as that. The log has to take action to make the MMD misses infrequent when under such load. The question is: what action?
 

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.

Richard Salz

unread,
Oct 21, 2016, 7:31:32 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
From Chrome's viewpoint, it is as simple as that.  From the log developer's viewpoint, guidance would be helpful but "throw hardware at it" is one possibiliy.

Ryan Sleevi

unread,
Oct 21, 2016, 7:33:32 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 4:17 PM, Ben Laurie <be...@google.com> wrote:
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).

Is that what you would recommend, or do you have other ideas?

Right, I see there being two sources of load for a Log Operator, at least w/r/t MMD-impacting events:
- New submissions
- "Backfill" submissions

The issue is that if you decide to accept a given root, then you are, in effect, agreeing to log everything that root ever issued (unless, of course, your policy makes some statement otherwise, such as examining the validity dates, or only N # of certs, etc). We haven't had any logs do so, but it's not impossible to think of them wanting to do so.

This also comes up with things like cross-signs. For example, if you agree to log Identrust or Symantec certs, then you also potentially agree to log the entirity of the US FPKI - which is quite a few certs! Now, if your log implementation checks revocation status, it could decide to reject such certs (not complying with policy), but now we get into that gray area of how much or how willing should a log be to log everything.

For "new" submissions - that is, new certificates being issued - it seems unlikely in general that a CA will cause serious pressure; even a CA like Let's Encrypt. If it does/is, then that's something that should be discussed, and is precisely the thing that is meaningful to the community to solve. But my gut and sense from discussions with log operators (including Google) is that even CAs such as Let's Encrypt do not place unmanagable load on reasonably developed logs, nor would they be anticipated to.

From what I understand from Ryan's post, it's most likely that this was a 'backfill' sort of operation. At that point, the submitter has the full ability to affect the QPS of which they log, and the upper scale of how many QPS may come in is, presuming a distributed enough submitter, equivalent to the totality of the WebPKI that the log operator accepts. That'd be huge!

My suggestion of 'removing a CA' was moreso with respect to thinking about the 'new' submissions case, and an example of how you could mitigate some of the challenge, if no other solution existed. For addressing the 'backfill' case, the answer would have to be some form of D(DoS) mitigation, which seems to fit within the reasonable bounds of mitigation, and is distinct from a 'total' outage. So even if submitted tried to log N million certificates in 1 second, you could reject once you exceeded your QPS budget that ensured you hit your MMD budget.

A log operator could also seek to mitigate this issue with acceptance policies (as mentioned above), or by 'pre' backfilling the log contents, such that it started from a known state. Of course, as the PKI grows, I expect that the former will be more popular than the latter, but I suspect both fit within a spectrum of option and degree, such that it's not either/or.

 
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

Given that we've seen CAs backdating certificates, how do you define 'new' certs? :)
 
Require an incident report if the frequency gets too high.

Who reports? :) With respect to adding precerts, only CAs trusted by the log can do that. With respect to MMD measurements, only those who successfully obtain an SCT can quantify that, and it may be that multiple parties are logging, and they aren't aware that they're all getting "come back laters" (e.g. 10 parties are seeing an 'outage'). Should they report unconditionally every failure? 

Brian Smith

unread,
Oct 21, 2016, 7:34:45 PM10/21/16
to Ben Laurie, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
Ben Laurie wrote:


On 21 October 2016 at 23:13, Ryan Sleevi <rsl...@chromium.org> wrote:


On Fri, Oct 21, 2016 at 2:51 PM, 'Ben Laurie' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).

Both of these actions are policy violations.

I don't see any other choice.


I don't think that's a fair statement, and I'm surprised to hear you state it.

You can throttle input, which is effectively an 'outage', as provided for in https://www.chromium.org/Home/chromium-security/certificate-transparency/log-policy . So long as the MMD is maintained, and the overall outage does not regress past the 99% uptime, this is still compliant.

I suspect you're more specifically thinking of "What happens when a single certificate is presented, and the option is either blow MMD or blow 99% uptime", which is a possible situation, but one would have hoped that the Log Operator took appropriate steps to avoid that situation, since a variety of options exist - up to and including no longer including CAs as accepted by the Log until the Log Operator is able to scale.

I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT,

Hi Ben,

Please forgive me for brainstorming in public. What do you think of this logic?

I think the primary intent of CT is to make sure the certs get publicly logged. The secondary goal is to ensure the logging process doesn't reduce availability too much. Therefore, a lot can (temporarily) refuse to give out a SCT even if it will eventually log the cert chain. That is, a log could estimate how likely it is that it will meet the MMD for a cert (based on its load and other factors) and decide whether or not to return the SCT to the submitter. This will risk reducing the availability for the website (assuming SCTs are required), but that's a secondary, and also that's unlikely as long as there are a plurality of trusted logs available for the website to use instead.

The window between when somebody receives the SCT for a cert chain and the time that the cert becomes publicly available in the log is the most critical window of vulnerability in CT, right? Therefore, once the log hands out a SCT it is really important that it meet the MMD. In fact a 24 hour MMD is already much larger of a window of vulnerability than we ultimately want, right?
 
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

I agree that this seems like a reasonable thing to do.

Cheers,
Brian

Ben Laurie

unread,
Oct 21, 2016, 7:40:21 PM10/21/16
to Richard Salz, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On 22 October 2016 at 00:31, Richard Salz <rich...@gmail.com> wrote:
From Chrome's viewpoint, it is as simple as that.  From the log developer's viewpoint, guidance would be helpful but "throw hardware at it" is one possibiliy.

Sadly, it isn't. The core problem is that to provide a reliable, available, consistent service has limits on performance - most services fudge around the edges of that, but CT is not permitted to.

I can see ways around it: for example, allow logs to shard - permit a "log" to be a series of sublogs, each with their own entry policy. The requirement is that at least one sublog should accept your valid cert. Then you can scale practically indefinitely.

I see Ryan's next post hints at that. More later.

Ryan Sleevi

unread,
Oct 21, 2016, 7:46:57 PM10/21/16
to Ben Laurie, Richard Salz, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
I should add that one area of concern is the role that name-constrained subordinate CAs will play. Setting aside the 6962-bis notion of "All these certificates are yours, except name constrained sub-CAs. Attempt no logging there" - if someone with an NC SubCA _wanted_ to log all their certs they issued, they could potentially introduce the 'spam' problem that is trying to be mitigated with respect to rejecting self-signed certs.

So it's another potential source of load, potentially malicious - a holder of an NC SubCA minting a ton of certs beforehand, then attempting to DoS a log into submission by throwing them all at a log at once. We need the flexibility for the log to handle that, since infinite scaling is hard, but at the same time, we need the reliability to assure that new certificates issued by CAs (potentially misissued) can be accepted.

You could imagine a log policy that they would reject certs from a name-constrained subCA (except, from the Chrome policy side, I think that's the opposite of what we want, and I'll be posting on TRANS to this effect), or you could imagine they decide to not log certs for a domain name (which potentially means misissuance wouldn't be detected), or they could decide to reject from that particular NC SubCA, but all of these are with tradeoffs and impact.

So to the general matter of "How should a log operator balance load", I'm not 100% sure, but perhaps this is an area where log operators could, within the IETF TRANS WG, work to write up some threat models and operational guidance suggestions, based on their experiences of the past several years, so that when evaluating questions of policy failures, we can look at them through that lens.

Ben Laurie

unread,
Oct 21, 2016, 7:59:00 PM10/21/16
to Brian Smith, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
I agree. In practice, the merges are usually much faster than that, so really the MMD is about the available repair window, not the expected merge time.

Perhaps it would be better expressed as a distribution?

There are also, btw, privacy concerns around excessively fast merges. :-)

Ben Laurie

unread,
Oct 21, 2016, 8:08:00 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On 22 October 2016 at 00:32, Ryan Sleevi <rsl...@chromium.org> wrote:


On Fri, Oct 21, 2016 at 4:17 PM, Ben Laurie <be...@google.com> wrote:
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).

Is that what you would recommend, or do you have other ideas?

Right, I see there being two sources of load for a Log Operator, at least w/r/t MMD-impacting events:
- New submissions
- "Backfill" submissions

The issue is that if you decide to accept a given root, then you are, in effect, agreeing to log everything that root ever issued (unless, of course, your policy makes some statement otherwise, such as examining the validity dates, or only N # of certs, etc). We haven't had any logs do so, but it's not impossible to think of them wanting to do so.

This also comes up with things like cross-signs. For example, if you agree to log Identrust or Symantec certs, then you also potentially agree to log the entirity of the US FPKI - which is quite a few certs! Now, if your log implementation checks revocation status, it could decide to reject such certs (not complying with policy), but now we get into that gray area of how much or how willing should a log be to log everything.

For "new" submissions - that is, new certificates being issued - it seems unlikely in general that a CA will cause serious pressure; even a CA like Let's Encrypt. If it does/is, then that's something that should be discussed, and is precisely the thing that is meaningful to the community to solve. But my gut and sense from discussions with log operators (including Google) is that even CAs such as Let's Encrypt do not place unmanagable load on reasonably developed logs, nor would they be anticipated to.

From what I understand from Ryan's post, it's most likely that this was a 'backfill' sort of operation. At that point, the submitter has the full ability to affect the QPS of which they log, and the upper scale of how many QPS may come in is, presuming a distributed enough submitter, equivalent to the totality of the WebPKI that the log operator accepts. That'd be huge!

This is exactly what happened.
 

My suggestion of 'removing a CA' was moreso with respect to thinking about the 'new' submissions case, and an example of how you could mitigate some of the challenge, if no other solution existed. For addressing the 'backfill' case, the answer would have to be some form of D(DoS) mitigation, which seems to fit within the reasonable bounds of mitigation, and is distinct from a 'total' outage. So even if submitted tried to log N million certificates in 1 second, you could reject once you exceeded your QPS budget that ensured you hit your MMD budget.

A log operator could also seek to mitigate this issue with acceptance policies (as mentioned above), or by 'pre' backfilling the log contents, such that it started from a known state. Of course, as the PKI grows, I expect that the former will be more popular than the latter, but I suspect both fit within a spectrum of option and degree, such that it's not either/or.

So could we fix this right now by setting an acceptance policy that says "unless load is too high"?
 

 
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.

Given that we've seen CAs backdating certificates, how do you define 'new' certs? :)

I mean new to the log.
 
 
Require an incident report if the frequency gets too high.

Who reports? :) With respect to adding precerts, only CAs trusted by the log can do that. With respect to MMD measurements, only those who successfully obtain an SCT can quantify that, and it may be that multiple parties are logging, and they aren't aware that they're all getting "come back laters" (e.g. 10 parties are seeing an 'outage'). Should they report unconditionally every failure? 

Ultimately you can judge the log by whether it has actually included certs: i.e. if it has failed to respond to you (for a new cert inclusion), can you see that it has, in fact, responded to others by including new certs?

Whether it responds to queries about its contents is judged by the various analysis services.
 

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.

Brian Smith

unread,
Oct 21, 2016, 9:43:26 PM10/21/16
to Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
Ryan Sleevi <rsl...@chromium.org> wrote:
On the one hand, I can see a very compelling argument to remove the Aviator log (or more aptly, treat it as no longer trusted within the next few weeks). This is consistent with a strict zero tolerance policy, which, despite all the negatives that zero tolerance involves, has the benefit that it hopefully rises beyond even the slightest suggestion of treating Google logs different.

I agree. I think there's another benefit to distrusting the Aviator log now. Currently, CT is relatively unimportant. However, especially in 2017 it looks like it will be Really Important. If Chrome were to distrust the Aviator log now, the ecosystem would get operational experience with what happens when a Google log is removed. In particular, think about all the people that have certs with only one Aviator SCT and one non-Google SCT embedded; they would (depending on how the distrust of Aviator would be done) not on their own be sufficiently logged according to Chrome's CT policy. It would be very interesting to see how the ecosystem would cope with these certificates suddenly becoming non-compliant with a browser's CT log inclusion policy. And, it would be better to learn this before CT becomes Really Important.

I had been planning a more thorough write-up of these concerns, but then I remembered I provided much of this back in June - https://groups.google.com/a/chromium.org/d/msg/ct-policy/AH9JHYDljpU/f4I9vQLACwAJ

That previous decision couldn't have helped people's perception of whether Google would give its own logs special treatment or not. The problem with these perception issues is that it will make enforcement for non-Google logs harder when they do similar, but perhaps worse, things. 

So, I think that we can tolerate the (temporary?) loss of the Aviator log, it is better to err on the side of a literal interpretation of the policy. If we can't tolerate the loss of the Aviator log then it kind of means that it has become "too big to fail," which would be an indication of a serious problem too. In the abstract, this would be an excellent test of CT's and the Chrome CT Policy's resilience against "too big to fail". Better sooner than later.

When we examine the nature of this specific failure - the failure to integrate the SCTs within the MMD - we have to think about the impact.

AFAICT, the only thing worse than handing out a SCT and not logging the cert within the MMD is handing out a SCT and never logging it at all. An MMD of 24 hours is already *very* generous and is supposed to be sized to already accommodate major operational problems. That is, when choosing the MMD it was already decided that 24 hours was the most that could be tolerated. Being over 9% over the maximum tolerance is non-trivial when we consider that.
 
Unlike an uptime issue,

Some uptime issues are more critical than others, AFAICT. For example, uptime of sharing the log contents of certificates for which SCTs have been shared seems more important than the uptime of certificate acceptance. 
 
so long as the STH is eventually consistent, there's no ability to hide misissued certificates longer than the window of non-compliance (which, as I understand, is 2.2 hours). As the STH eventually integrated all of these SCTs, the first and foremost concern - the ability for the log to mask misissuance - seems to have been limited, and that window is significantly less than the window that a stapled OCSP response could be used to assert 'goodness' (roughly 3.5 days)

Agreed.
 
That said, I wouldn't want to see a scenario in which logs routinely blow MMDs - that creates a system in which the reliability of the system becomes suspect, and increases the window of detection of misissuance.

Right! The MMD is already a huge security-for-availability trade-off. Again, it seems like it was already decided that 24 hours was the most that could reasonably be tolerated. If that's not the case, then let's raise the maximum MMDs allowed. 

Interestingly, the Chrome CT policy doesn't state a maximum MMD. So, couldn't a log simply define their MMD to be 10 years to ensure they are always in compliance? There is a part of the policy that says outages may not exceed "an MMD of more than 24 hours", but there isn't anything that limits what the log's actually MMD should be, IIUC.
 
I'm curious how the broader community feels, and, as I indicated above, I can see arguments for both. I think the ecosystem can support either action as well, so this is largely about understanding the threat model, community norms, and whether enforcing a zero tolerance policy provides the greatest benefit to the ecosystem, which it may very well do.

As other messages in the thread indicated, especially those by Ben Laurie, we should also consider whether the current policy is reasonable w.r.t. availability requirements. If it is determined that the current policy isn't reasonable then it should be changed. In that case, I think it makes sense to re-evaluate all log distrust decisions, including this one, based on the new criteria. All logs that were distrusted under the old rules but which would be trusted under the new rules should remain trusted (or be re-trusted, if they were already distrusted).

OTOH, if the policy doesn't change, then it seems reasonable to interpret the policy exactly as written.

Regardless, at a minimum, the policy should be changed to clearly state a maximum allowed MMD.

Cheers,
Brian

Eric Mill

unread,
Oct 21, 2016, 9:58:27 PM10/21/16
to Brian Smith, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On that note, I haven't seen anyone discuss on-thread whether a 24 hour MMD is perhaps just unreasonable at scale. While there are clearly security ramifications to extending the allowable window before a certificate is logged, maybe a 36-hour MMD is better for security than weakening the other guarantees of a log, and/or better than allowing the MMD requirement to get more "blurry" as small incidents are allowed?

-- Eric

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.

Todd Johnson

unread,
Oct 21, 2016, 10:15:51 PM10/21/16
to Eric Mill, Brian Smith, Certificate Transparency Policy, Paul Hadfield, Ryan Sleevi, Tom Ritter
Awesome conversation!  

Unfortunate circumstances do happen, but, that should not keep the operator(s) from being able to redeem themselves.  So long as there is *transparency*, the operator(s) should have some methodology to redeem themselves.  

Otherwise, would partitioning logs with fewer trust anchors help?

From an enterprise perspective, it is much easier to accept risk of longer grace periods for receiving bad news... Such as a revoked certificate on a CRL, with a 30 day publication schedule.  Or, discovering a miss-issued (or of malice) certificate.  What is "appropriate" for *public trust*?

To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

To post to this group, send email to ct-p...@chromium.org.
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

To post to this group, send email to ct-p...@chromium.org.

Ryan Sleevi

unread,
Oct 22, 2016, 9:03:34 AM10/22/16
to Eric Mill, Brian Smith, Ryan Sleevi, Tom Ritter, Paul Hadfield, Certificate Transparency Policy
On Fri, Oct 21, 2016 at 6:57 PM, Eric Mill <er...@konklone.com> wrote:
On that note, I haven't seen anyone discuss on-thread whether a 24 hour MMD is perhaps just unreasonable at scale. While there are clearly security ramifications to extending the allowable window before a certificate is logged, maybe a 36-hour MMD is better for security than weakening the other guarantees of a log, and/or better than allowing the MMD requirement to get more "blurry" as small incidents are allowed?

I'm not really sure it does though. All of the concerns we're discussing apply at 36 hours as well. Indeed, it was generally seen that 24 hours was itself quite long, and that logs should only need 12 hours.

Whether or not this is true, of course, is largely a factor of log operators providing feedback to the ecosystem about the challenges, but I'm otherwise disinclined to just change it to "see if it helps," without understanding the first principals about what challenges log operators are running into, and why. 

Paul Hadfield

unread,
Oct 22, 2016, 11:33:24 AM10/22/16
to Ryan Sleevi, Tom Ritter, Eric Mill, Certificate Transparency Policy, Brian Smith

Would it be helpful to share some data on merge times for the logs Google operates?

Or are you thinking of a more qualitative description of the factors that make sequencing and signing take the time they do?


>
> --
> You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.


> To post to this group, send email to ct-p...@chromium.org.

> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvY43PQL%2Bswx1UUyAZysnSpOE9xxaDnhiCn4o7M9iDcE4A%40mail.gmail.com.

Ryan Sleevi

unread,
Oct 23, 2016, 4:51:34 PM10/23/16
to Paul Hadfield, Ryan Sleevi, Tom Ritter, Eric Mill, Certificate Transparency Policy, Brian Smith
On Sat, Oct 22, 2016 at 8:33 AM, 'Paul Hadfield' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:

On 22 Oct 2016 14:03, "Ryan Sleevi" <rsl...@chromium.org> wrote:
> Whether or not this is true, of course, is largely a factor of log operators providing feedback to the ecosystem about the challenges, but I'm otherwise disinclined to just change it to "see if it helps," without understanding the first principals about what challenges log operators are running into, and why. 

Would it be helpful to share some data on merge times for the logs Google operates?

Or are you thinking of a more qualitative description of the factors that make sequencing and signing take the time they do?


Well, I think a big concern for potential log operators is how to best meet the constraints set forth. Lessons about best practices, things that don't work and things that do, how to avoid DoS, those are things that may be potentially helpful to developing a robust ecosystem.

I suspect this is moreso for implementors ( https://groups.google.com/forum/?fromgroups#!forum/certificate-transparency ) , but the concern is ensuring there's robust feedback from implementors about the policy, rather than treating the policy as set in stone. 

Andrew Ayer

unread,
Oct 25, 2016, 1:19:38 PM10/25/16
to rsl...@chromium.org, Certificate Transparency Policy
My view is that this incident is akin to an uptime issue and should be
treated as such.

Consider if the following had happened: exactly 24 hours after issuing
an SCT, Aviator incorporated the certificate and then immediately went
down for 2.2 hours. Auditors would have had to wait an additional 2.2
hours to demand an inclusion proof, and monitors would not have seen
the certificate for another 2.2 hours. This would have counted against
Aviator's uptime requirement, and as long as Aviator remained above 99%
uptime, it would not have been distrusted.

Aviator missing its 24 hour MMD by 2.2 hours does not strike me as any
more severe than this hypothetical downtime. In both cases, the
security impact is the same: a several hour delay before an SCT could
be audited or a certificate detected by monitors.

Therefore, I propose that every distinct non-overlapping period of time
during which Aviator was unable to provide an inclusion proof that it
should have been able to provide be counted against Aviator's uptime
requirement. If Aviator drops below 99%, kick it out. Otherwise,
keep it in. The log policy should be updated to codify this.

That said, the maximum MMD ought to be reduced (consider that Aviator's
normal MMD is only 1.5 hours) and log operators should be encouraged
to throttle/disable their submission endpoint rather than let their
MMD skyrocket. The reason is that refusal to accept a misissued
certificate means it won't be accepted by TLS clients (fail closed),
whereas accepting a misissued certificate but delaying its incorporation
deprives domain owners of critical time to respond, during which time
TLS clients will accept it (fail open). To that end, perhaps Chrome
should have a separate, laxer uptime requirement for submission
availability. Submission availability strikes me as an issue primarily
between CAs and the operators of the logs which they depend on for
issuance. As a practical solution to the issuance availability
problem, logs could prioritize submissions from the IP addresses of
trusted CAs so that an influx of submissions from anonymous sources
doesn't affect the timely issuance of new certificates.

Regards,
Andrew

Ryan Sleevi

unread,
Oct 25, 2016, 1:29:57 PM10/25/16
to Andrew Ayer, Ryan Sleevi, Certificate Transparency Policy
Andrew,

Thanks very much for putting this together, and concrete suggestions to go with it. I think all of this sounds very reasonable, and I'm curious if I've missed anything, or if others find it reasonable too. If so, I think we can go and make an attempt to update the policy, and circulate for feedback, if there's agreement that this could help address some of the concerns people have had with running logs or consuming logs (either as CAs or monitors)

Brian Smith

unread,
Oct 25, 2016, 3:03:48 PM10/25/16
to Andrew Ayer, Ryan Sleevi, Certificate Transparency Policy
Therefore, I propose that every distinct non-overlapping period of time
during which Aviator was unable to provide an inclusion proof that it
should have been able to provide be counted against Aviator's uptime
requirement.  If Aviator drops below 99%, kick it out.  Otherwise,
keep it in.  The log policy should be updated to codify this.

Just to be clear, the Chromium policy is "Have 99% uptime, with no outage lasting longer than the MMD (as measured by Google)," not just 99% uptime.

Do I understand your proposal is that we should let logs hide certificates that they've given SCTs for for up to 2*MMD? That would be 48 hours in this case. That seems like way too long for me.

That said, the maximum MMD ought to be reduced (consider that Aviator's
normal MMD is only 1.5 hours) and log operators should be encouraged
to throttle/disable their submission endpoint rather than let their
MMD skyrocket.

I agree that this seems better.
 
The reason is that refusal to accept a misissued
certificate means it won't be accepted by TLS clients (fail closed),
whereas accepting a misissued certificate but delaying its incorporation
deprives domain owners of critical time to respond, during which time
TLS clients will accept it (fail open).

Agreed.
 
To that end, perhaps Chrome
should have a separate, laxer uptime requirement for submission
availability. Submission availability strikes me as an issue primarily
between CAs and the operators of the logs which they depend on for
issuance.  As a practical solution to the issuance availability
problem, logs could prioritize submissions from the IP addresses of
trusted CAs so that an influx of submissions from anonymous sources
doesn't affect the timely issuance of new certificates.

It would be bad to have a policy where a CA could monopolize the submission bandwidth of a CT log, preventing third parties from logging third-party logs, so I don't think there should be a bias towards CAs. Remember the purpose of CT is to keep CAs in check, and with that in mind the bias should be towards ensuring third parties can get their certificates logged, not towards CAs.

Also, there are two parts of certificate acceptance: Saving the cert chain to incorporate into the log, and issuing the SCT to the submitter. The main function that needs to be limited to ensure MMD is the issuing of the SCT to the submitter. Logs should still try really hard to log all the cert chains they are given, even if they can't guarantee that they'll merge those certs into the log within the MMD; in these cases, they should just not return a SCT to the submitter but otherwise carry on.

It may also be good for logs to offer an alternative submission point for backfilling and other bulk operations that doesn't return SCTs to the submitter and so isn't subject to the MMD.

Cheers,
Brian
--

Rob Stradling

unread,
Oct 25, 2016, 3:08:09 PM10/25/16
to Brian Smith, Andrew Ayer, Ryan Sleevi, Certificate Transparency Policy
On 25/10/16 20:03, Brian Smith wrote:
<snip>
> It may also be good for logs to offer an alternative submission point
> for backfilling and other bulk operations that doesn't return SCTs to
> the submitter and so isn't subject to the MMD.

Interesting idea.

Adding an optional "no_sct_required" boolean input to the add-chain API
would be one way to accomplish that. (My list of things I wish we'd
added to 6962-bis before Last Call keeps growing...)

Andrew Ayer

unread,
Oct 25, 2016, 4:11:46 PM10/25/16
to Brian Smith, Ryan Sleevi, Certificate Transparency Policy
On Tue, 25 Oct 2016 09:03:46 -1000
Brian Smith <br...@briansmith.org> wrote:

> >
> > Therefore, I propose that every distinct non-overlapping period of
> > time during which Aviator was unable to provide an inclusion proof
> > that it should have been able to provide be counted against
> > Aviator's uptime requirement. If Aviator drops below 99%, kick it
> > out. Otherwise, keep it in. The log policy should be updated to
> > codify this.
> >
>
> Just to be clear, the Chromium policy is "Have 99% uptime, with no
> outage lasting longer than the MMD (as measured by Google)," not just
> 99% uptime.
>
> Do I understand your proposal is that we should let logs hide
> certificates that they've given SCTs for for up to 2*MMD? That would
> be 48 hours in this case. That seems like way too long for me.

That would be counted as 24 hours of downtime, which, assuming the
uptime is measured over a month, would put the log below 99% uptime
(which allows ~7 hours of downtime a month). I think even 31 hours is
too long, but that can be solved by requiring shorter MMDs.

> > To that end, perhaps Chrome
> > should have a separate, laxer uptime requirement for submission
> > availability. Submission availability strikes me as an issue
> > primarily between CAs and the operators of the logs which they
> > depend on for issuance. As a practical solution to the issuance
> > availability problem, logs could prioritize submissions from the IP
> > addresses of trusted CAs so that an influx of submissions from
> > anonymous sources doesn't affect the timely issuance of new
> > certificates.
>
> It would be bad to have a policy where a CA could monopolize the
> submission bandwidth of a CT log, preventing third parties from
> logging third-party logs, so I don't think there should be a bias
> towards CAs. Remember the purpose of CT is to keep CAs in check, and
> with that in mind the bias should be towards ensuring third parties
> can get their certificates logged, not towards CAs.

I see your point, but the full value of CT is not realized until CAs
log all certificates themselves, at issuance time, and TLS clients
reject certificates without SCTs. This is the best check on CAs that CT
can provide. Once this becomes a reality (for Chrome at least) one
year from now, do you think the third party submission case will be as
important?

Also, we should consider the upsides of allowing more flexible
submission policies. It might help incentivize more companies to
operate logs if they could guarantee submission availability to CAs in
exchange for a fee without having to worry about a sudden influx of
third party submissions. I think having a more diverse set of log
operators would be a major boon to the ecosystem.

> Also, there are two parts of certificate acceptance: Saving the cert
> chain to incorporate into the log, and issuing the SCT to the
> submitter. The main function that needs to be limited to ensure MMD
> is the issuing of the SCT to the submitter. Logs should still try
> really hard to log all the cert chains they are given, even if they
> can't guarantee that they'll merge those certs into the log within
> the MMD; in these cases, they should just not return a SCT to the
> submitter but otherwise carry on.

Agreed, though keep in mind this behavior would be impossible to audit.

> It may also be good for logs to offer an alternative submission point
> for backfilling and other bulk operations that doesn't return SCTs to
> the submitter and so isn't subject to the MMD.

That would also be nice.

Regards,
Andrew

Brian Smith

unread,
Oct 25, 2016, 8:13:57 PM10/25/16
to Andrew Ayer, Ryan Sleevi, Certificate Transparency Policy
Andrew Ayer <ag...@andrewayer.name> wrote:
> > Aviator's uptime requirement.  If Aviator drops below 99%, kick it
> > out.  Otherwise, keep it in.  The log policy should be updated to
> > codify this.
> >
>
> Just to be clear, the Chromium policy is "Have 99% uptime, with no
> outage lasting longer than the MMD (as measured by Google)," not just
> 99% uptime.
>
> Do I understand your proposal is that we should let logs hide
> certificates that they've given SCTs for for up to 2*MMD? That would
> be 48 hours in this case. That seems like way too long for me.

That would be counted as 24 hours of downtime, which, assuming the
uptime is measured over a month, would put the log below 99% uptime
(which allows ~7 hours of downtime a month).  I think even 31 hours is
too long, but that can be solved by requiring shorter MMDs.

I also think the policy should be clarified about what 99% uptime means. My understanding was that it was over a period of time longer than a month, but I'm not sure why I thought that.
 
> It would be bad to have a policy where a CA could monopolize the
> submission bandwidth of a CT log, preventing third parties from
> logging third-party logs, so I don't think there should be a bias
> towards CAs. Remember the purpose of CT is to keep CAs in check, and
> with that in mind the bias should be towards ensuring third parties
> can get their certificates logged, not towards CAs.

I see your point, but the full value of CT is not realized until CAs
log all certificates themselves, at issuance time, and TLS clients
reject certificates without SCTs. This is the best check on CAs that CT
can provide.  Once this becomes a reality (for Chrome at least) one
year from now, do you think the third party submission case will be as
important?

Yes. Chrome doesn't require CT yet, so it would be premature to create availability policies that could result in third-party submissions being blocked indefinitely by a small number of actors. Also, it will be a long time before every relevant software platform requires CT, third-party submissions will matter a lot for protecting any such platform.

OTOH, one could decide that protecting platforms for which CT isn't required are out of scope for CT and/or Chromium's policy, in which case I think what you suggest makes a lot of sense. In fact, the logs could even require strong authentication during certificate submission in that case, and flat-out refuse to accept any submission except thise directly from their customers (CAs).

Cheers,
Brian
--

Andrew Ayer

unread,
Oct 25, 2016, 11:06:41 PM10/25/16
to Brian Smith, Ryan Sleevi, Certificate Transparency Policy
On Tue, 25 Oct 2016 14:13:55 -1000
Brian Smith <br...@briansmith.org> wrote:

> Andrew Ayer <ag...@andrewayer.name> wrote:
>
> > > > Aviator's uptime requirement. If Aviator drops below 99%, kick
> > > > it out. Otherwise, keep it in. The log policy should be
> > > > updated to codify this.
> > > >
> > >
> > > Just to be clear, the Chromium policy is "Have 99% uptime, with no
> > > outage lasting longer than the MMD (as measured by Google)," not
> > > just 99% uptime.
> > >
> > > Do I understand your proposal is that we should let logs hide
> > > certificates that they've given SCTs for for up to 2*MMD? That
> > > would be 48 hours in this case. That seems like way too long for
> > > me.
> >
> > That would be counted as 24 hours of downtime, which, assuming the
> > uptime is measured over a month, would put the log below 99% uptime
> > (which allows ~7 hours of downtime a month). I think even 31 hours
> > is too long, but that can be solved by requiring shorter MMDs.
> >
>
> I also think the policy should be clarified about what 99% uptime
> means. My understanding was that it was over a period of time longer
> than a month, but I'm not sure why I thought that.

You're correct - it's a 90 day rolling window according to
this email:
https://groups.google.com/a/chromium.org/forum/#!msg/ct-policy/ccfVGhPR6g0/ZQJRLIVLBAAJ

I don't see this written down in the policy, which ought to be
rectified.

~22 hours is an awful long time to delay incorporating a certificate for
which a log has issued an SCT, but a log could only do this once per 90
days, and per the hypothetical in my first email, it doesn't create any
security issue that doesn't already exist by allowing downtime. If
22 hours is too long, then the uptime policy should be changed.

> > > It would be bad to have a policy where a CA could monopolize the
> > > submission bandwidth of a CT log, preventing third parties from
> > > logging third-party logs, so I don't think there should be a bias
> > > towards CAs. Remember the purpose of CT is to keep CAs in check,
> > > and with that in mind the bias should be towards ensuring third
> > > parties can get their certificates logged, not towards CAs.
> >
> > I see your point, but the full value of CT is not realized until CAs
> > log all certificates themselves, at issuance time, and TLS clients
> > reject certificates without SCTs. This is the best check on CAs
> > that CT can provide. Once this becomes a reality (for Chrome at
> > least) one year from now, do you think the third party submission
> > case will be as important?
> >
>
> Yes. Chrome doesn't require CT yet, so it would be premature to create
> availability policies that could result in third-party submissions
> being blocked indefinitely by a small number of actors.

Of course, CAs are already in a position to DoS logs by issuing and
logging infinite certificates.

> Also, it will be a long time before every relevant software platform
> requires CT, third-party submissions will matter a lot for protecting
> any such platform.
>
> OTOH, one could decide that protecting platforms for which CT isn't
> required are out of scope for CT and/or Chromium's policy, in which
> case I think what you suggest makes a lot of sense.

I think scope is the key question. What do people think about this?

Regards,
Andrew

Ryan Sleevi

unread,
Oct 26, 2016, 7:41:32 PM10/26/16
to Andrew Ayer, Brian Smith, Ryan Sleevi, Certificate Transparency Policy
On Tue, Oct 25, 2016 at 8:06 PM, Andrew Ayer <ag...@andrewayer.name> wrote:
You're correct - it's a 90 day rolling window according to
this email:
https://groups.google.com/a/chromium.org/forum/#!msg/ct-policy/ccfVGhPR6g0/ZQJRLIVLBAAJ

I don't see this written down in the policy, which ought to be
rectified.

Right, this wasn't documented as part of the policy somewhat intentionally, and that's my fault. We want to provide information to log operators about our view of uptime, but also are concerned with gaming of the criteria with the more we explicitly state.

However, given this discussion, I agree it at least is beneficial to be more explicit about this.
 
Of course, CAs are already in a position to DoS logs by issuing and
logging infinite certificates.

Perhaps more relevant to the discussion is the behaviour of technically-constrained sub-CAs, who perhaps have greater ability to do this, without the CA themselves having to risk censure or removal.
 
I think scope is the key question.  What do people think about this?

I'm a little uncomfortable with suggestions that restrict or segment off third-party submissions. The announced policy certainly doesn't require pre-certs or the CA to log themselves - even if that's the best way for the CA to ensure their users' needs are satisfied in the general case - so I wouldn't want to discourage those who wanted to submit on their own.

Also, our policies have, to date, tried to be very open and considering other platforms that may adopt CT, and tried to reduce any critical dependencies on Google, other than as scaffolding to more liberal policies. That is, the goal is to continually make the CT Policy more liberal, rather than more restrictive, and similarly, to encourage log operators to be more liberal and permissive. 

Sanjay Modi

unread,
Oct 27, 2016, 7:05:08 PM10/27/16
to Certificate Transparency Policy, t...@ritter.vg, hadf...@google.com, rsl...@chromium.org
If Aviator is disqualified then there is a direct impact on qualified log servers available to stick to Chrome’s diverse SCT policy, which requires at least one SCT from a Google log server. 
Though number of qualified log servers are available today, there are still few Google operated log servers. 
In light of this incident and the recently announced Chrome CT policy (https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!topic/ct-policy/78N3SMcqUGw), it will be good that Chrome team revisits diverse SCT policy so that certificate issuance adhering to Chrome CT policy is not exposed to a systematic risk. SCT diversity can be achieved by not tying to a specific log server operator.

Ryan Sleevi

unread,
Oct 27, 2016, 7:17:13 PM10/27/16
to Sanjay Modi, Certificate Transparency Policy, Tom Ritter, Paul Hadfield, Ryan Sleevi
On Thu, Oct 27, 2016 at 4:05 PM, Sanjay Modi <sanja...@symantec.com> wrote:
If Aviator is disqualified then there is a direct impact on qualified log servers available to stick to Chrome’s diverse SCT policy, which requires at least one SCT from a Google log server. 

Given that there are two other log servers available, this is not accurate, or, at best, entirely overstating the matter.

This is especially true since there are two more logs that are pending inclusion and may be qualified within the next few weeks, which would also comply with policy. If that happens, a full third of the logs will be operated by Google. If you view that as unacceptable, it would be useful to what you consider a necessary number.
 
Though number of qualified log servers are available today, there are still few Google operated log servers. 

This is a statement that cannot be objectively evaluated, because it's unclear what you believe is sufficient Google log servers, if not 30%. Could you expand on this, and why?
 
In light of this incident and the recently announced Chrome CT policy (https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!topic/ct-policy/78N3SMcqUGw), it will be good that Chrome team revisits diverse SCT policy so that certificate issuance adhering to Chrome CT policy is not exposed to a systematic risk. SCT diversity can be achieved by not tying to a specific log server operator.

While I have repeatedly stated that we're exploring ways to diversify the SCT policy, chief among the concerns is ensuring that log operators are honest and acting in the public interest. We've seen issues where logs were (unintentionally) dishonest, but we also have logs from a number of CAs who have had serious operational issues within their organizations. I can think of four logs that meet that criteria.

Although CT is designed to prevent the damage any one of these organizations can do, it relies on a fully functioning ecosystem of gossip and accountability. I've repeatedly made clear that we're committed to moving towards that system, but I don't think it would be wise to create a false sense of urgency and suggest it be relaxed. I say this because relaxing, prior to that robustness, would particularly benefit organizations who may not be able to ensure their employees follow proper procedures, or which may not keep up to date with changes in the log policy, as that could allow for misissuance to happen without detection, or through coercion.

Again, we're very much committed to the long term and exploring ways to relax the policy, but at present, Chrome feels that while it's reasonable to (for the short-term), trust Google to be honest, it's not reasonable to trust all logs to do so. Which is why the policy exists :)

Rob Stradling

unread,
Oct 28, 2016, 10:37:58 AM10/28/16
to Brian Smith, Andrew Ayer, Ryan Sleevi, Certificate Transparency Policy, Eran Messeri
On 25/10/16 20:08, Rob Stradling wrote:
> On 25/10/16 20:03, Brian Smith wrote:
> <snip>
>> It may also be good for logs to offer an alternative submission point
>> for backfilling and other bulk operations that doesn't return SCTs to
>> the submitter and so isn't subject to the MMD.
>
> Interesting idea.
>
> Adding an optional "no_sct_required" boolean input to the add-chain API
> would be one way to accomplish that. (My list of things I wish we'd
> added to 6962-bis before Last Call keeps growing...)

I discussed this with Eran. He objected on the grounds that:
"it further complicates the log implementation for little benefit. For
a log to properly support this feature, it would have a separate queue
of entries to be incorporated into the tree where it can make up a
timestamp for the SCT it issues (even if it does not return it). Because
even that queue on the log's side can't be infinite, it does not fully
relieve submitters of their duty to handle log throttling - so we don't
gain much. So for this reason I don't think the no_sct_required
parameter should be added - if the ultimate goal is enabling throttling
/ dealing with load, why don't we add an error code explicitly for that
or designate an http error code to indicate that?"

So, inspired by OCSP, I propose that we add a "tryLater" error code for
add-chain and add-pre-chain.

Rob Stradling

unread,
Oct 28, 2016, 5:44:31 PM10/28/16
to Brian Smith, Andrew Ayer, Ryan Sleevi, Certificate Transparency Policy, Eran Messeri
It was pointed out to me that this isn't needed, because 6962-bis
already permits a 503 response with optional "Retry-After" header.

https://github.com/google/certificate-transparency-rfcs/blob/master/draft-ietf-trans-rfc6962-bis-19.txt#L1281-L1294

Brian Smith

unread,
Oct 28, 2016, 6:03:02 PM10/28/16
to Rob Stradling, Andrew Ayer, Ryan Sleevi, Certificate Transparency Policy, Eran Messeri
Rob Stradling <rob.st...@comodo.com> wrote:
On 25/10/16 20:08, Rob Stradling wrote:
> On 25/10/16 20:03, Brian Smith wrote:
> <snip>
>> It may also be good for logs to offer an alternative submission point
>> for backfilling and other bulk operations that doesn't return SCTs to
>> the submitter and so isn't subject to the MMD.
>
> Interesting idea.
>
> Adding an optional "no_sct_required" boolean input to the add-chain API
> would be one way to accomplish that.  (My list of things I wish we'd
> added to 6962-bis before Last Call keeps growing...)

I discussed this with Eran.  He objected on the grounds that:

<snip>
 
Because
even that queue on the log's side can't be infinite, it does not fully
relieve submitters of their duty to handle log throttling - so we don't
gain much. So for this reason I don't think the no_sct_required
parameter should be added - if the ultimate goal is enabling throttling
/ dealing with load, why don't we add an error code explicitly for that
or designate an http error code to indicate that?"

So, inspired by OCSP, I propose that we add a "tryLater" error code for
add-chain and add-pre-chain.

The "no_sct_required" feature still helps with bounded queues. If the server is going to drop submissions with a "try later" then it should prefer dropping the "no_sct_required" before dropping any non-"no_sct_required" submissions. Perhaps it would be more clear if "no_sct_required" was instead "low_priority."

The maximum size of the queue for "no_sct_required" would likely be much larger than the maximum size of the queue that can be guaranteed to meet the MMD, I think. In the case where the low-priority queue isn't full, the log should return tryLater/Retry-After, but still enqueue the chain. In particular, it is very suboptimal to completely ignore and never log a submission except in the most dire circumstances. This is especially the case when the submission isn't explicitly marked as being low-priority by some means. This would be true at least as long as the policy wants to support the case where a relying party doesn't require CT for every certificate.

An alternative would be to do backfilling/bulk operations on a separate (untrusted by relying parties) log that is associated with some trusted log, and then merge the untrusted log to trusted logs using some internal mechanism.

Cheers,
Brian

Gervase Markham

unread,
Nov 1, 2016, 4:57:34 AM11/1/16
to Certificate Transparency Policy
On Friday, 21 October 2016 12:19:52 UTC+1, Paul Hadfield wrote:
As promised on Monday October 17, here are the findings of our investigation of the recent Google Aviator incident.
 

Reading all that has been said here, it seems to me that the correct approach is to demonstrate that everyone's treated equally by doing the following:

* un-trusting Aviator
* re-starting the same codebase and infra with a new key and re-applying for inclusion (as the issue is now fixed, so why not)
* making policy changes as necessary

Exactly what the policy change is, is tricky to decide. I agree that 24 hours is an absolute worst case figure, and "it was only 2 hours over" is not much of an argument. But I also think that logs should have legal means of dealing with having their infra overwhelmed. After all, there are N million certs in the largest logs, and a few thousand in the smallest - could any of those be knocked out of trust by an attacker who simply submitted all from the former to the latter at high speed, forcing the log to either blow its uptime requirement by refusing submissions, or blow its MMD by accepting them?

Gerv

Ben Laurie

unread,
Nov 1, 2016, 5:40:27 AM11/1/16
to Gervase Markham, Certificate Transparency Policy
On 1 November 2016 at 08:57, Gervase Markham <ge...@mozilla.com> wrote:
On Friday, 21 October 2016 12:19:52 UTC+1, Paul Hadfield wrote:
As promised on Monday October 17, here are the findings of our investigation of the recent Google Aviator incident.
 

Reading all that has been said here, it seems to me that the correct approach is to demonstrate that everyone's treated equally by doing the following:

* un-trusting Aviator
* re-starting the same codebase and infra with a new key and re-applying for inclusion (as the issue is now fixed, so why not)
* making policy changes as necessary

That makes no sense to me - if policy changes are necessary, then Aviator should be judged by the revised policy.
 

Exactly what the policy change is, is tricky to decide. I agree that 24 hours is an absolute worst case figure, and "it was only 2 hours over" is not much of an argument. But I also think that logs should have legal means of dealing with having their infra overwhelmed. After all, there are N million certs in the largest logs, and a few thousand in the smallest - could any of those be knocked out of trust by an attacker who simply submitted all from the former to the latter at high speed, forcing the log to either blow its uptime requirement by refusing submissions, or blow its MMD by accepting them?

Gerv

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.

To post to this group, send email to ct-p...@chromium.org.

Gervase Markham

unread,
Nov 1, 2016, 6:09:14 AM11/1/16
to Ben Laurie, Certificate Transparency Policy
On 01/11/16 09:40, Ben Laurie wrote:
> That makes no sense to me - if policy changes are necessary, then
> Aviator should be judged by the revised policy.

IMO, this is precisely what shouldn't happen, because it leaves one open
to the charge of changing the rules to suit oneself in a post-hoc
fashion. Later, when other logs violate the policy and are removed, it
would be hard to avoid the accusation of "one rule for Google and one
for everyone else". As the main driver of CT, and the one who lays down
these stringent requirements for logs which have already seen several
logs either not included or disqualified, it seems to me that Google
needs to be seen to be above reproach and scrupulously fair when judging
its own logs by the standard it has set.

If a law is found to be producing unjust results, it gets changed. But
it's rare that everyone tried under that law gets hauled back for a retrial.

CT is designed to be resilient to the failure of logs; if un-trusting
Aviator would have significant ecosystem repercussions, we have a
problem, and we need to look at what those are and see how they can be
made not to happen in future when other logs are untrusted. But if there
are not significant ecosystem repercussions, and the system is working
as designed, then there should be no problem with re-starting and
re-qualifying the log.

Gerv

Ben Laurie

unread,
Nov 1, 2016, 6:22:49 AM11/1/16
to Gervase Markham, Certificate Transparency Policy
On 1 November 2016 at 10:08, Gervase Markham <ge...@mozilla.org> wrote:
On 01/11/16 09:40, Ben Laurie wrote:
> That makes no sense to me - if policy changes are necessary, then
> Aviator should be judged by the revised policy.

IMO, this is precisely what shouldn't happen, because it leaves one open
to the charge of changing the rules to suit oneself in a post-hoc
fashion. Later, when other logs violate the policy and are removed, it
would be hard to avoid the accusation of "one rule for Google and one
for everyone else". As the main driver of CT, and the one who lays down
these stringent requirements for logs which have already seen several
logs either not included or disqualified, it seems to me that Google
needs to be seen to be above reproach and scrupulously fair when judging
its own logs by the standard it has set.

The rule followed, it seems to me, is that if your policy is impossible to comply with, then you should not execute people (err, logs) for their failure to do the impossible.
 

If a law is found to be producing unjust results, it gets changed. But
it's rare that everyone tried under that law gets hauled back for a retrial.

CT is designed to be resilient to the failure of logs; if un-trusting
Aviator would have significant ecosystem repercussions, we have a
problem, and we need to look at what those are and see how they can be
made not to happen in future when other logs are untrusted. But if there
are not significant ecosystem repercussions, and the system is working
as designed, then there should be no problem with re-starting and
re-qualifying the log.

This is beside the point. BTW, "no problem" != "no effort".

Gervase Markham

unread,
Nov 1, 2016, 6:38:30 AM11/1/16
to Ben Laurie, Certificate Transparency Policy
On 01/11/16 10:22, Ben Laurie wrote:
> The rule followed, it seems to me, is that if your policy is impossible
> to comply with, then you should not execute people (err, logs) for their
> failure to do the impossible.

I think it's a stretch to say that there was no way that Aviator could
have dealt with the situation. Ryan H mentioned that monitoring and DDoS
protection failures/inadequacies meant that the problem was not
addressed by throttling.

But the fact that it wasn't impossible to cope with this situation
doesn't mean that we shouldn't take this opportunity and what we've
learned and use it to make the policy more flexible and forgiving if we
think that's appropriate.

Perhaps the difference of opinion is partly rooted in the fact that
those outside Google see the Chrome people and the CT people as
"Google", whereas those inside see two independent teams. So an external
view might be that "Google made these rules that they've held everyone
to until they fall foul of them themselves, and then they suggest
changing them", and the internal view is "the CT team is just another
set of logs to the Chromium team, treated no differently to anyone else
- why are we not allowed to make a case that the rules are unreasonable?"

Gerv

Ryan Sleevi

unread,
Nov 1, 2016, 6:41:40 AM11/1/16
to Gervase Markham, Ben Laurie, Certificate Transparency Policy
On Tue, Nov 1, 2016 at 3:08 AM, Gervase Markham <ge...@mozilla.org> wrote:
On 01/11/16 09:40, Ben Laurie wrote:
> That makes no sense to me - if policy changes are necessary, then
> Aviator should be judged by the revised policy.

IMO, this is precisely what shouldn't happen, because it leaves one open
to the charge of changing the rules to suit oneself in a post-hoc
fashion. Later, when other logs violate the policy and are removed, it
would be hard to avoid the accusation of "one rule for Google and one
for everyone else". As the main driver of CT, and the one who lays down
these stringent requirements for logs which have already seen several
logs either not included or disqualified, it seems to me that Google
needs to be seen to be above reproach and scrupulously fair when judging
its own logs by the standard it has set.

If a law is found to be producing unjust results, it gets changed. But
it's rare that everyone tried under that law gets hauled back for a retrial.

The argument put forward here, by non-Google participants *and* by other log operators, suggest otherwise. That is, there's a distinction between murder and, say, jaywalking, and the policy doesn't make it clear.

Most importantly, however, is that the policy itself does not declare itself black and white, as you suggest. Did you happen to review the previous discussion, as it related to Izenpe, in which this was previously discussed and addressed? This was discussed in https://groups.google.com/a/chromium.org/d/msg/ct-policy/ZZf3iryLgCo/UL6keHE_CAAJ
 
CT is designed to be resilient to the failure of logs; if un-trusting
Aviator would have significant ecosystem repercussions, we have a
problem, and we need to look at what those are and see how they can be
made not to happen in future when other logs are untrusted. But if there
are not significant ecosystem repercussions, and the system is working
as designed, then there should be no problem with re-starting and
re-qualifying the log.

As noted, there are ecosystem repercussions with untrusting and retrusting logs, such that it should be the last resort, not the first resort.

I'm hoping you can advance your argument further, since, so far, you've been the only person to comment to suggest distrusting. That's not to say it's off the table, but there have been a number of compelling arguments as to why that's not desirable - and the policy itself states "Log Operators that fail to meet these requirements will be in violation of the Log Inclusion Policy, which may result in removal of the Log from the Chromium projects."

The keyword here being 'may' 

Ryan Sleevi

unread,
Nov 1, 2016, 6:43:11 AM11/1/16
to Gervase Markham, Ben Laurie, Certificate Transparency Policy
On Tue, Nov 1, 2016 at 3:38 AM, Gervase Markham <ge...@mozilla.org> wrote:
Perhaps the difference of opinion is partly rooted in the fact that
those outside Google see the Chrome people and the CT people as
"Google", whereas those inside see two independent teams. So an external
view might be that "Google made these rules that they've held everyone
to until they fall foul of them themselves, and then they suggest
changing them", and the internal view is "the CT team is just another
set of logs to the Chromium team, treated no differently to anyone else
- why are we not allowed to make a case that the rules are unreasonable?"

While I am entirely sympathetic to this argument - and that's why I wanted to ensure robust public discussion - I do want to point out the examples I mentioned earlier of other logs running afoul of the policy and not being removed. Would you view those similarly as candidates for removal, ex post facto?  

Peter Bowen

unread,
Nov 1, 2016, 6:46:51 AM11/1/16
to Ben Laurie, Gervase Markham, Certificate Transparency Policy
On Tue, Nov 1, 2016 at 3:22 AM, 'Ben Laurie' via Certificate
Transparency Policy <ct-p...@chromium.org> wrote:
>
>
> On 1 November 2016 at 10:08, Gervase Markham <ge...@mozilla.org> wrote:
>>
>> On 01/11/16 09:40, Ben Laurie wrote:
>> > That makes no sense to me - if policy changes are necessary, then
>> > Aviator should be judged by the revised policy.
>>
>> IMO, this is precisely what shouldn't happen, because it leaves one open
>> to the charge of changing the rules to suit oneself in a post-hoc
>> fashion. Later, when other logs violate the policy and are removed, it
>> would be hard to avoid the accusation of "one rule for Google and one
>> for everyone else". As the main driver of CT, and the one who lays down
>> these stringent requirements for logs which have already seen several
>> logs either not included or disqualified, it seems to me that Google
>> needs to be seen to be above reproach and scrupulously fair when judging
>> its own logs by the standard it has set.
>
>
> The rule followed, it seems to me, is that if your policy is impossible to
> comply with, then you should not execute people (err, logs) for their
> failure to do the impossible.

If I was to pick part of the policy to complain about it sure would
not be this part. As has been pointed out already, the policy allows
for massive amounts of downtime (1%). A log could easily go into
"downtime" for a few minutes to allow itself to get synchronized if it
detects that it is getting close to blowing the MMD (or even exceeding
an internal threshold way below MMD).

Thanks,
Peter

Peter Bowen

unread,
Nov 1, 2016, 7:03:45 AM11/1/16
to Ryan Sleevi, Gervase Markham, Ben Laurie, Certificate Transparency Policy
Ryan,

I finally got the opportunity to go back and read the threads you
linked. What really stood out to me was
https://groups.google.com/a/chromium.org/d/msg/ct-policy/Itoq0YUZTlA/abf6cmjyCwAJ
and subsequent messages in that thread where you repeatedly bring up
blowing the MMD in a manner that makes it clear you believed that
doing so was a clear ground for distrust. Quoting you:

"As explained on the other thread behind the reasoning, uptime has
security impact:
A significant downtime event can cause an MMD to be blown."

"For example, consider if you find a vulnerability that allows an SCT
to be issued that isn't incorporated in the MMD? That's discoverable
within 24 hours - and is reasonably serious enough to be grounds for
disqualifying the log (as we've seen)"

While I can't know all the private discussions you have had with log
operators, it MMD has been held up as a critical portion of CT log
trust. As pointed out elsewhere in this discussion, the standard
merge delay is an order of magnitude shorter than the MMD.

While I agree removing a log should not happen for trivial things,
Google has repeatedly stated blowing MMD is not trivial.

Thanks,
Peter

Gervase Markham

unread,
Nov 1, 2016, 7:12:56 AM11/1/16
to Certificate Transparency Policy
On 01/11/16 10:40, Ryan Sleevi wrote:
> The argument put forward here, by non-Google participants *and* by other
> log operators, suggest otherwise. That is, there's a distinction between
> murder and, say, jaywalking, and the policy doesn't make it clear.

I am not arguing that the policy shouldn't make it clear. :-)

I can see why other log operators would prefer the policy remain
flexible so they can argue for mercy when something goes wrong. And I'm
not saying it shouldn't be flexible. But I think it should be _least_
flexible for Google, who not only make the rules, but should hold
themselves to the highest standards, and are inevitably and unavoidably
open to perceptions of conflict of interest.

> Most importantly, however, is that the policy itself does not declare
> itself black and white, as you suggest.

That is true. Chrome certainly has discretion to do whatever it likes -
as you note, the word "may" is deployed. But on the other hand, the
point of writing a policy is to make it clear what sort of behaviours
are considered unacceptable. If the MMD isn't the "maximum merge delay",
but is in fact the "merge delay after which the Chrome team will frown
at you a little bit and suggest you don't do that again", then the name
is a bit misleading. :-)

> Did you happen to review the
> previous discussion, as it related to Izenpe, in which this was
> previously discussed and addressed? This was discussed
> in https://groups.google.com/a/chromium.org/d/msg/ct-policy/ZZf3iryLgCo/UL6keHE_CAAJ

I think Izenpe is a different situation. AIUI they put themselves in a
position where there were STHs about which were validly signed by their
production log's key but not produced by their production log. That's
cryptographically indistinguishable from presenting two views, which is
serious misbehaviour. But perhaps you agree with that. Are you drawing
my attention to that discussion to reinforce your point that the policy
isn't black and white? If so, I agree this is the case, above.

> As noted, there are ecosystem repercussions with untrusting and
> retrusting logs, such that it should be the last resort, not the first
> resort.

Well, OK, but if we want to be a bit utilitarian, the alternative
argument would be "CT needs to be able to cope with a commonly-used log
going away; now - before we make CT mandatory for everyone - is the best
time to see what happens in practice when we do that. After all, the log
has definitely violated the policy, so is a good choice for such a
test". Heck, the Chaos Monkey principle suggests that if none of the 3
oldest Google logs had violated the policy by now, you should kill
(well, un-trust and restart under a new key) one anyway, to see what
happens.

If it turns out that un-trusting Aviator breaks a lot of stuff, I
suspect valuable lessons will be learned from the process. If it
doesn't, that's a good validation of CT's design and implementation.

Gerv

Brian Smith

unread,
Nov 1, 2016, 12:02:49 PM11/1/16
to Gervase Markham, Certificate Transparency Policy
Gervase Markham <ge...@mozilla.org> wrote:
I can see why other log operators would prefer the policy remain
flexible so they can argue for mercy when something goes wrong. And I'm
not saying it shouldn't be flexible. But I think it should be _least_
flexible for Google, who not only make the rules, but should hold
themselves to the highest standards, and are inevitably and unavoidably
open to perceptions of conflict of interest.

There are three parts to this, which are in conflict:

1. The purpose of Chrome's implementation of CT is to protect Chrome users, so every decision that Chrome makes should be in the short-term and long-term interest of Chrome users, not by applying a policy by rote. The argument that Google trusts itself and so it never makes sense to remove Google's logs from Chrome's CT policy makes sense when considering this. However...

2. By applying the CT log inclusion policy literally and strictly to its own logs, Google maximizes its defense against being legally compelled to tamper with its own logs. If Google's logs don't have to follow the rules then they're not trustworthy.

3. Imagine that every browser requires one SCT from their own log(s) in every certificate: Microsoft, Mozilla, Google, 360, UC Browser, Opera, etc. Then we'd have a terrible situation where every certificate and/or every OCSP response would grow by ~1KB and it would be completely impractical for non-browser-owned logs to be used. We need to avoid this by avoiding treating any logs specially.
 
> As noted, there are ecosystem repercussions with untrusting and
> retrusting logs, such that it should be the last resort, not the first
> resort.

Well, OK, but if we want to be a bit utilitarian, the alternative
argument would be "CT needs to be able to cope with a commonly-used log
going away; now - before we make CT mandatory for everyone - is the best
time to see what happens in practice when we do that. After all, the log
has definitely violated the policy, so is a good choice for such a
test". Heck, the Chaos Monkey principle suggests that if none of the 3
oldest Google logs had violated the policy by now, you should kill
(well, un-trust and restart under a new key) one anyway, to see what
happens.

If it turns out that un-trusting Aviator breaks a lot of stuff, I
suspect valuable lessons will be learned from the process. If it
doesn't, that's a good validation of CT's design and implementation.

I very much agree with this.

However, I also agree with Ryan Sleevi and Ben Laurie that if the Chrome CT log inclusion policy needs to be improved based on this experience, then it makes sense to make the distrust/re-inclusion decision for all to-be-removed/removed logs based on the new policy.

Matt Palmer

unread,
Nov 1, 2016, 8:47:42 PM11/1/16
to Certificate Transparency Policy
On Tue, Nov 01, 2016 at 03:40:59AM -0700, Ryan Sleevi wrote:
> I'm hoping you can advance your argument further, since, so far, you've
> been the only person to comment to suggest distrusting.

Just in case Gerv is feeling a little lonely: I, too, support distrusting
Aviator, for much the same reasons Gerv has already stated. The optics
alone of (one part of) Google being seen to treat logs operated by (another
part of) Google with favour, regardless of the ability to find reasons that
this *particular* incident "ain't no thing", are quite damaging to the wider
vision of CT as anything more than "a Google thing".

- Matt

Ryan Sleevi

unread,
Nov 2, 2016, 12:06:28 PM11/2/16
to Peter Bowen, Ryan Sleevi, Gervase Markham, Ben Laurie, Certificate Transparency Policy
On Tue, Nov 1, 2016 at 4:03 AM, Peter Bowen <pzb...@gmail.com> wrote:
Ryan,

I finally got the opportunity to go back and read the threads you
linked.  What really stood out to me was
https://groups.google.com/a/chromium.org/d/msg/ct-policy/Itoq0YUZTlA/abf6cmjyCwAJ
and subsequent messages in that thread where you repeatedly bring up
blowing the MMD in a manner that makes it clear you believed that
doing so was a clear ground for distrust.  Quoting you:

"As explained on the other thread behind the reasoning, uptime has
security impact:
A significant downtime event can cause an MMD to be blown."

"For example, consider if you find a vulnerability that allows an SCT
to be issued that isn't incorporated in the MMD? That's discoverable
within 24 hours - and is reasonably serious enough to be grounds for
disqualifying the log (as we've seen)"

While I can't know all the private discussions you have had with log
operators, it MMD has been held up as a critical portion of CT log
trust.  As pointed out elsewhere in this discussion, the standard
merge delay is an order of magnitude shorter than the MMD. 

While I agree removing a log should not happen for trivial things,
Google has repeatedly stated blowing MMD is not trivial.

s/Google/Chrome/, to help those optics ;)


which have helped provide additional context to the hardline remarks :)

Ryan Sleevi

unread,
Nov 2, 2016, 12:09:27 PM11/2/16
to Gervase Markham, Certificate Transparency Policy
On Tue, Nov 1, 2016 at 4:12 AM, Gervase Markham <ge...@mozilla.org> wrote:
That is true. Chrome certainly has discretion to do whatever it likes -
as you note, the word "may" is deployed. But on the other hand, the
point of writing a policy is to make it clear what sort of behaviours
are considered unacceptable. If the MMD isn't the "maximum merge delay",
but is in fact the "merge delay after which the Chrome team will frown
at you a little bit and suggest you don't do that again", then the name
is a bit misleading. :-)

I think there's a distinction here, which I know you're familiar with, which is that unacceptable behaviours are not all equal; some, as have been mentioned in this thread, are unacceptable if occurring more than once or twice.

Consider Mozilla's policy, for which you're deeply familiar with - are you suggesting Mozilla also intends to adopt zero tolerance? :) Or is that policy not "unacceptable" behaviours, just "things Mozilla would think twice about"
 
Well, OK, but if we want to be a bit utilitarian, the alternative
argument would be "CT needs to be able to cope with a commonly-used log
going away; now - before we make CT mandatory for everyone - is the best
time to see what happens in practice when we do that. After all, the log
has definitely violated the policy, so is a good choice for such a
test". Heck, the Chaos Monkey principle suggests that if none of the 3
oldest Google logs had violated the policy by now, you should kill
(well, un-trust and restart under a new key) one anyway, to see what
happens.

If it turns out that un-trusting Aviator breaks a lot of stuff, I
suspect valuable lessons will be learned from the process. If it
doesn't, that's a good validation of CT's design and implementation.

These are all great and compelling arguments, and may help resolve some of the concern around optics and zero tolerance, as expressed in https://groups.google.com/a/chromium.org/d/msg/ct-policy/ZZf3iryLgCo/UL6keHE_CAAJ originally 

Ben Laurie

unread,
Nov 3, 2016, 12:35:09 PM11/3/16
to Ryan Sleevi, Gervase Markham, Certificate Transparency Policy
I strongly object to the argument that a log should be forcibly
distrusted because it will be an interesting experiment.

That said, if people want to see what happens when we turn a log off,
then that's something we should consider doing independently of this
policy question.

Ryan Sleevi

unread,
Nov 7, 2016, 8:25:55 PM11/7/16
to Certificate Transparency Policy
Before we decide what action to take, are there any further thoughts that people feel haven't been captured yet?

In favor of removal:
  • Chrome has removed other logs for failing to comply with other aspects of the policy. Chrome should treat all policy violations the same, and thus remove Aviator.
  • Chrome should remove Aviator for no other reason than to see what happens to the ecosystem when a popular log is removed.
  • If Chrome does not remove Aviator, it will give the appearance that Google logs are special from other logs.
    • Treating Google logs special negatively affects the perception of trust, even when people agree it's a reasonable position.
    • If Chrome treats Google logs as special than other logs, it will encourage other browsers to stand up special logs, and since the 'effective' CT policy is the union of all browsers' CT policies, this will negatively affect the ecosystem by increasing the number of SCTs required.
  • If Chrome does not remove Aviator, Google runs the risk of being legally compelled to violate the policy more egregiously, and then not take action.
Against removal:
  • The policy, as presently written, allows for logs to be DoS'd by CAs or by the public, by forcing a log to choose between blowing the MMD or blowing the uptime requirement. Therefore, the policy is unreasonable and should not be enforced for a single violation.
    • [This was mentioned in TRANS w/r/t RFC 6962-bis's policies] Log operators have an expectation to accept all certificates - even from intermediates that may be technically constrained or revoked. Therefore, there is no upper bound on the number of submissions a log may be asked to accept, therefore, there's no way to reject spam from a holder of an intermediate - which may be end users.
  • Removing Aviator for a single violation would discourage other log operators from operating logs, because it offers them no flexibility to learn and improve implementations, instead requiring a perfect implementation the first time, with a number of unknown risks.
  • Removing Aviator for a single violation would signal that the policy is zero tolerance, removing flexibility for Chrome to respond to incidents in the interest of Chrome users and the CT ecosystem.
  • If the log wished to hide its blown MMD, it could have gone offline instead, which would have not violated the policy so long as it was less than 99%. Therefore, the policy itself is inconsistent, and a blown MMD is itself not as significant as the overall uptime issue.
There have been a number of suggestions for policy improvements, which I think we'll want to summarize on a new thread (and thanks to all that have been involved in them), but I wanted to make sure I accurately and appropriately captured the arguments for and against. Is there anything people feel has not been captured or summarized in the above?

Gervase Markham

unread,
Nov 8, 2016, 4:31:11 AM11/8/16
to Ryan Sleevi, Certificate Transparency Policy
Hi Ryan,

On 08/11/16 01:25, Ryan Sleevi wrote:
> In favor of removal:
>
> * Chrome has removed other logs for failing to comply with other
> aspects of the policy. Chrome should treat all policy violations the
> same, and thus remove Aviator.

I don't think this quite captures it. An MMD blowout has been said in
the past to be a serious thing; it is not required that one believes
that "all policy violations should be treated the same" in order to
believe that Aviator should be removed in this case.

> * Chrome should remove Aviator for no other reason than to see what
> happens to the ecosystem when a popular log is removed.

"For no other reason" also doesn't capture it; I don't think I would be
arguing for this to happen to Aviator specifically if nothing had gone
wrong with it. I would instead put this as:

* It would be good for the ecosystem to see what happens when a popular
log is removed; Aviator has violated the policy and so it's the obvious
choice.

> Against removal:
>
> * The policy, as presently written, allows for logs to be DoS'd by CAs
> or by the public, by forcing a log to choose between blowing the MMD
> or blowing the uptime requirement. Therefore, the policy is
> unreasonable and should not be enforced for a single violation.

Your summary here implies the following consequential logic:

The policy is problematic and can be improved ->
The policy should not be enforced in the way it was written at the time
of the incident.

I don't believe that A implies B in this way. It is possible to believe
both that the policy can be improved, and that it should be enforced as
written at the time of the incident.

> * Removing Aviator for a single violation would discourage other log
> operators from operating logs, because it offers them no flexibility
> to learn and improve implementations, instead requiring a perfect
> implementation the first time, with a number of unknown risks.

Again, this implies that the act of removing Aviator means you are
enforcing a zero tolerance policy. I don't think that's the case. There
can be infractions less serious than blowing an MMD.

Gerv

Ben Laurie

unread,
Nov 8, 2016, 6:42:45 AM11/8/16
to Gervase Markham, Ryan Sleevi, Certificate Transparency Policy
Gerv,

Whilst I would perhaps agree with your position had there been a
policy-compliant alternative to blowing the MMD, the fact is, there
wasn't, and I haven't seen you address this at all.
> --
> You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
> To post to this group, send email to ct-p...@chromium.org.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/d3173812-5ba7-feee-15e2-323cbde93b1e%40mozilla.org.

Gervase Markham

unread,
Nov 8, 2016, 7:15:30 AM11/8/16
to Ben Laurie, Ryan Sleevi, Certificate Transparency Policy
On 08/11/16 11:42, Ben Laurie wrote:
> Whilst I would perhaps agree with your position had there been a
> policy-compliant alternative to blowing the MMD, the fact is, there
> wasn't, and I haven't seen you address this at all.

The original incident report said:

"Google are using the lessons learned from this incident to improve
operational practices for the Pilot, Rocketeer, Submariner, Icarus and
Skydiver logs; in particular the sequencing operation has been tuned, as
have protections for the logs against flooding. Monitoring has been
revised to provide earlier warning of similar events in the future."

That rather implies that if the flooding protection had been correctly
configured, the log would have continued to operate within limits. Is
that an incorrect conclusion from that statement?

Reviewing:
https://sites.google.com/a/chromium.org/dev/Home/chromium-security/certificate-transparency/log-policy
I don't see anything which would prevent a log from responding very
slowly to requests from a particular source when they came in at high
volume. I think it could even refuse some of them. It does define an
outage as "a failure to accept new Certificates to be logged", but given
that this is as measured by Google, and Google is unlikely to flood a
log, I think that throttling high-volume clients is within the spirit
and not against the letter of the policy. Certainly reducing service to
one client is very different from not accepting submissions from any client.

Gerv

Ben Laurie

unread,
Nov 8, 2016, 7:33:07 AM11/8/16
to Gervase Markham, Ryan Sleevi, Certificate Transparency Policy
On 8 November 2016 at 12:15, Gervase Markham <ge...@mozilla.org> wrote:
> On 08/11/16 11:42, Ben Laurie wrote:
>> Whilst I would perhaps agree with your position had there been a
>> policy-compliant alternative to blowing the MMD, the fact is, there
>> wasn't, and I haven't seen you address this at all.
>
> The original incident report said:
>
> "Google are using the lessons learned from this incident to improve
> operational practices for the Pilot, Rocketeer, Submariner, Icarus and
> Skydiver logs; in particular the sequencing operation has been tuned, as
> have protections for the logs against flooding. Monitoring has been
> revised to provide earlier warning of similar events in the future."
>
> That rather implies that if the flooding protection had been correctly
> configured, the log would have continued to operate within limits. Is
> that an incorrect conclusion from that statement?

The result of the protection is that cert inclusion requests may
sometimes be delayed or denied, both of which are (probably, the
policy is vague) breaches of uptime/RFC compliance policy. However,
this seems like a less bad option to take and reflects what we (CT
team) currently think would be the most sensible policy.

> Reviewing:
> https://sites.google.com/a/chromium.org/dev/Home/chromium-security/certificate-transparency/log-policy
> I don't see anything which would prevent a log from responding very
> slowly to requests from a particular source when they came in at high
> volume.

Clearly in this case "very slowly" would eventually have been of the
order of 24 hours. I think you'd struggle to argue you were up if you
took that long to respond to a request. And then, there's the question
of kernel resource exhaustion if you hold connections open for such
long periods.

> I think it could even refuse some of them. It does define an
> outage as "a failure to accept new Certificates to be logged", but given
> that this is as measured by Google, and Google is unlikely to flood a
> log, I think that throttling high-volume clients is within the spirit
> and not against the letter of the policy.

But it _is_ against the letter, ultimately.

> Certainly reducing service to
> one client is very different from not accepting submissions from any client.

Agreed. However, a client wishing to DoS can look like as many clients
as they want...

Ryan Sleevi

unread,
Nov 8, 2016, 11:06:21 AM11/8/16
to Gervase Markham, Certificate Transparency Policy

On Nov 8, 2016 1:31 AM, "Gervase Markham" <ge...@mozilla.org> wrote:
>
> Hi Ryan,
>
> On 08/11/16 01:25, Ryan Sleevi wrote:
> > In favor of removal:
> >
> >   * Chrome has removed other logs for failing to comply with other
> >     aspects of the policy. Chrome should treat all policy violations the
> >     same, and thus remove Aviator.
>
> I don't think this quite captures it. An MMD blowout has been said in
> the past to be a serious thing; it is not required that one believes
> that "all policy violations should be treated the same" in order to
> believe that Aviator should be removed in this case.

I'm sorry, I don't understand the distinction you are making. Are you referring to the email Peter pointed out in which I said that, and then the multiple corrections pointed out by others, pointing out that it isn't?

That is, the crux of the argument seems to be "You said something that was wrong, but you should now act is if it was right"

>
> >   * Chrome should remove Aviator for no other reason than to see what
> >     happens to the ecosystem when a popular log is removed.
>
> "For no other reason" also doesn't capture it; I don't think I would be
> arguing for this to happen to Aviator specifically if nothing had gone
> wrong with it. I would instead put this as:
>
> * It would be good for the ecosystem to see what happens when a popular
> log is removed; Aviator has violated the policy and so it's the obvious
> choice.

I was intentionally trying to avoid your value judgement that it would be good - you have neither articulated the benefits nor addressed the risks. However, it is also clear that even if you agree that Aviator should stay - for the reasons outlined below - that you're still suggesting it be removed just to "see what happens" - so that very much is "for no other reason".

If you feel otherwise, please help me understand, but I feel this certainly holds as a summary of the argument.

>
> > Against removal:
> >
> >   * The policy, as presently written, allows for logs to be DoS'd by CAs
> >     or by the public, by forcing a log to choose between blowing the MMD
> >     or blowing the uptime requirement. Therefore, the policy is
> >     unreasonable and should not be enforced for a single violation.
>
> Your summary here implies the following consequential logic:
>
> The policy is problematic and can be improved ->
> The policy should not be enforced in the way it was written at the time
> of the incident.

That is part of the argument being advanced, yes.

>
> I don't believe that A implies B in this way. It is possible to believe
> both that the policy can be improved, and that it should be enforced as
> written at the time of the incident.

Why don't you believe A implies B? Or, put differently, what of the impacts have you considered that you are ignoring or discarding?

This was an attempt to be a summary, but the argument goes that removal IS impactful, it is not meant to be a light thing to be done for fun, because it affects the whole ecosystem, and therefore there should be a high bar for removal. As such, in the event of bad policies, good faith should be extended.

It's unclear if you're simply arguing for an absolutist interpretation or if you disagree with the statement that removing logs is impactful and not to be done lightly.

>
> >   * Removing Aviator for a single violation would discourage other log
> >     operators from operating logs, because it offers them no flexibility
> >     to learn and improve implementations, instead requiring a perfect
> >     implementation the first time, with a number of unknown risks.
>
> Again, this implies that the act of removing Aviator means you are
> enforcing a zero tolerance policy. I don't think that's the case. There
> can be infractions less serious than blowing an MMD.

Can you name examples? It would be useful to understand your perspective here, particularly as to why you view the MMD as serious enough to cross a line for you, while still leaving other things less.

Regardless of your agreement, it is something captured by the onlist replies, and more seriously articulated off list, that the optics here are that it is a zero tolerance policy, because this is one of the few elements - unlike, say, split views - which can be fully induced by a remote attacker even in a perfectly implemented system with infinite scale. So if that isn't within the realm of discussion for leniency, what is?

Ryan Sleevi

unread,
Nov 8, 2016, 11:24:10 AM11/8/16
to Ryan Sleevi, Gervase Markham, Certificate Transparency Policy
On Tue, Nov 8, 2016 at 8:06 AM, Ryan Sleevi <rsl...@chromium.org> wrote:

> >   * Chrome should remove Aviator for no other reason than to see what

> >     happens to the ecosystem when a popular log is removed.
>
> "For no other reason" also doesn't capture it; I don't think I would be
> arguing for this to happen to Aviator specifically if nothing had gone
> wrong with it. I would instead put this as:
>
> * It would be good for the ecosystem to see what happens when a popular
> log is removed; Aviator has violated the policy and so it's the obvious
> choice.

I was intentionally trying to avoid your value judgement that it would be good - you have neither articulated the benefits nor addressed the risks. However, it is also clear that even if you agree that Aviator should stay - for the reasons outlined below - that you're still suggesting it be removed just to "see what happens" - so that very much is "for no other reason".

If you feel otherwise, please help me understand, but I feel this certainly holds as a summary of the argument.


I should also add that I hope you can articulate what you believe is the "unknown" here that would be learned from. We know what happens when a log is distrusted. We know what happens when a quorum of logs are distrusted. We know these are undesirable and impactful, and thus should be taken carefully based on the weight of the evidence, not taken lightly just to keep pulling strings on a sweater and finding when it unravels.

What we don't know is what happens when a log gracefully shuts down. That is, these are two distinct events - a log going into read-only mode, but still being trusted to operate in that read-only mode, versus a log being distrusted, and no longer sufficient to accurately reflect things. We know that the time for experimenting with that mode is, for the best interest of users, deferred until after the inclusion proof checking is in place. So I don't feel particularly compelled to do that experiment yet, as we still finish out the implementation for that via the DNS servers.

But if you believe there's a piece of information useful to the ecosystem, that isn't already known and understood what it's impact will have, perhaps you could elaborate.

Gervase Markham

unread,
Nov 8, 2016, 11:37:27 AM11/8/16
to rsl...@chromium.org, Certificate Transparency Policy
On 08/11/16 16:06, Ryan Sleevi wrote:
>> I don't think this quite captures it. An MMD blowout has been said in
>> the past to be a serious thing; it is not required that one believes
>> that "all policy violations should be treated the same" in order to
>> believe that Aviator should be removed in this case.
>
> I'm sorry, I don't understand the distinction you are making. Are you
> referring to the email Peter pointed out in which I said that, and then
> the multiple corrections pointed out by others, pointing out that it isn't?
>
> That is, the crux of the argument seems to be "You said something that
> was wrong, but you should now act is if it was right"

Well, OK then. _I_ think an MMD blowout is a serious thing. :-) 24 hours
is really a maximum. (See below for more on this.) But what I'm
objecting to in your wording is what seems to be the implication that
wanting to remove Aviator because it's violated the policy is because
one is "treating all policy violations the same". But the other position
is that one could want to remove Aviator because an MMD blowout is a
reasonably big deal.

Not all the arguments in favour of removal have to be compatible with
each other, of course, as they could be advanced by different people. So
perhaps what I'm saying is that you should add to your list.

* An MMD blowout is a serious enough violation of the policy that
Aviator should be removed.

> I was intentionally trying to avoid your value judgement that it would
> be good - you have neither articulated the benefits nor addressed the
> risks. However, it is also clear that even if you agree that Aviator
> should stay - for the reasons outlined below - that you're still
> suggesting it be removed just to "see what happens" - so that very much
> is "for no other reason".

Well, I think it would be wise to remove a log to see what happens, but
I would not be arguing that it had to be Aviator if Aviator had not
violated the policy. So I don't think it's fair to summarise it as "for
no other reason" where Aviator is concerned. But it's your summary :-)

>> I don't believe that A implies B in this way. It is possible to believe
>> both that the policy can be improved, and that it should be enforced as
>> written at the time of the incident.
>
> Why don't you believe A implies B?

The laws of logic - there is a coherent position which can be held which
endorses A but repudiates B. Therefore one cannot imply the other. You
may not agree with it, but "implement the policy as written at the time,
and then improve it if necessary" is a coherent position.

> This was an attempt to be a summary, but the argument goes that removal
> IS impactful, it is not meant to be a light thing to be done for fun,
> because it affects the whole ecosystem, and therefore there should be a
> high bar for removal. As such, in the event of bad policies, good faith
> should be extended.

Fair enough; IMO you should add that logical step to the explanation of
the argument.

> Can you name examples? It would be useful to understand your perspective
> here, particularly as to why you view the MMD as serious enough to cross
> a line for you, while still leaving other things less.

I think that logs should choose to throttle write access rather than
blow the MMD. If someone's uptime dipped below 99% because they were
DoSed, I would view that as a less serious infraction.

Why do we have an MMD? So logs can't issue SCTs and then not incorporate
them for an arbitrary amount of time. How long do you wait before there
is a trust problem? That amount of time should be set as the MMD. If
we've set it too low, and 26 hours is not a trust problem, we should
change it. If we've set it right, then blowing the MMD is a trust problem.

To put it another way: if we keep Aviator, and we don't change the MMD
in the policy, then the MMD is not actually an _M_MD.

Gerv

Gervase Markham

unread,
Nov 8, 2016, 11:39:12 AM11/8/16
to rsl...@chromium.org, Certificate Transparency Policy
On 08/11/16 16:23, Ryan Sleevi wrote:
> I should also add that I hope you can articulate what you believe is the
> "unknown" here that would be learned from. We know what happens when a
> log is distrusted. We know what happens when a quorum of logs are
> distrusted.

Well, as the architects and drivers of the system, you are in a better
position to evaluate what you know and don't know than I am. I would
have thought that the relatively small logs which have shut down so far
might not tell you much about what would happen if a big one shut down -
but if you think you've learned all you need right now about log
shutdowns, I'm happy to take your word for it.

Gerv

Ryan Sleevi

unread,
Nov 8, 2016, 11:57:14 AM11/8/16
to Gervase Markham, Ryan Sleevi, Certificate Transparency Policy
On Tue, Nov 8, 2016 at 8:37 AM, Gervase Markham <ge...@mozilla.org> wrote:
I think that logs should choose to throttle write access rather than
blow the MMD. If someone's uptime dipped below 99% because they were
DoSed, I would view that as a less serious infraction.

Why do we have an MMD? So logs can't issue SCTs and then not incorporate
them for an arbitrary amount of time. How long do you wait before there
is a trust problem? That amount of time should be set as the MMD. If
we've set it too low, and 26 hours is not a trust problem, we should
change it. If we've set it right, then blowing the MMD is a trust problem.

To put it another way: if we keep Aviator, and we don't change the MMD
in the policy, then the MMD is not actually an _M_MD.

This doesn't seem to incorporate the observations and feedback provided by Andrew Ayer on MMD. Do you disagree with his analysis in https://groups.google.com/a/chromium.org/d/msg/ct-policy/ZZf3iryLgCo/3_fX4-ngAQAJ

I found it particularly compelling, and didn't see any disagreement on substance or conclusion.

Gervase Markham

unread,
Nov 8, 2016, 12:39:45 PM11/8/16
to rsl...@chromium.org, Certificate Transparency Policy
On 08/11/16 16:56, Ryan Sleevi wrote:
> This doesn't seem to incorporate the observations and feedback provided
> by Andrew Ayer on MMD. Do you disagree with his analysis
> in https://groups.google.com/a/chromium.org/d/msg/ct-policy/ZZf3iryLgCo/3_fX4-ngAQAJ

I must have not taken that message in.

Treating an MMD blowout as counting towards the downtime requirement is
an interesting idea.

But if that had been the situation for Aviator, and assuming we had
taken Andrew's other advice of dropping the MMD (which is now the
max-time-before-it-counts-as-downtime) to, say, 4 hours, Aviator would
still have blown its 99% uptime requirement as measured over 3 months.
(99% uptime allows 7 hours downtime per month; 4 + (7 * 3) = 25. 25 < 26.2.)

Gerv

Ryan Sleevi

unread,
Nov 8, 2016, 2:42:59 PM11/8/16
to Gervase Markham, Ryan Sleevi, Certificate Transparency Policy
I think you're conflating two things.

Consider the scenario, permitted today under the policy:
At T=minus 1, issue STH
At T=0, issue SCT
At T=23:59:59 (e.g. 1 second before the MMD for the SCT), go offline
At T=24:00:00, MMD is blown - but no one can get the MMD, because the log is offline
At T=27:00:00, system coalesces, and an MMD for T=24:00:00 is produced. Log now comes back 'online'

From the perspective of observers, the log did not blow its MMD, and encounter 3:00:01 of downtime. And this is fully acceptable by the policy today, because it observes both properties - MMD and 99% uptime.

So if we accept that, then what we're saying is, Aviator should be removed because it didn't implement that 'trick' to hide a blown MMD. Which, naturally, means that every remaining log operator should totally implement that trick if they want to avoid a fate similar of Aviator, since the argument and expectation is now a single blown MMD is grounds for removal. Is any other logs were to be kept after that, the Google CT team could cry foul - and similarly, if it ended up being that only *some* logs were removed for a blown MMD (perhaps due to other, additional evidence), they would have a greater claim towards inconsistency. Thus, it naturally sets up a zero tolerance policy for not being 'clever' in hiding mistakes.

I think that's against both the spirit and intent of CT, nor do I think we want to punish those who weren't clever enough to hide mistakes. Instead, we should take them as learning opportunities - and in this case, there's a lot of opportunity here to learn for both log operators and as policy maintainers as to what expectations are reasonable.

I understood Andrew's argument to be that because, today, with no modifications to the policy, it's permissible, we should take that into account. Further, because it's permissible, we may want to lower the MMD so that the 'effective' MMD is shorter, while affording more flexibility to borrow from the 'uptime' requirement for abnormal situations. However, we definitely should hear more from log operators about their concerns with the MMD, before we go about shortening it - or else we run the risk of more disqualifications, which does more harm than good.

Matt Palmer

unread,
Nov 8, 2016, 4:21:19 PM11/8/16
to Certificate Transparency Policy
On Tue, Nov 08, 2016 at 08:06:17AM -0800, Ryan Sleevi wrote:
> This was an attempt to be a summary, but the argument goes that removal IS
> impactful, it is not meant to be a light thing to be done for fun, because
> it affects the whole ecosystem, and therefore there should be a high bar
> for removal.

It is my belief that the entire design of CT is that there should be a *low*
bar for log removal. The single biggest problem with the CA ecosystem is
that there is a *very* high bar for distrust because of the impact on users
that such an act creates; replicating such an arrangement in CT logging
seems... less than optimal.

> As such, in the event of bad policies, good faith should be
> extended.

Except that it's Google's own log that was caught up in the bad policy
problem; extending good faith to "yourself" isn't quite the same as
extending it to a neutral third party.

> It's unclear if you're simply arguing for an absolutist interpretation or
> if you disagree with the statement that removing logs is impactful and not
> to be done lightly.

For myself, I believe that removing logs *shouldn't* be impactful, and if it
is, that should be fixed. Thus, arguing against the removal because it
*would* be impactful is arguing from a false premise.

- Matt

Matt Palmer

unread,
Nov 8, 2016, 5:01:33 PM11/8/16
to Certificate Transparency Policy
On Tue, Nov 08, 2016 at 11:42:42AM +0000, 'Ben Laurie' via Certificate Transparency Policy wrote:
> Whilst I would perhaps agree with your position had there been a
> policy-compliant alternative to blowing the MMD, the fact is, there
> wasn't, and I haven't seen you address this at all.

Dip into the downtime budget and deny requests for inclusion.

- Matt

Matt Palmer

unread,
Nov 8, 2016, 5:40:15 PM11/8/16
to Certificate Transparency Policy
On Tue, Nov 08, 2016 at 11:42:17AM -0800, Ryan Sleevi wrote:
> Consider the scenario, permitted today under the policy:
> At T=minus 1, issue STH
> At T=0, issue SCT
> At T=23:59:59 (e.g. 1 second before the MMD for the SCT), go offline
> At T=24:00:00, MMD is blown - but no one can get the MMD, because the log
> is offline
> At T=27:00:00, system coalesces, and an MMD for T=24:00:00 is produced. Log
> now comes back 'online'
>
> From the perspective of observers, the log did not blow its MMD

If I were observing that system, I'd definitely say the log blew its MMD,
because no STH was observed that incorporated the logged entry within the
MMD period. Failing to respond to requests has been previously deemed by
Google equivalent to blowing the MMD for the purposes of log removal, so
it's not even a novel interpretation of the policy as written.

- Matt

Ryan Sleevi

unread,
Nov 8, 2016, 5:44:29 PM11/8/16
to Matt Palmer, Certificate Transparency Policy
Refresh my memory - are you speaking about https://bugs.chromium.org/p/chromium/issues/detail?id=534745#c16 or something else? 

Matt Palmer

unread,
Nov 8, 2016, 5:53:31 PM11/8/16
to Certificate Transparency Policy
Something else -- a private communication from a Google employee.

- Matt

Pierre Phaneuf

unread,
Nov 9, 2016, 9:33:44 AM11/9/16
to Matt Palmer, Certificate Transparency Policy
In that case, anyone might have the ability to take down logs more or
less immediately after their acceptance. Just gather all the valid
certs from all the other logs, and submit them all as quickly as
possible. Either the new log will accept them (and likely blow their
MMD), or they'll have to appear down for add-chain (and I expect it
would take more than 21 hours to submit the 40+ million certs, so then
they blow their availability budget).

With the policy as it is now, you only get to choose which way you'll go down?

Ryan Sleevi

unread,
Nov 9, 2016, 2:37:16 PM11/9/16
to Certificate Transparency Policy, mpa...@hezmatt.org


On Wednesday, November 9, 2016 at 6:33:44 AM UTC-8, Pierre Phaneuf wrote:
In that case, anyone might have the ability to take down logs more or
less immediately after their acceptance. Just gather all the valid
certs from all the other logs, and submit them all as quickly as
possible. Either the new log will accept them (and likely blow their
MMD), or they'll have to appear down for add-chain (and I expect it
would take more than 21 hours to submit the 40+ million certs, so then
they blow their availability budget).

With the policy as it is now, you only get to choose which way you'll go down?

It might be useful if, in another thread, you and other log operators might be able to share more details as to why accepting a large number of certificates correlates with blowing an MMD. Certainly, I think that point is subtle and non-obvious for those who haven't implemented CT logs, and clarifying that point might provide greater understanding about where the policy fails.

For example, isn't it "just" a matter of replicating that many certificates in that amount of time? If we assume that each cert is 2K (that is, rounding up), and someone logs 20 million certificates, isn't that "just" a matter of geographically replicating 40GB? Are you suggesting it takes more than 24 hours to do that?

It's also unclear why submitting 40+ million certs means that the log is unable to produce STHs. For example, couldn't the log produce an STH of the first 1 million certs, and continue repeating in batches, since the requirement is that it integrate the certificate it gave the SCT to within the MMD - not that it integrate all pending certificates within the MMD.

Ryan Sleevi

unread,
Nov 9, 2016, 2:51:42 PM11/9/16
to Certificate Transparency Policy, mpa...@hezmatt.org


On Tuesday, November 8, 2016 at 1:21:19 PM UTC-8, Matt Palmer wrote:
On Tue, Nov 08, 2016 at 08:06:17AM -0800, Ryan Sleevi wrote:
> This was an attempt to be a summary, but the argument goes that removal IS
> impactful, it is not meant to be a light thing to be done for fun, because
> it affects the whole ecosystem, and therefore there should be a high bar
> for removal.

It is my belief that the entire design of CT is that there should be a *low*
bar for log removal.  The single biggest problem with the CA ecosystem is
that there is a *very* high bar for distrust because of the impact on users
that such an act creates; replicating such an arrangement in CT logging
seems... less than optimal.

Oh, I see the disagreement here. While it's true CT is meant to have a *lower* bar for removal, that's not to say it's a *low* bar, especially as parts of the ecosystem are built out. The difference here with CT and the CA ecosystem is that CT's design is to provide a *verifiable* reason for removal (as much as possible) - rather than the CA ecosystem (particularly as it existed pre-CT), in which there was limited information and even less cryptographic assurances of the accuracy of this information. We've already seen several CAs, even in a post-CT world, lie or materially misrepresent or misunderstand issues, for which CT provided the verifiable evidence.

By CT's design, its meant to ensure that the assurances it provides, to the extent possible, can be independently cryptographically verified. However, not everything can meet that bar - uptime/availability is a great example of this, because it's dependent upon the observer(s) to quantify that, and it's not a cryptographic issue.

To the more general point about 'ease of removal', as with requirements like the "One Google/One Non-Google", parts of the current CT world require a degree of more trust than desired, as part of bootstrapping the system to a path of less trust. That is, One Google exists because, in the absence of SCT inclusion proof checking, you're essentially trusting the logs to be honest. Since we (Chrome) aren't entirely satisfied with trusting arbitrary logs (... often which are run by the same CAs for whose operations we may be suspicious of or needing to supervise), we've decided that to mitigate that, we trust Google. However, with inclusion proof checking, that need diminishes.

The same is here with the situation before us - in the absence of some of the more robust controls, such as log freezing, we need to decide to what extent we simply operate on trust, and what parts we deem are simply too risky to trust. And in doing that, we have to keep in mind what that impact may have on the ecosystem at large - because if you are unwilling to extend any trust at all, then you can't bootstrap CT, and you end up back in a situation where not only do you not trust anyone, but you don't have a path to trusting anyone.
 
> As such, in the event of bad policies, good faith should be
> extended.

Except that it's Google's own log that was caught up in the bad policy
problem; extending good faith to "yourself" isn't quite the same as
extending it to a neutral third party.

Is there any third party truly neutral?
 
For myself, I believe that removing logs *shouldn't* be impactful, and if it
is, that should be fixed.  Thus, arguing against the removal because it
*would* be impactful is arguing from a false premise.

I disagree on the conclusion - but I agree with the goal. Removing logs should be less impactful - but it will never *not* be impactful, plain and simple, which is why policies like minimum number of SCTs exist, to try to minimize that impact as much as possible. The only way to fully isolate that impact is to imagine a perfect world, where SCTs were delivered via OCSP or TLS extensions, servers had no bugs, everyone was updated and universally consistent. Then we could have an impact free system. However, that's as unlikely to happen today, or tomorrow, and so we must make decisions on the basis of the world we live in, not the world we want. Thus, we have to weigh the decisions based in the current world, in which decisions can be impactful - both to clients, but also to the broader health of the ecosystem. You can see the concerns on the list, and I've had similar concerns offered offlist, where the "zero tolerance because we want removal to be free of impact" would likely see fewer logs, less openness, and less adoption. And zero tolerance itself isn't and wasn't the goal. That's just the world we live in.

Matt Palmer

unread,
Nov 9, 2016, 4:31:39 PM11/9/16
to Certificate Transparency Policy
Given that there aren't any logs that accept all roots, it ain't anywhere
close to 40M+ certs. Another option is for the log to prime itself with
known certs chaining to roots it trusts, giving it existing SCTs to return
for all future submissions of those certs (if it isn't caching SCTs to
handle resubmissions, well, sucks to be you).

> With the policy as it is now, you only get to choose which way you'll go down?

The policy is not a secret. "Logs can receive a lot of submissions" is not
(or at least, it *should* not) be a surprising entry in the risk register.
If you haven't scoped the problem and engineered your log such that it can
handle entirely predictable peak traffic levels and remain within compliance
of the policy, how is that the policy's problem?

- Matt

Linus Nordberg

unread,
Nov 9, 2016, 7:25:00 PM11/9/16
to Matt Palmer, Certificate Transparency Policy
Matt Palmer <mpa...@hezmatt.org> wrote
Thu, 10 Nov 2016 08:31:33 +1100:

> On Wed, Nov 09, 2016 at 02:33:41PM +0000, 'Pierre Phaneuf' via
> Certificate Transparency Policy wrote:
>> On Tue, Nov 8, 2016 at 10:01 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
>> > On Tue, Nov 08, 2016 at 11:42:42AM +0000, 'Ben Laurie' via
>> > Certificate Transparency Policy wrote:
>> >> Whilst I would perhaps agree with your position had there been a
>> >> policy-compliant alternative to blowing the MMD, the fact is, there
>> >> wasn't, and I haven't seen you address this at all.
>> >
>> > Dip into the downtime budget and deny requests for inclusion.
>>
>> In that case, anyone might have the ability to take down logs more or
>> less immediately after their acceptance. Just gather all the valid
>> certs from all the other logs, and submit them all as quickly as
>> possible. Either the new log will accept them (and likely blow their
>> MMD), or they'll have to appear down for add-chain (and I expect it
>> would take more than 21 hours to submit the 40+ million certs, so then
>> they blow their availability budget).
>
> Given that there aren't any logs that accept all roots, it ain't anywhere
> close to 40M+ certs. Another option is for the log to prime itself with
> known certs chaining to roots it trusts, giving it existing SCTs to return

Agreed. Setting up a public log without priming it with all the cert
chains you can find that match your set of known roots is, well, asking
for it. And if it would be 40M+ certs, deal with it.


> for all future submissions of those certs (if it isn't caching SCTs to
> handle resubmissions, well, sucks to be you).

And even without SCT caching, the hard part -- the distribution -- has
been dealt with.

Ryan Sleevi

unread,
Nov 9, 2016, 7:29:44 PM11/9/16
to Linus Nordberg, Matt Palmer, Certificate Transparency Policy
On Wed, Nov 9, 2016 at 4:25 PM, Linus Nordberg <li...@nordu.net> wrote:
Agreed. Setting up a public log without priming it with all the cert
chains you can find that match your set of known roots is, well, asking
for it. And if it would be 40M+ certs, deal with it.

I'm sorry, but I'm not sure I buy this - especially when you consider the revocation status of some certificates (like the US FPKI cross-sign), or the ability to obtain a technically constrained sub-CA and then do whatever you want with it.

So I'm not sure I buy the 'victim blaming' approach here - it truly is an unknown as to what the extent is, and as currently written in RFC 6962/6962-bis, log operators don't really have the flexibility here that's needed to make more accurate predictions about the load and number of new certificates that might be logged - let alone preexisting.

And does it serve the community well to have all new logs effectively be lower bounded by the set of extant certificates? I don't think that necessarily helps build a healthy ecosystem, so I'm hesitant to suggest "deal with it" is the right answer.

Matt Palmer

unread,
Nov 9, 2016, 7:57:36 PM11/9/16
to Certificate Transparency Policy
On Wed, Nov 09, 2016 at 11:51:41AM -0800, Ryan Sleevi wrote:
> > > As such, in the event of bad policies, good faith should be
> > > extended.
> >
> > Except that it's Google's own log that was caught up in the bad policy
> > problem; extending good faith to "yourself" isn't quite the same as
> > extending it to a neutral third party.
>
> Is there any third party truly neutral?

I wouldn't say there are any truly "neutral" logs at present, no.

> health of the ecosystem. You can see the concerns on the list, and I've had
> similar concerns offered offlist, where the "zero tolerance because we want
> removal to be free of impact" would likely see fewer logs, less openness,
> and less adoption.

On the other hand, policy ambiguity and the appearance of favouritism
towards some logs over others would also likely see fewer logs, less
openness, and less adoption. That's not a hypothetical point, by the way:
in the wake of the Chromium all-CT-all-the-time announcement, I've been
trying to gather financial support for the operation of open-root-policy
logs, and between people saying to me, "CT is just a Google thing", and "why
would you want to run open-root logs when Google already does?", it's
*already* a hard sell; adding "Google is more lenient on their own logs than
the competition" to the arguments against supporting truly independent logs
is just making things even harder.

- Matt

Linus Nordberg

unread,
Nov 9, 2016, 8:00:29 PM11/9/16
to Ryan Sleevi, Certificate Transparency Policy
Ryan Sleevi <rsl...@chromium.org> wrote
Wed, 9 Nov 2016 16:29:02 -0800:
Maybe we're just disagreeing on what "dealing with it" means? I mean
being prepared to store that many cert chains, and if you don't put them
in your log before you make it public, make sure you can handle them
being submitted. I realise that "that many" isn't a known number.

I don't see how a log operator can avoid dealing with the amount of
certificates signed by the set of known roots they have configured their
log to accept. I'm not familiar with the "US FPKI" case but would
suggest that log operators who identify that root(s) as problematic
avoid including it/them in their log. What am I missing?

Related, I should add that the idea of fiddling with limiting the set of
known logs temporarily, as a way of _technically_ complying with a
policy seems bad to me. It does bring up an old question, which I hope
has been answered and that I've just missed it, about whether or not the
set of known roots is part of what is accepted by Chrome when a new log
is accepted. Since you were the one suggesting it, earlier in this
thread, I suppose it's _not_ viewed as part of the static metadata about
a log that mustn't change without reapplying for inclusion. Can you
clarify?

Matt Palmer

unread,
Nov 9, 2016, 8:05:58 PM11/9/16
to Certificate Transparency Policy
On Wed, Nov 09, 2016 at 04:29:02PM -0800, Ryan Sleevi wrote:
> And does it serve the community well to have all new logs effectively be
> lower bounded by the set of extant certificates? I don't think that
> necessarily helps build a healthy ecosystem, so I'm hesitant to suggest
> "deal with it" is the right answer.

Log stuffing may not be the "right" (as in, "ideal") answer, but it *is* an
answer to the question that was posed: how does a new log deal with
processing the volume of extant certificates, if they were all to be
submitted in a large batch by someone.

- Matt

Ryan Sleevi

unread,
Nov 9, 2016, 8:22:39 PM11/9/16
to Matt Palmer, Certificate Transparency Policy
On Wed, Nov 9, 2016 at 4:57 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
adding "Google is more lenient on their own logs than
the competition" to the arguments against supporting truly independent logs
is just making things even harder.

Can you concretely point to what you believe is an identical case?

If not, then we're not really talking about leniency - we're talking about there being degrees of issues - and in number. And that's not leniency, that's nuance. 

Ryan Sleevi

unread,
Nov 9, 2016, 8:27:38 PM11/9/16
to Linus Nordberg, Ryan Sleevi, Certificate Transparency Policy
On Wed, Nov 9, 2016 at 5:00 PM, Linus Nordberg <li...@nordu.net> wrote:
I don't see how a log operator can avoid dealing with the amount of
certificates signed by the set of known roots they have configured their
log to accept. I'm not familiar with the "US FPKI" case but would
suggest that log operators who identify that root(s) as problematic
avoid including it/them in their log. What am I missing?

Two publicly trusted CAs (Identrust and Symantec) cross-certified the US FPKI, which is a huge bridge topology of extant certificates that are otherwise not intended to be publicly trusted. Identrust revoked the certificate, Symantec refused to, instead deciding to let it expire. Because of the former being revoked, logs which still accept Identrust (which is also the cross-certifier of Let's Encrypt) effectively will accept certs from the US FPKI, even though that cross-cert is revoked. Because that's what 6962/6962-bis say to do.

So how should log operators identify root(s) as problematic?

Are you suggesting that, such as was with Turktrust, if a CA misissues a subordinate CA cert, it should be immediately removed from all logs? Or as with Symantec/Identrust, if they issue a single cross-certificate to an 'undesired' PKI, they be forever banned from logs?

If not, then how is the log going to prevent the holder of that subCA cert from abusing the log - such as generating 40M certs offline, and then submitting them all at once?
If so, how does that help the ecosystem? It moves CT from being a detection mechanism to an enforcement mechanism - forcing a rollover of a new CA hierarchy in order to be included again in logs, and therefore have certificates trustred.
 
Related, I should add that the idea of fiddling with limiting the set of
known logs temporarily, as a way of _technically_ complying with a
policy seems bad to me. It does bring up an old question, which I hope
has been answered and that I've just missed it, about whether or not the
set of known roots is part of what is accepted by Chrome when a new log
is accepted. Since you were the one suggesting it, earlier in this
thread, I suppose it's _not_ viewed as part of the static metadata about
a log that mustn't change without reapplying for inclusion. Can you
clarify?

The set of logs is part of the data that is applied to, and log operators are expected to update when the set of known logs change. We monitor this for compliance. Several logs have failed on occasion, but we haven't yet removed them - instead, working with them to improve the notification process. 

Matt Palmer

unread,
Nov 9, 2016, 9:33:53 PM11/9/16
to Certificate Transparency Policy
On Wed, Nov 09, 2016 at 05:21:58PM -0800, Ryan Sleevi wrote:
> On Wed, Nov 9, 2016 at 4:57 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
> > adding "Google is more lenient on their own logs than
> > the competition" to the arguments against supporting truly independent logs
> > is just making things even harder.
>
> Can you concretely point to what you believe is an identical case?

Well, that's the problem, isn't it? No two cases are *ever* identical; it
is always possible to construct an argument that this case is different to
that case is different to the other case so it should be treated
differently. Heck, even if you found two cases whose facts were identical
(two logs blew their MMDs by exactly the same number of seconds in exactly
the same way) the mere fact that they happen at different *times* is enough
to provide leverage for arguing they should be treated differently, because
even though we did *that* back then, nowadays we wouldn't do it that way
because we've learnt X, Y, and Z.

I'm talking about *perception* here; if the wider community merely
*perceives* Google as treating its own logs more leniently, that's all
that's needed for the damage to be done. It doesn't have to be deliberate,
it doesn't even have to be *true*; it only needs to be perceived as such,
and CT loses support and traction.

Again, this isn't a hypothetical possibility I'm talking about here: it is
what I have been told, first hand, by "the internet" that Google wants CT to
belong to.

- Matt

Ryan Sleevi

unread,
Nov 9, 2016, 9:38:33 PM11/9/16
to Matt Palmer, Certificate Transparency Policy
On Wed, Nov 9, 2016 at 6:33 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
I'm talking about *perception* here; if the wider community merely
*perceives* Google as treating its own logs more leniently, that's all
that's needed for the damage to be done.  It doesn't have to be deliberate,
it doesn't even have to be *true*; it only needs to be perceived as such,
and CT loses support and traction.

And isn't that equally applicable, then, to the concern expressed by log operators regarding zero tolerance policies? That the perception of zero tolerance - whether or not it *is* zero tolerance - for any issue, intentional or otherwise, would make it untenable to run logs?

In any event, it doesn't seem like we're sharing new information or perspective here, so I think I'll try to wrap up and conclude discussions tomorrow.

Matt Palmer

unread,
Nov 9, 2016, 10:28:32 PM11/9/16
to Certificate Transparency Policy
On Wed, Nov 09, 2016 at 06:37:51PM -0800, Ryan Sleevi wrote:
> On Wed, Nov 9, 2016 at 6:33 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
> > I'm talking about *perception* here; if the wider community merely
> > *perceives* Google as treating its own logs more leniently, that's all
> > that's needed for the damage to be done. It doesn't have to be deliberate,
> > it doesn't even have to be *true*; it only needs to be perceived as such,
> > and CT loses support and traction.
>
> And isn't that equally applicable, then, to the concern expressed by log
> operators regarding zero tolerance policies? That the perception of zero
> tolerance - whether or not it *is* zero tolerance - for any issue,
> intentional or otherwise, would make it untenable to run logs?

On the contrary, I would expect it to produce the opposite effect: visibly
equal enforcement of the policy against all infractions by all logs would
make for clear expectations, reduced uncertainty, and, above all,
transparency, which is, I believe, generally considered a good thing.

- Matt

Ryan Sleevi

unread,
Nov 9, 2016, 10:30:33 PM11/9/16
to Matt Palmer, Certificate Transparency Policy
On Wed, Nov 9, 2016 at 7:28 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
On the contrary, I would expect it to produce the opposite effect: visibly
equal enforcement of the policy against all infractions by all logs would
make for clear expectations, reduced uncertainty, and, above all,
transparency, which is, I believe, generally considered a good thing.

I'm not sure why you mentioned what you expect, when the evidence is right here on this thread about the concern - and the increased uncertainty and lack of flexibility it causes.

Matt Palmer

unread,
Nov 9, 2016, 11:30:56 PM11/9/16
to Certificate Transparency Policy
I've just reviewed the thread archives, and I can't find any messages that
state that enforcing the strict letter of the CT policy is, in and of
itself, a cause for uncertainty or lack of flexibility. Could you point me
to the messages which you interpret as doing so? I do see plenty of
messages which express concern about various details of the current policy,
and an inability to comply with the policy as currently written, but that is
a different problem: the policy should be amended to more precisely
delineate what Chromium considers acceptable vs unacceptable log behaviour,
taking into account the experiences of log operators thus far and the
various forms of entirely predictable misbehaviour on the part of the
Internet.

- Matt

Peter Bowen

unread,
Nov 10, 2016, 1:01:12 AM11/10/16
to Ryan Sleevi, Matt Palmer, Certificate Transparency Policy
On Wed, Nov 9, 2016 at 6:37 PM, Ryan Sleevi <rsl...@chromium.org> wrote:
>
>
> On Wed, Nov 9, 2016 at 6:33 PM, Matt Palmer <mpa...@hezmatt.org> wrote:
>>
>> I'm talking about *perception* here; if the wider community merely
>> *perceives* Google as treating its own logs more leniently, that's all
>> that's needed for the damage to be done. It doesn't have to be
>> deliberate,
>> it doesn't even have to be *true*; it only needs to be perceived as such,
>> and CT loses support and traction.
>
>
> And isn't that equally applicable, then, to the concern expressed by log
> operators regarding zero tolerance policies? That the perception of zero
> tolerance - whether or not it *is* zero tolerance - for any issue,
> intentional or otherwise, would make it untenable to run logs?

I think there are two key things here.

1) The only fair way to apply policy is to apply what existed at the
time of an incident. Maybe that is bad policy, but it is fair. One
of the results of reviewing the issue could be revisions to the policy
going forward, but that should not impact review of the issue at hand.

2) Google should hold itself to a higher standard. MMD is _maximum_
merge delay and it is _minimum_ acceptable uptime. Google has some of
the best SREs in the industry. If they can't keep a log well above
policy, then I don't think anyone can.

What is feels like is happening is that the rules are changing as we
go, which I think has the result of discouraging people from running
logs.

Thanks,
Peter

Ryan Sleevi

unread,
Nov 10, 2016, 1:37:57 AM11/10/16
to Peter Bowen, Ryan Sleevi, Matt Palmer, Certificate Transparency Policy
On Wed, Nov 9, 2016 at 10:01 PM, Peter Bowen <pzb...@gmail.com> wrote:
I think there are two key things here.

1) The only fair way to apply policy is to apply what existed at the
time of an incident.  Maybe that is bad policy, but it is fair.  One
of the results of reviewing the issue could be revisions to the policy
going forward, but that should not impact review of the issue at hand.

I don't think there's any question that Aviator failed to live up to the policies.

Moreso, it seems some confusion on the community that the policies were not and are not intended to be 'single strike and you're out'. It suggests this could be made clearer, but the intent was to provide guidance about acceptable behaviours, make it clear that action may be taken, but leave discretion and judgement the opportunity to evaluate the situation.

So the disagreement is not whether it violated the policy, but where the policy violation is significant. This is similar to the policy requiring that log operators keep up to date information - which not all did. Or that they file new bugs (as previously written), which only one did.

I mean, I thought "Policy Violations" section sufficiently captured this, but this seems like an area for improvement. The entire goal of *not* spelling out explicitly all the things that could get you hard removed (such as providing a split view, as shown with Izenpe) was to avoid the notion that it's a closed set - that everything not specifically enumerated as bad was therefore acceptable. Rather, that the situations would be judged on their merits and context and the information and implementation available at the time. This is, understandably, far from perfect, but that's the issue - the policy is far from perfect, and while we've sketched out the general outline for what might be deemed 'expected' behaviour, but that's just it - it's meant to evolve over time as we learn more, from log operators and the community.
 
2) Google should hold itself to a higher standard.  MMD is _maximum_
merge delay and it is _minimum_ acceptable uptime.  Google has some of
the best SREs in the industry.  If they can't keep a log well above
policy, then I don't think anyone can.

I'm not sure this is accurately capturing the issue. I think, for any log operator, mistakes can and will happen. Bugs in code happen. Operational mistakes happen. Part of the question is understanding the risk and context. Should these have been reasonably anticipated? Were there ways to detect this? Did other log operators take steps to prevent issues like this? Especially in this early stage, I'm more willing to accept that there are a lot more unknowns - the system itself is and has been maturing, and hiccups are neither unexpected nor intended to be fatal to the whole system. Rather, they're learning experiences for all involved - whether it's learning something to be forbidden by policy or learning something to relax or redefine.

What is feels like is happening is that the rules are changing as we
go, which I think has the result of discouraging people from running
logs.

I suspect this is where I struggle with understanding this perspective. There's no question Aviator violated the policy. We've seen other logs violate policy - such as failing to provide responses conforming to RFC 6962 (Symantec). We've seen logs with several policy violations get removed (SSLWatcher). So it's not as if the policy is changing, or its interpretation is changing. Aviator unquestionably was out of compliance.

With respect to inclusion disqualifications, there's no question that getting included represents a high bar for competence, given the overheads and risks of removal. But the longer a log is operated, the greater confidence can be built to determine if a violation was an exception or the norm. For lack of a better metaphor, the log's whuffie goes up over time.

But should we bounce a log for changing its set of root certs and not updating the bug? The policy doesn't say the timeline for when they have to, so if they update 2 days after the fact, are they in compliance or not? What about 2 months after the fact?

This is the area where I think I've been trying to get feedback, and where I think the discussion about "Well, it's a policy violation, even if it's a bad policy" isn't particularly helpful. That, to me, has never been the question. It's "Is this a serious policy violation? And if so, why?"

Andrew's point about uptime I think is a good example of where answering this is tricky. We know what the threat model is (a blown MMD represents a window in which a certificate can be used undetectably), but that doesn't necessarily indicate the relative severity of that risk, versus, say, the risk of removing every log the first time they violate their MMD. Nor of the risk relative to a pending inclusion log (for which fewer data points suggest it's a much greater risk) versus a log that's been included for a long time (indicating that, over the whole of the data, the risk is relatively low). These are some of the points that I was hoping to get more understanding and feedback on - because they shape what the response should be when the policy is violated.

Gervase Markham

unread,
Nov 10, 2016, 4:31:15 AM11/10/16
to Ryan Sleevi, Certificate Transparency Policy, mpa...@hezmatt.org
On 09/11/16 19:37, Ryan Sleevi wrote:
> It's also unclear why submitting 40+ million certs means that the log is
> unable to produce STHs. For example, couldn't the log produce an STH of
> the first 1 million certs, and continue repeating in batches, since the
> requirement is that it integrate the certificate it gave the SCT to
> within the MMD - not that it integrate all pending certificates within
> the MMD.

I'm not a log operator, but surely if you've accepted a cert (and so
it's "pending") that means you've responded to the call to add-chain or
add-pre-chain with an SCT. And therefore, it must be added to the log
within the MMD. There's no queue inside a log of certificates which are
submitted but for which an SCT has not been issued.

So if you accept 40+M certs in one hour, you can issue intermediate STHs
if you like, but it doesn't help - they all need to be integrated in the
next 23 hours, or you've blown your MMD.

Perhaps that suggests a solution, albeit one a bit late for 6962bis. Why
not split the add-chain and add-pre-chain calls into two parts -
add-chain and get-sct? The client would then call add-chain, and the
server would return success, and an estimated number of seconds before
which get-sct would succeed. After that many seconds, the client could
call get-sct (passing the cert's SHA-256 or some other identifier) and
the server would either return an SCT or a "try later" response. The log
would only be committed to merging the SCT into the log within the MMD
starting from the point of the timestamp in the SCT, not starting from
the point where the chain was submitted.

This means that you could deal with flooding simply by filling up a
queue on your disk, and issuing ever-later "try times". You would then
add them to the log as fast as you can, storing the SCTs for later
sending-back to whoever requested them.

Gerv

Gervase Markham

unread,
Nov 10, 2016, 5:15:02 AM11/10/16
to rsl...@chromium.org, Linus Nordberg, Matt Palmer, Certificate Transparency Policy
On 10/11/16 00:29, Ryan Sleevi wrote:
> I'm sorry, but I'm not sure I buy this - especially when you consider
> the revocation status of some certificates (like the US FPKI
> cross-sign), or the ability to obtain a technically constrained sub-CA
> and then do whatever you want with it.

Does the RFC allow you to say "I accept certs which chain up to this
root, but only if the chain does not go through this intermediate"? One
could read it as implying that you MUST accept all certs under a root if
you accept the root, but I can't see it written explicitly. It says one
MAY accept revoked certs, which suggests that one also MAY NOT, and
therefore one might also reject chains which go through a revoked
intermediate.

All this is to say: are logs allowed to "blacklist" the FPKI cross-signs
to keep all those certs out of their log, while still accepting other
certs under that root?

Gerv
It is loading more messages.
0 new messages