SUMMARY:
On 16 October 2016 the Google 'Aviator' log exceeded its stated Maximum Merge Delay (MMD). The apparent merge delay rose over the course of three days from its usual level of about 1.5 hours to 26.2 hours. This serious operational issue arose because of an unusually large backlog of added certificates that had been submitted to the log over the preceding day. Aviator was not able to incorporate those submissions into new Signed Tree Heads quickly enough.
Exceeding the stated Maximum Merge Delay is a violation of RFC 6962, section 3:
The log MUST incorporate a certificate in its Merkle Tree
within the Maximum Merge Delay period after the issuance of the SCT.
It is also a violation of the Chromium Certificate Transparency policy:
Log Operators must ... incorporate a certificate for which an SCT has been issued by the Log within the MMD.
IMPACT:
Five consecutive runs by Aviator's signer failed to incorporate recently submitted chains within MMD. These are as follows:
Submitted chains in index range [35910127, 35919332) were not incorporated within MMD for STH signed at timestamp 1476653510423 for tree size 35936627 (34.7% of entries sequenced for that STH).
Submitted chains in index range [35936627, 35962877) were not incorporated within MMD for STH signed at timestamp 1476657100609 for tree size 35962877 (100% of entries sequenced for that STH).
Submitted chains in index range [35962877, 35982377) were not incorporated within MMD for STH signed at timestamp 1476663459670 for tree size 35982377 (100% of entries sequenced for that STH).
Submitted chains in index range [35982377, 36081668) were not incorporated within MMD for STH signed at timestamp 1476671114710 for tree size 36084877 (96.9% of entries sequenced for that STH).
Submitted chains in index range [36084877, 36113179) were not incorporated within MMD for STH signed at timestamp 1476675980176 for tree size 36255877 (16.6% of entries sequenced for that STH).
ROOT CAUSE:
A large backlog of added certificates was generated over a 6 hour period during the early hours (PDT) of Sunday 16 October, caused by the (non-malicious) actions of some high-volume clients. Aviator's signer could not sequence the submitted certificate chains quickly enough to clear the backlog, exacerbated by the fact our protection against flooding did not activate when expected.
REMEDIATION AND PREVENTION:
During the impact period, Google's engineers worked to bring down the size of Aviator's backlog of submitted certs, in an attempt to avoid a policy violation. As part of that effort the /ct/v1/add-chain and /ct/v1/add-pre-chain end-points were made temporarily unavailable; this was not announced.
/add-chain was unavailable from 2016-10-16 16:42 PDT to 2016-10-16 21:30 PDT (4.75hrs);
/add-pre-chain was unavailable from 2016-10-16 16:42 PDT to 1026-10-16 17:42 PDT (1hr)
Google are using the lessons learned from this incident to improve operational practices for the Pilot, Rocketeer, Submariner, Icarus and Skydiver logs; in particular the sequencing operation has been tuned, as have protections for the logs against flooding. Monitoring has been revised to provide earlier warning of similar events in the future.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CA%2BcU71%3Dxdj0vu20hTv1nYR1UruUV3ir_fOrxfnMqR83ZpGWm1w%40mail.gmail.com.
I don't see any mention of Aviator's status with Chrome. Is this still
being decided, or is no news to be taken as no change will be made?
from a log qualified at the time of check is presented"? If there isWould Chrome reconsider the portion of the CT policy that says "SCT
a known good STH up to a certain point, what is the risk of simply
accepting SCTs included in the log as of that STH? The advantage of
this is a cert with, say, Aviator and Izenpe embedded SCTs would still
be trusted without having to find a server that does SCT delivery via
TLS extension or OCSP stapling.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvb13H-gF8etEG-5Wk%3Dogm6wnxSFT062X32uf1smW6J3VA%40mail.gmail.com.
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).Both of these actions are policy violations.I don't see any other choice.
published a policy but has never published the risk (e.g. threatI think we may be approaching this from the wrong end. Chrome
model) that the policy is trying to mitigate.
Given the current log
ecosystem has multiple operators with more than one log, we really
should be talking about both the operator and the log in these
discussions. Is the non-compliance with CT policy a log issue or an
log operator issue? What is the purpose of disqualifying a log (even
for downtime/availability) if the operator is allowed to immediately
resubmit?
Assuming the problem was a fluke outside the operator's
control (e.g. DDOS attack on their DNS provider), is the answer "don't
change anything,"
But certainly, I want to find solutions, but I'm not sure I agree with statements that it's "impossible" to comply with.
At fist I was strongly in favor of zero tolerance, and Caesar's wife and all that, but after reading everything so far I changed my mind. Tweak the policy to allow "infrequent" MMD misses, record them, and move on.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAFH29tp96VGg_7sd96G9Z_WE-qatsvgZhvPatKgie9vANA3rQg%40mail.gmail.com.
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).Is that what you would recommend, or do you have other ideas?
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.
Require an incident report if the frequency gets too high.
On 21 October 2016 at 23:13, Ryan Sleevi <rsl...@chromium.org> wrote:On Fri, Oct 21, 2016 at 2:51 PM, 'Ben Laurie' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).Both of these actions are policy violations.I don't see any other choice.I don't think that's a fair statement, and I'm surprised to hear you state it.You can throttle input, which is effectively an 'outage', as provided for in https://www.chromium.org/Home/chromium-security/certificate-transparency/log-policy . So long as the MMD is maintained, and the overall outage does not regress past the 99% uptime, this is still compliant.I suspect you're more specifically thinking of "What happens when a single certificate is presented, and the option is either blow MMD or blow 99% uptime", which is a possible situation, but one would have hoped that the Log Operator took appropriate steps to avoid that situation, since a variety of options exist - up to and including no longer including CAs as accepted by the Log until the Log Operator is able to scale.I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT,
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.
From Chrome's viewpoint, it is as simple as that. From the log developer's viewpoint, guidance would be helpful but "throw hardware at it" is one possibiliy.
On Fri, Oct 21, 2016 at 4:17 PM, Ben Laurie <be...@google.com> wrote:I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).Is that what you would recommend, or do you have other ideas?Right, I see there being two sources of load for a Log Operator, at least w/r/t MMD-impacting events:- New submissions- "Backfill" submissionsThe issue is that if you decide to accept a given root, then you are, in effect, agreeing to log everything that root ever issued (unless, of course, your policy makes some statement otherwise, such as examining the validity dates, or only N # of certs, etc). We haven't had any logs do so, but it's not impossible to think of them wanting to do so.This also comes up with things like cross-signs. For example, if you agree to log Identrust or Symantec certs, then you also potentially agree to log the entirity of the US FPKI - which is quite a few certs! Now, if your log implementation checks revocation status, it could decide to reject such certs (not complying with policy), but now we get into that gray area of how much or how willing should a log be to log everything.For "new" submissions - that is, new certificates being issued - it seems unlikely in general that a CA will cause serious pressure; even a CA like Let's Encrypt. If it does/is, then that's something that should be discussed, and is precisely the thing that is meaningful to the community to solve. But my gut and sense from discussions with log operators (including Google) is that even CAs such as Let's Encrypt do not place unmanagable load on reasonably developed logs, nor would they be anticipated to.From what I understand from Ryan's post, it's most likely that this was a 'backfill' sort of operation. At that point, the submitter has the full ability to affect the QPS of which they log, and the upper scale of how many QPS may come in is, presuming a distributed enough submitter, equivalent to the totality of the WebPKI that the log operator accepts. That'd be huge!
My suggestion of 'removing a CA' was moreso with respect to thinking about the 'new' submissions case, and an example of how you could mitigate some of the challenge, if no other solution existed. For addressing the 'backfill' case, the answer would have to be some form of D(DoS) mitigation, which seems to fit within the reasonable bounds of mitigation, and is distinct from a 'total' outage. So even if submitted tried to log N million certificates in 1 second, you could reject once you exceeded your QPS budget that ensured you hit your MMD budget.A log operator could also seek to mitigate this issue with acceptance policies (as mentioned above), or by 'pre' backfilling the log contents, such that it started from a known state. Of course, as the PKI grows, I expect that the former will be more popular than the latter, but I suspect both fit within a spectrum of option and degree, such that it's not either/or.
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.Given that we've seen CAs backdating certificates, how do you define 'new' certs? :)
Require an incident report if the frequency gets too high.Who reports? :) With respect to adding precerts, only CAs trusted by the log can do that. With respect to MMD measurements, only those who successfully obtain an SCT can quantify that, and it may be that multiple parties are logging, and they aren't aware that they're all getting "come back laters" (e.g. 10 parties are seeing an 'outage'). Should they report unconditionally every failure?
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWva-s4CHndJTeGp8VK2rNGahJHpVD4pSNR5M%2BJFfep44AQ%40mail.gmail.com.
On the one hand, I can see a very compelling argument to remove the Aviator log (or more aptly, treat it as no longer trusted within the next few weeks). This is consistent with a strict zero tolerance policy, which, despite all the negatives that zero tolerance involves, has the benefit that it hopefully rises beyond even the slightest suggestion of treating Google logs different.
I had been planning a more thorough write-up of these concerns, but then I remembered I provided much of this back in June - https://groups.google.com/a/chromium.org/d/msg/ct-policy/AH9JHYDljpU/f4I9vQLACwAJ
When we examine the nature of this specific failure - the failure to integrate the SCTs within the MMD - we have to think about the impact.
Unlike an uptime issue,
so long as the STH is eventually consistent, there's no ability to hide misissued certificates longer than the window of non-compliance (which, as I understand, is 2.2 hours). As the STH eventually integrated all of these SCTs, the first and foremost concern - the ability for the log to mask misissuance - seems to have been limited, and that window is significantly less than the window that a stapled OCSP response could be used to assert 'goodness' (roughly 3.5 days)
That said, I wouldn't want to see a scenario in which logs routinely blow MMDs - that creates a system in which the reliability of the system becomes suspect, and increases the window of detection of misissuance.
I'm curious how the broader community feels, and, as I indicated above, I can see arguments for both. I think the ecosystem can support either action as well, so this is largely about understanding the threat model, community norms, and whether enforcing a zero tolerance policy provides the greatest benefit to the ecosystem, which it may very well do.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAFewVt5-wVFJSjOxzMi2y4s0KmpoA61fxScYODDGzmNmgw_HLg%40mail.gmail.com.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAFewVt5-wVFJSjOxzMi2y4s0KmpoA61fxScYODDGzmNmgw_HLg%40mail.gmail.com.
----
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CANBOYLUXXsoHgujwgcdKbwSczLj%3DDWWw1amBvfeNqQT0mqAmbw%40mail.gmail.com.
On that note, I haven't seen anyone discuss on-thread whether a 24 hour MMD is perhaps just unreasonable at scale. While there are clearly security ramifications to extending the allowable window before a certificate is logged, maybe a 36-hour MMD is better for security than weakening the other guarantees of a log, and/or better than allowing the MMD requirement to get more "blurry" as small incidents are allowed?
Would it be helpful to share some data on merge times for the logs Google operates?
Or are you thinking of a more qualitative description of the factors that make sequencing and signing take the time they do?
>
> --
> You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
> To post to this group, send email to ct-p...@chromium.org.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvY43PQL%2Bswx1UUyAZysnSpOE9xxaDnhiCn4o7M9iDcE4A%40mail.gmail.com.
On 22 Oct 2016 14:03, "Ryan Sleevi" <rsl...@chromium.org> wrote:
> Whether or not this is true, of course, is largely a factor of log operators providing feedback to the ecosystem about the challenges, but I'm otherwise disinclined to just change it to "see if it helps," without understanding the first principals about what challenges log operators are running into, and why.Would it be helpful to share some data on merge times for the logs Google operates?
Or are you thinking of a more qualitative description of the factors that make sequencing and signing take the time they do?
Therefore, I propose that every distinct non-overlapping period of time
during which Aviator was unable to provide an inclusion proof that it
should have been able to provide be counted against Aviator's uptime
requirement. If Aviator drops below 99%, kick it out. Otherwise,
keep it in. The log policy should be updated to codify this.
That said, the maximum MMD ought to be reduced (consider that Aviator's
normal MMD is only 1.5 hours) and log operators should be encouraged
to throttle/disable their submission endpoint rather than let their
MMD skyrocket.
The reason is that refusal to accept a misissued
certificate means it won't be accepted by TLS clients (fail closed),
whereas accepting a misissued certificate but delaying its incorporation
deprives domain owners of critical time to respond, during which time
TLS clients will accept it (fail open).
To that end, perhaps Chrome
should have a separate, laxer uptime requirement for submission
availability. Submission availability strikes me as an issue primarily
between CAs and the operators of the logs which they depend on for
issuance. As a practical solution to the issuance availability
problem, logs could prioritize submissions from the IP addresses of
trusted CAs so that an influx of submissions from anonymous sources
doesn't affect the timely issuance of new certificates.
> > Aviator's uptime requirement. If Aviator drops below 99%, kick it
> > out. Otherwise, keep it in. The log policy should be updated to
> > codify this.
> >
>
> Just to be clear, the Chromium policy is "Have 99% uptime, with no
> outage lasting longer than the MMD (as measured by Google)," not just
> 99% uptime.
>
> Do I understand your proposal is that we should let logs hide
> certificates that they've given SCTs for for up to 2*MMD? That would
> be 48 hours in this case. That seems like way too long for me.
That would be counted as 24 hours of downtime, which, assuming the
uptime is measured over a month, would put the log below 99% uptime
(which allows ~7 hours of downtime a month). I think even 31 hours is
too long, but that can be solved by requiring shorter MMDs.
> It would be bad to have a policy where a CA could monopolize the
> submission bandwidth of a CT log, preventing third parties from
> logging third-party logs, so I don't think there should be a bias
> towards CAs. Remember the purpose of CT is to keep CAs in check, and
> with that in mind the bias should be towards ensuring third parties
> can get their certificates logged, not towards CAs.
I see your point, but the full value of CT is not realized until CAs
log all certificates themselves, at issuance time, and TLS clients
reject certificates without SCTs. This is the best check on CAs that CT
can provide. Once this becomes a reality (for Chrome at least) one
year from now, do you think the third party submission case will be as
important?
You're correct - it's a 90 day rolling window according to
this email:
https://groups.google.com/a/chromium.org/forum/#!msg/ct-policy/ccfVGhPR6g0/ZQJRLIVLBAAJ
I don't see this written down in the policy, which ought to be
rectified.
Of course, CAs are already in a position to DoS logs by issuing and
logging infinite certificates.
I think scope is the key question. What do people think about this?
If Aviator is disqualified then there is a direct impact on qualified log servers available to stick to Chrome’s diverse SCT policy, which requires at least one SCT from a Google log server.
Though number of qualified log servers are available today, there are still few Google operated log servers.
In light of this incident and the recently announced Chrome CT policy (https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!topic/ct-policy/78N3SMcqUGw), it will be good that Chrome team revisits diverse SCT policy so that certificate issuance adhering to Chrome CT policy is not exposed to a systematic risk. SCT diversity can be achieved by not tying to a specific log server operator.
On 25/10/16 20:08, Rob Stradling wrote:
> On 25/10/16 20:03, Brian Smith wrote:
> <snip>
>> It may also be good for logs to offer an alternative submission point
>> for backfilling and other bulk operations that doesn't return SCTs to
>> the submitter and so isn't subject to the MMD.
>
> Interesting idea.
>
> Adding an optional "no_sct_required" boolean input to the add-chain API
> would be one way to accomplish that. (My list of things I wish we'd
> added to 6962-bis before Last Call keeps growing...)
I discussed this with Eran. He objected on the grounds that:
Because
even that queue on the log's side can't be infinite, it does not fully
relieve submitters of their duty to handle log throttling - so we don't
gain much. So for this reason I don't think the no_sct_required
parameter should be added - if the ultimate goal is enabling throttling
/ dealing with load, why don't we add an error code explicitly for that
or designate an http error code to indicate that?"
So, inspired by OCSP, I propose that we add a "tryLater" error code for
add-chain and add-pre-chain.
As promised on Monday October 17, here are the findings of our investigation of the recent Google Aviator incident.
On Friday, 21 October 2016 12:19:52 UTC+1, Paul Hadfield wrote:As promised on Monday October 17, here are the findings of our investigation of the recent Google Aviator incident.
Reading all that has been said here, it seems to me that the correct approach is to demonstrate that everyone's treated equally by doing the following:
* un-trusting Aviator
* re-starting the same codebase and infra with a new key and re-applying for inclusion (as the issue is now fixed, so why not)
* making policy changes as necessary
Exactly what the policy change is, is tricky to decide. I agree that 24 hours is an absolute worst case figure, and "it was only 2 hours over" is not much of an argument. But I also think that logs should have legal means of dealing with having their infra overwhelmed. After all, there are N million certs in the largest logs, and a few thousand in the smallest - could any of those be knocked out of trust by an attacker who simply submitted all from the former to the latter at high speed, forcing the log to either blow its uptime requirement by refusing submissions, or blow its MMD by accepting them?
Gerv
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/6dcd5837-4884-48e4-82e5-2d4147a5cf9c%40chromium.org.
On 01/11/16 09:40, Ben Laurie wrote:
> That makes no sense to me - if policy changes are necessary, then
> Aviator should be judged by the revised policy.
IMO, this is precisely what shouldn't happen, because it leaves one open
to the charge of changing the rules to suit oneself in a post-hoc
fashion. Later, when other logs violate the policy and are removed, it
would be hard to avoid the accusation of "one rule for Google and one
for everyone else". As the main driver of CT, and the one who lays down
these stringent requirements for logs which have already seen several
logs either not included or disqualified, it seems to me that Google
needs to be seen to be above reproach and scrupulously fair when judging
its own logs by the standard it has set.
If a law is found to be producing unjust results, it gets changed. But
it's rare that everyone tried under that law gets hauled back for a retrial.
CT is designed to be resilient to the failure of logs; if un-trusting
Aviator would have significant ecosystem repercussions, we have a
problem, and we need to look at what those are and see how they can be
made not to happen in future when other logs are untrusted. But if there
are not significant ecosystem repercussions, and the system is working
as designed, then there should be no problem with re-starting and
re-qualifying the log.
On 01/11/16 09:40, Ben Laurie wrote:
> That makes no sense to me - if policy changes are necessary, then
> Aviator should be judged by the revised policy.
IMO, this is precisely what shouldn't happen, because it leaves one open
to the charge of changing the rules to suit oneself in a post-hoc
fashion. Later, when other logs violate the policy and are removed, it
would be hard to avoid the accusation of "one rule for Google and one
for everyone else". As the main driver of CT, and the one who lays down
these stringent requirements for logs which have already seen several
logs either not included or disqualified, it seems to me that Google
needs to be seen to be above reproach and scrupulously fair when judging
its own logs by the standard it has set.
If a law is found to be producing unjust results, it gets changed. But
it's rare that everyone tried under that law gets hauled back for a retrial.
CT is designed to be resilient to the failure of logs; if un-trusting
Aviator would have significant ecosystem repercussions, we have a
problem, and we need to look at what those are and see how they can be
made not to happen in future when other logs are untrusted. But if there
are not significant ecosystem repercussions, and the system is working
as designed, then there should be no problem with re-starting and
re-qualifying the log.
Perhaps the difference of opinion is partly rooted in the fact that
those outside Google see the Chrome people and the CT people as
"Google", whereas those inside see two independent teams. So an external
view might be that "Google made these rules that they've held everyone
to until they fall foul of them themselves, and then they suggest
changing them", and the internal view is "the CT team is just another
set of logs to the Chromium team, treated no differently to anyone else
- why are we not allowed to make a case that the rules are unreasonable?"
I can see why other log operators would prefer the policy remain
flexible so they can argue for mercy when something goes wrong. And I'm
not saying it shouldn't be flexible. But I think it should be _least_
flexible for Google, who not only make the rules, but should hold
themselves to the highest standards, and are inevitably and unavoidably
open to perceptions of conflict of interest.
> As noted, there are ecosystem repercussions with untrusting and
> retrusting logs, such that it should be the last resort, not the first
> resort.
Well, OK, but if we want to be a bit utilitarian, the alternative
argument would be "CT needs to be able to cope with a commonly-used log
going away; now - before we make CT mandatory for everyone - is the best
time to see what happens in practice when we do that. After all, the log
has definitely violated the policy, so is a good choice for such a
test". Heck, the Chaos Monkey principle suggests that if none of the 3
oldest Google logs had violated the policy by now, you should kill
(well, un-trust and restart under a new key) one anyway, to see what
happens.
If it turns out that un-trusting Aviator breaks a lot of stuff, I
suspect valuable lessons will be learned from the process. If it
doesn't, that's a good validation of CT's design and implementation.
Ryan,
I finally got the opportunity to go back and read the threads you
linked. What really stood out to me was
https://groups.google.com/a/chromium.org/d/msg/ct-policy/Itoq0YUZTlA/abf6cmjyCwAJ
and subsequent messages in that thread where you repeatedly bring up
blowing the MMD in a manner that makes it clear you believed that
doing so was a clear ground for distrust. Quoting you:
"As explained on the other thread behind the reasoning, uptime has
security impact:
A significant downtime event can cause an MMD to be blown."
"For example, consider if you find a vulnerability that allows an SCT
to be issued that isn't incorporated in the MMD? That's discoverable
within 24 hours - and is reasonably serious enough to be grounds for
disqualifying the log (as we've seen)"
While I can't know all the private discussions you have had with log
operators, it MMD has been held up as a critical portion of CT log
trust. As pointed out elsewhere in this discussion, the standard
merge delay is an order of magnitude shorter than the MMD.
While I agree removing a log should not happen for trivial things,
Google has repeatedly stated blowing MMD is not trivial.
That is true. Chrome certainly has discretion to do whatever it likes -
as you note, the word "may" is deployed. But on the other hand, the
point of writing a policy is to make it clear what sort of behaviours
are considered unacceptable. If the MMD isn't the "maximum merge delay",
but is in fact the "merge delay after which the Chrome team will frown
at you a little bit and suggest you don't do that again", then the name
is a bit misleading. :-)
Well, OK, but if we want to be a bit utilitarian, the alternative
argument would be "CT needs to be able to cope with a commonly-used log
going away; now - before we make CT mandatory for everyone - is the best
time to see what happens in practice when we do that. After all, the log
has definitely violated the policy, so is a good choice for such a
test". Heck, the Chaos Monkey principle suggests that if none of the 3
oldest Google logs had violated the policy by now, you should kill
(well, un-trust and restart under a new key) one anyway, to see what
happens.
If it turns out that un-trusting Aviator breaks a lot of stuff, I
suspect valuable lessons will be learned from the process. If it
doesn't, that's a good validation of CT's design and implementation.
On Nov 8, 2016 1:31 AM, "Gervase Markham" <ge...@mozilla.org> wrote:
>
> Hi Ryan,
>
> On 08/11/16 01:25, Ryan Sleevi wrote:
> > In favor of removal:
> >
> > * Chrome has removed other logs for failing to comply with other
> > aspects of the policy. Chrome should treat all policy violations the
> > same, and thus remove Aviator.
>
> I don't think this quite captures it. An MMD blowout has been said in
> the past to be a serious thing; it is not required that one believes
> that "all policy violations should be treated the same" in order to
> believe that Aviator should be removed in this case.
I'm sorry, I don't understand the distinction you are making. Are you referring to the email Peter pointed out in which I said that, and then the multiple corrections pointed out by others, pointing out that it isn't?
That is, the crux of the argument seems to be "You said something that was wrong, but you should now act is if it was right"
>
> > * Chrome should remove Aviator for no other reason than to see what
> > happens to the ecosystem when a popular log is removed.
>
> "For no other reason" also doesn't capture it; I don't think I would be
> arguing for this to happen to Aviator specifically if nothing had gone
> wrong with it. I would instead put this as:
>
> * It would be good for the ecosystem to see what happens when a popular
> log is removed; Aviator has violated the policy and so it's the obvious
> choice.
I was intentionally trying to avoid your value judgement that it would be good - you have neither articulated the benefits nor addressed the risks. However, it is also clear that even if you agree that Aviator should stay - for the reasons outlined below - that you're still suggesting it be removed just to "see what happens" - so that very much is "for no other reason".
If you feel otherwise, please help me understand, but I feel this certainly holds as a summary of the argument.
>
> > Against removal:
> >
> > * The policy, as presently written, allows for logs to be DoS'd by CAs
> > or by the public, by forcing a log to choose between blowing the MMD
> > or blowing the uptime requirement. Therefore, the policy is
> > unreasonable and should not be enforced for a single violation.
>
> Your summary here implies the following consequential logic:
>
> The policy is problematic and can be improved ->
> The policy should not be enforced in the way it was written at the time
> of the incident.
That is part of the argument being advanced, yes.
>
> I don't believe that A implies B in this way. It is possible to believe
> both that the policy can be improved, and that it should be enforced as
> written at the time of the incident.
Why don't you believe A implies B? Or, put differently, what of the impacts have you considered that you are ignoring or discarding?
This was an attempt to be a summary, but the argument goes that removal IS impactful, it is not meant to be a light thing to be done for fun, because it affects the whole ecosystem, and therefore there should be a high bar for removal. As such, in the event of bad policies, good faith should be extended.
It's unclear if you're simply arguing for an absolutist interpretation or if you disagree with the statement that removing logs is impactful and not to be done lightly.
>
> > * Removing Aviator for a single violation would discourage other log
> > operators from operating logs, because it offers them no flexibility
> > to learn and improve implementations, instead requiring a perfect
> > implementation the first time, with a number of unknown risks.
>
> Again, this implies that the act of removing Aviator means you are
> enforcing a zero tolerance policy. I don't think that's the case. There
> can be infractions less serious than blowing an MMD.
Can you name examples? It would be useful to understand your perspective here, particularly as to why you view the MMD as serious enough to cross a line for you, while still leaving other things less.
Regardless of your agreement, it is something captured by the onlist replies, and more seriously articulated off list, that the optics here are that it is a zero tolerance policy, because this is one of the few elements - unlike, say, split views - which can be fully induced by a remote attacker even in a perfectly implemented system with infinite scale. So if that isn't within the realm of discussion for leniency, what is?
> > * Chrome should remove Aviator for no other reason than to see what
> > happens to the ecosystem when a popular log is removed.
>
> "For no other reason" also doesn't capture it; I don't think I would be
> arguing for this to happen to Aviator specifically if nothing had gone
> wrong with it. I would instead put this as:
>
> * It would be good for the ecosystem to see what happens when a popular
> log is removed; Aviator has violated the policy and so it's the obvious
> choice.I was intentionally trying to avoid your value judgement that it would be good - you have neither articulated the benefits nor addressed the risks. However, it is also clear that even if you agree that Aviator should stay - for the reasons outlined below - that you're still suggesting it be removed just to "see what happens" - so that very much is "for no other reason".
If you feel otherwise, please help me understand, but I feel this certainly holds as a summary of the argument.
I think that logs should choose to throttle write access rather than
blow the MMD. If someone's uptime dipped below 99% because they were
DoSed, I would view that as a less serious infraction.
Why do we have an MMD? So logs can't issue SCTs and then not incorporate
them for an arbitrary amount of time. How long do you wait before there
is a trust problem? That amount of time should be set as the MMD. If
we've set it too low, and 26 hours is not a trust problem, we should
change it. If we've set it right, then blowing the MMD is a trust problem.
To put it another way: if we keep Aviator, and we don't change the MMD
in the policy, then the MMD is not actually an _M_MD.
In that case, anyone might have the ability to take down logs more or
less immediately after their acceptance. Just gather all the valid
certs from all the other logs, and submit them all as quickly as
possible. Either the new log will accept them (and likely blow their
MMD), or they'll have to appear down for add-chain (and I expect it
would take more than 21 hours to submit the 40+ million certs, so then
they blow their availability budget).
With the policy as it is now, you only get to choose which way you'll go down?
On Tue, Nov 08, 2016 at 08:06:17AM -0800, Ryan Sleevi wrote:
> This was an attempt to be a summary, but the argument goes that removal IS
> impactful, it is not meant to be a light thing to be done for fun, because
> it affects the whole ecosystem, and therefore there should be a high bar
> for removal.
It is my belief that the entire design of CT is that there should be a *low*
bar for log removal. The single biggest problem with the CA ecosystem is
that there is a *very* high bar for distrust because of the impact on users
that such an act creates; replicating such an arrangement in CT logging
seems... less than optimal.
> As such, in the event of bad policies, good faith should be
> extended.
Except that it's Google's own log that was caught up in the bad policy
problem; extending good faith to "yourself" isn't quite the same as
extending it to a neutral third party.
For myself, I believe that removing logs *shouldn't* be impactful, and if it
is, that should be fixed. Thus, arguing against the removal because it
*would* be impactful is arguing from a false premise.
Agreed. Setting up a public log without priming it with all the cert
chains you can find that match your set of known roots is, well, asking
for it. And if it would be 40M+ certs, deal with it.
adding "Google is more lenient on their own logs than
the competition" to the arguments against supporting truly independent logs
is just making things even harder.
I don't see how a log operator can avoid dealing with the amount of
certificates signed by the set of known roots they have configured their
log to accept. I'm not familiar with the "US FPKI" case but would
suggest that log operators who identify that root(s) as problematic
avoid including it/them in their log. What am I missing?
Related, I should add that the idea of fiddling with limiting the set of
known logs temporarily, as a way of _technically_ complying with a
policy seems bad to me. It does bring up an old question, which I hope
has been answered and that I've just missed it, about whether or not the
set of known roots is part of what is accepted by Chrome when a new log
is accepted. Since you were the one suggesting it, earlier in this
thread, I suppose it's _not_ viewed as part of the static metadata about
a log that mustn't change without reapplying for inclusion. Can you
clarify?
I'm talking about *perception* here; if the wider community merely
*perceives* Google as treating its own logs more leniently, that's all
that's needed for the damage to be done. It doesn't have to be deliberate,
it doesn't even have to be *true*; it only needs to be perceived as such,
and CT loses support and traction.
On the contrary, I would expect it to produce the opposite effect: visibly
equal enforcement of the policy against all infractions by all logs would
make for clear expectations, reduced uncertainty, and, above all,
transparency, which is, I believe, generally considered a good thing.
I think there are two key things here.
1) The only fair way to apply policy is to apply what existed at the
time of an incident. Maybe that is bad policy, but it is fair. One
of the results of reviewing the issue could be revisions to the policy
going forward, but that should not impact review of the issue at hand.
2) Google should hold itself to a higher standard. MMD is _maximum_
merge delay and it is _minimum_ acceptable uptime. Google has some of
the best SREs in the industry. If they can't keep a log well above
policy, then I don't think anyone can.
What is feels like is happening is that the rules are changing as we
go, which I think has the result of discouraging people from running
logs.