SUMMARY:
On 16 October 2016 the Google 'Aviator' log exceeded its stated Maximum Merge Delay (MMD). The apparent merge delay rose over the course of three days from its usual level of about 1.5 hours to 26.2 hours. This serious operational issue arose because of an unusually large backlog of added certificates that had been submitted to the log over the preceding day. Aviator was not able to incorporate those submissions into new Signed Tree Heads quickly enough.
Exceeding the stated Maximum Merge Delay is a violation of RFC 6962, section 3:
The log MUST incorporate a certificate in its Merkle Tree
within the Maximum Merge Delay period after the issuance of the SCT.
It is also a violation of the Chromium Certificate Transparency policy:
Log Operators must ... incorporate a certificate for which an SCT has been issued by the Log within the MMD.
IMPACT:
Five consecutive runs by Aviator's signer failed to incorporate recently submitted chains within MMD. These are as follows:
Submitted chains in index range [35910127, 35919332) were not incorporated within MMD for STH signed at timestamp 1476653510423 for tree size 35936627 (34.7% of entries sequenced for that STH).
Submitted chains in index range [35936627, 35962877) were not incorporated within MMD for STH signed at timestamp 1476657100609 for tree size 35962877 (100% of entries sequenced for that STH).
Submitted chains in index range [35962877, 35982377) were not incorporated within MMD for STH signed at timestamp 1476663459670 for tree size 35982377 (100% of entries sequenced for that STH).
Submitted chains in index range [35982377, 36081668) were not incorporated within MMD for STH signed at timestamp 1476671114710 for tree size 36084877 (96.9% of entries sequenced for that STH).
Submitted chains in index range [36084877, 36113179) were not incorporated within MMD for STH signed at timestamp 1476675980176 for tree size 36255877 (16.6% of entries sequenced for that STH).
ROOT CAUSE:
A large backlog of added certificates was generated over a 6 hour period during the early hours (PDT) of Sunday 16 October, caused by the (non-malicious) actions of some high-volume clients. Aviator's signer could not sequence the submitted certificate chains quickly enough to clear the backlog, exacerbated by the fact our protection against flooding did not activate when expected.
REMEDIATION AND PREVENTION:
During the impact period, Google's engineers worked to bring down the size of Aviator's backlog of submitted certs, in an attempt to avoid a policy violation. As part of that effort the /ct/v1/add-chain and /ct/v1/add-pre-chain end-points were made temporarily unavailable; this was not announced.
/add-chain was unavailable from 2016-10-16 16:42 PDT to 2016-10-16 21:30 PDT (4.75hrs);
/add-pre-chain was unavailable from 2016-10-16 16:42 PDT to 1026-10-16 17:42 PDT (1hr)
Google are using the lessons learned from this incident to improve operational practices for the Pilot, Rocketeer, Submariner, Icarus and Skydiver logs; in particular the sequencing operation has been tuned, as have protections for the logs against flooding. Monitoring has been revised to provide earlier warning of similar events in the future.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CA%2BcU71%3Dxdj0vu20hTv1nYR1UruUV3ir_fOrxfnMqR83ZpGWm1w%40mail.gmail.com.
I don't see any mention of Aviator's status with Chrome. Is this still
being decided, or is no news to be taken as no change will be made?
from a log qualified at the time of check is presented"? If there isWould Chrome reconsider the portion of the CT policy that says "SCT
a known good STH up to a certain point, what is the risk of simply
accepting SCTs included in the log as of that STH? The advantage of
this is a cert with, say, Aviator and Izenpe embedded SCTs would still
be trusted without having to find a server that does SCT delivery via
TLS extension or OCSP stapling.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvb13H-gF8etEG-5Wk%3Dogm6wnxSFT062X32uf1smW6J3VA%40mail.gmail.com.
The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).Both of these actions are policy violations.I don't see any other choice.
published a policy but has never published the risk (e.g. threatI think we may be approaching this from the wrong end. Chrome
model) that the policy is trying to mitigate.
Given the current log
ecosystem has multiple operators with more than one log, we really
should be talking about both the operator and the log in these
discussions. Is the non-compliance with CT policy a log issue or an
log operator issue? What is the purpose of disqualifying a log (even
for downtime/availability) if the operator is allowed to immediately
resubmit?
Assuming the problem was a fluke outside the operator's
control (e.g. DDOS attack on their DNS provider), is the answer "don't
change anything,"
But certainly, I want to find solutions, but I'm not sure I agree with statements that it's "impossible" to comply with.
At fist I was strongly in favor of zero tolerance, and Caesar's wife and all that, but after reading everything so far I changed my mind. Tweak the policy to allow "infrequent" MMD misses, record them, and move on.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAFH29tp96VGg_7sd96G9Z_WE-qatsvgZhvPatKgie9vANA3rQg%40mail.gmail.com.
I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).Is that what you would recommend, or do you have other ideas?
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.
Require an incident report if the frequency gets too high.
On 21 October 2016 at 23:13, Ryan Sleevi <rsl...@chromium.org> wrote:On Fri, Oct 21, 2016 at 2:51 PM, 'Ben Laurie' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:The important thing for me is that the policy as currently stated is impossible to comply with: if a source of new, valid certificates logs them as rapidly as it can, then either we have to get behind in sequencing, or we have to throttle the input (i.e. become unresponsive).Both of these actions are policy violations.I don't see any other choice.I don't think that's a fair statement, and I'm surprised to hear you state it.You can throttle input, which is effectively an 'outage', as provided for in https://www.chromium.org/Home/chromium-security/certificate-transparency/log-policy . So long as the MMD is maintained, and the overall outage does not regress past the 99% uptime, this is still compliant.I suspect you're more specifically thinking of "What happens when a single certificate is presented, and the option is either blow MMD or blow 99% uptime", which is a possible situation, but one would have hoped that the Log Operator took appropriate steps to avoid that situation, since a variety of options exist - up to and including no longer including CAs as accepted by the Log until the Log Operator is able to scale.I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT,
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.
From Chrome's viewpoint, it is as simple as that. From the log developer's viewpoint, guidance would be helpful but "throw hardware at it" is one possibiliy.
On Fri, Oct 21, 2016 at 4:17 PM, Ben Laurie <be...@google.com> wrote:I admit that this is a possibility that had not occurred to me. It doesn't really feel in keeping with the intent of CT, but I agree that the iog technically could temporarily have a reduced or empty CA list in order to throttle input (note, though, that the required notification would have to be made [the policy doesn't state how timely that has to be, btw]).Is that what you would recommend, or do you have other ideas?Right, I see there being two sources of load for a Log Operator, at least w/r/t MMD-impacting events:- New submissions- "Backfill" submissionsThe issue is that if you decide to accept a given root, then you are, in effect, agreeing to log everything that root ever issued (unless, of course, your policy makes some statement otherwise, such as examining the validity dates, or only N # of certs, etc). We haven't had any logs do so, but it's not impossible to think of them wanting to do so.This also comes up with things like cross-signs. For example, if you agree to log Identrust or Symantec certs, then you also potentially agree to log the entirity of the US FPKI - which is quite a few certs! Now, if your log implementation checks revocation status, it could decide to reject such certs (not complying with policy), but now we get into that gray area of how much or how willing should a log be to log everything.For "new" submissions - that is, new certificates being issued - it seems unlikely in general that a CA will cause serious pressure; even a CA like Let's Encrypt. If it does/is, then that's something that should be discussed, and is precisely the thing that is meaningful to the community to solve. But my gut and sense from discussions with log operators (including Google) is that even CAs such as Let's Encrypt do not place unmanagable load on reasonably developed logs, nor would they be anticipated to.From what I understand from Ryan's post, it's most likely that this was a 'backfill' sort of operation. At that point, the submitter has the full ability to affect the QPS of which they log, and the upper scale of how many QPS may come in is, presuming a distributed enough submitter, equivalent to the totality of the WebPKI that the log operator accepts. That'd be huge!
My suggestion of 'removing a CA' was moreso with respect to thinking about the 'new' submissions case, and an example of how you could mitigate some of the challenge, if no other solution existed. For addressing the 'backfill' case, the answer would have to be some form of D(DoS) mitigation, which seems to fit within the reasonable bounds of mitigation, and is distinct from a 'total' outage. So even if submitted tried to log N million certificates in 1 second, you could reject once you exceeded your QPS budget that ensured you hit your MMD budget.A log operator could also seek to mitigate this issue with acceptance policies (as mentioned above), or by 'pre' backfilling the log contents, such that it started from a known state. Of course, as the PKI grows, I expect that the former will be more popular than the latter, but I suspect both fit within a spectrum of option and degree, such that it's not either/or.
My preference would be for logs to be permitted under (demonstrable, at least for new certs!) high load to return a "come back later" response.Given that we've seen CAs backdating certificates, how do you define 'new' certs? :)
Require an incident report if the frequency gets too high.Who reports? :) With respect to adding precerts, only CAs trusted by the log can do that. With respect to MMD measurements, only those who successfully obtain an SCT can quantify that, and it may be that multiple parties are logging, and they aren't aware that they're all getting "come back laters" (e.g. 10 parties are seeing an 'outage'). Should they report unconditionally every failure?
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWva-s4CHndJTeGp8VK2rNGahJHpVD4pSNR5M%2BJFfep44AQ%40mail.gmail.com.
On the one hand, I can see a very compelling argument to remove the Aviator log (or more aptly, treat it as no longer trusted within the next few weeks). This is consistent with a strict zero tolerance policy, which, despite all the negatives that zero tolerance involves, has the benefit that it hopefully rises beyond even the slightest suggestion of treating Google logs different.
I had been planning a more thorough write-up of these concerns, but then I remembered I provided much of this back in June - https://groups.google.com/a/chromium.org/d/msg/ct-policy/AH9JHYDljpU/f4I9vQLACwAJ
When we examine the nature of this specific failure - the failure to integrate the SCTs within the MMD - we have to think about the impact.
Unlike an uptime issue,
so long as the STH is eventually consistent, there's no ability to hide misissued certificates longer than the window of non-compliance (which, as I understand, is 2.2 hours). As the STH eventually integrated all of these SCTs, the first and foremost concern - the ability for the log to mask misissuance - seems to have been limited, and that window is significantly less than the window that a stapled OCSP response could be used to assert 'goodness' (roughly 3.5 days)
That said, I wouldn't want to see a scenario in which logs routinely blow MMDs - that creates a system in which the reliability of the system becomes suspect, and increases the window of detection of misissuance.
I'm curious how the broader community feels, and, as I indicated above, I can see arguments for both. I think the ecosystem can support either action as well, so this is largely about understanding the threat model, community norms, and whether enforcing a zero tolerance policy provides the greatest benefit to the ecosystem, which it may very well do.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+unsubscribe@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAFewVt5-wVFJSjOxzMi2y4s0KmpoA61fxScYODDGzmNmgw_HLg%40mail.gmail.com.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAFewVt5-wVFJSjOxzMi2y4s0KmpoA61fxScYODDGzmNmgw_HLg%40mail.gmail.com.
----
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CANBOYLUXXsoHgujwgcdKbwSczLj%3DDWWw1amBvfeNqQT0mqAmbw%40mail.gmail.com.