Mozilla delayed revocation incident expectations

1,530 views
Skip to first unread message

Mike Shaver

unread,
Jun 26, 2024, 1:33:49 PMJun 26
to MDSP
I have two questions about the Mozilla expectations for CAs who have delayed revocation beyond the BR limits.


My first question concerns the requirements for detail about Subscriber requests for delayed revocation, which are currently:
  • The decision and rationale for delaying revocation will be disclosed in the form of a preliminary incident report immediately; preferably before the BR-mandated revocation deadline. The rationale must include detailed and substantiated explanations for why the situation is exceptional. Responses similar to “we do not deem this non-compliant certificate to be a security risk” are not acceptable. When revocation is delayed at the request of specific Subscribers, the rationale must be provided on a per-Subscriber basis.
Question: should this not require per-certificate detail? A subscriber may have certificates deployed in some systems that can't tolerate prompt revocation, but that shouldn't prevent revocation of other certificates on systems of lesser criticality. The report could group certificates that are all part of a given critical system, for ease and brevity, but I think we shouldn't give the impression that the Subscriber is the scope of the decision to delay revocation, rather than the individual certificates.

My second question concerns the expectation that the CA work with their auditor and the root programs:
  • Your CA will work with your auditor (and supervisory body, as appropriate) and the Root Store(s) that your CA participates in to ensure your analysis of the risk and plan of remediation is acceptable.
Question: is the expectation that the CA consults with the auditor and root programs as part of making the decision to delay, or is it instead meant to be a post-hoc consultation to get feedback on the risk analysis and remediation plan that the CA established independently for the incident? If it's the former, then I think that language should be changed to make that explicit. If it's the latter, then I think it should be explicit that the analysis and feedback be posted in the incident bug, so that other CAs and root community members can learn from the process as well.

If there is agreement on these being worth considering, I'll open appropriate github issues to discuss the details and make an actual decision.

Thanks,

Mike

Zacharias Björngren

unread,
Jun 26, 2024, 2:40:33 PMJun 26
to Mike Shaver, MDSP
I’ve been thinking a lot about the first question, ever since I noticed how many of the certificates I sampled in lists of certificates for delayed revocation incidents included clear markers of not being a production environment. It was: dev, test, preprod, uat. It felt like a majority of them did, which makes sense if using webPKI for such environments is commonplace. I would have thought that more organizations used a private CA for these scenarios, but apparently not.

I think that requiring the reasons for delayed revocation, the risk for significant harm or internet outages, to be more closely tied to the systems is a good idea. But at the same time I think that it’s important to consider priorities and how they might be reversed in a security scenario compared to mississuance. 

If there is a security issue then delayed revocation of certificates protecting critical infrastructure could risk greater harm than just revoking. To incentivize prompt action for the production environments delaying revocation for non-production environments could give organizations a chance to focus without having to worry about non-critical hosts.

In case of mississuance it could be better to allow an organization to delay revocation of their production environments so that they can verify that their reissued certificates work using their test environments.

I also reacted to how many of the certificates were for domains that did not appear to have public DNS entries. And I’ve noticed some reluctance to reveal more detailed information about why a system is critical with references to security. I think that if the claim is made that: Yes this host is part of critical infrastructure, no it is not publicly available, and no
we won't tell you what it does. Then the only proper answer is to hand them instructions to set up their own private CA. But that is a separate issue I guess.

Br
Zacharias



--
You received this message because you are subscribed to the Google Groups "dev-secur...@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev-security-po...@mozilla.org.
To view this discussion on the web visit https://groups.google.com/a/mozilla.org/d/msgid/dev-security-policy/CADQzZqt8w5Em8nZc78ML4yQgcPNhBzAoUZjW71vzaa5_JvYb2g%40mail.gmail.com.

Mike Shaver

unread,
Jun 26, 2024, 3:00:58 PMJun 26
to Zacharias Björngren, MDSP
On Wed, Jun 26, 2024 at 2:40 PM Zacharias Björngren <zacharias...@gmail.com> wrote:
I’ve been thinking a lot about the first question, ever since I noticed how many of the certificates I sampled in lists of certificates for delayed revocation incidents included clear markers of not being a production environment. It was: dev, test, preprod, uat. It felt like a majority of them did, which makes sense if using webPKI for such environments is commonplace. I would have thought that more organizations used a private CA for these scenarios, but apparently not.

There is no open source software package that rolls up all the CA activity and the policy management for root registration on browsers and servers, etc. I think that would spur a lot more use of private CAs.

If there is a security issue then delayed revocation of certificates protecting critical infrastructure could risk greater harm than just revoking. To incentivize prompt action for the production environments delaying revocation for non-production environments could give organizations a chance to focus without having to worry about non-critical hosts.

In case of mississuance it could be better to allow an organization to delay revocation of their production environments so that they can verify that their reissued certificates work using their test environments.

Non-production services aren’t critical, and should just be trivially revoked until they can be replaced. (Or the non-production services should be configured to ignore revocations in some cases, I guess.)

Reasoning about attention splitting and bandwidth requires more knowledge of how the organization is structured and their capabilities than I am likely to have. 

I also reacted to how many of the certificates were for domains that did not appear to have public DNS entries. And I’ve noticed some reluctance to reveal more detailed information about why a system is critical with references to security. I think that if the claim is made that: Yes this host is part of critical infrastructure, no it is not publicly available, and no
we won't tell you what it does. Then the only proper answer is to hand them instructions to set up their own private CA. But that is a separate issue I guess.

I think that “get off the web PKI” is indeed the right remediation for a system that cannot tolerate web PKI revocation policies, and unlike “educate customers” has a somewhat observable outcome (revocation of the web PKI certs in question). In a nice alignment of interests, some CAs also offer products and services related to building and operating a private PKI.

(“Pinning” should only be allowed to delay revocation once for a given organization. Replace it with validation of a key the operator controls. A CA that knowingly issues a cert that will be pinned in a too-big-to-break context is bumping up against some good-faith issues IMO.)

Mike

Zacharias Björngren

unread,
Jun 26, 2024, 4:15:22 PMJun 26
to Mike Shaver, MDSP
Non-production services aren’t critical, and should just be trivially revoked until they can be replaced. (Or the non-production services should be configured to ignore revocations in some cases, I guess.)”

While I see your point I also think that non-production services could still be part of the greater system, where the purpose of those non-production environments is enable testing and verification of changes with much lower stakes. But then again I’m confused about how replacing a certificate could involve the amount of effort and bureaucracy that is being claimed in all these delayed revocation incidents.

Ben Wilson

unread,
Jun 26, 2024, 5:12:01 PMJun 26
to Zacharias Björngren, Mike Shaver, MDSP
I think it would be good to collect and analyze use-case environments from subscribers who have requested delayed revocation, if anyone has bandwidth.
Thanks,
Ben

--
You received this message because you are subscribed to the Google Groups "dev-secur...@mozilla.org" group.
To unsubscribe from this group and stop receiving emails from it, send an email to dev-security-po...@mozilla.org.

Mike Shaver

unread,
Jun 26, 2024, 5:19:45 PMJun 26
to Ben Wilson, MDSP, Zacharias Björngren
On Wed, Jun 26, 2024 at 5:11 PM Ben Wilson <bwi...@mozilla.com> wrote:
I think it would be good to collect and analyze use-case environments from subscribers who have requested delayed revocation, if anyone has bandwidth.

I’m not sure that I have bandwidth, but I *am* sure that I don’t know how to go about effectively contacting subscribers to find out the details. There have been a *lot* of such subscribers lately, even if only counting the ones who had their request granted.

Do you propose that CAs provide contact information for subscribers who made that request?

Sadly, that level of detail is generally not provided in the CA incident report, although my reading of the Mozilla incident policy says that it should be…

Mike

Tyrel

unread,
Jun 26, 2024, 6:54:01 PMJun 26
to dev-secur...@mozilla.org
While I agree that having detailed use-case environments that result in subscribers requesting delayed revocation might be fascinating to read, I think it will be in practice very difficult to gather given the lack of specificity that seems to be publicly provided:


None of these (or others) have the level of detail needed to really understand what the use-case is, or why that use case is critical (not critical to revenue generation activities of the subscriber, but to society/the webPKI community),. Very few report the specific details of why certificates cannot be replaced in the 24hr/5 day timeline. Generalities involving "we need to get permission from a government agency" or "we need to provide the certificate to all relying parties on the other end" or "we do not have full time IT and need time to get our MSP to do it" seem common. But when pushed for details (such as which specific regulations or which regulators), all we see is... silence. A rare example of a more specific reason is: "we are contractually required to notify those who depend on it 90 days in advance of replacement," but that seems like an obvious place where the CA should then refuse to issue new certs to the subscriber.

Having manually inspected the entire certificate list for several of the recent entrust bugs, my own assessment is that 0.0% of them meet the critical bar. Sure, there are many examples of it being critical for the subscriber 's operations (e.g. Delta ticketing machines), and examples that fall under a very general "critical infrastructure" criteria as defined by CISA or others, but none that are "something really bad for society in general or the webPKI community in particular will happen if these are revoked on time."

In fact, Tim Callan's comment over in pub...@ccaddb.org rings very true to me: "...but when the CA simply offers that this revocation will occur in a specified timeframe, the Subscribers miraculously get the new certificates deployed.  Every time.  We see this miracle occur over and over again."

Miracles indeed. Especially as, as far as I can tell, most government regulations are not actually in conflict with fast revocation timelines, and in fact for cases where the use is "revenue generation critical," trade bodies seem to take great pains to make clear certs can be revoked in as little as 24 hr, and businesses should plan accordingly. For example, page 9 of: https://www.openbanking.exchange/wp-content/uploads/eidas-qualified-certificates-faq.pdf .

So while I agree that some understanding of what is being considered critical will be fascinating to read, I don't think it is necessary to determine a path forward. My specific suggestion is to remove entirely the Mozilla carve-out for revocation delays. There are plenty of CAs who get things revoked on time without society crashing down. If there is then a truly exceptional case (like if we didn't the reactors melted down making a continent a wasteland), a CA can still choose to delay, and put their fate in the hands of the community.

Tyrel

Mike Shaver

unread,
Jun 26, 2024, 7:39:57 PMJun 26
to Tyrel, dev-secur...@mozilla.org
On Wed, Jun 26, 2024 at 6:54 PM Tyrel <tmcque...@gmail.com> wrote:
While I agree that having detailed use-case environments that result in subscribers requesting delayed revocation might be fascinating to read, I think it will be in practice very difficult to gather given the lack of specificity that seems to be publicly provided:


None of these (or others) have the level of detail needed to really understand what the use-case is, or why that use case is critical (not critical to revenue generation activities of the subscriber, but to society/the webPKI community),.

Yes, these incident reports are not, IMO, in compliance with Mozilla’s policies.

While the existence of the delayed revocation protocol might make delayed revocation seem more acceptable, I think that it currently serves a useful purpose (or could, if complied with in good faith) in helping other CAs identify scenarios that they should prepare for themselves, and can shine light on cases where the CA is misaligned with the purpose and requirements of the Mozilla root program. Imagine how much harder people would have to fight to get useful information about a failure to revoke on time, if there wasn’t that set of Mozilla expectations to start from…

Mike

Tyrel

unread,
Jun 27, 2024, 11:54:48 AMJun 27
to dev-secur...@mozilla.org
Mike,
 
While the existence of the delayed revocation protocol might make delayed revocation seem more acceptable, I think that it currently serves a useful purpose (or could, if complied with in good faith) in helping other CAs identify scenarios that they should prepare for themselves, and can shine light on cases where the CA is misaligned with the purpose and requirements of the Mozilla root program. Imagine how much harder people would have to fight to get useful information about a failure to revoke on time, if there wasn’t that set of Mozilla expectations to start from…

I'm not sure I fully agree. I absolutely agree most delayed revocation reports are not up to how I interpret Mozilla's current expectations, and certainly fully see the merits with respect to it being useful in shining a light on which CAs are not aligned with the purpose and spirit of being a webPKI CA before it gets to the point of malicious misissuance (as we have seen in spades in a recent example). Yet I think the large number of delayed revocation events that explicitly cite the Mozilla policy first and foremost for why there is a delay is evidence that this policy is having a detrimental effect on the webPKI. 

If this language did not exist, the pressure on CA's is to "not delay revocation at all, or, if we do, risk being distrusted [or have our certificate lifetimes capped]." And if they decide that it truly is an exceptional case, then to avoid that distrust action (or lifetime capping or whatever) they are strongly incentivized to provide as much detail as possible for why these specific certificates were not revoked on time. Instead of having a policy, that due to the wide definition of "critical infrastructure" is taken as carte blanche permission to delay, which clearly seems to be the opinion of some CAs.

Put another way: as you so eloquently described above, ultimately it is up to each subscriber -- not the CA -- to weigh the risks and benefits of how they are acquiring and deploying certificates, and to act accordingly. That only works if the subscriber is exposed to the actual "risks" associated with webPKI certificates, which in turn comes from CAs, as custodians of public trust, actually enforcing the rules. The current Mozilla delayed revocation policy, based on the evidence in the many delayed revocation reports, is providing way too much leeway to CAs who are, in turn, not applying appropriate back pressure on subscribers to assess risks and plan accordingly. 

If the objective is to make sure other CAs and the community is learning as much as possible from events when they happen, it might instead be worth incorporating language that incident reports [of any variety] must include per-certificate and per-subscriber breakdowns of root causes / effects. This would be beneficial not just in revocation cases, but in others as well.
Reply all
Reply to author
Forward
0 new messages