Broader lessons from 1718771

Watson Ladd

unread,

Jul 16, 2021, 12:21:49 AM7/16/21

to dev-secur...@mozilla.org

Dear all,

I'm writing because of
https://bugzilla.mozilla.org/show_bug.cgi?id=1718771. This bug is
interesting for the astonishing number of certificates revoked, the
difficulty in identifying them, the initially much larger impact,
later revised downward through more precise identification, and raises
the question of other CAs that may have similarly relied on cached DCV
information improperly.

It seems that Sectigo had a hard time answering the question "what
were all the certificates that had DCV done by this method more than X
days before issuance"? This seems like an issue with the degree of
required recordkeeping in the BRs: I would like to think that for
every issuance, there would be a perpetual record of the validation
method employed and when the validation was carried out, and that this
would be accurate enough to always determine the circumstances of
issuance.

Looking at the BRs it seems that the closest requirements are in
3.2.2.4, stating that "CAs SHALL maintain a record of which domain
validation method, including relevant BR version number, they used to
validate every domain." This seems like it should have made the task
of finding the certs easy, yet when the eventuality had to be dealt
with it seems to have required examining an operation database quite
closely to figure out what happened because the maintained records
weren't detailed enough.

More broadly I'd like to see CAs think about response to mississuance:
can you determine what was requested, what happened to it, and why it
was allowed through for any certificate, and if any problems were
identified all of the ones with the same problem, in minutes not days?
Perhaps more needs to be done to ensure this level of playback is
possible easily.

I'd like to thank Sectigo for a very thorough writeup of what happened
and why, and hope that it can make future issues easier to solve for
everyone.

Sincerely,
Watson Ladd

Matthias van de Meent

unread,

Jul 19, 2021, 9:28:08 AM7/19/21

to Watson Ladd, dev-secur...@mozilla.org

On Fri, 16 Jul 2021 at 06:21, Watson Ladd <watso...@gmail.com> wrote:
> It seems that Sectigo had a hard time answering the question "what
> were all the certificates that had DCV done by this method more than X
> days before issuance"? This seems like an issue with the degree of
> required recordkeeping in the BRs: I would like to think that for
> every issuance, there would be a perpetual record of the validation
> method employed and when the validation was carried out, and that this
> would be accurate enough to always determine the circumstances of
> issuance.
>
> Looking at the BRs it seems that the closest requirements are in
> 3.2.2.4, stating that "CAs SHALL maintain a record of which domain
> validation method, including relevant BR version number, they used to
> validate every domain." This seems like it should have made the task
> of finding the certs easy, yet when the eventuality had to be dealt
> with it seems to have required examining an operation database quite
> closely to figure out what happened because the maintained records
> weren't detailed enough.

You probably overlooked the BR section that covers audit logs and the
requirements for specific actions to be logged. The audit logged
actions for subscriber certificates include "All verification
activities stipulated in these Requirements and the CA’s Certification
Practice Statement;" (BR s5.4.1 (2)(2)), which should cover the re-use
of validation information (if any) for that subscriber certificate.
These logs should then be retained for at least 2 years after either
the revocation or expiration of the subscriber certificate (BR s5.4.3
(2)).

If a CA can't find its re-use of validation information in their audit
logs (as described in BR s5.4), then I believe that BR s5.4 was not
correctly implemented by that CA.

Kind regards,

Matthias van de Meent.

Ryan Sleevi

unread,

Jul 19, 2021, 10:33:58 AM7/19/21

to Matthias van de Meent, Watson Ladd, dev-secur...@mozilla.org

On Mon, Jul 19, 2021 at 9:28 AM 'Matthias van de Meent' via dev-secur...@mozilla.org <dev-secur...@mozilla.org> wrote:

If a CA can't find its re-use of validation information in their audit
logs (as described in BR s5.4), then I believe that BR s5.4 was not
correctly implemented by that CA.

I wish it were that simple.

At least one major CA (representing a non-trivial amount of issuance) has stated that they maintain their audit logs as paper records. This is also why changes to validation methods/reuse have, in the past, faced stiff opposition - because some CAs are concerned with the cost and time simply to determine who would be affected.

This is, sadly, the distinction between "logged" and "searchable".

We've equally seen a number of CA incidents where CAs maintain the data in databases or so-called "data lakes", but then find it difficult to search. Sectigo's bug is an example of the complexity of searching across disparate datasets.

At present, we largely rely on CAs to "Do the Right Thing" and prepare for the worst case, and design their systems in a way that can support investigations robustly and rapidly. In practice, however, we know that's often far from the case. The Detailed Control Reports specifically aim to provide greater insight into the system design and how it's measured, to allow the development and harmonization of good practice, but a number of CAs oppose that for cost. Worse, however, is that certain large audit firms are concerned that having such detailed reports would jeopardize their audit business, because of the reputational risk from revealing how their audits are worse quality compared to both their competitors and the overarching goal.

Tim Callan

unread,

Aug 5, 2021, 7:06:58 PM8/5/21

to dev-secur...@mozilla.org, Ryan Sleevi, watso...@gmail.com, dev-secur...@mozilla.org, Matthias van de Meent

We’d like to offer our own perspective on this issue, having lived it firsthand, in case this perspective is valuable to the community.

It’s important to understand that while the total number of affected certificates was on the order of 100,000, the actual number of affected domains was about 1% of that. It just happened that there were a large number of certificates using a few of these domain names. That’s important because the exercise was to detect what ultimately turned out to be about 1000 domains with a DCV problem from the vast number of domains for which we perform DCV every year – and that the exercise was to isolate and eliminate the certificates with one or more incorrectly validated domains, not to make a sweeping revocation that would have gotten all the affected domains and a large number of unaffected certificates as well.

This last point is important. It would have been fast and easy to create a query that would have caught 100% of this misissuance and that also would have revoked an order of magnitude more of other certificates as well, despite the fact that they were perfectly fine. What slowed down the investigation was examining all domains in our corpus of active certificates for the many possible ways that DCV could have occurred. The tangled skein we had to investigate included these factors:

The numbers of domains and certificates in question were both very large.
The same domain name could have undergone DCV multiple times in multiple ways for multiple different certificates.
Many certificates contain multiple (and in some cases hundreds of) SANs.
Isolating certificates for revocation had to occur on a certificate-by-certificate basis, not a domain-by-domain basis.
Frequent reissuance among hosting partners means the total number of certificates to be tracked and DCV events to be traced is that much higher and more complex.

The key idea here is that the first DCV result that returns from an initial query may not be the only DCV event that actually occurred. We did have records of all these events, which is ultimately how we were able to execute this task, but as Ryan points out, this was one of those “data lake” situations, as we had to dig back into deeper records of our systems’ behavior.

It was, in fact, straightforward and reasonably fast to create that first list of suspect certificates for which we could not confirm that the “DCV reuse” had occurred within 825 days. In another circumstance that might have been the end of the query and we would have had our results. The problem here was that the reliance on DCV reuse was the very part of the system that was suspect, and so to put it under the magnifying glass we had to go to the very bottom of the data lake.

In other words, the fact that a particular certificate had DCV reuse marked incorrectly didn’t necessarily mean that DCV hadn’t occurred for that same domain in the specified time period, just that our primary record for that certificate didn’t indicate that this had happened. In response to that problem we have a ticket in to create a new table that will log our successful BR-compliant DCV checks in a manner that will make this kind of search considerably faster and easier to perform in the future.

Likewise, if the exercise had been to look at a single certificate or relatively few certificates, we could have found the answer very quickly, in the “minutes not days” that Watson asks about. However, if the request is for every large volume, global CA to be able at any time to perform an expansive search of every active certificate it has on any, single, unpredictable criterion that may be thrown its way and get a result back in minutes, that is a very difficult thing to be able to perform under any and all circumstances. Another way to think about this is, is the CA’s database meant to be something where all conceivable questions must be answerable immediately, or is it more reasonable to expect that for unexpected and complex questions involving large numbers of certificates the CA can perform a data investigation and return with answers after “days not minutes”?

Reply all

Reply to author

Forward