Duplicate certs (keys, actually)

David Champion

unread,

Nov 10, 2023, 11:59:41 AM11/10/23

to crt.sh

Hello,

I’m running a daily query to scan all logged certs for cryptographic vulnerabilities (see here, if interested, for elaboration). Recently I discovered a lot of duplication in the certs I’m scanning. For example, frequently I find leaf-/pre-certificate pairs logged within seconds of one another.

All that I am interested in is the public keys, so it seems I have been wasting a good deal of compute scanning duplicate keys.

Here is the current query; i.e., the one that contains duplicate keys:

SELECT regexp_replace(encode(CERTIFICATE, 'base64'), E'\\n', '', 'g'),

ID

FROM certificate

WHERE ID >= LAST_SCANNED_ID+1 AND ID <= LAST_SCANNED_ID+10000

And here is what I propose replacing it with; i.e., the one that omits duplicate keys:

WITH no_dups AS (
SELECT min(ID) ID,
x509_tbscert_strip_ct_ext(CERTIFICATE) STRIPPED_CERT,
x509_serialNumber(CERTIFICATE) SERIAL,
ISSUER_CA_ID
FROM certificate
WHERE ID BETWEEN LAST_SCANNED_ID+1 AND LAST_SCANNED_ID+10000 GROUP BY STRIPPED_CERT, SERIAL, ISSUER_CA_ID
)
SELECT regexp_replace(encode(CERTIFICATE, 'base64'), E'\\n', '', 'g'),
ID
FROM certificate
WHERE ID IN (SELECT ID FROM no_dups)

Can somebody please (a) vet my findings re: duplicate public keys and (b) if they are correct, vet my proposed query for correctness?

Above all, in optimizing the scan I do not want to miss ANY public keys signed by CAs.

Thanks!

Dave

r...@sectigo.com

unread,

Nov 16, 2023, 10:58:55 AM11/16/23

to crt.sh

Hi Dave. Your proposed approach only deduplicates public keys within the scope of the target range of 10,000 certificate IDs, rather than across the database as a whole. Also, there are plenty of cases where the same public key is used across multiple (pre)certs, so only deduplicating cert/precert pairs seems suboptimal compared to deduplicating based on SPKI.

Below is my suggested approach. Note that, however you approach deduplicating, the extra effort required is going to make the query run more slowly.

SELECT regexp_replace(encode(c_new.CERTIFICATE, 'base64'), E'\\n', '', 'g'),
ID
FROM certificate c_new
WHERE c_new.ID BETWEEN LAST_SCANNED_ID+1 AND LAST_SCANNED_ID+10000
AND NOT EXISTS (
SELECT 1
FROM certificate c_old
WHERE digest(x509_publicKey(c_new.CERTIFICATE), 'sha256') = digest(x509_publicKey(c_old.CERTIFICATE), 'sha256')
AND c_old.ID < c_new.ID
);

David Champion

unread,

Nov 16, 2023, 2:27:32 PM11/16/23

to crt.sh

Hi Rob, and thanks for your thoughtful response.

I am aware the scope of de-duplication is confined to the range of the query. But there is duplication even within these ranges. In fact my testing revealed that, as the range increases, so too does the proportion of duplicate keys (e.g., 3.5% duplication at 1k rows, 7.5% at 10k rows, 10% at 50k rows, etc.).

I was NOT aware of the x509_publicKey function, however, which appears to address precisely the problem I am trying to solve. I wonder if it is the hashing you believe will overwhelm the de-duplication savings, the query plan, a combination of the two, and/or something else? If the answer is primarily the query (and not the hashing), then I wonder further if my approach would not be more efficient (or if not, just plain wrong, in which case I would like to know that too before committing to it).

Thanks again,

Dave

David Champion

unread,

Nov 16, 2023, 2:38:31 PM11/16/23

to crt.sh

Correction: If the answer is primarily hashing (and not the query), then I wonder...

Reply all

Reply to author

Forward