Fix for inefficient pre-certificate de-duplication logic

83 views
Skip to first unread message

Craig Ringer

unread,
Jul 18, 2023, 7:10:16 PM7/18/23
to crt.sh
Hi

My organisation has been (trying to) use queries derived from crt.sh code to try to compute how many certs we've issued against a particular domain so we can alert when we approach Let's Encrypt rate limits.

I've found a couple of issues with the certificate de-duplication logic in our query and found that the same issues are present in the crt.sh code at https://github.com/crtsh/certwatch_db/blob/072a681239b508c18a0add9698026dd921b94f8c/fnc/web_apis.fnc#L3829 so I want to share my fixes with you.

This code:

AND NOT EXISTS (  
    SELECT 1                                                                          
    FROM certificate c2                                                                              
    WHERE x509_serialNumber(c2.CERTIFICATE) = x509_serialNumber(cai.CERTIFICATE)  
        AND c2.ISSUER_CA_ID = cai.ISSUER_CA_ID
        AND c2.ID < cai.CERTIFICATE_ID                                            
        AND x509_tbscert_strip_ct_ext(c2.CERTIFICATE) = x509_tbscert_strip_ct_ext(cai.CERTIFICATE)
    LIMIT 1
)

appears to try to detect pre-certificates by finding certs with the same serial and extension-stripped payload. But this method is very inefficient because of the use of a correlated subquery and a CPU-expensive comparison of a large blob of bytea data that must be  extracted from each candidate certificate for comparison.

The correlated subquery has a LIMIT clause which prevents postgres from optimising it into a left-anti-join.

It can be expressed MUCH more efficiently as a left anti-join like

LEFT JOIN certificate c2 ON (
   x509_serialNumber(c2.CERTIFICATE) = x509_serialNumber(cai.CERTIFICATE)'
  AND c2.ISSUER_CA_ID = cai.ISSUER_CA_ID'
AND c2.ID < cai.CERTIFICATE_ID'
AND x509_tbscert_strip_ct_ext(c2.CERTIFICATE) = x509_tbscert_strip_ct_ext
)
-- Only retain rows of c1 where no corresponding c2 was found
WHERE c2.id IS NULL

but for this case there's an even easier and more efficient way that avoids the need to join entirely. The query can filter-out pre-certificates with the following SQL term that checks for the pre-certificate x.509 critical extension that "poision"s them for clients, and discards the certs if the extension is present:

AND NOT x509_hasExtension(CERTIFICATE, '1.3.6.1.4.1.11129.2.4.3', true)

This can be made cleaner and more readable (IMO) if expressed as

LEFT JOIN LATERAL x509_hasExtension(CERTIFICATE, '1.3.6.1.4.1.11129.2.4.3', true) AS precert_status(is_precertificate)
...
other sql here
...
WHERE
...
AND NOT is_precertificate

so that you can also easily surface an "is_precertificate" flag in the SELECT output clause for other analytical needs.

It'd also be helpful if a comment in the code there explicitly stated that the dedup in question is to exclude pre-certs, it's _not_ to detect and exclude cert renewals, as I was confused by that for a while. I now understand that at least for Lets Encrypt there are never actually any true renewals, every cert is a new cert issued for the same FQDN set (and even if there were renewals they'd have a different serial).


Craig Ringer

unread,
Jul 18, 2023, 9:23:31 PM7/18/23
to crt.sh
Whoops, that should've been CROSS JOIN LATERAL not LEFT JOIN LATERAL. Sorry, editing error when adapting the query snippet.

If just using it for filtering-out, an

  INNER JOIN LATERAL
        x509_hasExtension(CERTIFICATE, '1.3.6.1.4.1.11129.2.4.3', true) precert_status(is_precertificate)  
        ON (NOT precert_status.is_precertificate)

would be sufficient instead.

Craig Ringer

unread,
Jul 19, 2023, 12:29:23 AM7/19/23
to crt.sh
I think I see why it's done this way now.

The final certs are delayed several days behind pre-certs in the cert transparency log.

So the idea of the query was probably to prefer a final cert if one could be found, otherwise match the pre-cert.

This can still be done much more efficiently using the left anti-join I showed above, rather than a correlated subquery.

Craig Ringer

unread,
Jul 19, 2023, 5:59:46 PM7/19/23
to crt.sh
May I suggest an is_x509PreCertificate function in libx509pq, defined as

CREATE OR REPLACE FUNCTION is_x509Precertificate(cert bytea)
RETURNS boolean
LANGUAGE sql
IMMUTABLE
RETURNS NULL ON NULL INPUT
AS $$
SELECT x509_hasExtension($1, '1.3.6.1.4.1.11129.2.4.3', true)
$$;

COMMENT ON FUNCTION is_x509Precertificate(bytea) IS $$
Is this a Certificate Transparency pre-certificate? Check for the X.509
pre-certificate "poision" critical extension. Each cert in the certificate
transparently log is recorded twice, first as a pre-cert then often much
later as a final cert, so this can be used to filter out these duplicates.
See the certificate transparency
spec for details.$$;

then indexing it as

CREATE INDEX ON certificate ((is_x509Precertificate(CERTIFICATE));

?

It's unlikely to be worth using it as a partial index predicate like

CREATE INDEX ON certificate (x509_serialnumber(certificate)) WHERE (is_x509Precertificate(CERTIFICATE)

because it'll have almost exactly 50% cardinality, so the index will only be roughly 1/2 the size, which isn't worth maintaining an extra index for.

But it'd be really handy to make common de-duplicate queries much (much) faster than the join method, especially if postgres can use a bitmap index scan to combine an is_precertificate index with other indexes. It'd come at the cost of only returning the pre-certificates not the final certificates from such queries, but for many applications that doesn't matter.

r...@sectigo.com

unread,
Jul 20, 2023, 9:54:31 AM7/20/23
to crt.sh
Hi Craig.

The idea of the query is to select either a certificate or a corresponding precertificate, but never both.  Sometimes crt.sh will be aware of a precertificate but won't be aware of the corresponding certificate (usually because there's no requirement to log the corresponding certificate); sometimes crt.sh will be aware of a certificate for which a pre-certificate was not issued; and sometimes crt.sh will be aware of both a certificate and its corresponding precertificate.  Often a precertificate will have a lower certificate ID than the corresponding certificate, but this is not always the case.  Selecting the record with the lowest certificate ID makes the query a bit more deterministic.

> It can be expressed MUCH more efficiently as a left anti-join

I'll take a look at this.  Thanks!

r...@sectigo.com

unread,
Jul 20, 2023, 9:59:57 AM7/20/23
to crt.sh
> It'd come at the cost of only returning the pre-certificates not the final certificates from such queries, but for many applications that doesn't matter.

Why would this not matter?  RFC6962 provides 3 ways to achieve CT compliance, only one of which requires a precertificate.  It should not be assumed that every certificate must have a corresponding precertificate.

r...@sectigo.com

unread,
Jul 20, 2023, 11:50:36 AM7/20/23
to crt.sh
> The correlated subquery has a LIMIT clause which prevents postgres from optimising it into a left-anti-join.

I see the same execution plan for the original query as I see for a version that's adapted according to your suggestion.

Craig Ringer

unread,
Jul 23, 2023, 7:31:43 PM7/23/23
to crt.sh
> The correlated subquery has a LIMIT clause which prevents postgres from optimising it into a left-anti-join.

I see the same execution plan for the original query as I see for a version that's adapted according to your suggestion.


Interesting that you see the same plan. Odd. Maybe it's sensitive to the stats, so my inputs tend to produce a different plan, or maybe the further processing of the results to do FQDN-set based de-dup for Lets Encrypt renewals is tipping the plan choice for the inner query. Or it might be due to the size of the data subset I'm querying. I'll dig up and share a plan comparison in a bit.

I was seeing a very (very) slow nested correlated subquery that was hitting all the partitions for every execution of the join. When looping over 10,000+ certs, this usually resulted in a timeout of the query.

> The idea of the query is to select either a certificate or a corresponding precertificate, but never both.  Sometimes crt.sh will be aware of a precertificate but won't be aware of the corresponding certificate (usually because there's no requirement to log the corresponding certificate); sometimes crt.sh will be aware of a certificate for which a pre-certificate was not issued; and sometimes crt.sh will be aware of both a certificate and its corresponding precertificate.  Often a precertificate will have a lower certificate ID than the corresponding certificate, but this is not always the case.  Selecting the record with the lowest certificate ID makes the query a bit more deterministic.
>
> > It'd come at the cost of only returning the pre-certificates not the final certificates from such queries, but for many applications that doesn't matter.
>
> Why would this not matter?  RFC6962 provides 3 ways to achieve CT compliance, only one of which requires a precertificate.  It should not be assumed that every certificate must have a corresponding precertificate.

Because I'm ignorant of the full scope of the problem it was trying to solve, and didn't properly connect that logic to the relevant parts of the spec. I've been focused on Let's Encrypt, for which this assumption appears to be correct at least based on the data available in the public cert transparency log entries I examined and queried. I found no rows at least for my org's domains in which there was a final cert without a corresponding pre-cert, and the final certs lagged days behind the pre-certs. Thankyou for explaining that this isn't the case for all issuers and may not necessarily always be the case for Let's Encrypt either.

Comments associating https://github.com/crtsh/certwatch_db/blob/5f97893a7a42470726943bf32484723e430760c4/fnc/web_apis.fnc#L4156 and https://github.com/crtsh/certwatch_db/blob/5f97893a7a42470726943bf32484723e430760c4/fnc/web_apis.fnc#L3839 might help future readers understand that together these form a left-anti-join to filter out duplicate certs, and that it may return either a final or pre-cert depending on details of the order in which the certs appear in the certificate transparency log.

Speaking of which, I'm a bit confused as to why:

    x509_tbscert_strip_ct_ext(c2.CERTIFICATE) = x509_tbscert_strip_ct_ext(cai.CERTIFICATE)

is required given that the join tests certificate serial number equality. But I'm still a bit of an X.509 and CT/ACME ignoramus.

r...@sectigo.com

unread,
Aug 18, 2023, 7:17:59 AM8/18/23
to crt.sh
> Speaking of which, I'm a bit confused as to why:
>    x509_tbscert_strip_ct_ext(c2.CERTIFICATE) = x509_tbscert_strip_ct_ext(cai.CERTIFICATE)
> is required given that the join tests certificate serial number equality. But I'm still a bit of an X.509 and CT/ACME ignoramus.

Sharing the same serial number is not the only requirement for a certificate and a precertificate to be considered a valid pair.

See the "tbs_certificate" definition in https://datatracker.ietf.org/doc/html/rfc6962#section-3.2 for some more details.
Reply all
Reply to author
Forward
0 new messages