Efficiently matching FQDN sets

62 views
Skip to first unread message

Craig Ringer

unread,
Jul 19, 2023, 5:38:48 PM7/19/23
to crt.sh
Hi

As part of the recent work I posted about re Let's Encrypt quota monitoring, I found that it appears to be difficult to efficiently build a complete set of fully qualified domain names for a given certificate using the current libx509pq APIs.

The identities() function is useful, but not suitable for matching exact sets of certificate FQDNs for equality when comparing whether two certs are for the same set of domains.

If I assume that the cert commonName is also present in the alternateNames, I can use this logic to build an ordered array of FQDN subjects for a certificate:

  CROSS JOIN LATERAL (
    SELECT array_agg(lower(altname) ORDER BY altname COLLATE "C")
    FROM x509_altNames(c.CERTIFICATE) can(altname)
  ) AS cans(fqdn_set)

... but it's not what you might call optimally efficient to do this in SQL with arrays. And I see that the identities() function also consults both x509_altNames_raw and x509_nameAttributes_raw so presumably a complete solution for all kinds of certs should do that too.

I'm wondering if you'd be potentially open to a PR for adding a new libx509pg C function to build an ordered set of distinct names for a certificate. This could be exposed as the SQL level as functions to generate a digest of all certificate subjects (for fast lossy indexing and comparison), an array (for identity comparison, unnesting etc) or a string (for faster identity comparison than arrays).

Then the identities() function could use this function instead of its current CPU-expensive PL/PgSQL implementation. While that implementation isn't usually called during queries because of the available index, it's going to be costly for those gin index maintenance operations.

And an index on the digest of the set of FQDNs could be added fairly cheaply to the certificate table. Or even - albeit at much greater disk space and index maintenance CPU cost - a GiST or GIN index on the sorted array of FQDNs, for use with the indexable @> and <@ array-containment operators not just full-set equality.

I can't promise to be able to prepare such a PR, but I'm happy to give it a go if you think it'd be useful and be open to accepting it.

r...@sectigo.com

unread,
Aug 18, 2023, 7:35:14 AM8/18/23
to crt.sh
> I'm wondering if you'd be potentially open to a PR for adding a new libx509pg C function to build an ordered set of distinct names for a certificate.

I'm open to adding any new libx509pq C function for which anyone expresses an interest.

> Then the identities() function could use this function instead of its current CPU-expensive PL/PgSQL implementation. While that implementation isn't usually called during queries because of the available index, it's going to be costly for those gin index maintenance operations.

Yes, it's definitely costly there.  For some time, crt.sh has been falling further and further behind in terms of ingesting new log entries, and it's the INSERTs into the "certificate" table that are by far the slowest component (and I presume that's mainly due to having to update the GIN index).  A few months ago I discovered that I could improve throughput by running "SELECT gin_clean_pending_list(psui.INDEXRELID) FROM pg_stat_user_indexes psui WHERE psui.RELNAME LIKE 'certificate%' AND psui.INDEXRELNAME LIKE '%identities%';"  every minute in a separate connection.  The CPU is still the bottleneck though, so I'm currently working on enhancing the ct_monitor application to be able to use multiple writer goroutines (and hence multiple CPUs).

Unsurprisingly, I'm interested in considering ideas on how to optimise the identities() function.  :-)

> And an index on the digest of the set of FQDNs could be added fairly cheaply to the certificate table. Or even - albeit at much greater disk space and index maintenance CPU cost - a GiST or GIN index on the sorted array of FQDNs, for use with the indexable @> and <@ array-containment operators not just full-set equality.

To be honest, I can't really consider adding any new indexes whilst the log ingestion backlog problem remains.  The system needs to be able to handle its current workload first!
Reply all
Reply to author
Forward
0 new messages