Hi,
I recently did some checks with the CRL data from CCADB contained in
the AllCertificateRecordsCSVFormatv4 file and noted some
inconsistencies.
Those can either be a single URL value (column "Full CRL Issued By This
CA") or a JSON list ("JSON Array of Partitioned CRLs").
* In the JSON list, it appears multiple different values are used to
indicate that the field is empty. It is a mix of empty strings (""),
JSON lists with an empty string ('[""]'), or JSON lists with a
double-double-quoted empty string ('[""""]'). In one particularly
peculiar case (DigiCert/Microsoft TLS G1 ECC CA 01), it is a list
containing a double-double-quoted non-breaking space
('[""\\u200b""]').
* In the single URL column, there are two cases that are missing the
protocol, i.e., no http:// or https://:
www.acabogacia.org/crl/aca_arl.crl and
ssl.gpki.go.kr/certs/ssl-ca.cer
I would suggest to add some basic sanity checks to the data. I don't
care which symbol is used to indicate an empty field for the JSON
column, but I think it should be consistent. Furthermore, I'd suggest
checking that URLs are URLs, and possibly also reject
unicode/non-ascii characters.
Note that there's a somewhat related issue that many of these CRLs are
not reliably accessible due to dubious blocking based on user-agents,
and that they are often served with incorrect MIME types. That's
recently been discussed on mdsp:
https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/PZTEB49qsHY/m/8vm3-C3oFgAJ
--
Hanno Böck - Independent security researcher
https://itsec.hboeck.de/
https://badkeys.info/