Effective download of certificate transparency logs for research purposes

800 views
Skip to first unread message

Nate

unread,
Jul 22, 2021, 6:31:27 AM7/22/21
to certificate-transparency
Hi,

We work on a research around certificates, and as part of it, we need to download all the logs.
What would be the most effective way to do so? We can download from the logs in small chunks, but it seems inefficient.

Is there any place from which we can take a DB dump or ZIP of the logs?
Thanks!

Pavel Kalinnikov

unread,
Aug 6, 2021, 7:57:45 AM8/6/21
to certificate-...@googlegroups.com
Hi Nate,

From Google CT team's side, we have all the CT certs ingested internally, but at the moment we don't have an easily accessible dump for you, except if you use the get-entries endpoint of the logs (and mirrors) repeatedly, possibly in parallel to make it quicker. CT monitors, who download all the logs, might have this data in some way too, try reaching out to them.

Do you mind answering a few questions so that we understand your use-case better, and see if we have other options?
  1. Do you just need a large dataset of "real world" certs, or specifically the ones stored in CT?
  2. Do you need only the certificates, or the corresponding chains too?
  3. Do you need the certs ordered exactly as in CT logs, so that you could recompute and validate the hashes of the Merkle trees?
  4. Can this dataset be incomplete or outdated? By how much?
  5. When you say "all logs", do you mean all "production" logs accepted by browser CT policies? Which policies (Chrome, Apple, etc)?
  6. Maybe you have a specific "query" / statistical data point that you are looking for?
Thank you,
Pavel

--
You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/certificate-transparency/52ffc63c-f462-490c-bbf6-906d642504efn%40googlegroups.com.

Nate

unread,
Aug 7, 2021, 6:08:50 PM8/7/21
to certificate-transparency
Thank you Pavel for the reply.

1. We have special interest in all the certificates that were issued (and logged), although some of them were never used publicly. If you can provide access to another dataset, it would be great (mainly if the other dataset is not contained in the logged certs).
2. Mainly the certs themselves, although we also have plans analyzing the chains.
3. No. We mainly need to make sure that I have all of them. If we get another indications, the index does matter.
4. Partial dataset will help, mainly if we know that it covers some or parts of the logs (even from one or more sources).
5. We start with Chrome, but plan to continue to the others as well (the parts that do not overlap).
6. We want to preprocess them and then explore the use/misuse/abuse of certificates. Not a particular query that I can mention.

Thanks again!
Reply all
Reply to author
Forward
0 new messages