Archiving Certificate Transparency Logs

670 views
Skip to first unread message

Filippo Valsorda

unread,
Sep 22, 2025, 6:35:51 AMSep 22
to Certificate Transparency Policy
Hello fellow mortals,

After chatting with some of you, including Philippe, Rasmus, and Ben, I built a little prototype of how long-term archival for CT log data could look like. Google currently operates mirrors of many historical logs, but that's not a sustainable solution in perpetuity. The idea is to package logs up as Static CT tiles, and upload them to the Internet Archive.

  • vanity-mirror downloads an RFC 6962 log into the Static CT format. It expects a log.v3.json file in the current directory, and takes a mirror URL as an argument. It does parallel get-entries and rebuilds the Merkle Tree, eventually checking it against the STH, which it verifies and converts to a checkpoint.
  • photocamera-archiver compresses a Static CT log into a series of zip files (000.zip, 001.zip, ...), each containing a subtree of height 24 (16 Mi entries, roughly 11.4 GiB). Every archive also contains a README, the checkpoint, the log.v3.json file, issuers, level 3+ tiles, and the partial tiles on the right edge, so each archive is self-verifying.
It would have been nice to upload the Static CT log uncompressed to a single Internet Archive item, but the IA system struggles with more than a few files per item. (It took a couple days to delete ~1000 files from an initial attempt.)

A sample archive, using a small 2018 vintage DigiCert log, is at https://archive.org/details/ct_digicert_yeti2018. It clocks in at 61.2 GiB over six zip files for 90785920 entries.

Since zip files are seekable, I am planning to make filippo.io/sunlight.Client capable of pulling entries and hashes directly from a set of zip files, without requiring decompression.

I'm looking for feedback on the strategy, format, and tools. Ideally log operators would archive their own logs, so I would especially like feedback from log operators.

Cheers,
Filippo

P.S. The tools took about an afternoon. Then I spent 3-4 days over multiple weekends debugging why the IA kept rejecting 005.zip (and only 005.zip) with "Uploaded content is unacceptable. - error checking archive file". I excruciatingly bisected the log size (since an almost empty 005.zip was acceptable, but an almost complete 005.zip was not) down to a single tile, tile/data/x351/191. When I removed that file from the full-size archive, it uploaded. Then, bizarrely, if I added it back using zip(1), it still uploaded. I have no idea wtf is going on, and I almost went mad over it. I'm hoping it's just a fluke and won't happen again with other logs, but FYI.

Filippo Valsorda

unread,
Dec 3, 2025, 8:50:23 AM (4 days ago) Dec 3
to Certificate Transparency Policy
Hello mortals and everyone else,

An update on the Certificate Transparency log archiving side-quest: the archival tools have a new home, and I am almost done archiving everything that's still available. 

https://github.com/geomys/ct-archive hosts the tools with a fresh README, as well as a table of archived logs. PRs welcome! (I also found the issue that was breaking uploads to the IA! It's a decade-old archive/zip bug.)

With Philippe's help, I am almost done uploading to the Internet Archive all the Google mirrors and the Rejected shards that are still available. Let's Encrypt kindly offered to archive their Oak shards themselves, IPng is archiving their halloumi2026h2 shard to S3, and Google is exploring archiving their Rejected logs to GCS if they are ever turned down.

Going forward, it would be amazing if log operators could archive their shards when turning them down, either to their own storage or to the Internet Archive. The ct-archive README has detailed instructions, and I'm happy to help. If that is not an option, please continue announcing shard turn-downs here with some notice, so that myself or someone else can archive the log instead.

That leaves—I believe, do double check my work!—the following logs that were at some point Qualified but are neither live nor archived. Does anyone have ideas to reconstruct them?


Cheers,
Filippo
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

Elaine Cubit

unread,
Dec 3, 2025, 9:36:38 AM (4 days ago) Dec 3
to Filippo Valsorda, Certificate Transparency Policy
Hi Filippo,


> Does anyone have ideas to reconstruct them?

Censys has archived Some of the logs you are looking for, and we'd be happy to give you access to our dataset for this. It's available via Google BigQuery and also as a (very large) Avro download. Either will include all certificates Censys has archived, and we unfortunately don't have a way to split out the individual logs you need. Feel free to email me directly if this would be helpful to you, and I will see what needs to be done to get you access. We don't have the original data, only the certificates, their indices, and their addition timestamp, so I know this might not be ideal.

We have a table of logs we've [archived](https://platform.censys.io/certificates/logs) you can reference (it's behind auth, unfortunately) - I believe we have most or all of the data from these logs you're looking for:

- nimbus2018
- nimbus2019
- nimbus2020
- nimbus2021
- yeti2024
- nessie2024
- oak2024h1
- oak2024h2
- sabre2025h1
- trustasia_log_2022
- trustasia_log_2023

Thanks for doing all this,

Elaine

Philippe Boneff

unread,
Dec 3, 2025, 10:32:35 AM (4 days ago) Dec 3
to Elaine Cubit, Filippo Valsorda, Certificate Transparency Policy
Amazing if you can get hold of leaves like that. Assuming that Censys's and our pipeline had the same final view of the log (and I hope they did!) I should be able to recover corresponding STHs, which should be enough to reconstruct the merkle trees? We'd still be missing extra_data, but it's better than nothing.

Cheers,
Philippe

Filippo Valsorda

unread,
Dec 3, 2025, 10:48:33 AM (4 days ago) Dec 3
to Elaine Cubit, Certificate Transparency Policy
Hi Elaine,

Thank you for the offer! Do the precert entries include the issuer_key_hash as well? That gets hashed into the Merkle Tree as well, and while we might be able to reconstruct it, it'd be very painful to debug if it didn't work.

Losing the extra_data is not the end of the world. There is nothing cryptographically binding it to the STH, and technically it could be empty if the certificate was in the roots.

Cheers,
Filippo

Rob Stradling

unread,
Dec 3, 2025, 11:03:30 AM (4 days ago) Dec 3
to Elaine Cubit, Philippe Boneff, Filippo Valsorda, Certificate Transparency Policy
> We don't have the original data, only the certificates, their indices, and their addition timestamp, so I know this might not be ideal.

Hi Filippo.  crt.sh can offer the same ^^ for all of the logs you listed.  I haven't kept the extra_data, but the issuer_key_hash values could be easily calculated.

One caveat:
Historically some logs have accepted certificates with malformed data in the "outer" certificate signature parameters (that aren't covered by the CA's signature).  On crt.sh I have on occasion converted such certificates to their canonical form, usually by "merging" the record on the certificate table with an already-existing record for the canonical form of the same certificate.  IINM, successfully reconstructing the merkle trees from crt.sh's data would require undoing these edits, but unfortunately I haven't kept the malformed data.


From: 'Philippe Boneff' via Certificate Transparency Policy <ct-p...@chromium.org>
Sent: 03 December 2025 15:31
To: Elaine Cubit <ecu...@censys.com>
Cc: Filippo Valsorda <fil...@ml.filippo.io>; Certificate Transparency Policy <ct-p...@chromium.org>
Subject: Re: [ct-policy] Archiving Certificate Transparency Logs
 
Amazing if you can get hold of leaves like that. Assuming that Censys's and our pipeline had the same final view of the log (and I hope they did!) I should be able to recover corresponding STHs, which should be enough to reconstruct the
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd

Elaine Cubit

unread,
Dec 3, 2025, 11:15:53 AM (4 days ago) Dec 3
to Certificate Transparency Policy, Rob Stradling, Filippo Valsorda, Certificate Transparency Policy, Elaine Cubit, Philippe Boneff
Unfortunately, we have not kept the issuer_key_hash from precert entries. We have only the actual certificate data, indices, and timestamps.

Kurt Roeckx

unread,
Dec 3, 2025, 4:42:54 PM (3 days ago) Dec 3
to ct-p...@chromium.org
Hi,

I should have a copy of all (or most) entries of all of those logs in a postgres database. If you need help in recovering something, let me know.

Kurt

Pim van Pelt

unread,
Dec 3, 2025, 8:51:10 PM (3 days ago) Dec 3
to ct-p...@chromium.org
Hoi Filippo, colleagues,



On 22.09.2025 12:34, Filippo Valsorda wrote:
I'm looking for feedback on the strategy, format, and tools. Ideally log operators would archive their own logs, so I would especially like feedback from log operators.
Thanks for the tool, I've used it to archive the previously corrupted halloumi2026h2 log to https://ct.ipng.ch/archive/ as an aside, 150GB of storage turned into a 4.2GB zip file, nice :) I am wondering, before I destroy the original copy and reclaim the space for future shards, is there a way to check the integrity of the ZIP files in the archive (other than unzip -t)?


P.S. The tools took about an afternoon. Then I spent 3-4 days over multiple weekends debugging why the IA kept rejecting 005.zip (and only 005.zip) with "Uploaded content is unacceptable. - error checking archive file". I excruciatingly bisected the log size (since an almost empty 005.zip was acceptable, but an almost complete 005.zip was not) down to a single tile, tile/data/x351/191. When I removed that file from the full-size archive, it uploaded. Then, bizarrely, if I added it back using zip(1), it still uploaded. I have no idea wtf is going on, and I almost went mad over it. I'm hoping it's just a fluke and won't happen again with other logs, but FYI.
Thanks for suffering through that!

groet,
Pim
-- 
Pim van Pelt <p...@ipng.ch>
PBVP1-RIPE https://ipng.ch/

Filippo Valsorda

unread,
Dec 4, 2025, 6:59:02 PM (2 days ago) Dec 4
to Certificate Transparency Policy
2025-12-03 19:52 GMT+01:00 Kurt Roeckx <ku...@roeckx.be>:
Hi,

I should have a copy of all (or most) entries of all of those logs in a postgres database. If you need help in recovering something, let me know.

Hi Kurt,

If you have all the original data, this might be the easiest source to reconstruct the trees from. If I built you a little tool that makes a Static CT log from your Postgres database, would you be down to run photocamera-archiver on them and uploading them to the IA?

If so, I think I'll only need the schema and a small sample.

Cheers,
Filippo

Filippo Valsorda

unread,
Dec 6, 2025, 5:24:42 PM (7 hours ago) Dec 6
to Certificate Transparency Policy
Hello leaves and roots,

I have two good news and one bad news.

The first good news is that I finished archiving all Google mirrors and all non-Digicert still-live logs. There's a nice table at https://github.com/geomys/ct-archive.

The second good news is that from v0.6.4-0.20251206201658-6074c64f2bb8, filippo.io/sunlight.Client can operate on Static CT logs on the local filesystem or even directly on a set of archival zip files, without unpacking. Client has methods to peck and iterate entries, always checking inclusion proofs and STHs. At a lower level, filippo.io/torc...@v0.8.0 can do that for any tlog-tiles log, too. At an even lower level, filippo.io/torchwood.TileArchiveFS is just an fs.FS that reads tiles and files from the right zip file automatically. I'm so happy with how little code that is.

The bad news is that my week off is over, and I won't be able to prioritize working on the remaining Digicert logs and on reconstructing the missing logs. This is a great opportunity to help out if you've been looking for a project!

For the live Digicert logs, it's just a matter of running the ct-archive tools (there's a README) and uploading them to the Internet Archive and then submitting a PR to the ct-archive table. You might need to fiddle with batch sizes to get a good rate out of Digicert's rate limits.
For the missing logs, we have three leads on the missing data: Censys has leaf data but no issuer_key_hash (which needs to be reconstructed to get the Merkle leaf right), crt.sh can reconstruct issuer_key_hash but (very reasonably) deduped certificates that have different leaf hashes, and Kurt has leaf data with issuer_key_hash and unsorted issuers (which are not that critical for the archives). I think Kurt's db might be the best lead. If anyone wants to give this a try (and Censys/crt.sh/Kurt have time!) it should be pretty easy to repurpose vanity-mirror: just replace the part where it fetches batches of entries with get-entries.

I'll still obviously be around here and on the Slacks if anyone wants to give this a try and needs help.

I hope going forward log operators will archive their logs for us, too <3

Cheers,
Filippo
Reply all
Reply to author
Forward
0 new messages