Archiving Certificate Transparency Logs

Filippo Valsorda

unread,

Sep 22, 2025, 6:35:51 AM9/22/25

to Certificate Transparency Policy

Hello fellow mortals,

After chatting with some of you, including Philippe, Rasmus, and Ben, I built a little prototype of how long-term archival for CT log data could look like. Google currently operates mirrors of many historical logs, but that's not a sustainable solution in perpetuity. The idea is to package logs up as Static CT tiles, and upload them to the Internet Archive.

https://github.com/FiloSottile/sunlight/pull/49/files has two new tools:

vanity-mirror downloads an RFC 6962 log into the Static CT format. It expects a log.v3.json file in the current directory, and takes a mirror URL as an argument. It does parallel get-entries and rebuilds the Merkle Tree, eventually checking it against the STH, which it verifies and converts to a checkpoint.
photocamera-archiver compresses a Static CT log into a series of zip files (000.zip, 001.zip, ...), each containing a subtree of height 24 (16 Mi entries, roughly 11.4 GiB). Every archive also contains a README, the checkpoint, the log.v3.json file, issuers, level 3+ tiles, and the partial tiles on the right edge, so each archive is self-verifying.

It would have been nice to upload the Static CT log uncompressed to a single Internet Archive item, but the IA system struggles with more than a few files per item. (It took a couple days to delete ~1000 files from an initial attempt.)

A sample archive, using a small 2018 vintage DigiCert log, is at https://archive.org/details/ct_digicert_yeti2018. It clocks in at 61.2 GiB over six zip files for 90785920 entries.

Since zip files are seekable, I am planning to make filippo.io/sunlight.Client capable of pulling entries and hashes directly from a set of zip files, without requiring decompression.

I'm looking for feedback on the strategy, format, and tools. Ideally log operators would archive their own logs, so I would especially like feedback from log operators.

Cheers,

Filippo

P.S. The tools took about an afternoon. Then I spent 3-4 days over multiple weekends debugging why the IA kept rejecting 005.zip (and only 005.zip) with "Uploaded content is unacceptable. - error checking archive file". I excruciatingly bisected the log size (since an almost empty 005.zip was acceptable, but an almost complete 005.zip was not) down to a single tile, tile/data/x351/191. When I removed that file from the full-size archive, it uploaded. Then, bizarrely, if I added it back using zip(1), it still uploaded. I have no idea wtf is going on, and I almost went mad over it. I'm hoping it's just a fluke and won't happen again with other logs, but FYI.

Filippo Valsorda

unread,

Dec 3, 2025, 8:50:23 AM12/3/25

to Certificate Transparency Policy

Hello mortals and everyone else,

An update on the Certificate Transparency log archiving side-quest: the archival tools have a new home, and I am almost done archiving everything that's still available.

https://github.com/geomys/ct-archive hosts the tools with a fresh README, as well as a table of archived logs. PRs welcome! (I also found the issue that was breaking uploads to the IA! It's a decade-old archive/zip bug.)

With Philippe's help, I am almost done uploading to the Internet Archive all the Google mirrors and the Rejected shards that are still available. Let's Encrypt kindly offered to archive their Oak shards themselves, IPng is archiving their halloumi2026h2 shard to S3, and Google is exploring archiving their Rejected logs to GCS if they are ever turned down.

Going forward, it would be amazing if log operators could archive their shards when turning them down, either to their own storage or to the Internet Archive. The ct-archive README has detailed instructions, and I'm happy to help. If that is not an option, please continue announcing shard turn-downs here with some notice, so that myself or someone else can archive the log instead.

That leaves—I believe, do double check my work!—the following logs that were at some point Qualified but are neither live nor archived. Does anyone have ideas to reconstruct them?

https://ct.cloudflare.com/logs/nimbus2018/

https://ct.cloudflare.com/logs/nimbus2019/

https://ct.cloudflare.com/logs/nimbus2020/

https://ct.cloudflare.com/logs/nimbus2021/

https://yeti2024.ct.digicert.com/log/

https://nessie2024.ct.digicert.com/log/

https://oak.ct.letsencrypt.org/2024h1/

https://oak.ct.letsencrypt.org/2024h2/

https://mammoth2024h1b.ct.sectigo.com/

https://mammoth2025h1.ct.sectigo.com/

https://sabre2025h1.ct.sectigo.com/

https://ct.trustasia.com/log2020/

https://ct2021.trustasia.com/log2021/

https://ct.trustasia.com/log2021/

https://ct.trustasia.com/log2022/

https://ct.trustasia.com/log2023/

https://log.certly.io/

https://ct.izenpe.com/

https://ctlog.wosign.com/

https://ct.wosign.com/

https://ctlog.api.venafi.com/

https://ctserver.cnnic.cn/

https://ct.startssl.com/

https://www.certificatetransparency.cn/ct/

https://alpha.ctlogs.org/

Cheers,

Filippo

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/067c4702-2069-49e4-99c9-7521449de078%40app.fastmail.com.

Elaine Cubit

unread,

Dec 3, 2025, 9:36:38 AM12/3/25

to Filippo Valsorda, Certificate Transparency Policy

Hi Filippo,

> Does anyone have ideas to reconstruct them?

Censys has archived Some of the logs you are looking for, and we'd be happy to give you access to our dataset for this. It's available via Google BigQuery and also as a (very large) Avro download. Either will include all certificates Censys has archived, and we unfortunately don't have a way to split out the individual logs you need. Feel free to email me directly if this would be helpful to you, and I will see what needs to be done to get you access. We don't have the original data, only the certificates, their indices, and their addition timestamp, so I know this might not be ideal.

We have a table of logs we've [archived](https://platform.censys.io/certificates/logs) you can reference (it's behind auth, unfortunately) - I believe we have most or all of the data from these logs you're looking for:

- nimbus2018
- nimbus2019
- nimbus2020
- nimbus2021
- yeti2024
- nessie2024
- oak2024h1
- oak2024h2
- sabre2025h1
- trustasia_log_2022
- trustasia_log_2023

Thanks for doing all this,

Elaine

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/a4b7f4e5-d782-4ada-9729-8d43643d9fd7%40app.fastmail.com.

Philippe Boneff

unread,

Dec 3, 2025, 10:32:35 AM12/3/25

to Elaine Cubit, Filippo Valsorda, Certificate Transparency Policy

Amazing if you can get hold of leaves like that. Assuming that Censys's and our pipeline had the same final view of the log (and I hope they did!) I should be able to recover corresponding STHs, which should be enough to reconstruct the merkle trees? We'd still be missing extra_data, but it's better than nothing.

Cheers,
Philippe

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CADgX%3DLpCW%2B_BKGknDC1EieyfRiyGeU5dF95x-c6u5Efy-GGtSg%40mail.gmail.com.

Filippo Valsorda

unread,

Dec 3, 2025, 10:48:33 AM12/3/25

to Elaine Cubit, Certificate Transparency Policy

Hi Elaine,

Thank you for the offer! Do the precert entries include the issuer_key_hash as well? That gets hashed into the Merkle Tree as well, and while we might be able to reconstruct it, it'd be very painful to debug if it didn't work.

Losing the extra_data is not the end of the world. There is nothing cryptographically binding it to the STH, and technically it could be empty if the certificate was in the roots.

Cheers,

Filippo

Rob Stradling

unread,

Dec 3, 2025, 11:03:30 AM12/3/25

to Elaine Cubit, Philippe Boneff, Filippo Valsorda, Certificate Transparency Policy

> We don't have the original data, only the certificates, their indices, and their addition timestamp, so I know this might not be ideal.

Hi Filippo. crt.sh can offer the same ^^ for all of the logs you listed. I haven't kept the extra_data, but the issuer_key_hash values could be easily calculated.

One caveat:

Historically some logs have accepted certificates with malformed data in the "outer" certificate signature parameters (that aren't covered by the CA's signature). On crt.sh I have on occasion converted such certificates to their canonical form, usually by "merging" the record on the certificate table with an already-existing record for the canonical form of the same certificate. IINM, successfully reconstructing the merkle trees from crt.sh's data would require undoing these edits, but unfortunately I haven't kept the malformed data.

From: 'Philippe Boneff' via Certificate Transparency Policy <ct-p...@chromium.org>
Sent: 03 December 2025 15:31
To: Elaine Cubit <ecu...@censys.com>
Cc: Filippo Valsorda <fil...@ml.filippo.io>; Certificate Transparency Policy <ct-p...@chromium.org>
Subject: Re: [ct-policy] Archiving Certificate Transparency Logs

This Message Is From an External Sender

This message came from outside your organization.

Report Suspicious

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAOzAejS10MrVfC-3QXXPOau2kEsxjCU2yqwRM%2By9ko0Z8c7EWA%40mail.gmail.com.

Elaine Cubit

unread,

Dec 3, 2025, 11:15:53 AM12/3/25

to Certificate Transparency Policy, Rob Stradling, Filippo Valsorda, Certificate Transparency Policy, Elaine Cubit, Philippe Boneff

Unfortunately, we have not kept the issuer_key_hash from precert entries. We have only the actual certificate data, indices, and timestamps.

Kurt Roeckx

unread,

Dec 3, 2025, 4:42:54 PM12/3/25

to ct-p...@chromium.org

Hi,

I should have a copy of all (or most) entries of all of those logs in a postgres database. If you need help in recovering something, let me know.

Kurt

Pim van Pelt

unread,

Dec 3, 2025, 8:51:10 PM12/3/25

to ct-p...@chromium.org

Hoi Filippo, colleagues,

On 22.09.2025 12:34, Filippo Valsorda wrote:

I'm looking for feedback on the strategy, format, and tools. Ideally log operators would archive their own logs, so I would especially like feedback from log operators.

Thanks for the tool, I've used it to archive the previously corrupted halloumi2026h2 log to https://ct.ipng.ch/archive/ as an aside, 150GB of storage turned into a 4.2GB zip file, nice :) I am wondering, before I destroy the original copy and reclaim the space for future shards, is there a way to check the integrity of the ZIP files in the archive (other than unzip -t)?

P.S. The tools took about an afternoon. Then I spent 3-4 days over multiple weekends debugging why the IA kept rejecting 005.zip (and only 005.zip) with "Uploaded content is unacceptable. - error checking archive file". I excruciatingly bisected the log size (since an almost empty 005.zip was acceptable, but an almost complete 005.zip was not) down to a single tile, tile/data/x351/191. When I removed that file from the full-size archive, it uploaded. Then, bizarrely, if I added it back using zip(1), it still uploaded. I have no idea wtf is going on, and I almost went mad over it. I'm hoping it's just a fluke and won't happen again with other logs, but FYI.

Thanks for suffering through that!

groet,
Pim

-- 
Pim van Pelt <p...@ipng.ch>
PBVP1-RIPE https://ipng.ch/

Filippo Valsorda

unread,

Dec 4, 2025, 6:59:02 PM12/4/25

to Certificate Transparency Policy

2025-12-03 19:52 GMT+01:00 Kurt Roeckx <ku...@roeckx.be>:

Hi,

I should have a copy of all (or most) entries of all of those logs in a postgres database. If you need help in recovering something, let me know.

Hi Kurt,

If you have all the original data, this might be the easiest source to reconstruct the trees from. If I built you a little tool that makes a Static CT log from your Postgres database, would you be down to run photocamera-archiver on them and uploading them to the IA?

If so, I think I'll only need the schema and a small sample.

Cheers,

Filippo

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/5F84CD29-6228-4809-9749-6F0547993C16%40roeckx.be.

Filippo Valsorda

unread,

Dec 6, 2025, 5:24:42 PM12/6/25

to Certificate Transparency Policy

Hello leaves and roots,

I have two good news and one bad news.

The first good news is that I finished archiving all Google mirrors and all non-Digicert still-live logs. There's a nice table at https://github.com/geomys/ct-archive.

The second good news is that from v0.6.4-0.20251206201658-6074c64f2bb8, filippo.io/sunlight.Client can operate on Static CT logs on the local filesystem or even directly on a set of archival zip files, without unpacking. Client has methods to peck and iterate entries, always checking inclusion proofs and STHs. At a lower level, filippo.io/torc...@v0.8.0 can do that for any tlog-tiles log, too. At an even lower level, filippo.io/torchwood.TileArchiveFS is just an fs.FS that reads tiles and files from the right zip file automatically. I'm so happy with how little code that is.

The bad news is that my week off is over, and I won't be able to prioritize working on the remaining Digicert logs and on reconstructing the missing logs. This is a great opportunity to help out if you've been looking for a project!

For the live Digicert logs, it's just a matter of running the ct-archive tools (there's a README) and uploading them to the Internet Archive and then submitting a PR to the ct-archive table. You might need to fiddle with batch sizes to get a good rate out of Digicert's rate limits.

https://wyvern.ct.digicert.com/2025h1/ → ct_digicert_wyvern2025h1 (276121828 entries)
https://sphinx.ct.digicert.com/2025h1/ → ct_digicert_sphinx2025h1 (275557396 entries)
https://nessie2025.ct.digicert.com/log/ → ct_digicert_nessie2025 (881954985 entries)
https://yeti2025.ct.digicert.com/log/ → ct_digicert_yeti2025 (1967320888 entries)

For the missing logs, we have three leads on the missing data: Censys has leaf data but no issuer_key_hash (which needs to be reconstructed to get the Merkle leaf right), crt.sh can reconstruct issuer_key_hash but (very reasonably) deduped certificates that have different leaf hashes, and Kurt has leaf data with issuer_key_hash and unsorted issuers (which are not that critical for the archives). I think Kurt's db might be the best lead. If anyone wants to give this a try (and Censys/crt.sh/Kurt have time!) it should be pretty easy to repurpose vanity-mirror: just replace the part where it fetches batches of entries with get-entries.

I'll still obviously be around here and on the Slacks if anyone wants to give this a try and needs help.

I hope going forward log operators will archive their logs for us, too <3

Cheers,

Filippo

Patrick Flynn

unread,

Dec 8, 2025, 12:26:01 PM12/8/25

to Filippo Valsorda, Certificate Transparency Policy

Thank you for all your efforts getting these mirrors archived Filippo! This is very timely as we have a pressing need to free up disk resources on our end going into 2026. Your work/vacation is very much appreciated!

--

You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/c7f8e012-1bc0-4096-930d-6fec2007bbc9%40app.fastmail.com.

Helio Machado

unread,

Jan 14, 2026, 11:07:17 AMJan 14

to Certificate Transparency Policy, Patrick Flynn, Certificate Transparency Policy, Filippo Valsorda

Hello!

The following DigiCert logs have already been archived:

The missing logs from geomys/ct-archive@31c7f59 → logs.txt:80-104 still need to be reconstructed and archived.

Kurt Roeckx, if you have time, maybe I can repurpose vanity-mirror to reconstruct those logs from your PostgreSQL database. If it's not publicly accessible, may I have the schema and a small data sample?

Filippo Valsorda

unread,

Jan 26, 2026, 4:04:39 AMJan 26

to Colin Stubbs, Certificate Transparency Policy

Hi Colin,

This is awesome, thank you for building this. I agree that having to unpack the zips would make them a lot more annoying. That's why https://pkg.go.dev/filippo.io/sunlight#Client supports reading directly from a set of zips with the archive+file:// schema. Your HTTP server makes it easier to work with non-filippo.io/sunlight client, though.

FYI, you can probably use https://pkg.go.dev/filippo.io/torchwood#TileArchiveFS for the reading-directly-from-a-set-of-zips part.

I added ct-archive-serve to the ct-archive README.

You might want to serve Cache-Control headers directly from this, since all the assets are immutable anyway.

And yeah, sorry about the missing issuers.

Cheers,

Filippo

2026-01-26 09:37 GMT+01:00 Colin Stubbs <colin....@routedlogic.net>:

Hi Filippo & List,

Filippo - Thank you for all your work on archival!

That said having to unpack the .zip's and then serve the files, given the sheer size involved, can be problematic.

No doubt your plan was to add a HTTP server component to serve the archives at some point, but since one was not yet available and to meet my own needs, I've created a project to serve the content from the zip file collections, over HTTP, directly without extraction.

I've now published this on GitHub via https://github.com/colin-stubbs/ct-archive-serve

It's still very much under development so if anyone is interested in using this it would be great to receive feedback and issues for bugs/problems encountered.

Hopefully by providing an existing CT client compatible HTTP interface into the zips it will also encourage/allow people to continue seeding, particularly for archive's that are too large for the IA to host as you mentioned on another thread recently?

Rationale:
1. No duplicate storage required, e.g. no need to unzip 10+TB of zip files only to wind up with ~25+TB (guesstimate) of disk utilisation
2. Serve static tiled log archives via HTTP interface that existing CT client code already uses with large log sets *without* having to add and maintain code to enable tile/etc extraction from zip files.
3. Helps enable development and testing of client code offline/in isolation with multiple large logs with minimal storage requirements

The PoC in the GitHub repo:
1. Designed to work immediately with folders/files downloaded using torrents.rss from your the https://github.com/geomys/ct-archive repo.
2. Automatically serves the log archives via HTTP after discovering matching folder names (ct_\* by default as from torrents) with at minimum a 000.zip file that has a valid structure and retrievable checkpoint and log.v3.json files from within 000.zip
3. Automatically generates and publishes /logs.v3.json describing all valid logs that it has found available so far. Periodically discovers and adds new logs that have been downloaded once valid 000.zip is available.
4. Single Go binary in a minimal container
5. Prometheus metrics
6. Optional example headless qBittorrent configuration and container (refer to compose-all.yml) that has been pre-configured with the URL to torrents.rss and auto-download rules to download everything

There's no TLS/HTTPS support in the Go binary so it's just acting as a plaintext HTTP zip content server, with no rate limiting or other features, as it focuses solely on serving up the files from within the zip files as fast as possible.

My thought is that anyone who wants to add HTTPS, caching, rate limiting etc, successful request logging etc; would stick a reverse proxy in front of it to meet whatever their needs actually are.

There's one customisation worth noting.

Given some of the archives lack /issuers/ files the generated /monitor.json includes an extra per log boolean field simply named "has_issuers", this defaults to false but is set to true if ct-archive-serve detects at least one file in 000.zip that starts with "/issuers/".

I was running into issues with my client trying to constantly retrieve CA certs from under /issuers/, to the point it was really killing performance.

By using 'has_issuers' CT log clients can, if appropriately extended to understand it, avoid requesting issuer certs that won't exist and which will just cause errors/delays in progressing thru the log archive.

I hope this is useful to someone.

-Colin

Colin Stubbs

unread,

Jan 26, 2026, 12:59:02 PMJan 26

to Certificate Transparency Policy, Certificate Transparency Policy, Filippo Valsorda

Reply all

Reply to author

Forward