--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAAUDTJgUYYURscKFxDFkz-BAQO4sw-WJjTMwqJEjzvA9%2BzbVJw%40mail.gmail.com.
Hi folks,
We've identified and fixed the issue in the Azul-based Raio static-ct-api logs that caused missing tiles.
A recent refactor (f10355f) introduced the bug by changing how the log recovered from 'fatal' sequencing errors (e.g., failing to write tiles to object storage) that require reloading the log to get back into a good state. After the refactor, the log attempted to immediately reload state after a sequencing failure, but critically did not correctly handle the error if reloading failed as well. This allowed the log to continue in a bad state despite the missing tiles. The fix (d38ccac) is to force the log to reload before the next sequencing operation so that the recovery mechanisms kick in.
Thus, the bug could be triggered by two transient failures when writing or reading to object storage. The regression quickly surfaced for the raio2025h2a log shard at 2025-07-30T10:24:00Z (approximately 18 hours after submitting the inclusion request to the Chrome and Apple CT programs), resulting in missing tiles and an irrecoverable integrity violation. We subsequently withdrew the inclusion requests.
Timeline:
[2025-07-21T13:01Z] Bug introduced in commit f10355f.
[2025-07-28T16:17Z] Raio log shards deployed.
[2025-07-28T17:42Z] Start monitoring Raio log shards with CT monitor
[2025-07-29T13:40Z] Start adding artificial load on test logs with entries from equivalent Tucscolo logs.
[2025-07-29T16:37Z] Submit Raio logs shards for inclusion in Chrome and Apple CT programs.
[2025-07-30T10:24Z] Bug triggered for raio2025h2a log shard.
[2025-07-30T11:11Z] Bug detected by Cloudflare ct-monitor as 404 error message (no alerts fired)
[2025-07-31T13:14Z] Missing tiles reported by Google.
[2025-07-31T13:48Z] Inclusion requests withdrawn.
[2025-08-05T12:13Z] Bug fix merged in commit d38ccac.
[2025-08-05T14:07Z] Bug fix deployed for all Raio log shards. Raio2025h2b launched to replace Raio2025h2a.
As the fix is now in place, we will resume the inclusion requests for the following log shards, with raio2025h2b replacing raio2025h2a as that was the only log shard for which the bug triggered.
What went poorly:
- Existing tests were insufficient to catch the regression in the log's safe recovery mechanisms. The library itself has tests for safe recovery, but the bug was in the application code calling the library. Adding end-to-end tests of the full application with built-in network fault injection could have helped to surface the issue sooner.
- The commit that introduced the bug was part of a larger refactoring, which contributed to allowing the bug to slip through manual review.
- Our own CT monitor detected the issue as it was unable to retrieve the missing tiles, but it did not fire alerts. Thus, we only learned about the bug from external reports.
What went well:
- The bug was reported to us quickly thanks to the diligence of Andrew Ayer and the Chrome team.
- The artificial load we've been putting on the log (cross-pollinating entries from the equivalent Tuscolo log shard -- thanks Filippo!) helped to surface the bug while the impact was low.
Going forward, we’ll work towards addressing the shortcomings mentioned above so that bugs are less likely to occur, and that we are quicker to detect and react to operational issues.
Best,
Luke
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/20250820174527.38b17f66df6cf06b24733d9f%40andrewayer.name.