Hello CT Policy Folks,
We at Let's Encrypt are writing to keep everyone in the loop about Oak, especially the 2026h2 shard. We recognize there has been a recent decline in availability there and we wanted to acknowledge the slump and describe the paths we are taking to improve availability. Over the weekend of 19 OCT, Oak2026h2's 30d-availability on AddChain and AddPreChain endpoints dropped as low as 98.5% by our calculation. This shard's availability looks to have been intermittently declining the week leading up to the 19th, and over the weekend our ct-woodpecker alerted us to submission errors. Our Get endpoints are fronted by CTile, and so consistently remained above 99% availability, but even those endpoints ( when they needed something from the Database ) were erroring for some clients.
First, we identified a few clients that were hammering shard resources with no apparent backoff. We've temporarily rate-limited those clients by IP at the ingress which gave significant breathing room back to other clients.
Second, we want to reduce latency of DB operations. To that end we are spending-up on some RDS specs. We have increased Provisioned IOPs ( still with some room to go, if needed ), and we've transitioned shards to a newer RDS Instance Class. Availability has been looking much better over the last ~48 hours, but we are also keeping the possibility of stepping up an instance class on the table as we continue to monitor the shards.
With the changes made so far, Oak 2026h2 has remained above 99.6% availability on all endpoints for the last 48+ hours, which is just starting to bring up their availability calculations.
Thank You
--