DigiCert Yeti2025 issues

414 views
Skip to first unread message

Chuck Blevins

unread,
Apr 29, 2025, 3:32:33 PMApr 29
to Certificate Transparency Policy
DigiCert's Yeti2025  is experience increasing database timeouts and increased pending entrie, with the majority of the load being caused by reads on the DB cluster. 

We have temporarily stopped accepting new entries as we troubleshoot - initial change was to update to 30K IOPs for the cluster which did not show improvement.
 
The team is considering next steps, I'll update as I have more information.

Cheers
Chuck

Chuck Blevins

unread,
Apr 29, 2025, 4:16:44 PMApr 29
to Certificate Transparency Policy, Chuck Blevins
The pause of new entries has not had the effect we'd hoped; for now we are going to also pause get-entries as well to allow the log to catch up.
We are finalizing plans to address the underlying issue while we are catching up.

Cheers
Chuck

Chuck Blevins

unread,
Apr 29, 2025, 5:13:19 PMApr 29
to Certificate Transparency Policy, Chuck Blevins, Chuck Blevins
Get-Entries were re-enabled about 45 minutes ago - we are not seeing additional difficulty and continue to  monitor.

-Chuck

Chuck Blevins

unread,
Apr 29, 2025, 6:23:20 PMApr 29
to Certificate Transparency Policy
We have turned submissions back on and tweaked nodes - things are looking stable now

We will be planning to add nodes in the short term 

Andrew Ayer

unread,
Apr 29, 2025, 6:42:53 PMApr 29
to Chuck Blevins, Certificate Transparency Policy
On Tue, 29 Apr 2025 18:23:05 -0400
Chuck Blevins <ch...@chuckblevins.com> wrote:

> We have turned submissions back on and tweaked nodes - things are
> looking stable now

Hi Chuck,

I'm still getting so many Internal Server Errors as well as 15 second or higher response times from get-entries that I can only get 30 entries per second out of the log. I don't think it's stable enough to be accepting submissions again, because it's going to cause massive backlogs in monitors and eventually an MMD violation.

Regards,
Andrew

Chuck Blevins

unread,
Apr 29, 2025, 6:44:16 PMApr 29
to Andrew Ayer, Certificate Transparency Policy
Thanks for the feedback
I’ll review with the team and respond

Chuck Blevins

unread,
Apr 29, 2025, 7:00:52 PMApr 29
to Andrew Ayer, Certificate Transparency Policy
We and turning off submissions again and monitoring 
-Chuck

Rick Roos

unread,
Apr 29, 2025, 8:00:36 PMApr 29
to Certificate Transparency Policy, Chuck Blevins, Certificate Transparency Policy, Andrew Ayer

I just wanted to chime in on this issue and give a few more details on the technical challenges we are facing with this log. The main challenge we are facing are the IOPS on our Cassandra nodes, mostly due to the get-entries endpoint load. This traffic is not only slowing down the get-entries calls but is also causing the backlog of new entries from being merged into the tree. When we turn off the get-entries endpoint the signer is able to empty the backlog and catch up.  When we turn it back on, the signing backlog grows and struggles again.  We agree with Andrew’s suggestion and turned off accepting any new entries again to protect against a MMD violation. 


We are committing to doing everything we can to save this log but in order to do this the immediate need would be to increase our IOPS and the number Cassandra nodes.  However this puts us into a difficult spot in that in order to add more nodes we will need to keep the add-chain and add-pre-chain endpoints off for a significant amount of time. 


Cassandra only allows one node to be added to the cluster at a time, each node could take up to a day to complete (due to the IO load it takes to add a new node) and we are thinking we would need to add up to 12 new nodes. This means this whole process would take up to about 12 days to complete. We also will most likely need to take the get-entries endpoint on and off a bit during this time depending on the IO needs to add the new nodes. 


This would obviously put us very much over our uptime requirements and thus we would like to get everyone feedback and thoughts on allowing this log to take this amount of down time to get into a more stable state.


Thanks,
Rick

Matt Palmer

unread,
Apr 29, 2025, 9:54:33 PMApr 29
to ct-p...@chromium.org
On Tue, Apr 29, 2025 at 05:00:35PM -0700, Rick Roos wrote:
> We are committing to doing everything we can to save this log but in order
> to do this the immediate need would be to increase our IOPS and the number
> Cassandra nodes. However this puts us into a difficult spot in that in
> order to add more nodes we will need to keep the add-chain and
> add-pre-chain endpoints off for a significant amount of time.
>
> Cassandra only allows one node to be added to the cluster at a time, each
> node could take up to a day to complete (due to the IO load it takes to add
> a new node) and we are thinking we would need to add up to 12 new nodes.
> This means this whole process would take up to about 12 days to complete.
> We also will most likely need to take the get-entries endpoint on and off a
> bit during this time depending on the IO needs to add the new nodes.
>
> This would obviously put us very much over our uptime requirements and thus
> we would like to get everyone feedback and thoughts on allowing this log to
> take this amount of down time to get into a more stable state.

Per https://github.com/GoogleChrome/CertificateTransparency/blob/main/log_policy.md#incident-detection-and-response:

> In some circumstances, such as responding to an incident, log
> operators may be forced to make a decision that favors compliance with
> one requirement over another. To ensure that SCTs remain a strong
> indicator that a certificate was issued transparently, Log Operators
> are encouraged to implement mechanisms in their Logs that prioritize
> the timely inclusion and availability of log entries over the
> continued availability of add-chain and add-pre-chain APIs.

It sounds like that's at least a tacit admission that turning off
add-chain and add-pre-chain for a period is better than blowing the MMD,
or violating uptime requirements on other endpoints.

- Matt

Filippo Valsorda

unread,
Apr 30, 2025, 9:30:13 AMApr 30
to Rick Roos, Certificate Transparency Policy, Chuck Blevins, Andrew Ayer
How would DigiCert, the community, and the relying parties feel about turning this into a Static CT log?

I had been playing with a design for a Sunlight-to-RFC 6962 proxy, Photosphere (a layer around the core that emits sunlight!). It's essentially the reverse of Andrew's Sunglasses. I could maybe build a prototype by EOD.

The idea is to issue RFC 6962 read requests (get-sth for checkpoints, get-entries for tile levels -1 and 0, at most 128 get-entry-and-proof for tile levels 1+) when a client requests a tile, uploading the result (checkpoint or tile and any new issuers) to S3, and then redirect the client to S3, keeping track of what tiles are already in S3. DigiCert could run this next to the log, and then ratelimit or even block get-entries requests from the outside.

Keeping track of what tiles are already uploaded requires less than 1MB of bitmap. Fetching level 1 tiles by doing 128 get-entry-and-proof requests is annoying (maybe someone has a better idea!), but there are "only" less than 6M of those, and we can save a request for any level 0 tile we already fetched by keeping less than 200MB of hashes in cache.

Since there are no approved Static CT logs yet, I believe no certificates will fall out of compliance.

The main problem with this idea, and maybe a deal breaker, is the lack of SCT extension, which will require keeping the get-proof-by-hash endpoint up, and a policy exception.

WDYT? Is this worth prototyping?
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

Luke Valenta

unread,
Apr 30, 2025, 10:12:37 AMApr 30
to Filippo Valsorda, Rick Roos, Certificate Transparency Policy, Chuck Blevins, Andrew Ayer
Nice, I love the Photosphere idea (and was just thinking about how Sunglasses is such a good project name). But I think we can go simpler?

Another option to consider is a CT-specific caching proxy in front of the get-entries endpoint, backed by S3-compatible storage. It wouldn't help with other read API endpoints like get-proof-by-hash, but those are rarely requested anyway so don't contribute meaningfully to DB load. Over the past week, we've had just over 3K get-proof-by-hash requests and zero get-entry-and-proof requests across all nimbus logs, versus over 1B get-entries and about 170M get-sth requests.

This is a sort of halfway point between static CT and the RFC6962 API, but wouldn't require any API or policy changes. We've considered implementing this for the nimbus logs, but never prioritized it with static CT on the horizon. It's possible other CT logs have implemented it already, so if you have please shout! It would work as follows:

1. Proxy receives a get-entries request, say 'get-entries?start=41&end=100', and normalizes the request to align with 32-entry boundaries (or whatever the log's max get-entries response size is), so 'get-entries?start=32&end=63' in this case. These are now basically static CT data tiles.
2. Proxy checks the S3 cache. If cached, skip to step 4.
3. Proxy sends the normalized request to the DB and adds the response to the cache if successful.
4. Proxy processes the response to match the start parameter of the original get-entries request, so 'get-entries?start=41&end=63', and returns that to the client.

(I believe Rick has already been looking into this, but he can say more.)

Best,
Luke





--
Luke Valenta
Systems Engineer - Research

Andrew Ayer

unread,
Apr 30, 2025, 10:15:11 AMApr 30
to Filippo Valsorda, Rick Roos, Certificate Transparency Policy, Chuck Blevins
On Wed, 30 Apr 2025 15:29:47 +0200
"Filippo Valsorda" <fil...@ml.filippo.io> wrote:

> I had been playing with a design for a Sunlight-to-RFC 6962 proxy,
> Photosphere (a layer around the core that emits sunlight!). It's
> essentially the reverse of Andrew's Sunglasses
> <https://github.com/AGWA/sunglasses>. I could maybe build a prototype
> by EOD.

Happy to see the naming theme continue :-D

> WDYT? Is this worth prototyping?

So for *this particular log* I see a couple issues:

1. It doesn't support get-entry-and-proof.

2. Given that the growth rate of 2025h1 shards has slowed considerably, I assume that most of the certificates being submitted to Yeti2025 are expiring in 2025h2 and could be submitted to one of DigiCert's 2 Trillian logs instead. Therefore, I'm skeptical that it's worth expending any resources returning this log to a writable state, either through Photosphere or by DigiCert executing their Cassandra upgrade plan. Since being made read-only, the performance of get-entries has improved considerably. I think it makes more sense to leave it in this state for the rest of 2025, and for DigiCert to spend the resources setting up Sunlight logs.

That said, I think Photosphere could be valuable if other RFC6962 logs begin struggling with read load. I think we'd want to add some text to the static-ct-api spec along the lines of:

"If the log was converted from a RFC 6962 log, it MAY omit the leaf_index extension. If so, it MUST implement the <submission prefix>/ct/v1/get-proof-by-hash API endpoint according to RFC 6962, Section 4.5. Auditors can use this endpoint to verify inclusion of an SCT if the leaf_index extension is absent."

This does impose additional complexity on auditors who verify SCTs, but it can be removed in the future when static-ct-api is ubiquitous.

Regards,
Andrew

Aaron Gable

unread,
Apr 30, 2025, 11:04:52 AMApr 30
to Luke Valenta, Filippo Valsorda, Rick Roos, Certificate Transparency Policy, Chuck Blevins, Andrew Ayer
Let's Encrypt has developed and deployed exactly this form of caching proxy, called ctile:

It's been incredibly successful for us, essentially singlehandedly turning our CT infrastructure from a constantly-burning low level fire into something we can mostly ignore. Improvements included:
- a 90% reduction in open database connections;
- an 80% reduction in submission latency;
- a 90% reduction in 99th percentile read latency for all non-get-entries reads

Take a look and see if you think it can be deployed in front of your logs as well.

Aaron

1000000038.png
1000000039.png

Luke Valenta

unread,
Apr 30, 2025, 12:01:38 PMApr 30
to Aaron Gable, Filippo Valsorda, Rick Roos, Certificate Transparency Policy, Chuck Blevins, Andrew Ayer
Thanks Aaron! I thought I remembered that someone had implemented this, but didn't remember the details.

Rasmus Dahlberg

unread,
Apr 30, 2025, 1:57:39 PMApr 30
to Filippo Valsorda, Rick Roos, Certificate Transparency Policy, Chuck Blevins, Andrew Ayer
FWIW: I think the reverse of sunglasses sounds like a great idea.  It
would be even better if every RFC-6962 log had a Photosphere proxy,
i.e., not just the ones which need better get-entries performance.  A
selling point would be that monitors can implement one API, not two.  I
would argue that this helps quite a bit with the migration story.

The framing I see is basically: Photosphere could be a mirror that
helps monitors download the logs and it happens to use static-ct.

-Rasmus

Rick Roos

unread,
Apr 30, 2025, 2:04:46 PMApr 30
to Certificate Transparency Policy, Certificate Transparency Policy
I just wanted to say thank you for all your feedback. These have been great suggests and will be implementing some type of caching as soon as we can as that will give the more long term relief the log needs.  While we work on that we have also started to add more Cassandra nodes and we'll continue to give updates on our progress towards turning accepting entries back on.

Thanks,
Rick

Rick Roos

unread,
May 2, 2025, 12:11:03 AMMay 2
to Certificate Transparency Policy, Rick Roos, Certificate Transparency Policy
I wanted to give an update on our progress. We were able to add the 12 new nodes more quickly than expected and we are ready to turn on accepting entries again but I'd like to get some feedback from the browsers first. We'd hate to turn it back on for CAs to submit new certs just to get SCTs that won't be trusted. Is there any guidance from the browsers about their thoughts on turning on accepting new entries on again?

Thanks,
Rick

Mustafa Emre Acer

unread,
May 2, 2025, 5:04:49 PMMay 2
to Certificate Transparency Policy, Rick Roos, Certificate Transparency Policy
Hi Rick,

Thanks for the update and quick turnaround. We don't have any concerns about this log starting to accept new entries again.

At least for Chrome's CT policy, even if for some reason the log failed in a way that required transitioning it to Retired, any existing issued SCTs should still be valid. This is handled by this specific case:

> In order to contribute to a certificate’s CT Compliance, an SCT must have been issued before the Log’s Retired timestamp, if one exists. Chrome uses the earliest SCT among all SCTs presented to evaluate CT compliance against CT Log Retired timestamps. This accounts for edge cases in which a CT Log becomes Retired during the process of submitting certificate logging requests.

There is also some flexibility in how we pick the timestamp for transitions to the Retired state, and in the past we have tried to pick them to minimize breakage (based on known extant certificates).

Thanks,
Mustafa, on behalf of the Chrome CT Team

Rick Roos

unread,
May 2, 2025, 7:59:55 PMMay 2
to Certificate Transparency Policy, Certificate Transparency Policy
Hi Mustafa,

Thank you for the feedback. We have turned back on accepting new entries and are monitoring the system.

Thanks,
Rick

Reply all
Reply to author
Forward
0 new messages