Stricter rate limits enabled on Sabre and Mammoth log shards

1,032 views
Skip to first unread message

Martijn Katerbarg

unread,
Feb 7, 2025, 10:45:28 AM2/7/25
to ct-p...@chromium.org

All,

We're monitoring increased merge delays for both mammoth2025h1 and sabre2025h1 log shards. As our history shows we have seen these logs take performance hits repeatedly. While we're working on getting new, PostgreSQL backed CT logs deployed and submitted to resolve this long term, we will for now also need to deal with the ongoing issues. 

To that effect and to avoid future MMD violations, we have set stricter rate limits on all of our log shards. These are for the moment set to 30 requests per second per IP address, with an additional 125 requests per seconds global limit. 

We will monitor performance of the different log shards over the next few days, potentially weeks, and may adjust these limits, with or without future notification to this group.  

Regards,

Martijn Katerbarg
Sectigo

Matt Palmer

unread,
Feb 9, 2025, 5:05:24 PM2/9/25
to ct-p...@chromium.org
On Fri, Feb 07, 2025 at 03:45:19PM +0000, 'Martijn Katerbarg' via Certificate Transparency Policy wrote:
> We're monitoring increased merge delays for both mammoth2025h1 and sabre2025h1 log shards. As our history shows we have seen these logs take performance hits repeatedly. While we're working on getting new, PostgreSQL backed CT logs deployed and submitted to resolve this long term, we will for now also need to deal with the ongoing issues.
>
> To that effect and to avoid future MMD violations, we have set stricter rate limits on all of our log shards. These are for the moment set to 30 requests per second per IP address, with an additional 125 requests per seconds global limit.

If I'm understanding that rate limit correctly, five monitors (or one
bad actor with five EC2 instances) can essentially stop the CT
monitoring ecosystem from effectively monitoring these logs for
correctness. Am I misunderstanding your description of the rate limits
you've put in place?

Restricting the ability of monitors to observe the log's behaviour seems
antithetical to the purpose of CT. Rather than restrict monitors, did
you instead consider restricting submissions?

To my way of thinking, rate-limiting submissions is a load management
mechanism far more in line with the principles of CT. A log
occasionally failing to accept a submission should not be a big deal,
because there are, by design, multiple logs that can receive
submissions. On the other hand, a log that cannot be effectively
monitored is worse than no log at all, since it is providing a
misleading assurance of security.

- Matt

Devon O'Brien

unread,
Feb 10, 2025, 7:12:56 PM2/10/25
to Certificate Transparency Policy, Martijn Katerbarg
Hi Martin,

Thanks for looking into the merge delay backlog for Sabre and Mammoth logs and for taking actions to reduce the likelihood of either log blowing their MMDs. Regarding the rate limits, I did have two questions for you:

1. While updating ct-policy@ isn't a requirement, Chrome's CT Log Policy requires all changes to CT logs' policies and acceptance criteria to the log's crbug.com application bug (historically, this includes substantive changes to rate limits). Could you please keep track of relevant changes to log operation in the corresponding bug?

2. In the days since this announcement, we've seen the availability numbers for mammoth2025h1 (now 98.41%) and sabre2025h1 (now 94.71%) continue to drop, with the most recent drops attributable to failed add-chain and add-pre-chain calls from our log monitoring infrastructure. In his reply to your post, Matt Palmer raises some good points about the relative importance of log availability per-api, but we are already starting to see even write availability drop. Do you have a sense as to whether this rate limiting is something that can be lifted after pending certificate backlogs reduce, or is this intended as a longer-term mitigation?

-Devon

Martijn Katerbarg

unread,
Feb 12, 2025, 7:03:12 AM2/12/25
to Certificate Transparency Policy, Devon O'Brien

Hi Matt, Devon,



> Am I misunderstanding your description of the rate limits you've put in place?

No, your understanding is correct.

We have only lowered the thresholds on rate limits that were already in place. Previously, the thresholds were set at 60 requests/IP/s, with a global limit of 250/s.

> Rather than restrict monitors, did you instead consider restricting submissions?

We felt that swift action had become necessary, and we were able to quickly tweak the thresholds for the existing rate limits. We have a task for our DevOps team pending to implement separate rate limit thresholds for GETs versus POSTs.

> If I'm understanding that rate limit correctly, five monitors (or one bad actor with five EC2 instances) can essentially stop the CT monitoring ecosystem from effectively monitoring these logs for correctness.

You are correct. It’s why we do want to monitor this and tweak until we find the appropriate settings (and in the end move to PostgreSQL backed logs).



> While updating ct-policy@ isn't a requirement, Chrome's CT Log Policy requires all changes to CT logs' policies and acceptance criteria to the log's crbug.com application bug (historically, this includes substantive changes to rate limits). Could you please keep track of relevant changes to log operation in the corresponding bug?

Thank you for this reminder. We have updated the crbug.com application bugs accordingly.



> In the days since this announcement, we've seen the availability numbers for mammoth2025h1 (now 98.41%) and sabre2025h1 (now 94.71%) continue to drop, with the most recent drops attributable to failed add-chain and add-pre-chain calls from our log monitoring infrastructure. In his reply to your post, Matt Palmer raises some good points about the relative importance of log availability per-api, but we are already starting to see even write availability drop. Do you have a sense as to whether this rate limiting is something that can be lifted after pending certificate backlogs reduce, or is this intended as a longer-term mitigation?

We expect this availability drop is the result of an increase in  “429 Too Many Requests” responses due to these rate limits. Once we’ve implemented separate rate limit thresholds for GETs versus POSTs, we do expect at least the GET rate limit thresholds to be raised in the short term. Since you’re also seeing an increase in failed add-chain and add-pre-chain calls, we will certainly look at what we can do there without running into other problems.

We do agree that Matt raises a good point. We can’t risk running into the MMD every X days/weeks, but also we can’t be setting rate limits that are too strict. Given the very limited headroom available in our currently deployed software stack, there’s a delicate balance that needs to be taken care of here.

Regards,

Martijn Katerbarg
Sectigo


Op dinsdag 11 februari 2025 om 01:12:56 UTC+1 schreef Devon O'Brien:

Mustafa Emre Acer

unread,
May 2, 2025, 5:59:14 PM5/2/25
to Certificate Transparency Policy, Martijn Katerbarg, Devon O'Brien
Hi Martin,

We've recently seen the availability numbers for mammoth2025h2 to drop below 99% as well, due to an increased number of 429s on all endpoints. Just wanted to check if you saw any reduction in the pending queue size so far in these logs. If so, do you expect to raise the rate limits again in the near future?

Thanks,
Mustafa, on behalf of the Chrome CT Team

Martijn Katerbarg

unread,
May 7, 2025, 6:08:33 AM5/7/25
to Mustafa Emre Acer, Certificate Transparency Policy, Devon O'Brien

Hi Mustafa, all, 

We don't have much additional news at this moment, but we are currently working on rolling out a change to support different rate limits for POST and GET requests, which we expect to be completed before the end of this week. I'd like to suggest we follow up with more details on this in about a week from when that deployment has been completed, so we can (hopefully) gather some additional statistics.

Regards,

Martijn
Sectigo

 

From: Mustafa Emre Acer <mea...@chromium.org>
Date: Friday, 2 May 2025 at 23:53
To: Certificate Transparency Policy <ct-p...@chromium.org>
Cc: Martijn Katerbarg <martijn....@sectigo.com>, Devon O'Brien <asymm...@google.com>
Subject: Re: Stricter rate limits enabled on Sabre and Mammoth log shards

Hi Martin, We've recently seen the availability numbers for mammoth2025h2 to drop below 99% as well, due to an increased number of 429s on all endpoints. Just wanted to check if you saw any reduction in the pending queue size so far in these

ZjQcmQRYFpfptBannerStart

This Message Is From an External Sender

This message came from outside your organization.

 

ZjQcmQRYFpfptBannerEnd

Mustafa Emre Acer

unread,
May 9, 2025, 3:25:16 PM5/9/25
to Martijn Katerbarg, Certificate Transparency Policy, Devon O'Brien
Thanks for the update Martijn (and apologies for my typo in the previous message)

- Mustafa
 

James Thomas

unread,
May 14, 2025, 12:08:34 PM5/14/25
to Certificate Transparency Policy, Martijn Katerbarg
I'm curious if the rate limit change has been implemented yet?

We're seeing a lot of 429s (and a few 502s and 504s) on Sabre and especially on Mammoth. We're nowhere near 30 requests/sec/IP, so I assume it's the global limit that's being triggered. We're hitting enough 429s that our crawl rate is pretty slow because we're spending so much time on retries/backoffs.

Martijn Katerbarg

unread,
May 14, 2025, 3:42:14 PM5/14/25
to James Thomas, Certificate Transparency Policy

Hi James,

 

> ZjQcmQRYFI'm curious if the rate limit change has been implemented yet?

 

They have. We’ve also since further discussed this and are looking at additional changes. We will expand on that in a few days. For now, we’re working on updating the rate limits once more to further finetune and see the effect of this. I hope this will be deployed before the end of the day today, though I cannot guarantee this.

Please expect further details from us before the end of the week.

Regards,

Martijn

Sectigo

 

From: 'James Thomas' via Certificate Transparency Policy <ct-p...@chromium.org>
Date: Wednesday, 14 May 2025 at 18:08
To: Certificate Transparency Policy <ct-p...@chromium.org>
Cc: Martijn Katerbarg <martijn....@sectigo.com>
Subject: [ct-policy] Re: Stricter rate limits enabled on Sabre and Mammoth log shards

I'm curious if the rate limit change has been implemented yet? We're seeing a lot of 429s (and a few 502s and 504s) on Sabre and especially on Mammoth. We're nowhere near 30 requests/sec/IP, so I assume it's the global limit that's being triggered.

ZjQcmQRYFpfptBannerStart

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/93b7f376-7246-4a32-9114-ee0761e7fc98n%40chromium.org.

Martijn Katerbarg

unread,
May 16, 2025, 4:10:00 AM5/16/25
to James Thomas, Certificate Transparency Policy

All,


Yesterday we’ve made further changes to all of our logs.

First up we have added a CTFE-based rate limiting mechanism (see https://github.com/google/certificate-transparency-go/pull/1698) for older certificates. The reason for this is that we’ve seen quite a lot of submissions of certificates to our logs that have been issued several days ago. This rate limit is utilizing the notBefore date of the certificate. The threshold for considering something as an "old" certificate is set to 28 hours, with the global (per shard) rate limit set to 40 req/sec, burst=10.

 

Next to that, we’ve also increased our global rate limit, for both POST and GET requests, while simultaneously lowering the per IP limits.

We are aware of the ease of bypassing per IP limits by simply utilizing multiple IP addresses, but we hope it may deter some heavy submitters and requesters, especially when combining it with the CTFE-based rate-limiting.


For both GET and POST requests, the per IP limit is now set at 20 requests per second per IP.

For GET requests, the global limit is set at 400 requests per second. For POST requests, the global limit is set at 200 requests per second.

 

We feel this is a reasonable limit based on the current issuance rate of the WebPKI. We would like to further raise, at a minimum, the global rate limits, but would like to keep an eye on resource usage.

Regards,

Martijn

 

From: 'Martijn Katerbarg' via Certificate Transparency Policy <ct-p...@chromium.org>
Date: Wednesday, 14 May 2025 at 21:42
To: James Thomas <jamesth...@proton.me>, Certificate Transparency Policy <ct-p...@chromium.org>
Subject: Re: [ct-policy] Re: Stricter rate limits enabled on Sabre and Mammoth log shards

Hi James, > ZjQcmQRYFI'm curious if the rate limit change has been implemented yet? They have. We’ve also since further discussed this and are looking at additional changes. We will expand on that in a few days. For now, we’re working on

ZjQcmQRYFpfptBannerStart

James Thomas

unread,
May 18, 2025, 11:03:42 AM5/18/25
to Certificate Transparency Policy, Martijn Katerbarg, James Thomas
We're still seeing a lot of 429s and 502s from Mammoth and Elephant. We're using a single crawl thread (~1-3 requests per second), so I assume it's the global rate limit that's triggering the 429s (although we're getting more 502s than 429s).

Matt Palmer

unread,
May 18, 2025, 7:45:29 PM5/18/25
to ct-p...@chromium.org
On Fri, May 16, 2025 at 08:09:53AM +0000, 'Martijn Katerbarg' via Certificate Transparency Policy wrote:
> For both GET and POST requests, the per IP limit is now set at 20
> requests per second per IP. For GET requests, the global limit is set
> at 400 requests per second. For POST requests, the global limit is set
> at 200 requests per second.
>
> We feel this is a reasonable limit based on the current issuance rate
> of the WebPKI. We would like to further raise, at a minimum, the
> global rate limits, but would like to keep an eye on resource usage.

What are Sectigo's plans for bringing these logs into compliance with
the Chrome CT Log policy's new recommendations around rate-limiting,
which view "rate-limiting during normal operation" as a form of log
unavailability?

- Matt

Pierre Barre

unread,
May 19, 2025, 7:09:40 AM5/19/25
to James Thomas, Certificate Transparency Policy, Martijn Katerbarg
All sectigo logs are effectively down on my end, as it's exceedingly hard to perform even a single request without hitting the global rate limit.

Best,
Pierre

Rob Stradling

unread,
May 20, 2025, 6:15:38 AM5/20/25
to Certificate Transparency Policy, Pierre Barre, Martijn Katerbarg, James Thomas

Soon after the scheduled maintenance concluded on Saturday, some of our front-end load balancers started hitting CPU resource limits.  This affected the majority of requests to our Mammoth and Elephant logs and also some requests from some localities to our Sabre and Tiger logs.

More CPU resources were allocated at approximately midday UTC yesterday, after which our logs show a dramatic increase in HTTP 200 responses and a dramatic decrease in 50x responses.

Are folks seeing the effect of this improvement?

James Thomas

unread,
May 20, 2025, 9:40:43 AM5/20/25
to Certificate Transparency Policy, Rob Stradling, Pierre Barre, Martijn Katerbarg, James Thomas
We're seeing some improvements but still getting errors (a mix of 429, 500, and 504) on mammoth25h2 and sabre25h2.

Joe DeBlasio

unread,
May 20, 2025, 7:50:58 PM5/20/25
to James Thomas, Certificate Transparency Policy, Rob Stradling, Pierre Barre, Martijn Katerbarg
Only looking at requests made today (2025-05-20, PDT) until roughly now, Chrome's compliance monitoring infrastructure is still seeing availability around 94% on mammoth2025h2, and around 91% on sabre2025h2. All other shards have been fine. The failures we've seen have been a mix of 429s, 504s, and 500s (in decreasing order of frequency).

We're pleased to see the availability numbers improving, but they're still not really where Chrome needs them to be, so thank you for your continued work on this. We were in a somewhat precarious spot a bit ago where several operators were having issues simultaneously -- thankfully it looks like most operators have mostly recovered, so we're keen to see Sectigo's logs follow suit.

Joe

James Thomas

unread,
May 29, 2025, 9:42:06 PM5/29/25
to Certificate Transparency Policy, Certificate Transparency Policy, Rob Stradling, Martijn Katerbarg
Just checking in to see if there's any update/news on this. We've been mostly unable to consume Mammoth or Sabre due to high error rates.

Pierre Barre

unread,
Jun 5, 2025, 3:18:55 AM6/5/25
to Mustafa Emre Acer, Certificate Transparency Policy, Martijn Katerbarg, Devon O'Brien, mhutc...@google.com
Hi,

Some piece of information that may be useful/interesting: while running my test endpoint at https://compact-log.pre-test.ct.merklemap.com/ I noticed that someone is preloading around 100-200 entries a second (almost exclusively add-entry, not add-pre-entry).

Traffic is originating from a netcup server and a single IP address. The user-agent is go-http-client/1.1 (ctl...@google.com)
@Martin Hutchinson informed me it's not them (it'd be surprising that they'd use a netcup server anyway?).

If you are hit by the same user, this "useless" traffic is probably hurting your log.

Best,
Pierre
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
Reply all
Reply to author
Forward
0 new messages