All,
We're monitoring increased merge delays for both mammoth2025h1 and sabre2025h1 log shards. As our history shows we have seen these logs take performance hits repeatedly. While we're working on getting new, PostgreSQL backed CT logs deployed and submitted to resolve this long term, we will for now also need to deal with the ongoing issues.
To that effect and to avoid future MMD violations, we have set stricter rate limits on all of our log shards. These are for the moment set to 30 requests per second per IP address, with an additional 125 requests per seconds global limit.
We will monitor performance of the different log shards over the next few days, potentially weeks, and may adjust these limits, with or without future notification to this group.
Regards,
Martijn Katerbarg
Sectigo
Hi Matt, Devon,
> Am I misunderstanding your description of the rate limits you've put in
place?
No, your understanding is correct.
We have only lowered the thresholds on rate limits that were already in place. Previously, the thresholds were set at 60 requests/IP/s, with a global limit of 250/s.
> Rather than restrict monitors, did you instead consider restricting
submissions?
We felt that swift action had become necessary, and we were able to quickly tweak the thresholds for the existing rate limits. We have a task for our DevOps team pending to implement separate rate limit thresholds for GETs versus POSTs.
> If I'm understanding that rate limit correctly, five monitors (or one bad
actor with five EC2 instances) can essentially stop the CT monitoring ecosystem
from effectively monitoring these logs for correctness.
You are correct. It’s why we do want to monitor this and tweak until we find the appropriate settings (and in the end move to PostgreSQL backed logs).
> While updating ct-policy@ isn't a requirement, Chrome's CT Log Policy
requires all changes to CT logs' policies and acceptance criteria to the log's
crbug.com application bug (historically, this includes substantive changes to
rate limits). Could you please keep track of relevant changes to log operation
in the corresponding bug?
Thank you for this reminder. We have updated the crbug.com application bugs accordingly.
> In the days since this announcement, we've seen the availability numbers
for mammoth2025h1 (now 98.41%) and sabre2025h1 (now 94.71%) continue to drop,
with the most recent drops attributable to failed add-chain and add-pre-chain
calls from our log monitoring infrastructure. In his reply to your post, Matt
Palmer raises some good points about the relative importance of log
availability per-api, but we are already starting to see even write
availability drop. Do you have a sense as to whether this rate limiting is
something that can be lifted after pending certificate backlogs reduce, or is
this intended as a longer-term mitigation?
We expect this availability drop is the result of an increase in “429 Too Many Requests” responses due to these rate limits. Once we’ve implemented separate rate limit thresholds for GETs versus POSTs, we do expect at least the GET rate limit thresholds to be raised in the short term. Since you’re also seeing an increase in failed add-chain and add-pre-chain calls, we will certainly look at what we can do there without running into other problems.
We do agree that Matt raises a good point. We can’t risk running into the MMD every X days/weeks, but also we can’t be setting rate limits that are too strict. Given the very limited headroom available in our currently deployed software stack, there’s a delicate balance that needs to be taken care of here.
Regards,
Martijn Katerbarg
Sectigo
Hi Mustafa, all,
We don't have much additional news at this moment, but we are currently working on rolling out a change to support different rate limits for POST and GET requests, which we expect to be completed before the end of this week. I'd like to suggest we follow up with more details on this in about a week from when that deployment has been completed, so we can (hopefully) gather some additional statistics.
Regards,
Martijn
Sectigo
From:
Mustafa Emre Acer <mea...@chromium.org>
Date: Friday, 2 May 2025 at 23:53
To: Certificate Transparency Policy <ct-p...@chromium.org>
Cc: Martijn Katerbarg <martijn....@sectigo.com>, Devon O'Brien <asymm...@google.com>
Subject: Re: Stricter rate limits enabled on Sabre and Mammoth log shards
Hi Martin, We've recently seen the availability numbers for mammoth2025h2 to drop below 99% as well, due to an increased number of 429s on all endpoints. Just wanted to check if you saw any reduction in the pending queue size so far in these
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Hi James,
> ZjQcmQRYFI'm curious if the rate limit change has been implemented yet?
They have. We’ve also since further discussed this and are looking at additional changes. We will expand on that in a few days. For now, we’re working on updating the rate limits once more to further finetune
and see the effect of this. I hope this will be deployed before the end of the day today, though I cannot guarantee this.
Please expect further details from us before the end of the week.
Regards,
Martijn
Sectigo
From:
'James Thomas' via Certificate Transparency Policy <ct-p...@chromium.org>
Date: Wednesday, 14 May 2025 at 18:08
To: Certificate Transparency Policy <ct-p...@chromium.org>
Cc: Martijn Katerbarg <martijn....@sectigo.com>
Subject: [ct-policy] Re: Stricter rate limits enabled on Sabre and Mammoth log shards
I'm curious if the rate limit change has been implemented yet? We're seeing a lot of 429s (and a few 502s and 504s) on Sabre and especially on Mammoth. We're nowhere near 30 requests/sec/IP, so I assume it's the global limit that's being triggered.
ZjQcmQRYFpfptBannerStart
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to
ct-policy+...@chromium.org.
To view this discussion visit
https://groups.google.com/a/chromium.org/d/msgid/ct-policy/93b7f376-7246-4a32-9114-ee0761e7fc98n%40chromium.org.
All,
Yesterday we’ve made further changes to all of our logs.
First up we have added a CTFE-based rate limiting mechanism (see
https://github.com/google/certificate-transparency-go/pull/1698) for older certificates. The reason for this is that we’ve seen quite a lot of submissions of certificates to our logs that have been issued several days ago. This rate limit is utilizing the
notBefore date of the certificate. The threshold for considering something as an "old" certificate is set to 28 hours, with the global (per shard) rate limit set to 40 req/sec, burst=10.
Next to that, we’ve also increased our global rate limit, for both POST and GET requests, while simultaneously lowering the per IP limits.
We are aware of the ease of bypassing per IP limits by simply utilizing multiple IP addresses, but we hope it may deter some heavy submitters and requesters, especially when combining it with the CTFE-based rate-limiting.
For both GET and POST requests, the per IP limit is now set at 20 requests per second per IP.
For GET requests, the global limit is set at 400 requests per second. For POST requests, the global limit is set at 200 requests per second.
We feel this is a reasonable limit based on the current issuance rate of the WebPKI. We would like to further raise, at a minimum, the global rate limits, but would like to keep an eye on resource usage.
Regards,
Martijn
From:
'Martijn Katerbarg' via Certificate Transparency Policy <ct-p...@chromium.org>
Date: Wednesday, 14 May 2025 at 21:42
To: James Thomas <jamesth...@proton.me>, Certificate Transparency Policy <ct-p...@chromium.org>
Subject: Re: [ct-policy] Re: Stricter rate limits enabled on Sabre and Mammoth log shards
Hi James, > ZjQcmQRYFI'm curious if the rate limit change has been implemented yet? They have. We’ve also since further discussed this and are looking at additional changes. We will expand on that in a few days. For now, we’re working on
ZjQcmQRYFpfptBannerStart
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/aa70d289-ae8f-49d4-8359-1f0c4c8b83bfn%40chromium.org.
Soon after the scheduled maintenance concluded on Saturday, some of our front-end load balancers started hitting CPU resource limits. This affected the majority of requests to our Mammoth and Elephant logs and also some requests from some localities to our Sabre and Tiger logs.
More CPU resources were allocated at approximately midday UTC yesterday, after which our logs show a dramatic increase in HTTP 200 responses and a dramatic decrease in 50x responses.
Are folks seeing the effect of this improvement?
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/3154d673-3bc0-40b1-b4ef-61a9495ab328n%40chromium.org.
--You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/9cdab150-5d49-45e7-8cb1-962135359371n%40chromium.org.