Parsing CT Logs

Shaukat

unread,

Nov 20, 2024, 4:37:43 AM11/20/24

to certificate-transparency

Dear All,

I am currently working on parsing Certificate Transparency (CT) logs and developing a solution using standard tools. While implementing the process, I have encountered some unexpected behaviors in the responses from CT log servers. Below is a brief overview of my algorithm:

1. Retrieve a list of usable servers.
For each usable server, query the current tree_size.
Send requests to the endpoint ct/v1/get-entries?start={start}&end={end}, ensuring the difference between start and end is 24.
During this process, I observed the following behaviors, for which I could not find adequate documentation or explanation:

2. Server Status Changes
Some servers frequently change their statuses between usable and rejected. Can CT entries still be reliably retrieved from servers marked as rejected?

3. Tree Size Decrease
On certain servers, the tree_size occasionally decreases instead of increasing. What could be the reason for this behavior, and what does it signify?

4. Limits on CT Entries per Request
It appears some servers impose limits on the number of entries returned per request. Where can I find official or detailed documentation regarding these limits?
Below are dynamically calculated limits from recent observations:

{
"https://sphinx.ct.digicert.com/2026h2": 255,
"https://ct.googleapis.com/logs/eu1/xenon2024": 31,
"https://sabre2026h2.ct.sectigo.com": 255,
"https://ct.googleapis.com/logs/eu1/xenon2025h1": 31
}
Inconsistent Response Limits
The number of entries returned seems to vary dynamically. For example:

A request with a range of 100 might return 50 entries.
A request with a range of 150 might return all 150 entries.
What is the best practice for efficiently retrieving CT entries given this variability?
I would greatly appreciate any guidance, clarification, or references to relevant documentation on these matters.

Best regards,
Shaukat

Philippe Boneff

unread,

Nov 20, 2024, 5:50:30 AM11/20/24

to certificate-...@googlegroups.com

Hi Shaukat,

You're raising a lot of good points!

Some servers frequently change their statuses between usable and rejected. Can CT entries still be reliably retrieved from servers marked as rejected?

After logs are REJECTED by Chrome, logs are not mandated to maintain availability targets. Eventually, log operators may turn their log down, at which point entries won't be accessible directly from the log anymore.

On certain servers, the tree_size occasionally decreases instead of increasing. What could be the reason for this behavior, and what does it signify?

This is very likely due to propagation delays of global systems, and it will eventually converge. Chrome's policy states that an SCT must be integrated within 24 hours, so what's important here is:
- for all the entries within a given size to be available 24 hours after an STH has been published.
- for consistency of the tree to be maintained. If all the STHs are consistent, all good.

It appears some servers impose limits on the number of entries returned per request. Where can I find official or detailed documentation regarding these limits?

I think you're looking at the right page, the API and server behaviours section is meant to clarify this. There are multiple reasons that would explain why responses are truncated, some libraries know how to deal with them, others don't: that's what the "Dynamic indexes" column is about.

Cheers,

Philippe

--
You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/certificate-transparency/52506ebe-564a-4917-b679-57d49461dcebn%40googlegroups.com.

Shaukat

unread,

Nov 20, 2024, 7:26:35 AM11/20/24

to certificate-transparency

Hi,
Philippe

My deepest gratitude for your prompt and accurate responses.
I know which steps of my algorithm should be reconsidered.

Shaukat

среда, 20 ноября 2024 г. в 13:50:30 UTC+3, phbo...@google.com:

Shaukat

unread,

Nov 25, 2024, 5:21:32 AM11/25/24

to certificate-transparency

Hello everyone,

I am encountering an issue while parsing CT logs. I have implemented an auto-calculation mechanism for a step in retrieving CT log batches. However, I am now facing a problem with receiving a 429 response from the CT log servers, indicating that the request limit per minute has been exceeded.

As a result, I am falling behind in processing. Could you please suggest the best solution to avoid this issue?

Thank you for your assistance.

Best regards,
Shaukat

среда, 20 ноября 2024 г. в 15:26:35 UTC+3, Shaukat:

Matt Palmer

unread,

Nov 25, 2024, 7:06:19 PM11/25/24

to certificate-...@googlegroups.com

On Mon, Nov 25, 2024 at 02:21:32AM -0800, Shaukat wrote:
> I am encountering an issue while parsing CT logs. I have implemented an
> auto-calculation mechanism for a step in retrieving CT log batches.

Why? Just do what everyone else does, and ask for a huge chunk, and
increment the start entry value by the count of however many entries you
get back.

> However, I am now facing a problem with receiving a 429 response from the
> CT log servers, indicating that the request limit per minute has been
> exceeded.
>
> As a result, I am falling behind in processing. Could you please suggest
> the best solution to avoid this issue?

Make sure you're getting as many entries per request as you possibly
can, primarily. Also, don't wait too long after getting a 429 before
trying again -- any time after the rate limit resets that you're not
making requests is time wasted.

Some logs do set their rate limits low enough to *almost* not allow
monitors to keep up, but none that I can see set it too low. My
scraping keeps up with publication rates; the exact algorithm I'm using
is encoded in the source (https://github.com/mpalmer/scrape-ct-log), so
you can compare against that as a means of determining what you might be
doing wrong.

- Matt

Luke Valenta

unread,

Nov 25, 2024, 7:14:07 PM11/25/24

to certificate-...@googlegroups.com

See

https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CAL%3D9YSV6STEnKM-yuOkkzT64hTKwHogmabf04gwib5_yw2H_aw%40mail.gmail.com

for tips on avoiding hitting rate limits altogether to crawl logs more optimally. Basically, add a 100ms or so sleep in between requests.

Luke Valenta

Systems Engineer - Research

--
You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/certificate-transparency/380abcb0-d893-4263-95c0-0b77f0fa02a0%40mtasv.net.

Adrian Wiedemann

unread,

Dec 2, 2024, 8:19:31 AM12/2/24

to certificate-...@googlegroups.com

Am 26.11.24 um 01:06 schrieb Matt Palmer:

>> I am encountering an issue while parsing CT logs. I have implemented an
>> auto-calculation mechanism for a step in retrieving CT log batches.
>

> Some logs do set their rate limits low enough to *almost* not allow
> monitors to keep up, but none that I can see set it too low.

We've noticed that hard limit too. All I can say that at least from a
German perspective there are logs that are hard to keep up with. We're
aware that doing things on US based logs will take longer, and we can
clearly see the timing difference between European and US based
locations for other logs where keeping the pace is not an issue. Adding
delays between subsequent requests is done here but does not prevent
falling behind - the cause is the entire duration of the request. And
since rate limiting is deployed, parallelizing requests is detected and
blocked (via 429).

Best regards, Adrian

--
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstrasse 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99
Geschäftsführer: Christoph Fischer HRB105469 Mannheim

OpenPGP_signature.asc

Matt Palmer

unread,

Dec 2, 2024, 5:42:19 PM12/2/24

to certificate-...@googlegroups.com

On Mon, Dec 02, 2024 at 02:16:16PM +0100, Adrian Wiedemann wrote:
> Am 26.11.24 um 01:06 schrieb Matt Palmer:
>
> > > I am encountering an issue while parsing CT logs. I have implemented an
> > > auto-calculation mechanism for a step in retrieving CT log batches.
> >
> > Some logs do set their rate limits low enough to *almost* not allow
> > monitors to keep up, but none that I can see set it too low.
>
> We've noticed that hard limit too. All I can say that at least from a German
> perspective there are logs that are hard to keep up with. We're aware that
> doing things on US based logs will take longer, and we can clearly see the
> timing difference between European and US based locations for other logs
> where keeping the pace is not an issue. Adding delays between subsequent
> requests is done here but does not prevent falling behind - the cause is the
> entire duration of the request. And since rate limiting is deployed,
> parallelizing requests is detected and blocked (via 429).

My log scraper runs out of Germany, too, and while request latency is
impacted, I haven't noticed any problems with rate limits (since request
rate and service latency are orthogonal).

- Matt

Reply all

Reply to author

Forward