Google CT log - getting entries

554 views
Skip to first unread message

Cedric De Vroey

unread,
Mar 22, 2023, 7:36:33 PM3/22/23
to certificate-transparency
Hi all,

Since this is my first message in this group, let me first introduce myself shortly: Name is Cedric, I'm a security researcher living in Belgium and for my latest project I have set to look into the Certificate Transparency logs. 

For my application to work, I'm setting up my own mirror of the logs for efficient searching and querying. I have created jobs that can consume the logs from various publishers. But with the ones from Google I am struggling with the rate limiting. Can someone explain me the rules a bit?

What I am currently seeing: I can give start and end parameters with the get-entries request, but the CT logs from Google will not allow me to fetch more than 32 entries at a go. They do allow me to run up to 16 get requests in parallel from 1 ip. So if I split my requests over 2 IPs I can get 2*16*32 certs from the log per cycle. Wouldn't it be more efficient to allow get-entries requests to get a larger amount of entries, for example 1024, in order to limit the amount of requests necessary to consume the historical backlog?

Please do let me know if my assumptions are wrong, maybe I'm just doing something wrong or should I ratelimit the number of requests in order to get the 1024 entries per request. Would be interesting if someone in the know could shed some light on this. 

Kind regards,

Cedric

Mohammadamin Karbasforushan (Amin Karbas)

unread,
Mar 22, 2023, 10:30:10 PM3/22/23
to certificate-...@googlegroups.com
Hello,

From what I understand, the simplest way is more IPs; there are many layers of rate limits at different parts of your path into Google’s infrastructure, and it’s not the simplest thing to get around.

Also see this if you haven’t: https://community.letsencrypt.org/t/enabling-coerced-get-entries/114436

Cheers,
Amin
> --
> You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/certificate-transparency/c7ad1e6a-73f4-4d0f-9941-132a8505678dn%40googlegroups.com.


Cedric De Vroey

unread,
Mar 23, 2023, 7:33:00 AM3/23/23
to certificate-...@googlegroups.com
Oh I have indeed observed this behaviour. Now, how would I know which the ideal request size would be? 32? 64? 256? 1024?



Op do 23 mrt. 2023 03:30 schreef Mohammadamin Karbasforushan (Amin Karbas) <k.moham...@gmail.com>:
You received this message because you are subscribed to a topic in the Google Groups "certificate-transparency" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/certificate-transparency/M0MI6kLYooM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to certificate-transp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/certificate-transparency/DC5D430B-3F7A-4D42-815F-B55037840BBB%40gmail.com.

Pierre Phaneuf

unread,
Mar 23, 2023, 8:25:59 AM3/23/23
to certificate-...@googlegroups.com
There is no "ideal" request size, it depends on both the client and the server. For example, a client that would do streaming JSON parsing might not really have any limitations on the maximum number of entries in a get-entries response, but a client that buffers the data before processing it might want to limit its memory usage or the latency to receive the response. And similarly on the server side, various implementation details (which can change over time!) might dictate how many entries the server is willing to return in a single request.

The best way to think of the number of entries per get-entries request is as a negotiation: the client asks for as many entries as it is prepared to receive, and the server does its best to fulfill this request, within its own limitations (and the client should be prepared for this possibility). Note that those server limitations are not necessarily fixed, and that a server which returned 32 entries today might return 512 tomorrow (or the other way around)!

If you think your client could handle 8192 entries, given the memory and other resources available, then it should request 8192 entries, and be prepared to repeat the operation if more entries are needed afterwards (for example, if the server only returned 256 entries). In many cases, more entries than what a client is prepared to handle in a single response are needed (for example, if a million entries are needed, but fetching 1024 entries at a time), so this logic is almost certainly necessary anyway.

Cedric De Vroey

unread,
Mar 23, 2023, 8:41:28 AM3/23/23
to certificate-...@googlegroups.com
Thank you for that full explanation.

I had made my job to just pull 32 entries at a go, fixed. But I will change it to first try to get 1024, then count the number of entries in that first response and then use that number as the batch-size for consecutive  requests.

To give some light on what I am building: I first take the STH for each log and store that with log-info records I hold in a database. If the last number I have in the database is lower than the treesize then a job will start queuing get-entries requests for that log to a redis queue per batch-size. Then I have 2 container environments on 2 IPs that run the get-entries jobs from that redis queue. Works pretty well but at the rate I was working it would take a long time to get the entire backlog of most prominent CT logs.

Thanks for all your help and advice, much appreciated!

Kind regards,

Cedric

Op do 23 mrt 2023 om 13:26 schreef 'Pierre Phaneuf' via certificate-transparency <certificate-...@googlegroups.com>:

Rasmus Dahlberg

unread,
Mar 23, 2023, 9:43:25 AM3/23/23
to certificate-...@googlegroups.com
Hi Cedric,

What download time do you consider to be "a long time"?

-Rasmus

Message has been deleted

Cedric De Vroey

unread,
Mar 23, 2023, 9:56:14 AM3/23/23
to certificate-...@googlegroups.com
My calculations might be off, but I estimated that it would take me about 50days to fetch all 4.5B entries from the prominent logs at a rate of 1024 entries/sec.




Op do 23 mrt 2023 om 14:43 schreef Rasmus Dahlberg <rasmus.gd...@gmail.com>:

Rasmus Dahlberg

unread,
Mar 23, 2023, 11:41:12 AM3/23/23
to certificate-...@googlegroups.com
Argon2023 has ~0.86B entries.  This means a download time of ~10 days if
you can handle 1024 entries/s.  In my experience, this is what you can
expect from Argon2023 on a single-IP machine with concurrent fetchers
that back-off exponentially for each request on rate-limit errors.

You may find this part of the CT/go repo useful:

  https://github.com/google/certificate-transparency-go/blob/master/scanner/scanner.go#L314-L326

FWIW I consider 10 days a very fast download time.  Unless there is a
good motivation for downloading faster, I would not recommend that a
research project tries to side-step rate-limits with multiple IPs!

-Rasmus

Kurt Roeckx

unread,
Mar 23, 2023, 12:29:51 PM3/23/23
to certificate-...@googlegroups.com, Rasmus Dahlberg
In my experience, 1024/s is unrealistic for some of Google's logs. If I remember correctly, argon is slow and xenon is fast. Xenon is located in the EU, and so am I, but argon is in the US and slow.


Kurt

Cedric De Vroey

unread,
Mar 23, 2023, 1:37:05 PM3/23/23
to certificate-...@googlegroups.com
Ok, going from your feedback I am concluding that how I'm approaching it now is pretty much how it works and that I will need to sit out the ride and just wait. It already helps to know that I am not doing something completely wrong.

Thanks all!

Kind regards,
Cedric



Op do 23 mrt 2023 om 17:29 schreef Kurt Roeckx <ku...@roeckx.be>:

Pierre Phaneuf

unread,
Apr 28, 2023, 11:37:16 AM4/28/23
to certificate-...@googlegroups.com
On Thu, 23 Mar 2023 at 12:41, 'Cedric De Vroey' via certificate-transparency <certificate-...@googlegroups.com> wrote:

This came back on topic recently, which made me notice something that should be clarified, if only for posterity (and future reference!)...

I had made my job to just pull 32 entries at a go, fixed. But I will change it to first try to get 1024, then count the number of entries in that first response and then use that number as the batch-size for consecutive  requests.

I would recommend NOT using the number of entries received in that first response for the following requests, because the server is allowed to adjust the number of entries it returns on a per-request basis.

For example, compare the result from these two requests for the same number of entries (1000) on the same log:

https://ct.googleapis.com/logs/argon2023/ct/v1/get-entries?start=1018820000&end=1018821000
https://ct.googleapis.com/logs/argon2023/ct/v1/get-entries?start=1018820031&end=1018821031

The number of entries returned can vary, be them deterministic (like here, due to some implementation details), or non-deterministic (based on database replication state, memory usage, server load, or other such variable factors, for example).

So the best thing for a client to do is to request the number of entries it is comfortable with handling in a single response (more entries usually help improve throughput, but might increase latency, and possibly increase the memory usage in the client, to buffer the larger responses), and follow up with further requests as needed.

If your client can handle 1024 entries, keep requesting that amount (unless fewer entries are needed, of course!), let it limit the number of entries, and if the conditions on the server get better, you might see the throughput of your system automatically improve, without any intervention! :-)

See this example code: https://go.dev/play/p/__GHoay-5vR


Message has been deleted

Cedric De Vroey

unread,
Apr 29, 2023, 2:10:57 PM4/29/23
to certificate-transparency
That doesn't work for me since I start multiple downloads in parallel and for this reason I need to be able to keep track of the amount of entries incoming.  To be honest, the variable number of records in the responses is for this reason a serious pain. It creates all kinds of operational inefficiencies on the client side.

Op vrijdag 28 april 2023 om 17:37:16 UTC+2 schreef ppha...@google.com:

Ben Laurie

unread,
Apr 30, 2023, 6:24:41 AM4/30/23
to certificate-...@googlegroups.com
On Sat, 29 Apr 2023 at 19:09, 'Cedric De Vroey' via certificate-transparency <certificate-...@googlegroups.com> wrote:
That doesn't work for me since I start multiple downloads in parallel and for this reason I need to be able to keep track of the amount of entries incoming.  To be honest, the variable number of records in the responses is for this reason a serious pain. It creates all kinds of operational inefficiencies on the client side.

This is pretty easy to deal with, surely - e.g. have one client fetch even numbered batches of 102,400 entries and the other odd-numbered.
 

N0B0T | Ethical Hacking Team
Cedric De Vroey

T: 0484 48 34 65



Op vr 28 apr 2023 om 17:37 schreef 'Pierre Phaneuf' via certificate-transparency <certificate-...@googlegroups.com>:
--
You received this message because you are subscribed to a topic in the Google Groups "certificate-transparency" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/certificate-transparency/M0MI6kLYooM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to certificate-transp...@googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.

Pierre Phaneuf

unread,
May 2, 2023, 7:54:39 AM5/2/23
to certificate-...@googlegroups.com
As Ben says, this isn't that difficult, and is in any case absolutely necessary, since this is how the protocol is defined, handling this possibility is not optional.

We have a few parallel fetching systems, and essentially they all split up the entries to fetch in batches in a way they judge appropriate (some systems want oldest entries first, as much as possible, others just want to get entries as quickly as possible), and within each of those batches, the algorithm I described in my previous message is used.


Reply all
Reply to author
Forward
0 new messages