Infrastructure of crt.sh

482 views
Skip to first unread message

Sergio Garcia

unread,
Aug 4, 2023, 2:42:08 AM8/4/23
to crt.sh
Hi everyone,

I was looking at the source code for crt.sh and was very impressed of the amount of work that is done just using postgresql and was wondering if there are any details of the hardware powering the main db anywhere.

Thanks,

Sergio Garcia

r...@sectigo.com

unread,
Aug 18, 2023, 7:05:57 AM8/18/23
to crt.sh
Hi Sergio.  We haven't published details anywhere, AFAIK.

The primary DB is currently running on a bare metal server with 16x "Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz" CPUs and 96GB RAM, according to /proc/cpuinfo.  Does that help?

Sergio Garcia

unread,
Aug 18, 2023, 4:04:42 PM8/18/23
to crt.sh
Thanks for the info.

Bret McGee

unread,
Jan 9, 2024, 10:26:21 AMJan 9
to crt.sh
I've been looking to work on CT Logs as a personal project for a long time and have found even my "home" hardware has worked well so far.   Disk storage space is the main need for me.  30TiB of raw response data and counting, ready for import at some point).

I was looking at MariaDB but it seems you can index expressions in PostgreSQL which looks very interesting (never used PG before, I'm a MS SQL Server user).  Possible that would save a lot of space given the fact I want to keep the original leaf verbatim as well as index most (all?) its attributes.

Out of interest how much storage does crt.sh use?

(I found the logs are very large, and extra_data is very redundant, of course, so careful design can minimise storage cost)

As an aside, I have found massive performance differences / behaviour between the various logs/operators (around what triggers a 429 HTTP Response).  Cloudflare takes anything you can throw at it, whilst DigiCert is the most fragile.

r...@sectigo.com

unread,
Jan 9, 2024, 10:40:55 AMJan 9
to crt.sh
> Out of interest how much storage does crt.sh use?

Currently, 28 terabytes (uncompressed).

Bret McGee

unread,
Jan 9, 2024, 11:00:58 AMJan 9
to crt.sh
Thanks Rob, I hope to get some time to work on this project this year!

Sergio Garcia

unread,
Jan 9, 2024, 11:18:34 AMJan 9
to Bret McGee, crt.sh
Bret,

At this file you can check how many concurrency you can throw at any server before being throttled


Please note that the page size here is not exactly the number of items you get when calling the API due the coerced paging (eg. if the coerced page size is 32 and you ask 32 with start at 16, you will get only 16). Please look at this post for more details:


If the server doesn't provide documentation about its coerced size, just try a big enough request starting at 0, and the number of items you get is the coerced page size.

Also, take a look at the project [https://github.com/sergiogarciadev/ctmon] for downloading entries, it can grab the maximum items of all clogs at maximum speed with use of multiple IPs (a feature that is only needed if your logs are too outdated) and is simple enough to be easy to understand.

Also, if you are using postgresql with libx509pq (https://github.com/crtsh/libx509pq) and ZFS filesystem with compression, you need a small server with enough storage. Your pain point will be data ingestion (and never try to reindex it) as postgres index can be parallelized (this can be mitigated with partitioning, but for old data you will still have tons of data on each partition).




--
You received this message because you are subscribed to a topic in the Google Groups "crt.sh" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/crtsh/p4RX2X4pdVI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to crtsh+un...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/crtsh/93f58758-5959-4e65-bfc3-1af95081c08dn%40googlegroups.com.

Bret McGee

unread,
Jan 9, 2024, 11:41:09 AMJan 9
to crt.sh
Very interesting Sergio.  Yes, I had worked out the page sizes quite quickly.   I suspect the concurrency might be slightly different for xenon2024 however.  Xenon's behaviour seems to be more course-grained in that I think it might limit transfer size (or number of requests) over a given window of time.  For example, I have to back off for well over a minute at times before it will allow ANY new connections.  Yeti2024 is by far the slowest averaging just 20Mb per second.

Thanks for you comment about coerced entries - I'll have a good read later but I had noticed that too.  My downloads now always start on a multiple of 4096 start boundary which seems to work for all operators (32, 256 and 1024) and you always get full pages.  (This is also how I've been storing the raw data - in gzipped chunks of 4096 results).

Having the verbatim responses means I can re-load as many times as I like as the database design and code progresses as I don't want to be downloading logs again and again (wasteful and just too slow).  I just need to start storing some tree heads and proofs as it is taking my system about 12 hours per billion leaves to verify (single thread and a mechanical drive).

My current code (only proof of concept really) at this time is in C#/.NET and I'm using ZLib to compress the raw leaves (first removing the B64 encoding) before inserting.
Reply all
Reply to author
Forward
0 new messages