New, faster CT log monitor application

608 views
Skip to first unread message

Rob Stradling

unread,
Sep 3, 2018, 7:18:59 AM9/3/18
to crt.sh
Over the past few months, crt.sh has struggled to keep up with the ever-increasing combined growth rate of the monitored CT logs.  To rescue crt.sh from becoming irrelevant, I've completely rewritten the https://github.com/crtsh/ct_monitor application in Go.  The new ct_monitor has been running for the past 3 weeks or so, and the total backlog (see https://crt.sh/monitored-logs) - which at one time exceeded 200,000,000 log entries(!) - has been gradually coming down.

m...@alexbouma.me

unread,
Sep 3, 2018, 12:09:11 PM9/3/18
to crt.sh
Hey Rob,

Happy to see the backlog slowly but surely slinking back to zero!

I am running a mirror of crt.sh and am wondering if you are planning to open source the new monitor at some point... :)

Op maandag 3 september 2018 13:18:59 UTC+2 schreef Rob Stradling:

Rob Stradling

unread,
Sep 5, 2018, 9:04:30 AM9/5/18
to crt.sh
Hi Alex.

Firstly, well done on getting a mirror of crt.sh running given the complete lack of documentation.  :-D

Yes, I'll open-source the new monitor soon.

Alex Bouma

unread,
Sep 5, 2018, 9:08:00 AM9/5/18
to crt.sh
Thanks, it definitely was not easy to figure out ;)

Worked really well until the backlog exploded (first thought it was my instance not having enough I/O), happy to see I can continue running it with the new monitor (if I can get that to work :D)!

Op woensdag 5 september 2018 15:04:30 UTC+2 schreef Rob Stradling:

testtlnls

unread,
Sep 8, 2018, 3:00:13 AM9/8/18
to crt.sh
CT log monitor has been stoped.

On Monday, September 3, 2018 at 8:18:59 PM UTC+9, Rob Stradling:

Rob Stradling

unread,
Sep 10, 2018, 5:35:04 AM9/10/18
to crt.sh
Actually, crt.sh's log monitor is still running just fine.

The problem is that the replication from the master database to the front-end slave databases stalled over the weekend.  Our ops team are fixing it.

Rob Stradling

unread,
Sep 10, 2018, 12:57:15 PM9/10/18
to crt.sh
For reasons unknown, a bunch of WALs have been prematurely deleted from the backup server that we use to ship them out to the front-end datacentres.  This means that the replication to the current slave database instances is irreparably broken.

To fix this, our ops team are currently taking a fresh backup of the master database, after which they will stand up some fresh slave database instances.  This may take 2 or 3 days to complete, I'm afraid.

Ben Sjoberg

unread,
Sep 10, 2018, 2:07:39 PM9/10/18
to crt.sh
Thanks for keeping us updated! :)

Rob Stradling

unread,
Oct 8, 2018, 8:09:49 AM10/8/18
to crt.sh

Alex Bouma

unread,
Oct 9, 2018, 4:54:16 AM10/9/18
to crt.sh
Hi Rob,

Awesome, looking good! I do however find that my backlog is still going the wrong way :)

This seems to process with 1 thread (at least 1 thread calling process work on the database).

baseconfigfile:default.toml conninfo:<redacted> connopen:8 connidle:2 connlife:10m0s interval:1s batch:5000 concurrent:100 chunk:500 httptimeout:30s

I thought opening more connections would possibly add more processing power, but only one core if the database machine is working, could you maybe explain a but more what the parameters configure and what I should aim for?

Before I was running: cat /var/lib/postgresql/ct_monitor/logs.txt | parallel -j 10 timeout 600s /var/lib/postgresql/ct_monitor/ct_monitor {}

This would run 10 threads on the 10 core machine nicely eating up it's available CPU and available I/O. 

However this new one does seem to use _a lot_ more network & I/O which would suggest it's faster, so maybe my machine is just way underpowered to even begin doing this right :)

Op maandag 8 oktober 2018 14:09:49 UTC+2 schreef Rob Stradling:

Rob Stradling

unread,
Oct 9, 2018, 6:51:46 AM10/9/18
to crt.sh
Hi Alex.  I've found that bloat on some of the small, regularly used tables can affect the performance of ct_monitor, so I've been doing the following fairly regularly:

CLUSTER ca USING ca_uniq;
CLUSTER ct_log USING ctl_pk;
CLUSTER crl USING crl_pk;
CLUSTER ocsp_responder USING or_pk;

Does that help?

Rob Stradling

unread,
Oct 9, 2018, 8:14:51 AM10/9/18
to crt.sh
Newly added certificates are immediately run through the 3 linters (cablint, x509lint and zlint), which is relatively time consuming.  If this isn't useful for you, you could disable it by commenting out this PERFORM query:
https://github.com/crtsh/certwatch_db/blob/master/process_new_entries.fnc#L167

I'm currently working on implementing the linting within a new goroutine in ct_monitor.go (rather than on the DB server).  Once I've completed this, I'm hoping to be able to re-enable the linting for newly added certs issued by Let's Encrypt (see https://github.com/crtsh/certwatch_db/issues/40).

Alex Bouma

unread,
Oct 10, 2018, 6:11:14 AM10/10/18
to crt.sh
Hi Rob,

Many thanks for the pointers.

1. Those cluster command do not seem to have effect, I have run them about every 60 minutes for an day to test but that had no effect on CPU/I/O/thoughput.

2. I actually was never running the linters since that was the only part I could never get running (and I also don't need them) so I did disable the linting in the query already.

Op dinsdag 9 oktober 2018 14:14:51 UTC+2 schreef Rob Stradling:

Rob Stradling

unread,
Oct 11, 2018, 9:30:51 AM10/11/18
to crt.sh
Hi Alex.  Then I think your machine is just underpowered for this task, I'm afraid.

Alex Bouma

unread,
Oct 13, 2018, 4:15:48 PM10/13/18
to crt.sh
Hi Rob,

Again, thanks for all the info and the application in the first place, I figured out wat the problem was, you might like this :P

I was still using the old backlog count (or rather latest_entry_id field in the ct_logs table to calculate my backlog), you see how it was going up instead of down... the tree_size was still updating but the latest entry id not since the new monitor does not use that...

I noticed after I changed my app to use crt.sh public db for a bit and the backlog was as high as the tree_size since the latest_entry_id field was removed from your db (defaulting to 0 in my calculations)... I feel a but stupid now (note te self: read the commit logs).

So after changing around some things and using the correct counting method, the backlog is going down, not to too fast (ingesting about 150k rows in postgres / hour) but it's definitely going down and faster than before (used to be <10k row / hour) :)

So... sorry for wasting your time, could have figured this out myself. But did implement the cluster command just in case, can't hurt.

Op donderdag 11 oktober 2018 15:30:51 UTC+2 schreef Rob Stradling:

Rob Stradling

unread,
Oct 16, 2018, 1:19:30 PM10/16/18
to crt.sh
On Saturday, October 13, 2018 at 9:15:48 PM UTC+1, Alex Bouma wrote:
Hi Rob,

Again, thanks for all the info and the application in the first place, I figured out wat the problem was, you might like this :P

I was still using the old backlog count (or rather latest_entry_id field in the ct_logs table to calculate my backlog), you see how it was going up instead of down... the tree_size was still updating but the latest entry id not since the new monitor does not use that...

:-)
 

I noticed after I changed my app to use crt.sh public db for a bit and the backlog was as high as the tree_size since the latest_entry_id field was removed from your db (defaulting to 0 in my calculations)... I feel a but stupid now (note te self: read the commit logs).

Sorry that the code and the commit logs are the only documentation.


So after changing around some things and using the correct counting method, the backlog is going down, not to too fast (ingesting about 150k rows in postgres / hour) but it's definitely going down and faster than before (used to be <10k row / hour) :)

:-)
 

So... sorry for wasting your time, could have figured this out myself. But did implement the cluster command just in case, can't hurt.

No problem.  Glad you figured it out.

Jaime Hablutzel

unread,
Dec 22, 2023, 11:21:56 AM12/22/23
to crt.sh
Hi Alex. Could you share some pointers on how did you set up a crt.sh mirror (absolute PostgreSQL newbie here)? Was this a full mirror? I'm looking that the database is currently around 25TB (according to `pg_database_size`).

Hi Rob, is there currently any documented method to get a full crt.sh mirror or an usable subset of it for local tests involving long queries? Are you publishing snapshots or something like that anywhere?

Regards.

r...@sectigo.com

unread,
Dec 22, 2023, 11:47:55 AM12/22/23
to crt.sh
No, I'm afraid there's no documented method for mirroring the database, and we don't publish any snapshots.
Reply all
Reply to author
Forward
0 new messages