Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Sabre 2025h1 MMD violation

664 views
Skip to first unread message

Andrew Ayer

unread,
Oct 21, 2024, 12:05:53 PM10/21/24
to ct...@sectigo.com, Certificate Transparency Policy
Sabre 2025h1 is incorporating entries with a delay of about 28 hours;
e.g. entry 68595972 has a timestamp of 1729425110336 (2024-10-20
11:51:50+00:00), but the STH at that size (attached) has a timestamp of
1729525724217 (2024-10-21 15:48:44+00:00). My monitor first observed
this approximately 18 hours ago when the merge delay was around 26
hours. Maybe the log should be temporarily made read-only until the
backlog of unincorporated entries is under control?

Regards,
Andrew
sth.json

Joe DeBlasio

unread,
Oct 21, 2024, 12:12:39 PM10/21/24
to Andrew Ayer, ct...@sectigo.com, Certificate Transparency Policy
Thanks, Andrew. We have also observed this, via both our SCT auditing and standard compliance monitoring systems. We agree that preventing the log from issuing new SCTs until the issue can be resolved would be prudent.

(And as always, once resolved, we'd love a postmortem into what happened posted to ct-policy@).

Best,
Joe

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/20241021120546.d478f7b6e454f48b693038f6%40andrewayer.name.

Matthew McPherrin

unread,
Oct 21, 2024, 12:38:14 PM10/21/24
to Joe DeBlasio, Andrew Ayer, ct...@sectigo.com, Certificate Transparency Policy
Let's Encrypt will stop submitting now, which should hopefully provide some breathing room to the log if it's just overloaded.

Martijn Katerbarg

unread,
Oct 21, 2024, 12:45:08 PM10/21/24
to Matthew McPherrin, Joe DeBlasio, Andrew Ayer, #CTOps, Certificate Transparency Policy

All,

 

Just acknowledging we're aware of this and are currently investigating.

 

> Let's Encrypt will stop submitting now, which should hopefully provide some breathing room to the log if it's just overloaded.

 

Thank you! That is our preliminary thought.

 

Regards,

Martijn Katerbarg

Sectigo

 

From: Matthew McPherrin <ma...@letsencrypt.org>
Date: Monday, 21 October 2024 at 18:38
To: Joe DeBlasio <jdeb...@chromium.org>
Cc: Andrew Ayer <ag...@andrewayer.name>, #CTOps <ct...@sectigo.com>, Certificate Transparency Policy <ct-p...@chromium.org>
Subject: Re: [ct-policy] Sabre 2025h1 MMD violation

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Martijn Katerbarg

unread,
Oct 21, 2024, 3:33:26 PM10/21/24
to Matthew McPherrin, Joe DeBlasio, Andrew Ayer, #CTOps, Certificate Transparency Policy

All,

 

Just a short update. Around 17:10 UTC we blocked all POST calls to sabre2025h1 in order for it to catch up. We will follow up later with more details and announce when it will once again be available.

 

Regards,


Martijn Katerbarg

Sectigo

 

Martijn Katerbarg

unread,
Oct 23, 2024, 4:53:18 AM10/23/24
to Matthew McPherrin, Joe DeBlasio, Andrew Ayer, #CTOps, Certificate Transparency Policy

All,

 

The backlog has been cleared, and we’ve been able to add additional resources to sabre2025h1 so it should be able to cope with the volume going forward. As of 07:54 UTC, Sabre2025h1 is once again fully operational.

Regards,


Martijn Katerbarg
Sectigo

 

Joe DeBlasio

unread,
Oct 25, 2024, 3:40:05 PM10/25/24
to Martijn Katerbarg, Matthew McPherrin, Andrew Ayer, #CTOps, Certificate Transparency Policy
Thanks, Martijn!
  • Was this issue caught by any internal monitoring, or was Andrew's email the first time you became aware of the issue. If the latter, are you able to add monitoring to detect backups before violating MMDs?
  • Presumably this occurred due to a gradually-increasing load on the log over time (or was it something bursty?). You mention that you were able to give additional resources to the log -- if this overloading was due to a gradual increase of load over time, do you have an estimate of how much runway these additional resources should buy you?
Thanks,
Joe

Rob Stradling

unread,
Oct 30, 2024, 6:30:40 AM10/30/24
to Certificate Transparency Policy, Joe DeBlasio, Matthew McPherrin, Andrew Ayer, #CTOps, Certificate Transparency Policy, Martijn Katerbarg
Hi Joe.

Andrew's email was the first time we became aware of the issue.  We were already using Prometheus to record the relevant Trillian metrics, but we lacked effective alerting that would have drawn our attention to the issue sooner.

The attached images show Sabre2025h1's merge delay over the past 7 days and 30 days.  It seems likely that 90-day certificates being in scope for Sabre2025h1 since October 3rd is a major factor, although we don't know why it took nearly two weeks after that date for the sequencing backlog to start to get out of control.  We don't recall facing such large sequencing backlogs with any previous log shards since we moved from SuperDuper to Trillian.

It's not yet clear what difference, if any, the extra resources we've allocated to Sabre2025h1 will make.  Early signs, as you can see from the attached images, are that Sabre2025h1 is still struggling.

It seems clear that the MariaDB database backing Sabre2025h1 is the bottleneck.  In a previous incident report earlier this year, I mentioned that we're keen to move away from MariaDB and that we had already started work on implementing a PostgreSQL backend for Trillian.  That effort reached a milestone yesterday: PR #3644 is now ready for review.  I've not yet been able to directly compare the sequencing throughput of the MySQL/MariaDB backend against the PR #3644 PostgreSQL backend, but the integration tests show something that makes me hopeful: the PostgreSQL QPS is ~10X higher than the CockroachDB QPS!  (PR #3644 uses various PostgreSQL performance tricks I've learnt through implementing several iterations of crt.sh's log monitor :-) ).
sabre2025h1_sequencermergedelay_7days.png
sabre2025h1_sequencermergedelay_30days.png

Joe DeBlasio

unread,
Oct 31, 2024, 4:56:23 PM10/31/24
to Rob Stradling, Certificate Transparency Policy, Matthew McPherrin, Andrew Ayer, #CTOps, Martijn Katerbarg
Thanks, Rob. A postgres backend does seem encouraging, though I'm a bit worried about sabre2025h1's ability to keep it together long enough for a migration to occur. Our merge delay measurements of sabre2025h1 track with yours, with most recent delays now around 13.5 hours. If the trend continues the log will start violating MMDs again by Monday. I'd encourage you to check-in with TrustFabric, and maybe Let's Encrypt, to see if there are any other tips for squeezing performance out of the current database, but in any event, please do keep us informed.

Joe 

Reply all
Reply to author
Forward
0 new messages