Yeti CT log database issue

872 views
Skip to first unread message

Jeremy Rowley

unread,
Sep 28, 2022, 4:43:56 PM9/28/22
to Certificate Transparency Policy
Hey all, 

Just a heads up that something happened while preparing to move Yeti to a new, larger cluster to handle additional load. We don't believe the existing database is recoverable. We've turned off acceptance of new log requests and think Yeti should be removed as a viable log. I'll post an update when our investigation is complete.

Jeremy

Clint Wilson

unread,
Sep 28, 2022, 5:17:01 PM9/28/22
to Jeremy Rowley, Certificate Transparency Policy
Hi Jeremy,

Does this impact all Yeti shards (2022-2 [Usable], 2023 [Usable], 2024 [Qualified])?

Thank you!
-Clint

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/d1decbf3-6ad9-4bd9-9834-813821c799b3n%40chromium.org.

Jeremy Rowley

unread,
Sep 28, 2022, 5:53:43 PM9/28/22
to Clint Wilson, Certificate Transparency Policy
Yes. It impacts all yeti shards.

Joe DeBlasio

unread,
Sep 28, 2022, 7:39:10 PM9/28/22
to Jeremy Rowley, Clint Wilson, Certificate Transparency Policy
Hi Jeremy,

Yeti shards seem to still be responding to get-entries queries and issuing new STHs, which seems inconsistent with an unrecoverable database failure. Can you offer any more insight on what's failing and what's still working?

Thanks!
Joe

Jeremy Rowley

unread,
Sep 28, 2022, 7:48:34 PM9/28/22
to Joe DeBlasio, Clint Wilson, Certificate Transparency Policy
There's an issue with the database that we're working on. We think
there is still a slim chance of making sure the log stays operational,
but we think if the machine reboots that the issue with the database
will cause the log to collapse. We don't plan on restarting the
machine right away, but we're pretty sure that will be the outcome for
at least the 2022 shard because of its size. Like I said, we're still
investigating so I'm not 100% sure what will happen. However, I wanted
to give the community a heads up that there is an issue that we
encountered in case it stops working. .

Kurt Roeckx

unread,
Sep 29, 2022, 10:01:26 AM9/29/22
to Joe DeBlasio, Jeremy Rowley, Clint Wilson, Certificate Transparency Policy
On Wed, Sep 28, 2022 at 04:38:48PM -0700, Joe DeBlasio wrote:
> Hi Jeremy,
>
> Yeti shards seem to still be responding to get-entries queries and issuing
> new STHs, which seems inconsistent with an unrecoverable database failure.
> Can you offer any more insight on what's failing and what's still working?

I think the new STHs was just digicert's delay of 12 hours between
generating the STH and publishing it. They seem to have stopped
publishing new STHs now.


Kurt

Kurt Roeckx

unread,
Sep 29, 2022, 10:03:08 AM9/29/22
to Joe DeBlasio, Jeremy Rowley, Clint Wilson, Certificate Transparency Policy
They all return http code 500 now ...


Kurt

Jeremy Rowley

unread,
Sep 29, 2022, 10:45:05 AM9/29/22
to Kurt Roeckx, Joe DeBlasio, Clint Wilson, Certificate Transparency Policy
Hey Kurt and Clint - we managed to recover Yeti 2024 and 2025. Those
should be operating as normal. 2023 is in the process of recovering.
Although the downtime exceeds the desired level, we think we can
recover it this morning, and have it operating as normal. The 2022
shard was too big to recover and is dead. There is no threat
associated with the issue as an internal issue caused the database to
be unavailable. Therefore, I'd like to reverse my previous statement
about removal of all shards. Really, removal should be limited to the
2022 shard. I'll be posting more details as soon as the RCA is
available.

Clint Wilson

unread,
Sep 29, 2022, 10:47:17 AM9/29/22
to Jeremy Rowley, Kurt Roeckx, Joe DeBlasio, Certificate Transparency Policy
Thank you for the update Jeremy! We’ll, of course, wait for the final report, but appreciate the interim information.
-Clint

-- 
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

Rob Stradling

unread,
Sep 29, 2022, 2:13:19 PM9/29/22
to Jeremy Rowley, Joe DeBlasio, Clint Wilson, Certificate Transparency Policy, Kurt Roeckx
Hi Jeremy.  Is it possible to turn yeti2022-2 back on enough to allow get-entries and get-sth calls to be made for the cert and precert entries that were added before the database issue occurred?  Or is it too dead even for that?

crt.sh had fallen behind considerably on ingesting entries from this log shard, and obviously I would like crt.sh's dataset to be as complete as possible.

It seems (I guessed the URL) that Google has a mirror for this log shard, but it too seems to have fallen somewhat behind - see https://ct.googleapis.com/logs/eu1/mirrors/digicert_yeti2022_2/ct/v1/get-sth.

From: ct-p...@chromium.org <ct-p...@chromium.org> on behalf of Jeremy Rowley <rowl...@gmail.com>
Sent: 29 September 2022 15:44
To: Kurt Roeckx <ku...@roeckx.be>
Cc: Joe DeBlasio <jdeb...@chromium.org>; Clint Wilson <cli...@apple.com>; Certificate Transparency Policy <ct-p...@chromium.org>
Subject: Re: [ct-policy] Yeti CT log database issue
 
CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.

Jeremy Rowley

unread,
Sep 29, 2022, 6:21:39 PM9/29/22
to Rob Stradling, Joe DeBlasio, Clint Wilson, Certificate Transparency Policy, Kurt Roeckx
It's totally dead, and I don't think we can recover it. The sheer
number of records in the 2022 shard is making recovery of the database
pretty much impossible.

Here's the incident report:

The morning of Sept 28, SRE was performing work setting up a new
Cassandra cluster for the Yeti CTLogs to migrate to. During this, a
blanket delete command was inadvertently issued to all Cassandra
servers of both the new and old clusters. This resulted in the
deletion of all files under /var/lib/cassandra/data.

The mistake was immediately caught, but not before the command was
completed running on all Cassandra servers. At 19:22 GMT the Yeti
CTLog applications were configured to not accept new entries to
prevent any further data loss. After stopping the damage, we attempted
to recover the files and started exporting data for the 2023-2025
shards. We also stopped signer applications for all Yeti shards. 2024
and 2025 were quite fast as they had little to no certs. We managed to
re-stand up those logs within 2 hours of starting the export, although
they weren't live for new signing for about 3 hours. 2023 had more
data and the attempts to export failed. We figured out the issue with
the export, fixed it, and started exporting again.

Please let me know if there are any follow up questions I can answer.

Jeremy

Kurt Roeckx

unread,
Sep 30, 2022, 9:21:45 AM9/30/22
to Jeremy Rowley, Joe DeBlasio, Clint Wilson, Certificate Transparency Policy
On Thu, Sep 29, 2022 at 08:44:52AM -0600, Jeremy Rowley wrote:
> Hey Kurt and Clint - we managed to recover Yeti 2024 and 2025. Those
> should be operating as normal. 2023 is in the process of recovering.
> Although the downtime exceeds the desired level, we think we can
> recover it this morning, and have it operating as normal.

Hi Jeremy,

Do you have an update about 2023? I'm still seeing error 500.


Kurt

سید محمد رضا مقدم

unread,
Oct 1, 2022, 2:14:38 PM10/1/22
to Certificate Transparency Policy, Kurt Roeckx, Joe DeBlasio - Google, Clint, Certificate Transparency Policy, Jeremy Rowley
وصل شدن به سرور 

Kurt Roeckx در تاریخ جمعه ۳۰ سپتامبر ۲۰۲۲ ساعت ۱۶:۵۱:۴۵ (UTC+3:30) نوشت:

سید محمد رضا مقدم

unread,
Oct 1, 2022, 2:15:56 PM10/1/22
to Certificate Transparency Policy, سید محمد رضا مقدم, Kurt Roeckx, Joe DeBlasio - Google, Clint, Certificate Transparency Policy, Jeremy Rowley
لطفاً من را به سرور وصل کنید ؟🙏😘👉

سید محمد رضا مقدم در تاریخ شنبه ۱ اکتبر ۲۰۲۲ ساعت ۲۱:۴۴:۳۸ (UTC+3:30) نوشت:

Clint Wilson

unread,
Oct 4, 2022, 1:42:37 PM10/4/22
to Jeremy Rowley, Joe DeBlasio, Certificate Transparency Policy, Kurt Roeckx
Following up on this, are there any additional updates regarding 2023 that you can provide?

-Clint
> --
> You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/YzbtZJRMS66IqNUn%40roeckx.be.

Jeremy Rowley

unread,
Oct 4, 2022, 2:31:40 PM10/4/22
to Clint Wilson, Certificate Transparency Policy, Joe DeBlasio, Kurt Roeckx
What else would you like to know? We found 5 certs missing from the 2023 log. We corrected the log and have it running in read-only mode so people can download the information. Our plan is to keep it operational until apple and google have removed the shard. At that point, we’ll decommission the shard.

Clint Wilson

unread,
Oct 4, 2022, 2:43:10 PM10/4/22
to Jeremy Rowley, Certificate Transparency Policy, Joe DeBlasio, Kurt Roeckx
Thanks Jeremy!
Reply all
Reply to author
Forward
0 new messages