Retiring Let’s Encrypt Testflume and Oak 2019/2020 CT shards

474 views
Skip to first unread message

Phil Porada

unread,
Feb 23, 2021, 3:58:37 PM2/23/21
to Certificate Transparency Policy
Hi everybody,

Let's Encrypt will be freezing and eventually deleting the following CT shards. These particular shards have long since stopped accepting new certificate issuances. This data purge will free up disk space for future shards, reduce replication time to new databases, and reduce Let’s Encrypt operating costs.

Shards:
- Testflume 2019
- Testflume 2020
- Oak 2019
- Oak 2020

We will freeze the aforementioned shards on March 1st, 2021. Freeze means setting these shards to read-only mode also known in the Trillian software as a “soft delete”. Users will still be able to hit RFC 6962 get-* endpoints on these particular shards and have data returned. On April 1st, 2021 the shards will be automatically garbage collected by Trillian and “hard deleted”. Users will be unable to request data from the listed shards after the hard deletion.

Google has graciously provided the ecosystem with read-only log mirrors for production logs (e.g. Oak and NOT Testflume). Users are welcome to continue retrieving shard data from these mirrors. Please keep in mind that Google has not published any SLA or guarantee about the lifecycle of these mirrors.
- https://ct.googleapis.com/logs/eu1/mirrors/letsencrypt_oak2019/
- https://ct.googleapis.com/logs/eu1/mirrors/letsencrypt_oak2020/

Kurt Roeckx

unread,
Mar 14, 2021, 8:59:31 AM3/14/21
to Phil Porada, Certificate Transparency Policy
On Tue, Feb 23, 2021 at 12:58:36PM -0800, Phil Porada wrote:
> Hi everybody,
>
> Let's Encrypt will be freezing and eventually deleting the following CT
> shards. These particular shards have long since stopped accepting new
> certificate issuances. This data purge will free up disk space for future
> shards, reduce replication time to new databases, and reduce Let’s Encrypt
> operating costs.
>
> Shards:
> - Testflume 2019
> - Testflume 2020
> - Oak 2019
> - Oak 2020
>
> We will freeze the aforementioned shards on *March 1st, 2021*. Freeze means
> setting these shards to read-only mode also known in the Trillian software
> as a “soft delete”. Users will still be able to hit RFC 6962 get-*
> endpoints on these particular shards and have data returned. On *April 1st,
> 2021* the shards will be automatically garbage collected by Trillian and
> “hard deleted”. Users will be unable to request data from the listed shards
> after the hard deletion.

It seems testflume 2019 is now returning 404. I last had contact
with it on 2021-03-09 02:50 UTC. Did it get deleted sooner than
expected?

What I'm also seeing is that it seems that new signatures are
still generated, but with the same timestamp. For example, for
testflume 2020 I get 2 new STHs first seen on 2021-03-09 02:53 UTC,
and I have 10 STHs with timestamp 2021-03-01 12:41:49.225+00.

oak 2019 and 2020 still seems to be generating new STHs, last
timestamp for oak 2019 is 2021-03-14 01:55:03.956+00, for oak 2020
2021-03-14 11:58:07.53+00.


Kurt

Fran007 Cortes

unread,
Mar 14, 2021, 11:51:56 AM3/14/21
to Kurt Roeckx, Phil Porada, Certificate Transparency Policy


movil

--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/YE4IrCxsi2wuhHwN%40roeckx.be.

Phil Porada

unread,
Mar 18, 2021, 5:18:50 PM3/18/21
to Certificate Transparency Policy, Phil Porada, Certificate Transparency Policy, Kurt Roeckx
Kurt,

TL;DR -  A database table ran out of disk space and the early deletion of Testflume 2019 was intentional.

The Testflume log contains 2B+ certificates/precerts spread across the 2019-2023 shards. Cert and precertificate data that has not yet been incorporated into the Trillian merkle tree is stored in the `LeafData` table. Without yearly shard pruning, this table will grow unbounded and consume all available disk space. We had neglected pruning shards and it bit us. Technically this same thing can happen to our Oak log. At this time, Oak 2019 and 2020 have yet to be frozen. We are still well within time to clean up the existing Oak shards, but want to be especially careful with how we proceed.

You may be asking, "Wait, what?" We use MariaDB databases on Amazon RDS with InnoDB file-per-table tablespaces enabled. We have disk utilization monitoring and RDS storage autoscaling in place, but neither of those helped here. Our monitoring only checked the disk space available on the storage mount and alert us at a particular threshold rather than alerting per-tablespace. The storage autoscaling automatically grows the disk when the entire mount containing all of the file-per-table files approaches ~90% full. https://mariadb.com/kb/en/innodb-file-per-table-tablespaces/

On the night of March 8th we discovered that RDS has a 16TB limit per tablespace when using file-per-table; well below the disk space alerting and autoscaling thresholds. Instead, our on-call was alerted that sequencing latency was rising and endpoint errors had increased. https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.KnownIssuesAndLimitations.html#MySQL.Concepts.Limits.FileSize

In our currently deployed infrastructure and stock standard to Trillian, each CT logs stores shard data in the same database. Deleting the Testflume 2019 shard ahead of schedule would allow us to bring the remaining 2020-2023 shards back online. The Trillian mysql schema uses `... ON DELETE CASCADE;` to slowly remove the 2019 shard data, but the OS still considers that disk space to be in-use. Technically those delete statements are still running as of this post. To regain some of the disk space, we ran an `optimize table LeafData;` which locked the `LeafData` table for the duration until we killed the query. As a result, the 2020-2023 had an outage because those shards were unable to incorporate data into the merkle tree. We deleted the Testflume 2019 data from the certificate-transparency-go frontend to stop serving that shard.

Example how Trillian stores all `LeafData` information in a single table in a database.
```
> select TreeId, LeafIdentityHash,QueueTimestampNanos from LeafData LIMIT 5;
+--------------------+------------------------------------------------------------------+---------------------+
| TreeId             | hex(LeafIdentityHash)                                            | QueueTimestampNanos |
+--------------------+------------------------------------------------------------------+---------------------+
| 123123123123123123 | xxxxxxxxxxxxxxxxxxxxxxxx | 1600020128099737092 |
| 456456456456456456 | yyyyyyyyyyyyyyyyyyyyyyyyy | 1597434125325190272 |
| 789789789789789789 | zzzzzzzzzzzzzzzzzzzzzzzz | 1605081733706674075 |
| 123123123123123123 | xxxxxxxxxxxxxxxxxxxxxxxx | 1600020128099794300 |
| 456456456456456456 | yyyyyyyyyyyyyyyyyyyyyyyyy | 1597434125325242342 |
```

We're currently planning what to do for the existing Testflume and Oak shards, how to best prevent an Oak disqualification event, and what to do for future shards. This was an unfortunate, but valuable, learning experience given that it happened to the test log.
Reply all
Reply to author
Forward
0 new messages