2021-12-08 Oak CT Log Database Migration Post-mortem

340 views

Skip to first unread message

Phil Porada

unread,

Dec 15, 2021, 6:58:31 PM12/15/21

to Certificate Transparency Policy

Summary:
Let’s Encrypt Oak 2022 CT log was returning no data for the get-entries endpoint and errors for get-proof-by-hash. Upon investigation we found missing rows, missing indexes, and a missing constraint. That combination caused lookup failures for old data and the missing indexes caused database performance degradation. All missing rows were inserted from the original Oak copy of the log database and missing indexes/constraints were created. After data was restored, all get-* endpoints began returning correctly.

During this period, Oak 2022 continued sequencing submissions and issuing SCTs.

Details:
Let's Encrypt created an internal plan to move from a single database containing all temporal shards to distinct databases per shard. The reasoning for this is due to an initial planning oversight. MariaDB in AWS RDS has a limitation of 16TB per tablespace: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/MySQL.KnownIssuesAndLimitations.html#MySQL.Concepts.Limits.FileSize

Non-RDS MariaDB databases, which we operate in the Let's Encrypt secure datacenters, have a maximum tablespace of 64TB. We calculated that the Oak database containing temporal shards 2019 - 2023 was projected to run out of space in January 2022. We first detected the tablespace size limitation issue with our Testflume testing log: https://groups.google.com/a/chromium.org/g/ct-policy/c/madcUcQZ1IQ .

To prevent sudden Oak death, we chose to use AWS Database Migration Service (DMS) with replication filters to copy only 2022 and 2023 shard data into the new Oak database. The process includes creating a source and target endpoint. We used the following configurations and documents:
* https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.MySQL.html#CHAP_Target.MySQL.Prerequisites
* https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MySQL.html
```
Source: (MariaDB)
Endpoint specific settings:
eventsPollInterval=5,CleanSrcMetadataOnMismatch=true

Target: (MariaDB)
Endpoint specific settings:
targetDbType=SPECIFIC_DATABASE,parallelLoadThreads=5,initstmt=SET FOREIGN_KEY_CHECKS=0,CleanSrcMetadataOnMismatch=true
```

On 2021-12-08 we began cutover maintenance on Oak https://groups.google.com/a/chromium.org/g/ct-policy/c/8Dd3VTHmvM0 . Validation checks included naive content checks of each migrated database table, performing get-entries calls several times on recently sequenced data, and ensuring that data continued to be sequenced. We closed our maintenance window and wrapped up for the night.

On 2021-12-09, Philippe Boneff from Google contacted us stating that their monitoring of the Oak 2022 CT log had encountered errors for certain operations including get-entries and get-proof-by-hash, but the same checks on Google’s mirrored copy of the log were functional. This implied that the tree itself was still ok. The errors began at 2021-12-08 20:24; nearing the end of our scheduled Oak log maintenance.

Upon investigation we found missing rows in the Subtree and LeafData tables, missing indexes on SequencedLeafData and TreeHead tables, and a missing constraint on the Unsequenced table. The cause of the missing indexes and constraints were due to a DMS limitation; secondary indexes that the Trillian schema requires would not automatically be created during a full table load. https://docs.aws.amazon.com/dms/latest/userguide/CHAP_BestPractices.html

From 2021-12-09 13:59 UTC to 2021-12-10 19:19 UTC we copied the missing data from the old database and created the indexes and constraints on the tables. This resolved the lookup failures for get-entries and timeout errors for get-proof-by-hash.

Improvements:
1. Use an automated tool to review the entire tree after migration

If we were to ever use this plan again, we would want to run the `scanlog` tool against each shard which would have alerted us of missing data.
```
scanlog -start_index 0 -end_index 249583813 -log_uri https://oak.ct.letsencrypt.org/2022/
```

Our running ct-woodpecker tool continued monitoring from the tree head and detected no issue. It was able to submit certificates and verify that those news certificates were sequenced. We should have started a new copy of ct-woodpecker to begin checking entries from the tree root.

2. Ensure that schemas are identical after using AWS DMS.

3. Use a single database host per temporal shard.

Using Trillian to freeze a log following https://github.com/google/trillian/blob/master/docs/howto/freeze_a_ct_log.md , an administrator sets a flag in the Trees table to indicate a soft delete and at a given time later, the data for that shard will be hard deleted. The hard delete will be individual `DELETE x FROM y` queries in each table that can severely impact performance for other shards on the same database host. We determined that the impact of this on Oak would cause a disqualification event and didn't pursue that option further. Moving forward we will use a separate database host per shard and delete the host when the shard ages out.

Database Fix Details:
The following shell commands were used to stream data from the old Oak database into the new Oak database. All told this was 1.34TiB of LeafData and ~365GiB for Subtree. As soon as these two tables were loaded into the new Oak database, the get-entries endpoint began correctly returning all missing entries.

$ mysqldump ${DBNAME} \
LeafData \
--host=${DBHOST_ORIG} \
--user=${DBUSER_ORIG} \
--password=${DBPASS_ORIG} \
--insert-ignore \
--quick \
--no-create-info \
--skip-lock-tables \
--complete-insert \
--hex-blob \
--compact \
--order-by-primary \
--where "TreeId=${TREEID}" \
| pv | \
mysql ${DBNAME} \
--host=${DBHOST_NEW} \
--user=${DBUSER_NEW} \
--password=${DBPASS_NEW}

$ mysqldump ${DBNAME} \
Subtree \
--host=${DBHOST_ORIG} \
--user=${DBUSER_ORIG} \
--password=${DBPASS_ORIG} \
--insert-ignore \
--quick \
--no-create-info \
--skip-lock-tables \
--complete-insert \
--hex-blob \
--compact \
--order-by-primary \
--where "TreeId=${TREEID}" \
| pv | \
mysql ${DBNAME} \
--host=${DBHOST_NEW} \
--user=${DBUSER_NEW} \
--password=${DBPASS_NEW}

To generate the missing indexes and constraints while continuing service traffic we used pt-online-schema-change. We had storage space available in this new database to create a copy of each table that we ran the tool against. RDS does not provide SUPER privileges, and instead required a parameter group change as indicated by this Percona blog post. As soon these indexes/constraints were generated, the get-proof-by-hash endpoint began returning successfully. https://www.percona.com/blog/2016/07/01/pt-online-schema-change-amazon-rds/ For posterity, here's our pt-online-schema-change incantations:

# Parameter group change
log_bin_trust_function_creators = 1

$ pt-online-schema-change \
-u${DBUSER} \
-p${DBPASS} \
--max-load Threads_running=350 --critical-load Threads_running=500 \
--recursion-method=none \
--nocheck-replication-filters \
--noanalyze-before-swap \
--execute \
--alter="ADD KEY TreeId (TreeId,LeafIdentityHash), ADD KEY SequencedLeafMerkleIdx (TreeId,MerkleLeafHash)" \
D=${DBNAME},t=SequencedLeafData,h=${DBHOST}

$ pt-online-schema-change \
-u${DBUSER} \
-p${DBPASS} \
--max-load Threads_running=350 --critical-load Threads_running=500 \
--recursion-method=none \
--nocheck-replication-filters \
--noanalyze-before-swap \
--execute \
--alter="ADD UNIQUE TreeHeadRevisionIdx (TreeId,TreeRevision)" \
D=${DBNAME},t=TreeHead,h=${DBHOST}

$ pt-online-schema-change \
-u${DBUSER} \
-p${DBPASS} \
--max-load Threads_running=350 --critical-load Threads_running=500 \
--recursion-method=none \
--nocheck-replication-filters \
--noanalyze-before-swap \
--execute \
--alter="ADD UNIQUE QueueID (QueueID)" \
D=${DBNAME},t=Unsequenced,h=${DBHOST}

# Parameter group change
log_bin_trust_function_creators = 0

Reply all

Reply to author

Forward

0 new messages