Incident Report
Summary & Impact
On January 5th, we were notified by Let’s Encrypt that they were receiving lots of HTTP 500 (Internal Server Error) responses when attempting to submit entries to our mammoth2024h1 CT log.
A table on mammoth2024h1’s database became corrupted due to a problem with available disk space. As a result of this corruption, log entries cannot be added for at least two SCTs. We deem this to be an unrecoverable error that necessitates the retirement of this log.
Mammoth2024h1 has been put in permanent read-only mode, and we anticipate that it will transition to Retired in the Chrome and Apple log lists in the near future.
Timeline
All times are UTC.
2024-01-05:
- 16:03:44 Mammoth2024h1’s MariaDB database runs out of disk space.
- 16:17:52.3313 Earliest unincorporated SCT timestamp, if a corrupted index can be believed.
- 23:02 Let’s Encrypt notifies our CT Ops team that mammoth2024h1 has been emitting lots of submission errors (HTTP 500) since around 16:04 UTC.
2024-01-06:
- 09:39 Our Compliance team’s nascent monitoring system also observes the problem and escalates the matter to our Operations team.
2024-01-08:
- 09:48 Our Operations team reports that the database ran out of disk space, and that the issue has been resolved by allocating more disk space.
2024-01-09:
- 00:45 The Chrome CT team notifies our CT Ops team that they had observed stale STHs being served by mammoth2024h1 over the previous weekend, and that they had seen merge delays over 24hrs for four SCTs, two of which had still not been included in the log.
- 10:01 We begin investigating the Chrome CT team’s report.
- 16:07 We confirm that the two leaf hashes provided by the Chrome CT team have still not been included in the log.
- 16:27 We observe that although new submissions are being processed in a timely fashion, the “Unsequenced” table appears to contain ~1,300 rows that have timestamps dating back as far as 2024-01-05T16:17:52.3313Z.
- 17:17 We ask the Google CT team for help with processing the stuck unsequenced rows.
- 20:44 We realize that the “Unsequenced” table has become corrupted. Whilst its primary key index still includes the ~1,300 unsequenced rows, the table itself no longer contains them. The index only contains a subset of the table’s columns, which is not enough to recover the full data necessary to include the promised entries in the log.
- 21:05 We inform the Chrome CT team that we have begun investigating their report with urgency.
2024-01-10:
- 11:37 With further input from the Google CT team, we begin attempting to recover the data from the MariaDB redo logs.
- 16:44 We notify the Chrome and Apple CT teams that although we haven’t yet given up all hope, we do believe it will not be possible to restore mammoth2024h1 to a state worthy of continued trust, and that in anticipation of its retirement we will begin work to spin up a replacement log.
- 18:52 The Chrome CT team requests that mammoth2024h1 stop accepting new entries.
- 20:34 Our Operations team puts mammoth2024h1 into a read-only state by blocking POST requests. Subsequent add-chain and add-pre-chain requests return HTTP 403 (Forbidden).
- 21:03 We notify ct-policy that mammoth2024h1 has stopped accepting submissions.
2024-01-11:
- 11:23 We abandon our efforts to recover the missing unsequenced rows from the MariaDB redo logs, having concluded that it is not possible.
Root Cause Analysis
The internal server errors and merge delays occurred due to the underlying filesystem running out of available disk space.
The fast growth of mammoth2024h1 caught us by surprise. The log only became Usable on November 26th 2023. Based on our previous experience running CT logs, we were not expecting to run out of disk space within 6 weeks. While monitoring of disk usage was in place, an effective alerting mechanism to the CT Ops and Operations teams has not yet been made available.
The failure to incorporate at least two SCTs into the log occurred due to corruption of the “Unsequenced” database table, which was caused by MariaDB being susceptible to corruption when the underlying filesystem runs out of disk space.
Lessons Learned
What went well
- It was unfortunate that the incident began late on a Friday, but our Operations team was able to resolve the disk space problem promptly as soon as the working week began.
What didn’t go well
- Although disk usage monitoring is in place, effective alerting was lacking.
- MariaDB databases are vulnerable to corruption when there is no more disk space available.
Where we got lucky
- Disk space for sabre2024h1 was increased in the last week of 2023, avoiding the same fate that mammoth2024h1 has now suffered.
- Only the “Unsequenced” table was corrupted, so we did not lose any information relating to entries that had already been included. Therefore, the log’s various GET endpoints still return useful data for included entries.
Action Items
Action Item
Kind
Due Date
Increase disk capacity on all Mammoth and Sabre log databases by a factor of 8.
Mitigate
Done
Submit inclusion requests to Chrome and Apple for a “mammoth2024h1b” replacement log.
Mitigate
Today
Inventory our current monitoring and alerting mechanisms for our CT logs, and draft a plan to improve their effectiveness. This plan will look at prioritizing metrics and alerts that could result in a fatal error, possibilities for reducing alert fatigue, and the practicalities of making the alerts more visible so that they don’t get overlooked.
Detect & Mitigate
2024-01-31
Our teams have very little expertise with MySQL/MariaDB, whereas we use PostgreSQL extensively. Prior to this incident, we had already begun work on implementing a PostgreSQL storage backend for Trillian, the open source software that underpins our CT logs. Interestingly, it appears that PostgreSQL is not vulnerable to corruption when disk space runs out, which makes us even more keen to complete this project.
Prevent
TBD