Incident Report for TrustAsia log2025a and log2025b MMD violations
There were two phases in this incident 1) 2025-07-08: Approximately 11 hours of intermittent 5xx errors and short pause of signing service; 2)2025-07-12 to 2025-07-15: Log merge backlog. Both were chain reactions triggered by a sudden surge in number of log submission and pulling.
Impact and Details
1. 2025-07-08 Singing service paused
2. 2025-07-08 HTTP 500/502/504 errors occurred for approximately 4% of total requests
3. 2025-07-08 Observed error on database in our log service, that host is blocked because of many connection errors
4. 2025-07-15 Log merge backlog; tree merge service was running but could not keep pace with submission rates, leading to the result that the submitted log entries incorporating with delays in excess of 24 hours
Impact Window
1. 2025-07-08 04:20 +0800 ~ 2025-07-08 15:30 +0800
2. 2025-07-12 16:30 +0800 ~ 2025-07-15 20:30 +0800
Resolution
1. Blocked some IPs with large-volume submission and restarted services. The signing service briefly recovered but paused again while HTTP 500/504 errors persisted
2. Temporary solution: suspended add-cert and add-pre-cert interfaces. After suspended certificate submission, the signing service recovered and HTTP 500/504 errors ceased. However, the 500/504 errors continued when reopening certificate submissions
3. Added CPU resources for database servers
4. Phase two actions: temporarily disabled certificate submission interfaces
5. Phase two actions: upgraded database server disks for performance
Root Cause Analysis
1. Phase one: there were large number of submissions to log2025a/log2025b beginning around 2025-07-08 04:00 +0800, increasing database query pressure. The surge in submissions led to increased query volume, pushing database CPU usage to its limit. This resulted in query backlog and ultimately triggered the error “host is blocked because of many connection errors”. After we added CPU resources to the database servers, the 5xx errors were mitigated.
2. Phase two: After the short stabilizing by adding CPU resources on July 8th, the new request increase gradually raised the database IOPS demand. Sustained heavy querying led to an IOPS bottleneck, causing a sharp decline in database processing capability. This eventually affected the operation of trillian_log_signer, slowing down the tree merge rate and leading to backlog accumulation.
3. We monitor merge delays using the sequencer_merge_delay_count and sequencer_sequenced metrics. However, during periods of high overall system pressure, these metrics became inaccurate due to value stagnation.
4. For certificate submission and tree signing monitoring, our existing monitoring system remained operational. Since merging was continuously processing, no alert rules were triggered.
Timeline
1. 2025-07-08 04:30 +0800 Alerts notified in our internal workplace tool, for HTTP 500 and 504 errors
2. 2025-07-08 05:11 +0800 Manual phone alert by our staff that there were request timeouts on log2025b. Investigation revealed that starting from 2025-07-08 04:00 +0800, log2025a began receiving a surge of requests. We suspected it was caused by large-volume certificate submissions. Therefore identified a batch of suspicious IPs and temporarily blocked them.
3. 2025-07-08 06:00 +0800 We found that there were still HTTP 500 and 504 errors on log2025b, and its signing service continued pausing after short recovery. We then restarted services and rebooted business server, but found signing service continued pausing after short recovery.
4. 2025-07-08 09:00 +0800 After inspecting the database and connection middleware, we identified the error message “host is blocked because of many connection errors”. We attempted to adjust database runtime parameters and restart the database. The signing service briefly recovered but failed again, confirming insufficient CPU resources on the database servers.
5. 2025-07-08 11:00 +0800 Suspended add-cert and add-pre-cert interfaces. Checked the database status after it stabilized.
6. 2025-07-08 14:20 +0800 Prepared to add CPU resources to the database servers.
7. 2025-07-08 15:10 +0800 Completed adding CPU resources and the service began to gradually recover.
8. 2025-07-08 15:30 +0800 The HTTP 500 and 504 errors were confirmed ceasing and the signing service resumed normal operation.
9. 2025-07-11 08:00 +0800 ~ 2025-07-15 10:00 +0800 A growing backlog of unsigned entries has been observed in the merge tree.
10. 2025-07-15 09:27 +0800 Received external email notification that log entries incorporating with delays in excess of 24 hours.
11. 2025-07-15 09:42 +0800 Suspended certificate submission interface.
12. 2025-07-15 16:00 +0800 Implemented monitoring and alerting mechanisms for the Unsequenced table in the database.
13. 2025-07-15 19:20 +0800 Started to upgrade disk resource.
14. 2025-07-15 20:12 +0800 Completed upgrading disk resource.
15. 2025-07-15 20:26 +0800 Restored certificate submission interface and started to accept submissions. Awaiting system for stable running.
16. 2025-07-16 17:00 +0800 The monitoring system indicated stable operation.
Actions
1. Added database CPU resources, upgraded database disk performance and enhanced database processing capabilities.
2. In addition to using the sequencer_merge_delay_count and sequencer_sequenced metrics to monitor merge delays, monitoring and alerts have been added for the Unsequenced table.
3. Implemented a new operational strategy, that is the add-cert interface will be automatically suspended when the Unsequenced table reaches a threshold, preventing excessive data accumulation.
4. Plan to implement new external monitoring for more comprehensive API interface monitoring.