Key Conclusion: The incident had no impact on the database. The Merkle Tree structure remains intact, and data consistency is fully preserved.
Affected APIs: add-chain, add-pre-chain, get-entries.
Error Rate: Approximately 6% of requests returned HTTP 500 during the incident window.
Affected Window: 2025-12-24 22:55 +0800 to 2025-12-25 21:00 +0800.
2025-12-24 22:55: The system began exhibiting sporadic HTTP 500 errors.
2025-12-25 00:55: Error frequency increased, intermittently triggering system alert thresholds.
2025-12-25 10:07: Our team decided to suspend external certificate submissions to investigate potential risks.
2025-12-25 (Daytime): Investigation confirmed that the Merkle Tree and database data were not compromised. The issue was identified to a specific component, and database connection configurations for trillian_log_server were adjusted.
2025-12-25 19:00: Following the deployment configuration adjustments, HTTP 500 errors effectively ceased.
2025-12-25 21:00: After successful verification, certificate submission services were resumed, marking the end of the incident.
The root cause lies in a behavioral defect with the connection pool (pgxpool) of the
github.com/jackc/pgx library under specific concurrent scenarios.
Trigger Scenario: ct-server initiates a GetLeavesByRangeRequest gRPC call to trillian_log_server. The backend database executes a complex SELECT query in transaction mode.
Abnormal Interruption: When an external HTTP request is unexpectedly interrupted or a gRPC call times out, the Context is canceled. This forces the currently executing database query into Rollback mode.
Connection Contamination: pgx has a known issue (refer to Issue #2100) where a connection in Rollback mode may be assigned by the pool to the next query request without being properly cleaned up or checked for state.
Cascading Error: Subsequent queries are executed on a “rolling back” dirty connection, triggering TX rollback error: failed to deallocate cached statement(s): conn closed. This causes the SQL execution to fail, resulting in an API 500 error.
Default Configuration Risk: The default MaxConns setting in pgxpool defaults to the number of OS CPU cores. During peak traffic, this exacerbates connection reuse contention and increases the probability of hitting a dirty connection.
- Remediation and Mitigation
Horizontal Scaling: Deployed additional trillian_log_server instances to handle traffic load.
Connection Pool Tuning: Explicitly increased the MaxConns parameter for the pgx connection pool used by trillian_log_server. This adjustment reduced the frequency of connection reuse, effectively diluting the impact of dirty connections and preventing errors.
A fundamental resolution involves deep modifications to the underlying pgx library, which carries significant difficulty and risk.
We will continue to monitor community discussions (Issue #2100) and evaluate whether to introduce patches in future versions or adopt alternative connection management strategies.