TrustAsia CT log hetu2027 suspending submissions certificate

Xiaoming Yang

unread,

Dec 24, 2025, 9:07:56 PM12/24/25

to Certificate Transparency Policy

"Our CT log hetu2027 ( https://hetu2027.trustasia.com/hetu2027/ ) is encountering a 'GetEntries handler error: failed to fix log leaf: leaves:{merkle_leaf_hash' error. We are suspending certificate submissions to investigate the issue."

Xiaoming Yang

unread,

Dec 25, 2025, 9:40:26 AM12/25/25

to Certificate Transparency Policy, Xiaoming Yang

We've investigated the issues with CT log hetu2027 and identified the cause. A temporary fix is in place, and we are gradually reopening submissions. Further validation is underway, and a full report will follow.

Fusion Android

unread,

Dec 31, 2025, 1:29:36 AM12/31/25

to Certificate Transparency Policy, Xiaoming Yang

Hi Xiaoming Yang,

Would this cause a certificate to not be present in the v3 log list, for a certificate issued around the time of issue?

Regards,

Sean

Xiaoming Yang

unread,

Jan 1, 2026, 11:05:03 AMJan 1

to Certificate Transparency Policy, Fusion Android, Xiaoming Yang

We have examined the Merkle tree and found no corruption so far. The issue lies within the Go pgx library; it only affects the application and has no impact on the database.

Xiaoming Yang

unread,

Jan 9, 2026, 1:24:57 AMJan 9

to Certificate Transparency Policy, Xiaoming Yang

Intermittent Service Caused by pgx Library

Incident Overview

This incident manifested as intermittent HTTP 500 errors on specific APIs (add-chain, add-pre-chain, get-entries). The root cause was traced to the pgx layer, a database driver library for the Go ecosystem.

Key Conclusion: The incident had no impact on the database. The Merkle Tree structure remains intact, and data consistency is fully preserved.

Impact Scope

Affected APIs: add-chain, add-pre-chain, get-entries.
Error Rate: Approximately 6% of requests returned HTTP 500 during the incident window.
Affected Window: 2025-12-24 22:55 +0800 to 2025-12-25 21:00 +0800.

Timeline (UTC+8)

2025-12-24 22:55: The system began exhibiting sporadic HTTP 500 errors.
2025-12-25 00:55: Error frequency increased, intermittently triggering system alert thresholds.
2025-12-25 10:07: Our team decided to suspend external certificate submissions to investigate potential risks.
2025-12-25 (Daytime): Investigation confirmed that the Merkle Tree and database data were not compromised. The issue was identified to a specific component, and database connection configurations for trillian_log_server were adjusted.
2025-12-25 19:00: Following the deployment configuration adjustments, HTTP 500 errors effectively ceased.
2025-12-25 21:00: After successful verification, certificate submission services were resumed, marking the end of the incident.

Root Cause Analysis

The root cause lies in a behavioral defect with the connection pool (pgxpool) of the github.com/jackc/pgx library under specific concurrent scenarios.
Trigger Scenario: ct-server initiates a GetLeavesByRangeRequest gRPC call to trillian_log_server. The backend database executes a complex SELECT query in transaction mode.
Abnormal Interruption: When an external HTTP request is unexpectedly interrupted or a gRPC call times out, the Context is canceled. This forces the currently executing database query into Rollback mode.
Connection Contamination: pgx has a known issue (refer to Issue #2100) where a connection in Rollback mode may be assigned by the pool to the next query request without being properly cleaned up or checked for state.
Cascading Error: Subsequent queries are executed on a “rolling back” dirty connection, triggering TX rollback error: failed to deallocate cached statement(s): conn closed. This causes the SQL execution to fail, resulting in an API 500 error.
Default Configuration Risk: The default MaxConns setting in pgxpool defaults to the number of OS CPU cores. During peak traffic, this exacerbates connection reuse contention and increases the probability of hitting a dirty connection.

Remediation and Mitigation

- Short-term Fixes

Horizontal Scaling: Deployed additional trillian_log_server instances to handle traffic load.
Connection Pool Tuning: Explicitly increased the MaxConns parameter for the pgx connection pool used by trillian_log_server. This adjustment reduced the frequency of connection reuse, effectively diluting the impact of dirty connections and preventing errors.

- Long-term Plan

A fundamental resolution involves deep modifications to the underlying pgx library, which carries significant difficulty and risk.
We will continue to monitor community discussions (Issue #2100) and evaluate whether to introduce patches in future versions or adopt alternative connection management strategies.

Issue #2100 https://github.com/jackc/pgx/issues/2100

Reply all

Reply to author

Forward