TrustAsia CT log hetu2027 suspending submissions certificate

372 views
Skip to first unread message

Xiaoming Yang

unread,
Dec 24, 2025, 9:07:56 PM12/24/25
to Certificate Transparency Policy

"Our CT log hetu2027 ( https://hetu2027.trustasia.com/hetu2027/ ) is encountering a 'GetEntries handler error: failed to fix log leaf: leaves:{merkle_leaf_hash' error. We are suspending certificate submissions to investigate the issue."

Xiaoming Yang

unread,
Dec 25, 2025, 9:40:26 AM12/25/25
to Certificate Transparency Policy, Xiaoming Yang
We've investigated the issues with CT log hetu2027 and identified the cause. A temporary fix is in place, and we are gradually reopening submissions. Further validation is underway, and a full report will follow.

Fusion Android

unread,
Dec 31, 2025, 1:29:36 AM12/31/25
to Certificate Transparency Policy, Xiaoming Yang
Hi Xiaoming Yang,

Would this cause a certificate to not be present in the v3 log list, for a certificate issued around the time of issue?

Regards,
Sean

Xiaoming Yang

unread,
Jan 1, 2026, 11:05:03 AMJan 1
to Certificate Transparency Policy, Fusion Android, Xiaoming Yang
We have examined the Merkle tree and found no corruption so far. The issue lies within the Go pgx library; it only affects the application and has no impact on the database.

Xiaoming Yang

unread,
Jan 9, 2026, 1:24:57 AMJan 9
to Certificate Transparency Policy, Xiaoming Yang
Intermittent Service Caused by pgx Library
  • Incident Overview 
This incident manifested as intermittent HTTP 500 errors on specific APIs (add-chain, add-pre-chain, get-entries). The root cause was traced to the pgx layer, a database driver library for the Go ecosystem.
Key Conclusion: The incident had no impact on the database. The Merkle Tree structure remains intact, and data consistency is fully preserved. 
  • Impact Scope 
Affected APIs:  add-chain, add-pre-chain, get-entries.
Error Rate:  Approximately 6% of requests returned HTTP 500 during the incident window.
Affected Window:  2025-12-24 22:55 +0800 to 2025-12-25 21:00 +0800.
  • Timeline (UTC+8)
2025-12-24 22:55:  The system began exhibiting sporadic HTTP 500 errors.
2025-12-25 00:55:  Error frequency increased, intermittently triggering system alert thresholds.
2025-12-25 10:07:  Our team decided to suspend external certificate submissions to investigate potential risks.
2025-12-25 (Daytime):  Investigation confirmed that the Merkle Tree and database data were not compromised. The issue was identified to a specific component, and database connection configurations for trillian_log_server were adjusted.
2025-12-25 19:00:  Following the deployment configuration adjustments, HTTP 500 errors effectively ceased.
2025-12-25 21:00:  After successful verification, certificate submission services were resumed, marking the end of the incident. 
  • Root Cause Analysis
The root cause lies in a behavioral defect with the connection pool (pgxpool) of the github.com/jackc/pgx library under specific concurrent scenarios.
Trigger Scenario:  ct-server initiates a GetLeavesByRangeRequest gRPC call to trillian_log_server. The backend database executes a complex SELECT query in transaction mode.
Abnormal Interruption:  When an external HTTP request is unexpectedly interrupted or a gRPC call times out, the Context is canceled. This forces the currently executing database query into Rollback mode.
Connection Contamination:  pgx has a known issue (refer to Issue #2100) where a connection in Rollback mode may be assigned by the pool to the next query request without being properly cleaned up or checked for state.
Cascading Error:  Subsequent queries are executed on a “rolling back” dirty connection, triggering TX rollback error: failed to deallocate cached statement(s): conn closed. This causes the SQL execution to fail, resulting in an API 500 error.
Default Configuration Risk:  The default MaxConns setting in pgxpool defaults to the number of OS CPU cores. During peak traffic, this exacerbates connection reuse contention and increases the probability of hitting a dirty connection. 
  • Remediation and Mitigation
    • Short-term Fixes
Horizontal Scaling:  Deployed additional trillian_log_server instances to handle traffic load.
Connection Pool Tuning:  Explicitly increased the MaxConns parameter for the pgx connection pool used by trillian_log_server. This adjustment reduced the frequency of connection reuse, effectively diluting the impact of dirty connections and preventing errors. 
    • Long-term Plan
A fundamental resolution involves deep modifications to the underlying pgx library, which carries significant difficulty and risk.
We will continue to monitor community discussions (Issue #2100) and evaluate whether to introduce patches in future versions or adopt alternative connection management strategies. 

Reply all
Reply to author
Forward
0 new messages