Tiger2026h1 degraded performance - now resolved

58 views
Skip to first unread message

Rob Stradling

unread,
Nov 20, 2025, 9:20:12 AM (9 days ago) Nov 20
to Certificate Transparency Policy
Yesterday we detected and resolved an issue with Tiger2026h1 that was causing a high number of HTTP 50x responses.

At 2025-11-19 10:32, our internal monitoring alerted us that the Trillian log-server pods for Tiger2026h1 were regularly restarting, and we immediately began an investigation.  Our first focus was to consider the problem that affected Tiger2025h2 in early September, but we concluded that this was not the cause this time.  Looking at the relevant log-server logs, we found evidence of lots of internal timeouts with database connections.  We also observed that the PostgreSQL VM was running on a Proxmox node that was under high CPU load.

At 2025-11-19 13:30, we did a live migration of the affected PostgreSQL VM to a less busy Proxmox node.  After closely monitoring the system for a while afterwards, it became clear that this VM migration had resolved the problem.

At 2025-11-19 19:10, the Chrome CT Team notified us that since "2025-11-17 11:57 UTC, and significantly ramping up yesterday, we've started seeing a significant number of HTTP 50x status codes (502, 503, 504) on requests to tiger2026h1 across all endpoints. This has resulted in tiger2026h1 falling below the 99% 90-day availability threshold".

We don't currently have any alerting for 50x spikes, but we intend to look into adding this capability.
Reply all
Reply to author
Forward
0 new messages