Tiger2025h2 submission failures in early September

44 views
Skip to first unread message

Rob Stradling

unread,
Sep 18, 2025, 6:12:36 AM (8 days ago) Sep 18
to Certificate Transparency Policy
On September 2nd, the Chrome CT team notified Sectigo that:
"We've been tracking an increase in failures across endpoints in tiger2025h2 over the past several days. Though most of the dip is relatively minor, add-chain seems particularly impacted, which has resulted in tiger2025h2 falling below Chrome's requirements of 99% availability per-endpoint.
These issues have taken the form of requests returning HTTP 504 Gateway Timeout or 503 Service Unavailable."

Upon investigating we found that Tiger2025h2's log-server pods had begun to restart regularly.  The application logs showed evidence of database connection timeouts.  The corresponding PostgreSQL database showed lots of blocked sessions coming and going, and in every case the blocker and blocked sessions were executing the count_estimate() function, which is part of the Trillian Quota Manager's PostgreSQL implementation.

The count_estimate() function implements two strategies for counting/estimating the number of rows on a table.  The first strategy is fast when the number of rows is "small", but very slow when the number of rows is "large".  Whereas the second strategy is slower when "small" but faster when "large", because it is expected to run in approximately constant time regardless of the number of rows being estimated.

Spot-checking Tiger2025h2's "Unsequenced" database table showed that it was regularly exceeding the fixed threshold of 1,000 rows, presumably due to one or more users recently beginning to submit (pre)certificates at a faster rate.  Exceeding this threshold meant that the second strategy was being used; and it turned out that the "slower" runtime of the second strategy, plus the fact that each session was doing a blocking ANALYZE, meant that the database simply could not keep up with the rate of submissions.

The original threshold of 1,000 had been set arbitrarily, and we quickly concluded that it was too low.  To resolve the problem, we increased the threshold from 1,000 to 10,000 (see this Pull Request) and then recompiled the count_estimate() function.  This change was rolled out to all of Sectigo's PostgreSQL-based logs, and since then we have not observed any performance problems.
Reply all
Reply to author
Forward
0 new messages