TrustAsia log2025a Performance optimization and problem investigation

348 views
Skip to first unread message

Jerry Hou

unread,
Apr 24, 2025, 2:06:28 AMApr 24
to Certificate Transparency Policy

Hi all,

We have been monitoring a large amount of requests submitting previously logged certificates to TrustAsia log2025a, and repeated log pulling requests, which has resulted in 502 error on our CT service.

In order to improve the response performance of our CT services, we have enabled HTTP/2 for the network connection between the data center and CDN, on April 23, 2025 (UTC+8).

After the operation, we have observed a significant decrease on TCP connections and the 502 error decreased by 99%. However, on April 24, 2025 (UTC+8), we have monitored some availability decline on part of the interfaces of Google CT uptime monitoring. Maybe there was cache on availability. We have already did rollback and our engineers are further investigating the problem.


We will keep posted. Thank you.

Jerry Hou

unread,
May 7, 2025, 1:37:01 AMMay 7
to Certificate Transparency Policy, Jerry Hou

 

Hi all,

 

Please find below the incident report for the issue reported on April 24, 2025.

 

1. Overview

We have been monitoring a large amount of requests submitting previously logged certificates to TrustAsia log2025a, and repeated log pulling requests, which has resulted in 502 error on log2025a.

 

In order to improve the response performance of our CT services, we have implemented enhanced CT Proxy (developed on pingora) for TrustAsia log2025a log service on April 23, 2025 (UTC+8) and enabled HTTP/2 for the network connection between CDN and CT Proxy as well as increasing the data caching ability. After the operation, we have observed a significant decrease on TCP connections and the 502 error decreased by 99%.

 

However, on April 24, 2025 (UTC+8), we have monitored some availability decline on some interfaces in Google CT uptime monitor. Maybe there was cache on availability. We have done rollback and fixed the issue after verifying. The services were recovered and work now.

 

2. Impact

From 2025-04-23 09:00 +0800 to 2025-04-24 10:00 +0800

Large amount of 429 responses to some interfaces;

Some of the requests with gzip were cached by CT Proxy and returned to clients that do not support gzip.

 

3. Timeline

·      Since 2025 February, there has been a large increase in data submission and data request. We have been planning to implement enhanced CT Proxy;

·      2025 March, the new CT Proxy has been in testing and function validation in the test environment;

·      2025-04-22 (UTC+8) Large increase in submission (870k certs per hour) and pulling requests in log2025a, which resulted in exhaustion of server TCP connection resources and multiple HTTP 5xx errors in our monitor;

·      2025-04-23 09:00 +0800 Implement new version of CT Proxy to log2025a;

·      2025-04-23 10:24 +0800 TCP connections reduced a lot and the 5xx error decreased;

·      2025-04-23 10:00 +0800 Intermittent alerts from our uptime monitor (AWS Lambda) of log2025a;

·      2025-04-23 22:00 +0800 Modify gateway configuration by disabling caching (get-sth), and the alerts stop

·      2025-04-24 04:00 +0800 Availability of some interfaces in Google uptime monitor (get-proof-by-hash) declined from 99.8% to 98.7% (get-sth keeps at 100%);

·      2025-04-24 10:00 +0800 Did rollback on CDN deployments and old version of gateway (HTTP/2 off);

·      2025-04-24 10:10 +0800 TCP connections increased greatly and 5xx errors increased;

·      2025-04-25 04:00 +0800 Availability of some interfaces in Google uptime monitor (get-proof-by-hash) continues declining to 98.1% (get-sth keeps at 100%);

·      2025-04-25 23:40 +0800 Update the CT Proxy with issue fixed. Services work. TCP connections decreased greatly and 5xx errors disappear;

·      2025-04-27 04:00 +0800 Availability of some interfaces in Google uptime monitor begins to recover

 

4. Root Cause Analysis

a. To address the recent surge in abnormal requests that consumed excessive socket connections and led to an increase in 5xx errors, we implemented an enhanced CT Proxy. This enhancement included enabling HTTP/2, gzip compression, and caching mechanism. However, after enabling these features, two issues occurred:

 

i. There was a problem with the caching implementation: the cache key did not differentiate by Accept-Encoding, resulting in some clients that do not support gzip receiving compressed data, which they could not decode.

 

ii. Abnormal requests were triggered at the CDN level, resulting in some 429 responses. Specifically, under the HTTP/2 protocol, the Content-Length header did not match the actual content length (with Content-Length being smaller), which caused the CDN to repeatedly send multiple requests in a short time. This behavior led the upstream CT Proxy to apply rate limiting to the request, resulting in 429 responses.

 

b. Why wasn’t this detected in time?

 

i. After updating the enhanced CT Proxy, the number of 5xx alerts significantly dropped, and our monitor occasionally reported 5xx errors. While continuing to observe, we also contacted the CDN provider to investigate intermittent alerts. It was discovered that some interfaces exhibited abnormal behavior from CDN to the origin server. Additionally, our external monitoring tool (running on AWS Lambda) could not decode gzip-compressed responses and thus returned 5xx errors, overlapping with the current increase in 5xx alerts. This overlap led to a delayed response and attention to the issue.

 

5. Follow-up Actions

a. Update the CT Proxy

               i.         Remove Content-Length on HTTP/2 protocol

             ii.         Fix the caching issue by differentiating different Accept-Encoding

Joe DeBlasio

unread,
May 8, 2025, 8:02:01 PMMay 8
to Jerry Hou, Certificate Transparency Policy
Thank you very much for the investigation, fixes, and report!

Regarding the availability decline from Chrome's compliance monitor: a current limitation of the compliance monitor is that publicly-published availability data may lag by as much as a day behind what is being measured internally, so the behavior you saw is expected. (Our hope is to address this issue at some point this year. Thank you for your patience until then.)

Best,
Joe, on behalf of the Chrome CT team


--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/fb940a10-b4d9-4e9d-9010-77131da4bc94n%40chromium.org.
Reply all
Reply to author
Forward
0 new messages