Overview:
Around 2020-11-04 22:35 CST our ELB monitor alerts that a large number of http 500 errors are occurring in the production environment. Looking at the specific error messages returned, we see a large number of "Error 1040: Too many connections". We think this is a burst of requests causing too many connections to the database. Since we are using an older version of Trillian code and have not set a database connection limit, we have taken steps to restart a single server to relieve the connection pressure.
About 2020-11-05 12:31 CST Our ELB monitoring alerts are monitoring for a large number of 500 errors. Still too many database connections are causing the error. So we temporarily expanded and upgraded the database service at 2020-11-05 12:40 CST.
Root Cause:
After checking and comparing, we finally found that the cause of this failure was that the machine used by the test cluster for elastic scaling was incorrectly connected to the production cluster, due to our test cluster incorrectly using the production environment ETCD address and the test machine used for elastic scaling not properly setting up the network isolation during the test. This results in a large number of writes to the production database and a small portion of the data that should have been stored in the production database is stored in the test database.
Path: ct_submit -> ct_server(prod) -> etcd(prod) -> log_server(test) -> database(test).
Detail Information:
Following the unexplained ETCD failures of 2020-10-14 to 2020-10-17, we conducted upgrade drill, stress tests, and failure drills in order to reproduce the ETCD failure problem and to simulate the production environment. We cloned the production environment for use as a test cluster. The ETCD cluster and storage database for this test cluster also used the cloned set. The initial stress tests and failure simulation tests performed normally with no surprises.
On or about 2020-11-04 21:40 CST, the machine we use for resilient scaling goes online for testing in the test cluster. Approximately 2020-11-04 21:45 CST the traffic from our stress test started to enter the test cluster and 2020-11-04 22:50 CST the test stress stopped with an average of approximately 36,000 pre-certificate submission requests per minute and a peak of 64,000 pre-certificate submission requests per minute. This resulted in a large number of http 500 alarms on our production system for the first time and the "Error 1040: Too many connections" error was detected. The alarms stopped after the test pressure stopped.
At approximately 2020-11-05 11:23 CST we begin a second resilience scaling validation and stress test, which continues until 2020-11-05 17:09 CST. The average number of pre-certificate submission requests per minute is approximately 41,000, with a peak of approximately 70,000 pre-certificate submission requests per minute. This caused us to monitor the online http 500 alert for the second time and the "Error 1040: Too many connections" error. We upgraded our database to handle this.
Around 2020-11-05 17:50 CST, finally after troubleshooting, we found this misconfiguration and misconnected to the production cluster's Elastic Scaling walkthrough machine. And we stopped this machine that had the error.
Problems caused by this issue:
1. Caused the production server database resources to be overloaded and generate a large number of query connections, resulting in database overload and several brief http 500 alarms.
2. Caused a small number of production servers to incorrectly deposit data into the test database and the production environment Merkle tree had consistency issues.
3. This issue affects Trust Asia 2020/Trust Asia 2021 and not Trust Asia 2022/Trust Asia 2023.
The Trust Asia 2020/2023 data is very small. We observed that at the beginning of the accident (about 2020-11-04 21:40 CST), SCT without production environment entered the test environment:
Trust Asia 2022 last SCT: 2020-10-21T02:32:13Z
Trust Asia 2023 last SCT: 2020-10-21T02:32:11Z
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/d17cbb81-fd76-4503-8833-8d18828dbdc8n%40chromium.org.