Trust Asia 2021 has produced inconsistent STHs

325 views
Skip to first unread message

Andrew Ayer

unread,
Nov 4, 2020, 10:09:19 AM11/4/20
to ct-p...@chromium.org
The Trust Asia 2021 log (ID Z422Wz50Q7bzo3DV4TqxtDvgoNNR98p0IlDHxvpRqIo=)
has signed the following STHs (also attached):

{"timestamp": 1604492463992, "tree_size": 228935, "sha256_root_hash": "5BNLr36AuR0z8U3FBQ92aA356Gn45STdjllpCRKtSVA=", "tree_head_signature": "BAMASDBGAiEAhC6Sqvs0e2AkisPtcX3BbRO7MDpEQx03wlT5MxMqzNMCIQCGlROYA/mAj2SajTJNlldpbZ3WsM1hFyCVhGDMF+bEoQ=="}

{"timestamp": 1604500136203, "tree_size": 455962, "sha256_root_hash": "o79vMSP0u9jN20OB3qM48trR4T4UGlTBkntLMI+fY4c=", "tree_head_signature": "BAMARzBFAiEA2vgS1U5t6svUsMZBlWSi2oWDWked8mT//QpURGLJogACIG4/zNGLpZKrI1B0Ee10CYuHH5Bq8NHlRoNn8qmG3Ytu"}

However, it is unable to produce a valid consistency proof between
these STHs. In the sequence of entries which this log presented
to my monitor, the root hash at tree size 228935 was really
kuX66A0IvdfjbFHTTrNFtwyL/gLAt662fDxAMGckgmI=.

Regards,
Andrew
sth1.json
sth2.json

Jerry Hou

unread,
Nov 5, 2020, 4:10:24 AM11/5/20
to Certificate Transparency Policy, Andrew Ayer
Hi Andrew,

Thank you for analyzing this problem, Can you share your diagnostic procedures? Let's confirm the problem.

Andrew Ayer

unread,
Nov 5, 2020, 4:19:07 PM11/5/20
to Jerry Hou, Certificate Transparency Policy
On Thu, 5 Nov 2020 01:10:24 -0800 (PST)
Jerry Hou <yry...@gmail.com> wrote:

> Hi Andrew,
>
> Thank you for analyzing this problem, Can you share your diagnostic
> procedures? Let's confirm the problem.

Hi Jerry,

You can use the ctclient program from
https://github.com/google/certificate-transparency-go to verify the
consistency of two STHs. It outputs "Failed to VerifyConsistencyProof"
when you use it with the data from the two STHs:

ctclient -log_uri https://ct.trustasia.com/log2021 -prev_size 228935 -size 455962 -prev_hash 5BNLr36AuR0z8U3FBQ92aA356Gn45STdjllpC
RKtSVA= -tree_hash o79vMSP0u9jN20OB3qM48trR4T4UGlTBkntLMI+fY4c= consistency

Unfortunately, this is not something which can be fixed with the
current log. After inconsistent STHs are produced, the log's Merkle
Tree integrity is permanently compromised and the log needs to be
rebuilt with a new key. However, it would be very valuable to
understand what went wrong so it can be avoided in the future. I would
suggest looking for anomalies in your infrastructure around the time
that these STHs were produced (according to the timestamps, 2020-11-04
12:21:03+00:00 and 2020-11-04 14:28:56+00:00). It might also be helpful
to review the discussion of a previous incident:
https://groups.google.com/a/chromium.org/d/msg/ct-policy/ohtZ64gLN3I/q4nSPkrWCQAJ

Regards,
Andrew

Devon O'Brien

unread,
Nov 5, 2020, 6:46:54 PM11/5/20
to Certificate Transparency Policy, Andrew Ayer, Certificate Transparency Policy, yry...@gmail.com
Hi Andrew,

Thank you for reporting this. We have confirmed this specific inconsistent STH set and are currently investigating whether there are additional consistency issues among the other CT Logs in this set.

TrustAsia - Inconsistent STHs are a serious issue that breaks the required append-only and verifiability properties of CT Logs, and producing such inconsistency violates the ongoing requirements of CT Log Operators as defined in the Chrome CT Policy: https://chromium.github.io/ct-policy/log_policy.html. Please provide a detailed post-mortem that includes an analysis of the root cause of this issue as well as the scope of this issue, including whether other your other CT Logs were affected as well.

-Devon

Kurt Roeckx

unread,
Nov 6, 2020, 5:16:53 PM11/6/20
to Andrew Ayer, ct-p...@chromium.org
Looking at my collection of STHs, I see the following:
tree_size | timestamp | sha256_root_hash |
454612 | 2020-11-04 13:23:04.358+01 | \x0528eff05bfdcc72108eee4110536e89a59fc89ccdb0752db43f92b126656394 |
454612 | 2020-11-04 13:23:04.358+01 | \x0528eff05bfdcc72108eee4110536e89a59fc89ccdb0752db43f92b126656394 |
454606 | 2020-11-04 13:22:04.352+01 | \x0a13b3a41d983bdf381ca43f27a15e42678a2785343feedf3959be8175bb5d2e |
454606 | 2020-11-04 13:22:04.352+01 | \x0a13b3a41d983bdf381ca43f27a15e42678a2785343feedf3959be8175bb5d2e |
454600 | 2020-11-04 13:21:04.327+01 | \x532645f0e4b3747aebf95c72685031da3aa111f96befa3d9d1e85e5a78e64726 |
228935 | 2020-11-04 13:21:03.992+01 | \xe4134baf7e80b91d33f14dc5050f76680df9e869f8e524dd8e59690912ad4950 |
454589 | 2020-11-04 13:20:04.213+01 | \x599dc4a4cb44d78f7c4ce7d5b58fb0e3e872012c40bd75d1c51e51152c2411b7 |
454589 | 2020-11-04 13:20:04.213+01 | \x599dc4a4cb44d78f7c4ce7d5b58fb0e3e872012c40bd75d1c51e51152c2411b7 |
454588 | 2020-11-04 13:19:56.099+01 | \x5b7ad2fbe90e658a58e578d0167e9b9ddd65c265fbe83ff0d34aa6c9b5152170 |
454575 | 2020-11-04 13:18:56.005+01 | \x3ce6d08b1379381fc81950c50b228f9c54843a2e87e04cdca2809afa75b05bf7 |
454570 | 2020-11-04 13:17:55.974+01 | \x035df4f3ddb76a0d8e0938cff8994dd0e9fbda68a612260473d6cc386e1a5489 |
454570 | 2020-11-04 13:17:55.974+01 | \x035df4f3ddb76a0d8e0938cff8994dd0e9fbda68a612260473d6cc386e1a5489 |
[...]
228990 | 2020-10-21 04:38:02.341+02 | \xb773a58109a44def3331463ef9875d6ac684872f856f0f4b0ec5319d10f6f736 |
228990 | 2020-10-21 04:38:02.341+02 | \xb773a58109a44def3331463ef9875d6ac684872f856f0f4b0ec5319d10f6f736 |
228980 | 2020-10-21 04:37:02.416+02 | \xc013668d5b67ca5ab75ba1892bc5d8195b5b733b22cb2d9a83e8252eeda0576f |
228980 | 2020-10-21 04:37:02.416+02 | \xc013668d5b67ca5ab75ba1892bc5d8195b5b733b22cb2d9a83e8252eeda0576f |
228968 | 2020-10-21 04:36:02.262+02 | \xf97671f9953a08d9e0cf18c710e5c10178eeeceda3fffa5a2c75dee3c617f88d |
228957 | 2020-10-21 04:35:02.307+02 | \x54abb304dc8cbd3fdbad815eccadc622b0c70cbe318620e7cc7b1d0c0ab7d48b |
228947 | 2020-10-21 04:34:02.567+02 | \x58baa2b1138872880863b594e895c5f4076b57d4bb1ea2fc41197805f430dc42 |
228924 | 2020-10-21 04:30:13.041+02 | \xafe37396b272621724bbec3d02484154a2c7ff8bc8c59d8ad0ff047b82c684be |
228924 | 2020-10-21 04:30:13.041+02 | \xafe37396b272621724bbec3d02484154a2c7ff8bc8c59d8ad0ff047b82c684be |
228924 | 2020-10-21 04:30:13.041+02 | \xafe37396b272621724bbec3d02484154a2c7ff8bc8c59d8ad0ff047b82c684be |
228916 | 2020-10-21 04:29:13.11+02 | \x8de64bbf088312345025f82d7e7e4059f75158b4a5a87f669f7fb309d4995749 |
228916 | 2020-10-21 04:29:13.11+02 | \x8de64bbf088312345025f82d7e7e4059f75158b4a5a87f669f7fb309d4995749 |
228902 | 2020-10-21 04:28:13.068+02 | \xefdc9f7b6f6d055f1074a84355e50cb8c9c3a654518d05167f524b88466cbaee |
228902 | 2020-10-21 04:28:13.068+02 | \xefdc9f7b6f6d055f1074a84355e50cb8c9c3a654518d05167f524b88466cbaee |

Note that the size of 228935 should be have been around 2020-10-21
04:30+02, but was generated on 2020-11-04 13:21:03.992+01 instead.

The entries with the same timestamp and sha256_root_hash actually
have a different signature. For the samples I looked that, it's
the only case I see 2 STHs signed so close to each other. Usually
it's only signed once every minute. But the good one and the bad
one seem to have signed within 1 second of each other.


Kurt

Jerry Hou

unread,
Nov 6, 2020, 11:47:04 PM11/6/20
to Certificate Transparency Policy, Kurt Roeckx, ct-p...@chromium.org, Andrew Ayer
Hello,

We carried out an investigation into the accident.

Overview:

Around 2020-11-04 22:35 CST our ELB monitor alerts that a large number of http 500 errors are occurring in the production environment. Looking at the specific error messages returned, we see a large number of "Error 1040: Too many connections". We think this is a burst of requests causing too many connections to the database. Since we are using an older version of Trillian code and have not set a database connection limit, we have taken steps to restart a single server to relieve the connection pressure.


About 2020-11-05 12:31 CST Our ELB monitoring alerts are monitoring for a large number of 500 errors. Still too many database connections are causing the error. So we temporarily expanded and upgraded the database service at 2020-11-05 12:40 CST.

 

Root Cause:

After checking and comparing, we finally found that the cause of this failure was that the machine used by the test cluster for elastic scaling was incorrectly connected to the production cluster, due to our test cluster incorrectly using the production environment ETCD address and the test machine used for elastic scaling not properly setting up the network isolation during the test. This results in a large number of writes to the production database and a small portion of the data that should have been stored in the production database is stored in the test database. 

Path: ct_submit -> ct_server(prod) -> etcd(prod) -> log_server(test) -> database(test).

 

Detail Information:

Following the unexplained ETCD failures of 2020-10-14 to 2020-10-17, we conducted upgrade drill, stress tests, and failure drills in order to reproduce the ETCD failure problem and to simulate the production environment. We cloned the production environment for use as a test cluster. The ETCD cluster and storage database for this test cluster also used the cloned set. The initial stress tests and failure simulation tests performed normally with no surprises.


 On or about 2020-11-04 21:40 CST, the machine we use for resilient scaling goes online for testing in the test cluster. Approximately 2020-11-04 21:45 CST the traffic from our stress test started to enter the test cluster and 2020-11-04 22:50 CST the test stress stopped with an average of approximately 36,000 pre-certificate submission requests per minute and a peak of 64,000 pre-certificate submission requests per minute. This resulted in a large number of http 500 alarms on our production system for the first time and the "Error 1040: Too many connections" error was detected. The alarms stopped after the test pressure stopped.


At approximately 2020-11-05 11:23 CST we begin a second resilience scaling validation and stress test, which continues until 2020-11-05 17:09 CST. The average number of pre-certificate submission requests per minute is approximately 41,000, with a peak of approximately 70,000 pre-certificate submission requests per minute. This caused us to monitor the online http 500 alert for the second time and the "Error 1040: Too many connections" error. We upgraded our database to handle this.


Around 2020-11-05 17:50 CST, finally after troubleshooting, we found this misconfiguration and misconnected to the production cluster's Elastic Scaling walkthrough machine. And we stopped this machine that had the error.

 

Problems caused by this issue:

1. Caused the production server database resources to be overloaded and generate a large number of query connections, resulting in database overload and several brief http 500 alarms.

2. Caused a small number of production servers to incorrectly deposit data into the test database and the production environment Merkle tree had consistency issues. 

3. This issue affects Trust Asia 2020/Trust Asia 2021 and not Trust Asia 2022/Trust Asia 2023.

The Trust Asia 2020/2023 data is very small. We observed that at the beginning of the accident (about 2020-11-04 21:40 CST), SCT without production environment entered the test environment:

Trust Asia 2022 last SCT: 2020-10-21T02:32:13Z

Trust Asia 2023 last SCT: 2020-10-21T02:32:11Z


test_2023_certlist.txt
test_2022_certlist.txt

Al Cutter

unread,
Nov 9, 2020, 4:20:05 AM11/9/20
to Jerry Hou, Certificate Transparency Policy, Kurt Roeckx, Andrew Ayer, google-ct-logs
Hi Jerry,

Many thanks for sending this (and of course to Kurt and Andrew for detailed and vigilant observation and investigation as always)! 

Reading through made me wonder how we could help folks avoid this situation in future.

Normally, I would have expected that the Trillian tree IDs were different in your production and test environments - in the failure scenario you describe I'd have expected prod/test CTFEs to have simply returned 500 errors for all requests going to an incorrect logserver (i.e. test->prod or prod->test) because the tree IDs sent with each request should have been unknown to the Trillian logserver.
 
Unfortunately this was not the case because your test instance was built on cloned data from your production setup - even if you didn't clone the private key material for the prod ctfe instances they were unable to tell the difference between the "forked" trees in the two environments.

Looking forward, we do have a shared resource with some guidance and potential pitfalls to avoid for log operators here: https://github.com/google/certificate-transparency-go/blob/master/trillian/docs/Operation.md
but it could do with being embraced and expanded so we can all learn from each other - to that end it would be good to update it with a general note about these sorts of testing scenarios too.

Similarly and more generally, I wonder if there are lessons from past post-mortems/general guidance from other log operators which could also be incorporated?

Anyway, for what it's worth, my personal thanks for a very open investigation report, and despite the unfortunate situation kudos for running the stress tests and drills as part of your log operation.

Cheers,
Al.
--
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/d17cbb81-fd76-4503-8833-8d18828dbdc8n%40chromium.org.

Devon O'Brien

unread,
Nov 17, 2020, 2:19:00 AM11/17/20
to Certificate Transparency Policy, Al Cutter, Certificate Transparency Policy, Kurt Roeckx, Andrew Ayer, google-ct-logs, yry...@gmail.com
I echo Al's thanks for the observations and reports from the community as well as TrustAsia's investigation and subsequent analysis of the incident. Due to the nature of this failure (inconsistent STHs), Chrome will be moving to Retire the affected Logs, which will be announced in a separate thread very shortly.

TrustAsia is welcome to re-apply for inclusion for a 2021 CT Log shard to ensure contiguous coverage of their expiry ranges along side their 2022 and 2023 shards.

-Devon
Reply all
Reply to author
Forward
0 new messages