Firstly, apologies for the delay in sending this out. We decided to do a more thorough impact analysis / breakdown and that took up some extra time.
On October 24th 2018 all Trillian based Google CT logs experienced an approx 40 minute impact on availability. This occurred from roughly 04:22 to 05:03 Pacific Time (11:22 to 12:03 GMT), and affected the argon20XX and xenon20XX logs, as well as the solera20XX and crucible test logs.
Within this time there was a shorter 18 minute window when 3653 requests in total (from 157 unique IPv4/v6 addresses) received 502 HTTP response codes. Successful requests, totalling 35153 during this 18 minute window, possibly experienced much higher and variable latency than normal. For the overall ~40 minute impact 131240 requests (from 224 unique IP addresses) were subject to potentially higher latencies.
The root cause was an unexpected behavioural change in a network library that we depend on for routing external requests to our servers. The result was that the servers began rejecting all inbound traffic. Automatic checks on the new release binary gave different results on several runs. We believe this was due to differing traffic patterns at the time and because internal traffic bypassed the failure. The result was that the new release was briefly set live before being manually rolled back to the previous version.
Summary timeline of events:
04:22 PT OUTAGE BEGINS - Rollout of new release begins
04:28 PT DETECTION TIME - First warning bug received for raised error rates
04:33 PT ESCALATION TIME - First page for raised error rates
04:34 PT Rollout of new release reaches approx. 90% complete
04:34 PT Rollout aborted
04:36 PT (approx.) On-call requests immediate rollback
04:39 PT Rollback process is initiated to restore previous release
04:45 PT Rollback begins to take effect
05:03 PT OUTAGE ENDS - Rollback complete
As of November 5th 2018 the lowest argon20xx availability 90 day uptime is 99.9907% for argon2018.
The following list shows the number of 502s we served for each of the affected log endpoints. A few malformed requests have been omitted.
Endpoint | 502s Returned |
/logs/argon2017/ct/v1 | 1 |
/logs/argon2018/ct/v1/get-roots | 1 |
/logs/argon2019/ct/v1/add-chain | 1 |
/logs/argon2021/ct/v1/add-chain | 1 |
/logs/argon2021/ct/v1/get-roots | 1 |
/logs/solera2018/ct/v1/get-entries | 1 |
/logs/solera2019/ct/v1/get-entries | 1 |
/logs/solera2021/ct/v1/get-entries | 1 |
/logs/xenon2019/ct/v1/add-pre-chain | 1 |
/logs/xenon2020/ct/v1/get-roots | 1 |
/logs/xenon2021/ct/v1/add-pre-chain | 1 |
/logs/xenon2022/ct/v1/get-entries | 1 |
/logs/xenon2018/ct/v1/get-entries | 2 |
/logs/xenon2019/ct/v1/get-entries | 2 |
/logs/xenon2021/ct/v1/get-entries | 3 |
/logs/xenon2020/ct/v1/get-entries | 4 |
/logs/argon2018/ct/v1/add-pre-chain | 5 |
/logs/argon2021/ct/v1/get-entries | 8 |
/logs/argon2019/ct/v1/add-pre-chain | 22 |
/logs/argon2021/ct/v1/add-pre-chain | 25 |
/logs/argon2020/ct/v1/get-entries | 36 |
/logs/solera2021/ct/v1/get-sth | 68 |
/logs/solera2018/ct/v1/get-sth | 69 |
/logs/solera2019/ct/v1/get-sth | 69 |
/logs/solera2020/ct/v1/get-sth | 70 |
/logs/solera2022/ct/v1/get-sth | 73 |
/logs/xenon2021/ct/v1/get-sth | 87 |
/logs/xenon2020/ct/v1/get-sth | 88 |
/logs/argon2022/ct/v1/get-sth | 89 |
/logs/crucible/ct/v1/get-sth | 91 |
/logs/xenon2019/ct/v1/get-sth | 98 |
/logs/xenon2018/ct/v1/get-sth | 113 |
/logs/argon2017/ct/v1/get-sth | 115 |
/logs/xenon2022/ct/v1/get-sth | 115 |
/logs/argon2019/ct/v1/get-entries | 137 |
/logs/argon2019/ct/v1/get-sth | 207 |
/logs/argon2018/ct/v1/get-sth | 212 |
/logs/argon2020/ct/v1/get-sth | 216 |
/logs/argon2021/ct/v1/get-sth | 231 |
/logs/argon2020/ct/v1/add-pre-chain | 394 |
/logs/argon2018/ct/v1/get-entries | 523 |
Total | 3184 |
We apologize for this interruption to serving and will be introducing additional deployment checks and monitoring to guard against a future recurrence.
--
You received this message because you are subscribed to the Google Groups "certificate-transparency" group.
To unsubscribe from this group and stop receiving emails from it, send an email to certificate-transp...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/certificate-transparency/CAK76_KVJO4U6y2ax1_F4tzkPGOD8nN%2B1oqZkKUsYRd1RX7u%2B%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
This provides a useful breakdown and impact analysis, but does not appear to raise to the level of a postmortem that helps the community understand root causes and mitigations.During the recent Apple-hosted CT Policy Days, Martin gave an excellent presentation that arguably could serve as a model post-mortem in the level of details in analyzing the root cause, the explanation of the architectural considerations, the steps being taken to mitigate those issues, and sufficient context to provide insight into other potential issues. I'm curious if there are plans to share that more broadly, either as a result of the minutes from CT Policy Days, or as a further follow-up to this incident report.
--You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACM%3D_Ocyry-Z4mw6fVnX55QxYTvU9fm9TWF7UXfsvqB2zsmWtg%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACvaWvYsS8ad94xpw-8EuKZw5etu9OaFQFzcWmMwgbSL-mtYkA%40mail.gmail.com.
This provides a useful breakdown and impact analysis, but does not appear to raise to the level of a postmortem that helps the community understand root causes and mitigations.
During the recent Apple-hosted CT Policy Days, Martin gave an excellent presentation that arguably could serve as a model post-mortem in the level of details in analyzing the root cause, the explanation of the architectural considerations, the steps being taken to mitigate those issues, and sufficient context to provide insight into other potential issues. I'm curious if there are plans to share that more broadly, either as a result of the minutes from CT Policy Days, or as a further follow-up to this incident report.
On Tue, Nov 13, 2018 at 8:55 AM 'Al Cutter' via Certificate Transparency Policy <ct-p...@chromium.org> wrote:
You received this message because you are subscribed to the Google Groups "Certificate Transparency Policy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ct-policy+...@chromium.org.
To post to this group, send email to ct-p...@chromium.org.
--To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/ct-policy/CACM%3D_Ocyry-Z4mw6fVnX55QxYTvU9fm9TWF7UXfsvqB2zsmWtg%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "Google CT Logs" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-ct-log...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/google-ct-logs/CACvaWvYsS8ad94xpw-8EuKZw5etu9OaFQFzcWmMwgbSL-mtYkA%40mail.gmail.com.