Mammoth scheduled maintenance window and incident report

309 views
Skip to first unread message

Martijn Katerbarg

unread,
Mar 13, 2024, 4:35:05 PMMar 13
to Certificate Transparency Policy

All,

Through this message we would like to inform the community of an upcoming maintenance window affecting all Mammoth sharded CT logs and update the community on a recent availability incident.

Scheduled Maintenance

Sectigo has a scheduled maintenance window on March 16th, 2024 starting at 09:00 UTC. The maintenance window is expected to take no longer than 8 hours. During this maintenance window the logs will be completely unavailable.

Incident Report

On March 5th we were notified by the Chrome Certificate Transparency team that they were observing intermittent availability issues with all Mammoth CT log shards.

Our own logging shows an anomaly started on March 4th at 16:20 UTC. By March 6th, 18:13 our logging reported a return to normal state.

Unfortunately, we were unable to confirm the distinct root cause of the availability issue. Our internal monitoring system did not detect any issues directly with the CT logs; however, several tests performed from outside of our network confirmed the issues reported by the Chrome CT team.

Upon reviewing our system logs, we discovered that the incident started right after we increased the available memory and CPU limits for all our CT logs. We didn’t believe that those increases themselves could have caused this issue, but this discovery led us into investigating the control plane nodes of our Kubernetes cluster. We performed restarts of the control plane nodes, which resolved the issue.

Based on further investigation, it’s our belief that after our Kubernetes cluster renewed several certificates, at least some of the control plane nodes did not automatically switch to using these renewed certificates. Our best guess is that the intermittent availability issues were due to the previous certificates being still in use after they had expired. Automation is great when it works, but when it fails unexpectedly it can sometimes be harder to detect and diagnose the problem than it would have been if a manual mechanism had been used instead!

Regards,

Martijn Katerbarg
Sectigo

Reply all
Reply to author
Forward
0 new messages