Sectigo Mammoth outage drops log to under 99% availability

812 views
Skip to first unread message

Devon O'Brien

unread,
Feb 14, 2022, 12:55:14 PM2/14/22
to Certificate Transparency Policy, #CTOps
Hello ct-policy@ and Sectigo CT Ops,

Late last week, we detected a large downtime event for Sectigo Mammoth (https://mammoth.ct.comodo.com/), which both reduced the trailing 90 day availability to below 99% as well as resulted in alerts from our SCT auditing infrastructure that numerous certificates were issued SCTs from Mammoth but were not verifiably included within the 24 hour MMD. As of 14 Feb 2022, we are observing a trailing 90 day uptime of 98.13% (https://www.gstatic.com/ct/compliance/uptime.csv) , down from 99.65% on 10 Feb 2022.

Sectigo CT Ops - Can you please reply to this thread with a postmortem rundown of this incident, root cause, remediation, and scope of impact? Was this outage related to past incidents or was this caused by a new failure mode?

Thanks!
-Devon

Smitty

unread,
Jul 6, 2022, 8:28:00 PM7/6/22
to Certificate Transparency Policy, Devon O'Brien, #CTOps
It appears Mammoth is unavailable again: I am getting 502 Bad Gateway when trying to fetch a STH, and uptime.csv puts Mammoth at 98.9032% uptime.

Devon O'Brien

unread,
Jul 20, 2022, 1:26:19 PM7/20/22
to Certificate Transparency Policy, Devon O'Brien, #CTOps
Hello Sectigo CT Ops,

Can you please look into Mammoth's ongoing availability issues and provide a rundown here? I haven't received a response from (off-thread) outreach from earlier this month when Mammoth dropped below 99% and it appears that availability is continuing to degrade and its current availability is at 98.11%. Please let us know if this issue is ongoing and whether we can expect to see Mammoth's uptime level off soon.

-Devon

David Colon

unread,
Jul 20, 2022, 9:38:47 PM7/20/22
to Certificate Transparency Policy, Devon O'Brien, #CTOps
Hello,

Thanks for everyone's patience, it's been quite busy on our end.  The 2022/07/06 outage was similar to other outages in the past (i.e. crashes of the binary across our fleet within the "cold boot" time).  The outage that happened earlier this week (2022/07/18) was due to a network outage.  The network outage was caused by our edge routers shutting off interfaces due to hitting critical temperatures.  The temperatures rose in the data center because of the unprecedented heat in the region (United Kingdom).  The temperature control in the data center is considered stabled and resolved now.

We have a tentative date of moving to Trillian Q4 of this year.  Currently, we are using available cycles, so there is a possibility we might pull this forward if things go well.  Our last migration run was earlier this month for our Dodo CT log (test log) and the results seem promising (currently, the last remaining item is to investigate why two certificates were missing in the tree).  We will gladly reach out to the Trillian team for assistance when appropriate.

Best,
Dave
Reply all
Reply to author
Forward
0 new messages