mammoth.ct.comodo.com is down (04/18/2019)

196 views
Skip to first unread message

David Colon

unread,
Apr 18, 2018, 1:59:54 PM4/18/18
to Certificate Transparency Policy
Hello all,

I just wanted to write a brief report and alert that mammoth.ct.comodo.com is currently down.

Quick brief:
We were alerted of Mammoth's downtime at around 16:00 UTC.  Quickly looking through the logs, we found no useful information in the warning, error, and fatal logs of any processes that was running before it died to prove valuable.  Our monitoring processes automatically restarted the dead processes around 16:01 - 16:04 UTC.  Unfortunately, due to the size of Mammoth (59MM+), I expect a downtime of ~3-5 hours based on our recent experience with restarting the "ct-server" binary with new "ca-roots.pem" file this past weekend.  For those unaware, we are still running on "SuperDuper" and have yet to switch to a Trillian based log.

I will update this thread as soon as I have more information about Mammoth's downtime.  I will most likely be contacting the Google CT development team for any support they can provide in deciphering the INFO logs that "SuperDuper" produces.

Best,
David Colon
Comodo CA

David Colon

unread,
Apr 18, 2018, 4:12:11 PM4/18/18
to Certificate Transparency Policy
Mammoth is now back up since 20:52 UTC.

Rob Stradling

unread,
Apr 19, 2018, 5:56:58 AM4/19/18
to David Colon, Certificate Transparency Policy
Nit: The downtime was actually between 17:00 UTC and 19:52 UTC.

Time zones are hard.

On 18/04/18 21:12, David Colon wrote:
> Mammoth is now back up since 20:52 UTC.
>
> On Wednesday, April 18, 2018 at 1:59:54 PM UTC-4, David Colon wrote:
>
> Hello all,
>
> I just wanted to write a brief report and alert that
> mammoth.ct.comodo.com <http://mammoth.ct.comodo.com> is currently down.
>
> *Quick brief:*
>
> We were alerted of Mammoth's downtime at around 16:00 UTC.
> Quickly looking through the logs, we found no useful information
> in the warning, error, and fatal logs of any processes that was
> running before it died to prove valuable.  Our monitoring
> processes automatically restarted the dead processes around
> 16:01 - 16:04 UTC.  Unfortunately, due to the size of Mammoth
> (59MM+), I expect a downtime of ~3-5 hours based on our recent
> experience with restarting the "ct-server
> <https://github.com/google/certificate-transparency>" binary
> with new "ca-roots.pem" file this past weekend.  For those
> unaware, we are still running on "SuperDuper" and have yet to
> switch to a Trillian based log.
>
>
> I will update this thread as soon as I have more information about
> Mammoth's downtime.  I will most likely be contacting the Google CT
> development team for any support they can provide in deciphering the
> INFO logs that "SuperDuper" produces.
>
> Best,
> David Colon
> Comodo CA

--
Rob Stradling
Senior Research & Development Scientist
Email: R...@ComodoCA.com

David Colon

unread,
Apr 25, 2018, 11:08:07 PM4/25/18
to Certificate Transparency Policy, david.col...@gmail.com
Hello again,

I debated on creating a new thread for discussing the downtime that occurred on 4/20/18.  mammoth.ct.comodo.com was plagued by another set of dead processes with no clear error or warnings in the logs.  I suspected that an OOM killer process might have been triggered, however, the OS that these servers run is set to panic and reboot instead of killing a process.  In this particular case, the ct-server binary was just dead, no reboot from the server.  Health graphs showed everything to be fine so I hypothesized that the lack of memory is what caused ct-server to crash.   Mammoth is now stable for the past few days after increasing the memory (almost double).  We are planning to do the same with Sabre tomorrow to avoid any unnecessary downtime for that particular CT log.

- David Colon
Reply all
Reply to author
Forward
0 new messages